## Abstract

We present *distinct*, a general method for differential analysis of full distributions that is well suited to applications on single-cell data, such as single-cell RNA sequencing and high-dimensional flow or mass cytometry data. High-throughput single-cell data reveal an unprecedented view of cell identity and allow complex variations between conditions to be discovered; nonetheless, most methods for differential expression target differences in the mean and struggle to identify changes where the mean is only marginally affected. *distinct* is based on a hierarchical non-parametric permutation approach and, by comparing empirical cumulative distribution functions, identifies both differential patterns involving changes in the mean, as well as more subtle variations that do not involve the mean. We performed extensive benchmarks across both simulated and experimental datasets from single-cell RNA sequencing and mass cytometry data, where *distinct* shows favourable performance, identifies more differential patterns than competitors, and displays good control of false positive and false discovery rates. *distinct* is available as a Bioconductor R package.

## Background

Technology developments in the last decade have led to an explosion of high-throughput single-cell data, such as single-cell RNA sequencing (scRNA-seq) and highdimensional flow or mass cytometry data, allowing researchers to investigate biological mechanisms at singlecell resolution. Single-cell data have also extended the canonical definition of differential expression by displaying cell-type specific responses across conditions, known as differential state (DS) [28], where genes or proteins vary in specific sub-populations of cells (e.g., a cytokine response in myeloid cells but not in other leukocytes [10]). Classical bulk differential expression methods have been shown to perform well when used on single-cell measurements [22, 23, 27] and on aggregated data (i.e., averages or sums across cells), also referred to as pseudo-bulk (PB) [5, 28]. However, most bulk and PB tools focus on shifts in the means, and may conceal information about cell-to-cell heterogeneity. Indeed, single-cell data can show more complex variations (Figure 1 and Supplementary Figure 1); such patterns can arise due to increased stochasticity and heterogeneity, for example owing to oscillatory and unsynchronized gene expression between cells, or when some cells respond differently to a treatment than others [12, 27]. In addition to bulk and PB tools, other methods were specifically proposed to perform differential analyses on single-cell data (notably: *scDD* [12], *SCDE* [11], *MAST* [8], *BASiCS* [26] and mixed models [24]). Nevertheless, they all present significant limitations: BASiCS does not perform cell-type specific differential testing between conditions, scDD does not directly handle covariates and biological replicates, while PB, SCDE, MAST and mixed models performed poorly in previous benchmarks when detecting differential patterns that do not involve the mean [5, 12].

## Results

*distinct*’s full distribution approach

To overcome these challenges, we developed *distinct*, a flexible and general statistical methodology to perform differential analyses between groups of distributions. *distinct* is particularly suitable to compare groups of samples (i.e., biological replicates) on single-cell data.

Our approach computes the empirical cumulative distribution function (ECDF) from the individual (e.g., single-cell) measurements of each sample, and compares ECDFs to identify changes between full distributions, even when the mean is unchanged or marginally involved (Figure 1 and Supplementary Figure 1). First, we compute the ECDF of each individual sample; then, we build a fine grid and, at each cut-off, we average the ECDFs within each group, and compute the absolute difference between such averages. A test statistic, *s*^{obs}, is obtained by adding these absolute differences.

More formally, assume we are interested in comparing two groups, that we call *A* and *B*, for which *N*_{A} and *N*_{B} samples are available, respectively. The ECDF for the *i*-th sample in the *j*-th group, is denoted by , for *j* ∈ {*A*, *B*} and *i* = 1, … , *N*_{j}. We then define *K* equally spaced cut-offs between the minimum, *min*, and maximum, *max*, values observed across all samples: *b*_{1}, … , *b*_{K}, where *b*_{k} = *min* + *k* × *l*, for *k* = 1, … , *K*, with *l* = (*max* — *min*)/(*K* + 1) being the distance between two consecutive cut-offs. We exclude *min* and *max* from the cut-offs because, trivially, and , ∀*j,i*. At every cut-off, we compute the absolute difference between the mean ECDF in the two groups; our test statistic, *s*^{obs}, is obtained by adding these differences across all cut-offs:

Note that in differential state analyses, these operations are repeated for every gene-cluster combination.

Intuitively, *s*^{obs}, which ranges in [0, ∞), approximates the area between the average ECDFs, and represents a measure of distance between two groups of densities: the bigger *s*^{obs}, the greater the distance between groups. The number of cut-offs *K*, which can be defined by users, is set to 25 by default, because no detectable difference in performance was observed when further increasing it (data not shown). Note that, although at each cut-off we compute the average across each group’s curves, ECDFs are computed separately for each individual sample, therefore our approach still accounts for the within-group variability; indeed, at a given threshold, the average of the sample-specific ECDFs differs from the group-level ECDF (i.e., the curve based on all individual measurements from the group). The null distribution of *s*^{obs} is then estimated via a hierarchical non-parametric permutation approach (see Methods). A major disadvantage of permutation tests, which often restricts its usage on biological data, is that too few permutations are available from small samples. We overcome this by permuting cells, which is still possible in small samples, because there are many more cells than samples. In principle, this may lead to an inflation of false positives due to lack of exchangability (see Methods); nonetheless, in our analyses, *distinct* provides good control of both false positive and false discovery rates.

Importantly, *distinct* is general and flexible: it targets complex changes between groups, explicitly models biological replicates within a hierarchical framework, does not rely on asymptotic theory, avoids parametric assumptions, and can be applied to arbitrary types of data. Additionally, *distinct* can also adjust for sample-level cell-cluster specific covariates (i.e., whose effect varies across cell clusters), such as batch effects,: *distinct* fits a linear model with the input data (e.g., CPMs or log2-CPMs) as response variable, and the covariates as predictors; the method then removes the estimated effect of covariates, and performs differential testing on these normalized values (see Methods).

Furthermore, to enhance the interpretability of differential results, *distinct* provides functionalities to compute (log) fold changes between conditions, and to plot densities and ECDFs, both for individual samples and at the group-level.

Note that, although *distinct* and the Kolmogorov-Smirnov [15] (KS) test share similarities (they both compare distributions via non-parametric tests), the two approaches present several conceptual differences. Firstly, the KS considers the maximum distance between two ECDFs, while our approach estimates the overall distance between ECDFs, which in our view is a more appropriate way to measure the difference between distributions. Secondly, the KS test only compares two individual densities, while our framework compares groups of distributions. Thirdly, while the KS statistic relies on asymptotic theory, our framework uses a permutation test. Finally, a comparison between *distinct* and *scDD* [12] based on the KS test (labelled *scDD-KS*) shows that our method, compared to the KS test, has greater statistical power to detect differential effects and leads to fewer false discoveries (see Simulation studies).

### Simulation studies

We conducted an extensive benchmark, based on scRNA-seq and mass cytometry simulated and experimental datasets to investigate *distinct*’s ability to identify differential patterns in sub-populations of cells.

First, we simulated droplet scRNA-seq data via *muscat* [5] (see Methods). We ran five simulation replicates for each of the differential profiles in Figure 1, with 10% of the genes being differential in each cluster, where DE (differential expression) indicates a shift in the entire distribution, DP (differential proportion) implies two mixture distributions with different proportions of the two components, DM (differential modality) assumes a unimodal and a bimodal distribution, DB (both differential modality and different component means) compares a unimodal and a bimodal distribution with the same overall mean, and DV (differential variability) refers to two unimodal distributions with the same mean but different variance (Figure 1 and Supplementary Figure 1). Each individual simulation consists of 4,000 genes, 3,600 cells, separated into 3 clusters, and two groups of 3 samples each, corresponding to an average of 200 cells per sample in each cluster.

We considered three different normalizations: counts per million (CPMs), logarithm of CPMs to base 2 (log2-CPMs) and residuals from variance stabilizing normalization from *sctransform* (vstresiduals) [9]. We compared *distinct* to several PB approaches from *muscat*, based on *edgeR* [21], *limma-voom* and *limma-trend* [20], which emerged among the best performing methods for differential analyses from scRNA-seq data [5,23]. We further considered three methods from *muscat* based on mixed models (MM), namely *MM-dream2, MM-vstresiduals* and *MM-nbinom* (see Methods). Finally, we included *scDD* [12], which is conceptually similar to our approach: *scDD* implements a non-parametric method to detect changes between individual distributions from scRNA-seq, based on the Kolmogorov-Smirnov test, *scDD-KS*, and on a permutation approach, *scDD-perm*. For *scDD-perm* we used 100 permutations to reduce the computational burden.

In all scenarios and on all three input datasets, *distinct* shows favourable performance: it has good statistical power while controlling for the false discovery rate (FDR) (Figure 2). In particular, for DE, DP and DM, *distinct* has similar performance to the best performing competitors *(edgeR.counts* and *limma-trend.log2-CPMs*), while for DB and DV, it achieves significantly higher true positive rate (TPR), especially when using *log2-CPMs*. PB methods in general perform well for differential patterns involving changes in the mean (DE, DP and DM), but struggle to identify DB and DV patterns. *scDD* provides good TPR across all patterns when using the KS test on vstresiduals *(scDD-KS.vstresiduals*), while the TPR is significantly reduced when using *log2-CPMs* and with the permutation *approach(scDD-perm*); however, *scDD* methods also show a significant inflation of the FDR. In contrast, MM methods provide good control of the FDR but have low statistical power in all differential scenarios.

We further simulated five null simulation replicates with no differential patterns; again with each simulation having 4,000 genes, 3,600 cells, 3 cell clusters and two groups of 3 samples each. In the null simulated data, no method presents an inflation of false positives, with *distinct, edgeR, limma-trend* and *scDD* showing approximately uniform p-values for all types of input data (Figure 3).

We also extended previous simulations to add a cell-type specific batch effect (i.e., a batch effect that affects differently each cell-type) [5,14]. In particular, we simulated 2 batches, that we call *b*_{1} and *b*_{2}, with one group of samples having two samples associated to *b*_{1} and one to *b*_{2}, and the other group of samples having two samples from batch *b*_{2} and one from *b*_{1}. Differential results are substantially unchanged (Supplementary Figure 2), which shows *distinct* can effectively remove nuisance confounders. Furthermore, by varying the number of cells in the simulated data, we show that, compared to PB, MM and *scDD* methods, *distinct* achieves higher overall TPR, while controlling for the FDR, regardless of the number of available cells (Figure 5 and Supplementary Figure 3).

From a computational perspective, *distinct* required an average time of 3.4 to 4.5 minutes per simulation, which is higher than PB methods (0.1 to 0.2 minutes) and *scDD-KS* (0.4 to 0.5 minutes), but significantly lower than MM approaches (29.4 to 297.3 minutes) and *scDD-perm* (447.5 to 1970.1 minutes) (Figure 4 and Supplementary Table 1). All methods were run on 3 cores, except PB approaches, which used a single core, because they do not allow for parellel computing.

We further considered the semi-simulated mass cytometry data from Weber *et al.* [28] (labelled *diffcyt* simulation), where spike-in signals were computationally introduced in experimental data [3], hence maintaining the properties of real biological data while also embedding a known ground truth signal. We evaluated *distinct* and two methods from *diffcyt*, based on *limma* [20] and linear mixed models (LMM), which outperformed competitors on these same data [28]. In particular, we considered three datasets from Weber *et al.* [28]: the main DS dataset and two more where differential effects were diluted by 50 and 75%. Each dataset consists of 24 protein markers, 88,435 cells, and two groups (with and without spike-in signal) of 8 samples each. Measurements were first transformed, and then cells were grouped into sub-populations with two separate approaches (see Methods): i) similarly to the *muscat* simulation study, cell labels were defined based on 8 manually annotated cell types [28] (Figure 6a), and ii) as in the original *diffcyt* study from Weber *et al.* [28], cells were grouped into 100 high-resolution clusters (based on 10 cell-type markers, see Methods) via unsupervised clustering (Figure 6b). In the main simulation, *distinct* achieves higher TPR when considering cell-type labels (Figure 6a, ‘main’), while all methods exhibit substantially overlapping performance when using unsupervised clustering (Figure 6b, ‘main’). In both clustering approaches, as the magnitude of the differential effect decreases, the distance between methods increases: *diffcyt* tools show a significant drop in the true positive rate (TPR) whereas *distinct* maintains a higher TPR while effectively controlling for the false discovery rate (FDR) (Figures 6a-b and Supplementary Figure 4). This indicates that *distinct* has good statistical power to detect even small changes between conditions. We also considered the three replicate null datasets from Weber *et al.* [28] (i.e., with no differential effect), containing 24 protein markers and 88,438 cells across 8 cell types, and found that all methods display approximately uniform p-values (Figure 6c).

### Experimental data analyses

In order to investigate false positive rates (FPRs) in real data, we considered two experimental scRNA-seq datasets where no differential signals were expected, by comparing samples from the same experimental condition. Given the high computational cost and low power of MM, and the high FDR of *scDD* models, for the real data analyses, we only included *distinct* and PB methods. We considered gene-cluster combinations with at least 20 non-zero cells across all samples. The first dataset (labelled *T-cells*) consists of a Smart-seq2 scRNA-seq dataset of 23,459 genes and 11,138 T cells isolated from peripheral blood from 12 colorectal cancer patients [30]. We automatically separated cells in clusters (via *igraph* [1,6]), and generated replicate datasets, by randomly separating, three times, the 12 patients to two groups of size 6. The second dataset (labelled *Kang*) contains 10x droplet-based scRNA-seq peripheral blood mononuclear cell data from 8 Lupus patients, before (controls) and after (stimulated) 6h-treatment with interferon-*β* (INF-*β*), a cytokine known to alter the transcriptional profile of immune cells [10]. The full dataset contains 35,635 genes and 29,065 cells, which are separated (via manual annotation [10]) into 8 cell types. One of the 8 patients was removed as it appears to be a potential outlier (Supplementary Figures 5-7). Here we only included singlet cells and cells assigned to a cell population, and considered control samples only, resulting in 11,854 cells. Again, we artificially created three replicate datasets by randomly assigning the 7 retained control samples in two groups of size 3 and 4. In both null analyses, we found that *limma-trend* leads to a major increase of FPRs, *distinct*’s p-values are only marginally inflated towards 0, while *edgeR* and *limma-voom* are the most conservative methods and provide the best control of FPRs (Figure 7a and Supplementary Tables 2-3).

We then considered again the *Kang* dataset, and performed a DS analysis between controls and stimulated samples. Again, we removed one potential outlier patient, and only considered singlet cells and cells assigned to a cell population, resulting in 35,635 genes, 23,571 cells across 8 cell types and 14 samples; we further filtered gene-cluster combinations with less than 20 non-zero cells across all samples. We found that *distinct* identifies more differential patterns than PB methods, with *edgeR* and *limma-voom* being the most conservative methods, and that its results are very coherent across different input data (Supplementary Figure 8). When visually investigating the gene-cluster combinations detected by *distinct* (adjusted p-value < 0.05), on all input data (CPMs, log2-CPMs and vstresiduals), and not detected by any PB method (adjusted p-value > 0.05), we found several interesting non-canonical differential patterns (Figure 7b and Supplementary Figures 9-17). In particular, gene MARCKSL1 displays a DB pattern, with stimulated samples having higher density on the tails and lower in the centre of the distribution, gene RPL13 mirrors classical DE, while the other genes seem to emulate DP profiles. Interestingly, eight out of nine of these genes are known tumor prognostic markers: EIF3K for cervical and renal cancer, SRSF9 for liver cancer and melanoma, NDUFA4 for renal cancer, RPL24 for renal and thyroid cancer, HN-RNPA0 for renal and pancreatic cancer, MARCKSL1 for liver and renal cancer, GTF3C6 for liver cancer and RPL13 for endometrial and renal cancer [25]. This is an interesting association, considering that INF-*β* stimulation is known to inhibit and interfere with tumor progression [7,19]. Finally, Supplementary Figures 917 show how *distinct* can identify differences between groups of distributions even when only a portion of the ECDF varies between conditions.

## Discussion

High-throughput single-cell data can display complex differential patterns; nonetheless, most methods for differential expression fail to identify changes where the mean is not affected. To overcome present limitations, we have introduced *distinct*, a general method to identify differential patterns between groups of distributions, which is particularly well suited to perform differential analyses on high-throughput single-cell data. We ran extensive benchmarks on both simulated and experimental datasets from scRNA-seq and mass cytometry data, where our method exhibits favourable performance, provides good control of the FPR and FDR, and is able to identify more patterns of differential expression compared to canonical tools, even when the overall mean is unchanged. Furthermore, *distinct* allows for biological replicates, can adjust for covariates (e.g., batch effects), and does not rely on asymptotic theory. Finally, note that *distinct* is a very general test that, due to its non-parametric nature, can be applied to various types of data, beyond the single-cell applications shown here.

## Availability

*distinct* is freely available as a Bioconductor R package at: https://bioconductor.org/packages/distinct. The scripts used to run all analyses are available on GitHub (https://github.com/SimoneTiberi/distinct_manuscript, version v2) and Zenodo (DOI: 10.5281/zenodo.4739098). The *diffcyt* simulated data is available via FlowRepository (accession ID FR-FCM-ZYL8 [28]) and *HDCytoData* R Bioconductor package [29]; the *Kang* dataset can be accessed via *muscData* R Bioconductor package [4]; the *T-cells* dataset is deposited on the European Genome-phenome (accession id EGAD00001003910 [30]).

## Author contributions

ST conceived the method, implemented it, performed all analyses and wrote the manuscript. ST and MDR designed the study. HLC and LMW contributed to *muscat* and *diffcyt* simulation studies, respectively. PS contributed to the computational development of *distinct*. All authors read, contributed to, and approved the final article.

## Competing interests

The authors declare no competing interests.

## Methods

### Permutation test

In order to test for differences between groups, we employ a hierarchical permutation approach: to estimate the null distribution of *s*^{obs}, we permute the individual observations (e.g., single-cell measurements) instead of the samples. Note that this violates the exchangeability assumption of permutation tests and, hence, p-values are not guaranteed to be uniformly distributed under the null hypothesis; nonetheless, in our simulated and experimental analyses, we empirically show that *distinct* provides good control of both false positive and false discovery rates. We randomly permute individual observations *P* times across all samples and groups, by retaining the original sample sizes. We denote by *s*_{p} the test statistic computed at the *p*-th permutation, *p* = 1, … , *P*. A p-value, , is obtained as [18]:
where **1** *(cond)* is 1 if *cond* is true, and 0 otherwise. In order to accurately infer small p-values, when is below some pre-defined thresholds, the number of permutations are automatically increased and is re-computed. By default, *distinct* initially computes 100 permutations; when these are increased to 500; the new we use 2,000 permutations, which are further increased to 10,000 if . Note that the number of permutations (i.e., 100, 500, 2,000 and 10,000) can be specified by the user.

### Covariates

Assume we observe *Z* nuisance covariates, and that *N* samples are available across all groups, where for the *i*-th sample we observe *C*_{i} values (e.g., single-cell measurements). We fit the following linear model:
where represents the *c*-th observation for the *i*-th sample, *β*_{0} is the intercept of the model, indicates the *z*-th covariate in the *i*-th sample, *β*_{z} represents the coefficient for the *z*-th covariate, and is the residual for the *c*-th observation in the *i*-th sample. We infer rameters *β*_{0}, … ,*β*_{Z} via least squares regression, with the estimated values denoted by . We then remove the estimated effect of covariates as ; differential testing is performed, as described above, on these normalized values. For DS analyses, model (3) is fit, separately, for every gene-cluster combination, hence accommodating for cell-type specific effects of covariates.

### Normalization

In scRNA-seq datasets, CPMs and log2-CPMs were computed via *scater* Bioconductor R package [16], while vstresiduals were calculated via *sctransform* R package [9] (except for the *T-cells* data, where, due to a failure of *sctransform*’s variance stabilizing normalization, we used *DESeq2*’s vst transformation [13]).

In mass cytometry datasets, measurements were transformed via *diffcyt*’s *transformData* function, which applies an *arcsinh* transformation.

*diffcyt* simulation

The *diffcyt* semi-simulated data originates from a real mass cytometry dataset of healthy peripheral blood mononuclear cells from two paired groups of 8 samples each [3]; one group contains unstimulated cells, while the other was stimulated with B cell receptor/Fc receptor cross-linker. The original dataset contains a total of 172,791 cells and 24 protein markers: 10 of these are cell-type markers used for cell clustering, while 14 are cell state markers used for differential state analyses; the distinction between cell state and cell-type markers is based on prior biological knowledge [28]. In Weber *et al.* [28], semi-simulated data were generated by separating the cells of each unstimulated sample in two artificial samples; a differential signal was then computationally introduced by replacing, in one group, unstimulated B cells with B cells from stimulated samples. Measurements were transformed and cells clustered via *diffcyt*’s *transformData* (which applies an *arcsinh* transformation) and *generateClusters* functions, respectively. For the DS simulation in Figure 6b, as in Weber *et al.* [28], we evaluated methods’ performance in terms of detecting DS for phosphory-lated ribosomal protein S6 (pS6) in B cells, which is the strongest differential signal across the cell types in this dataset [17,28]. For the DS simulation in Figure 6a, we considered previously manually annotated cell types [28] and included all 14 cell state markers. *diffcyt’s limma* and LMM methods were applied via *diffcyt’s testDS_limma* and *testDS_LMM* functions, respectively [28]. We accounted for the paired design by modelling the patient id as a covariate.

*muscat* simulation and *Kang* data

In all *muscat* simulations, we used the control samples of the *Kang* dataset as a anchor data; as in the real data analyses, we excluded one sample as it emerged as a potential outlier (Supplementary Figures 5-7), and only considered singlet cells and cells assigned to a cell population. In *muscat*’s simulation studies, we considered gene-cluster combinations with simulated expression mean greater than 0.2; for DB patterns, we increased this threshold to 1 because with low expression values differences are not visible by eye. For every simulations, five replicates were simulated, and results were averaged across replicates. In the main simulation (Figure 2) and the batch effect simulation (Supplementary Figure 3), we simulated from a paired design 2 groups of 3 samples each, with 4,000 genes, and 3,600 cells distributed in 3 clusters (corresponding to an average of 200 cells per sample in each cluster). For the simulation study when varying the number of cells (Figure 5 and Supplementary Figure 3), the total numbers of available cells were 900, 1,800, 3,600 and 7,200, corresponding to an average of 50, 100, 200 and 400 cells per sample in every cluster. For the differential simulations, we used log2-FC values of 1 for DE, 1.5 for DP and DM, and 3 for DB and DV. For the batch effect simulation study we used a modified version of *muscat*, developed by Almut Luetge at the Robinson lab (available at: https://github.com/SimoneTiberi/distinct_manuscript), which allows simulating cluster-specific batch effects [5,14]. All *muscat* simulation studies, as well as the *Kang* non-null data analysis, were performed by editing the original snakemake workflow from Crowell *et al.* [5]. PB methods were applied on aggregated data by summing cell-level measurements; for differential testing, we used *muscat*’s *pbDS* function [5]. Mixed model methods were implemented, via *muscat*’s *mmDS* function, using the same approaches as in Crowell *et al.* [5]: in *MM-dream2* and *MM-vstresiduals* linear mixed models were applied to log-normalized data with observational weights and variance-stabilized data, respectively, while in *MM-nbinom* generalized linear mixed models were fitted directly to raw counts. In the *muscat* simulations and in the *Kang* non-null data analysis, we accounted for the paired design by modelling the patient id as a covariate in all methods that allow for covariates (i.e., *distinct*, PB and MM).

### P-values adjustment

All p-values were adjusted via Benjamini-Hochberg correction [2]. In *diffcyt* simulations we used globally adjusted p-values for all methods, i.e., p-values from all clusters are jointly adjusted once. However, since PB methods were found to be over-conservative when globally adjusting p-values [5], in *muscat* simulations and *Kang* discovery analyses, we used locally adjusted p-values for all methods.

### Software versions

All analyses were performed via R software version 4.0.0, with Bioconductor packages from release 3.11.

## Acknowledgements

We acknowledge Almut Luetge and the entire Robinson lab for precious comments and suggestions. This work was supported by Forschungskredit to ST (grant number FK-19-113) as well as by the Swiss National Science Foundation to MDR (grants 310030_175841, CR-SII5 177208). MDR acknowledges support from the University Research Priority Program Evolution in Action at the University of Zurich.

## Footnotes

↵* e-mail: Simone.Tiberi{at}uzh.ch