## Abstract

Single-cell RNA sequencing (scRNA-seq) experiments often measure thousands of genes, making them high-dimensional data sets. As a result, dimensionality reduction (DR) algorithms such as t-SNE and UMAP are necessary for data visualization. However, the use of DR methods in other tasks, such as for cell-type detection or developmental trajectory reconstruction, is stymied by unquantified non-linear and stochastic deformations in the mapping from the high- to low-dimensional space. In this work, we present a statistical framework for the quantification of embedding quality so that DR algorithms can be used with confidence in unsupervised applications. Specifically, this framework generates a local assessment of embedding quality by statistically integrating information across embeddings. Furthermore, the approach separates biological signal from noise via the construction of an empirical null hypothesis. Using this approach on scRNA-seq data reveals biologically relevant structure and suggests a novel “spectral” decomposition of data. We apply the framework to several data sets and DR methods, illustrating its robustness and flexibility as well as its widespread utility in several quantitative applications.

## Introduction

Recent advances in high-throughput measurement techniques have revolutionized cellular and molecular biology. In particular, the advent of single-cell RNA sequencing (scRNA-seq) has made it possible to ask detailed questions about cellular differentiation, patterning, signaling, and variation at a single-cell resolution [1–13]. However, transcriptional sequencing’s attempt to characterize the entire cellular transcriptome at once means that thousands of genes must be measured simultaneously, making the data inherently high-dimensional and subject to the “curse of dimensionality” [14]. As a result, sophisticated methods must be employed in order to make statistical inferences from the data (e.g., the DESeq2 algorithm to infer simple differences in fold-change expression [15]).

Traditionally, a researcher would seek to find a reduced set of features (genes) or combination of features on which statistical methods can be applied with more power; this is known as **dimensionality reduction** (DR), and when done correctly, can be used to find a more “natural” description of a system [16]. Significant effort has been put into the development and application of DR algorithms such as PCA [17], t-SNE [18], UMAP [19], and others [20–34], which attempt to find lower-dimensional (usually two- or three-dimensional) representations of the data that preserve some aspect of the original structure (for a review, see [35–37]; in application to - omics data, see [38]). However, regardless of the choice of algorithm, DR will always incur a loss of information [39, 40], which manifests itself as distortions of high-dimensional structure in the lower-dimensional embedding [27, 36, 41] (See Figure S1 for an example). Furthermore, it is impossible to detect “by eye” which parts of an embedding are signal and which are noise, since these methods are often non-linear or stochastic, and therefore do not homogeneously distort the data [42, 43].

As a result, the use of DR algorithms in scRNA-seq analysis may be treated skeptically [9, 12, 44] or require additional guidance [45]. To underscore these concerns, consider Figure 1, where scRNA-seq data from over 5000 bone-marrow cells that were collected by the Tabula Muris Consortium [8] have been embedded in two dimensions using several DR algorithms. As noted earlier, the detection of different cell types in a heterogeneous tissue is a biologically interesting task, so the annotated cell types from [8] have been used to color the embeddings (the legend can be found in the S2). However, a quick examination of the embeddings in panels **B** and **E** reveals that the arrangement and shape of the clusters are different between two different runs of the t-SNE algorithm. As a result, even though algorithms such as t-SNE are *provably good* at clustering [46], they cannot reliably be used by themselves for unsupervised clustering due to these non-linear and stochastic effects. The rest of Figure 1 underscores that the addition of algorithmic hyperparameters and the choice of algorithms only serve to complicate this process.

To address these issues, much work has been done to provide guidelines on how to use these algorithms [10, 44, 45, 47] and to make improvements to the algorithms themselves [46, 48–54] that aim to correct or account for these distortions. At the same time, an entire set of methods for quantitatively assessing the *quality* of DR methods has been developed [37, 43, 55]. These metrics can roughly be categorized as being global [27, 49, 56–64] or local [26, 42, 43, 65, 66] in scope, and either based on preserving distances [64], neighborhoods [27, 43, 55–57, 59, 67, 68], or topology [63, 69, 70], but in all cases, they attempt to summarize the extent to which a given DR algorithm preserves some aspect of the original data’s structure. These metrics have been successfully used to compare DR methods and optimize hyperparameters, and a recent comprehensive benchmarking of these algorithms noted that t-SNE and UMAP were, in fact, consistently high-quality methods across many data set and metrics [37].

It is in this context that we propose a *statistical* framework for characterizing the stability and variability of embedding quality by posing a point-wise metric as an **Empirical Embedding Statistic**. We propose this approach to address several aspects of scRNA-seq data that have limited the direct application of many of the tools in the quality assessment literature. Specifically, we note that any assessment methodology for scRNA-seq should **(1)** measure quality *locally*, not globally across an embedding, **(2)** estimate the *variation* in embeddings that is introduced by the DR algorithms themselves, and **(3)** estimate where embeddings show structure consistent with actual high-dimensional structure and not noise. In the rest of this section, we explain why these criteria are necessary for a useful DR assessment framework. We then outline the approach in the next section before demonstrating its application and utility on several data sets.

First, we note that while global assessment of DR methods for hyperparameter optimization is important, the direct use of DR output for clustering or lineage reconstruction is limited by concerns about the *local* quality of the samples within an embedding. (Heuristics for maximizing global quality have been proposed [45], and human-optimized parameters are often close to those selected algorithmically [49].) That is, it is much more important to know the answer to “*Is cell A close to cell B in the embedding because they are similar in gene-space?*” than it is to know whether the average cell is well-embedded. Therefore, we want to make use of the extensive work on local quality metrics [41, 42, 47, 66–69, 71] in developing our approach.

In addition, the non-linearity and stochasticity of popular DR methods is well known to cause large variation in the arrangement and shape of embedded structures [45], as can be seen in Figure 1 for t-SNE. (Similar results can be shown for other common algorithms, such as UMAP [19] and PHATE [34].) As a result, we should expect that this will introduce variation into any quality metrics, and this variation should be incorporated into any downstream analysis. That is, we don’t just want to know whether a cell is well-embedded one time, but whether it is *consistently* well-embedded. While there is a large body of work on ensemble visualization [72, 73], only recently [74] has there been an attempt to apply this theory to assess the variability of DR embeddings. Our approach differs in that it proposes a *statistical* framework in which to consider quality metric variability. Specifically, we can consider each cell’s local quality score to be a measurement of a quality *statistic* and we now want to assess the distribution of this statistic across embeddings.

This approach then allows us to incorporate concerns about signal and noise as a statistical *hypothesis test*, where we can use consistently elevated embedding quality as evidence of real biological structure. To perform this test, we propose the use of resampling to generate “null” data sets that contain no biological structure. These null data are then embedded to provide a null distribution for local quality scores. Combining this null distribution with the actual quality metrics from the data, the output of our method is a *p*-value assigned to each sample, requiring no further corrections, indicating the likelihood that it was embedded better than noise. This assessment of the presence of biological structure is especially useful in the context of scRNA-seq data, which is notoriously noisy and sparse [13, 44].

In this way, the statistical approach to dimensionality reduction provides a biologically relevant quantification of DR quality. The approach, outlined in the next section, addresses several unique concerns that arise with scRNA-seq data including the local fidelity of embeddings, and the variability in embedding that is due to both biological noise and the DR algorithm. In our results, we show that the application of this approach indicates that heterogeneity in embedding quality is generic across data sets and DR algorithms. We then show that examining this cell-wise embedding variability across scale parameters reveals a spectral view of the data. We demonstrate that the method can be used to rigorously compare DR methods and data sets, allowing the user to untangle analysis choices like those presented in Figure 1. Finally, we show that the approach may have utility in downstream analyses such as unsupervised clustering that can incorporate uncertainty in embedding.

### The Statistical Approach

The statistical approach to dimensionality reduction consists of three components: **(1)** the embedding of the data, **(2)** the construction and embedding of the null data, and **(3)** the calculation of the embedding statistic and performance of a hypothesis test. These are illustrated heuristically in Figure 2 and more technically in S3. These steps are centered on the calculation of a local quality statistic, the Empirical Embedding Statistic (EES), for each sample (cell) in the data set. To clarify the notation throughout the rest of this paper: consider a data set *X* to be a collection of *N*_{Cells} vectors, where each cell contains measurements for each of *D* genes. Recalling that the data can be embedded multiple times to yield different embeddings, we denote the position of each *i*^{th} cell in the *n*^{th} embedded space by , where the number of embeddings is *N*_{Embed}. For each cell, in each embedding, we will calculate the embedding statistic, which we denote *EES*_{i,n}. We use an * to indicate null data generated by resampling, so that a resampled high-dimensional data vector is and it’s position in the embedded space would be . The final step of the hypothesis test process involves calculating the *p*-value: *p*_{i,n} = *P*(*EES** ≤ *EES*_{i,n}) using the empirically generated distribution of *EES**. We elaborate on each of these three steps below.

**Data Embedding:**The first step in our approach is simply to embed the data one or more times, and to calculate a sample-wise quality metric on each sample (single cell) for each embedding,*EES*_{i,n}. Depending on what aspects of data structure the researcher wants to assess, several of the quality metrics mentioned earlier may be adapted for this purpose. In this work, we calculate a similarity score between each pair of samples in the high-dimensional space and again in the low-dimensional space. For similarity in the high-dimensional space, we calculate a Gaussian probability for two points being a certain Euclidean distance apart, where the scale is set by fixing the overall entropy of a sample’s similarities, as in [25]. The low-dimensional similarity is given by the likelihood of a certain distance under a Student’s*t*-distribution, with*v*= 1 degree of freedom. We then compare these similarity distributions using the Kullback-Leibler Divergence*D*_{KL}[77], and we use*D*_{KL}≡*EES*as our embedding statistic. This choice of statistic is justified in that it directly relates the quality statistic to the similarities upon which t-SNE operates, and because it can be seen as a continuous generalization of*recall*, which is the likelihood that a sample’s neighbors in the embedded space are also neighbors in the original data [27]. (See Figure S4 for example distributions of the EES.) Choosing a neighborhood-preservation metric is also consistent with our goal of assessing local quality in the embeddings. As shown in the results section, this also gives our quality metric the same scale parameter as t-SNE, making the results easily interpretable. However, the choice of local quality metric is flexible, and other metrics may be preferable in different contexts.**Null Construction and Embedding:**The most crucial step in our process is to generate a biologically-realistic synthetic data set that has no biological structure, which we define as having zero inter-gene correlation. This is achieved via**marginal resampling**, where genes in the null data are independently drawn from the original data’s gene distribution (See Figure S5 for an illustration of this process). In this way, the null data contains biologically realistic distributions of individual genes, but no grouping of the samples as a function of these genes. This provides a basis on which to empirically generate a distribution of quality scores by embedding the null data multiple times. Figure S6 shows that the null distribution is generally stable across embeddings, so that only 5-20 are needed to generate a sufficient distribution.**Empirical Hypothesis Test:**Once the null data have been created and the embedding statistic*EES** has been calculated for every point over several embeddings, each of the data statistics,*EES*_{i,n}can be compared to the aggregated distribution of null statistics, as in the panel in Figure 2. This yields an empirical*p*-value, which can be summarized across the*N*_{D}embeddings [75, 78, 79] to give a single quality metric,*p*_{i}, for each cell.

We should note that while others have performed calculations for random rank orderings to adjust quality statistics [26, 57, 60], it is not immediately clear that a generic DR method will produce completely randomized neighborhoods when applied to noise. Furthermore, these calculations do not describe the *spread* with which we expect to observe neighborhood preservation from noise, so that unlike our proposed method, they cannot evaluate the likelihood of extreme values.

## Results

### Embedding Quality is Heterogeneous Across an Embedding

As noted earlier, there is no good reason to assume *a priori* that a generic data set has any uniformity (in terms of density, continuity, topology, etc.) in the high-dimensional space. This lack of uniformity, is one of the central difficulties of dimensionality reduction and the analysis of high-dimensional data. There have been many methods proposed to address this heterogeneity, from t-SNE’s scale-sensitive kernel [25] to Isomap’s method of making local graph approximations [22], and yet even the most advanced algorithms necessarily end up setting global hyperparameters. As a result, we expect that even for parameters that are *globally optimal* there will be regions that are poorly embedded compared to the rest of the data (we know that global quality is bounded [40], but not the variance in quality within an embedding).

This local heterogeneity in embedding quality has previously been shown [42, 43, 66, 68, 71, 80], and we also recover this phenomenon, as shown in Figures 2, 3, and 4. Using the bottom panels of Figure 3 as an example, we note that embedding quality varies considerably across an embedding, and not necessarily with any generic pattern. Especially in comparison to an unmarked embedding it is impossible to deduce what parts of an embedding are interpretable features - and what parts are noise - without the use of a local quality metric.

### Sweeping Across Scale Reveals Spectral Structures in the Data

The previous result highlights one of the difficulties of working with real data: that we don’t know whether the data are spread throughout the high-dimensional space with a uniform spatial scale. To deal with this, t-SNE applies a similarity measure between data points where the scale of that measure is tuned to the size of each individual point’s local neighborhood. Similarly, UMAP builds a similarity graph by only considering the distances to a sample’s *k* nearest neighbors. The effect of these choices is to maintain an equal weighting of local scales across an embedding, which is ideal because it allows for both densely-packed regions of the data space to be examined with equal standing to more diffuse regions. However, this balancing of scales comes at the cost of specifying a neighborhood size in the form of an algorithmic hyperparameter: either perplexity for t-SNE or n_neighbors for UMAP. Considerable effort has been dedicated to disentangling the effect of this hyperparameter or eliminating it altogether [29, 30, 45, 48, 49, 81], but we demonstrate in Figure 3 that examining the embedding quality over many scales reveals important structure in the data.

First, we note that sweeping across t-SNE’s perplexity parameter can be used to find a scale at which most of the samples in a data set are well embedded. In the case of the Tabula Muris FACS Marrow data, (3A) shows that this occurs at perplexity≈1200, which is considerably higher than most recommendations for the hyperparameter [25, 45]. Considering neighborhood sizes considerably smaller or larger than this results in embeddings that have neighborhoods that are indistinguishable from noise (3A left and right insets).

While 3A shows how we can summarize the effect of choosing a neighborhood size, our local statistical approach also allows for the examination of different portions of the data independently. Examining cells according to their annotated labels as in 3B-E, shows that different cell types might have different characteristic structures that are better represented at certain neighborhood sizes than others (See Figure S8 for all annotated cell types). This suggests that examining embedding “power” as a function of scale - a sort of spatial power spectrum - might be a useful way to explore scRNA-seq data even if cell type annotations are not available.

We can use these spectra to select interesting scales at which to examine the data. For example, (3B, D) suggest that these cell types are best embedded at large perplexity (∼1500), while (3C, E) suggest that perplexity ∼100 may be more appropriate. A more rigorous approach that applies PCA to the scale spectra is shown in Figure S7, and suggests that perplexity ∼100 and ∼2200 may be more “natural” scales for the data. Applying perplexity = 97 and 2197 yields Figures 3F and 3G, respectively.

Interestingly, the large perplexity 3G indicates that three large clusters of cells are well-embedded, and the labels in the right half confirm that these are biologically consistent clusters corresponding to B Cells, Progenitor Cells, and Granulocytes. However, examination of these clusters at a smaller neighborhood scale in 3F shows that much of the apparent structure in the B cells and progenitor cells is not as well resolved as at the higher scale, except for the Late Pro-B cells, which have broken off into their own well-embedded cluster. On the other hand, the granulocytes also have a different apparent structure at the smaller perplexity, with the granulopoeitic cells breaking off into their own cluster and the granulocytes separating into a well-embedded hourglass shape. In this way, we can examine how meaningful biological differences show up at different scales - the B cells seem to be similar at a wider scale, but are less distinguishable at a narrower resolution. Meanwhile, granulocytes are also well-grouped at a large scale, but might actually be composed of several distinct sub-types, as suggested by examination at a smaller scale.

### A Statistical Approach Allows for Comparisons of Data and Algorithms

The statistical approach to dimensionality reduction also provides a rigorous method to evaluate and compare data analysis protocols or the performances of dimensionality reduction algorithms on specific data sets. These topics have been the subject of significant debate [10, 11, 13, 37, 44, 45, 82] because, as Figure 1 shows, these choices can significantly impact analysis and interpretation.

In Figure 4, a comparison between embeddings of the Tabula Muris marrow tissue generated by PCA, t-SNE, and UMAP at their default parameters are all shown. Rather than making heuristic arguments, it can now be seen quantitatively that the structures in PCA and t-SNE are better representations of the original data’s structure than UMAP, when using default parameters. Applying the procedure in Figure 3A suggests a methodology for choosing algorithmic parameters, regardless of the specific algorithm being used.

### A Statistical Approach Can Be Used Upstream of Unsupervised Clustering

Since the statistical approach is able to summarize each cell’s neighborhood-fidelity with a *p*-value, we can leverage this quantity in downstream applications that depend on the geometry of the low-dimensional representations, such as unsupervised clustering. In the bottom of Figure 4 the results of applying DBSCAN [83] to embeddings at the two scales from Figure 3 are compared to the cell ontology annotations generated by the Tabula Muris consortium. In generating these clusterings, the *p*-values, *p*_{i}, for each cell were leveraged in two ways: first, cells that are never embedded well at any scale are omitted from the clustering process, then the inverse of the *p*-values, 1*/p*_{i} are used as *weights* in the DBSCAN algorithm, allowing for more high fidelity regions of the embedding to take priority in the clustering process. Comparing the unsupervised clustering to the expert annotations reiterates some of the observations made about Figure 3 concerning the annotated cell types that appear to be part of larger, coherent structures detected by the statistical method applied to t-SNE. However, on a more fundamental level, Figure 4 indicates the immediate and obvious utility of our statistical approach in a variety of contexts.

## Discussion

Dimensionality reduction is a complicated procedure, even in the best of circumstances. Single-cell RNA sequencing offers a path towards untold biological discovery, but its high-dimensional nature and relatively noisy measurements require the careful application of dimensionality reduction algorithms in order to make progress. Unfortunately, the state of the art in dimensionality reduction currently rests on ever-changing heuristics to a degree that limits data analysis.

The statistical approach presented in this work provides a rigorous approach to the evaluation of these heuristics, and at the same time unearths information about data sets that is of immediate utility to biological researchers. The statistical approach is relatively simple (Figure 2), and can be applied and utilized in a variety of contexts (Figure 4). Perhaps more importantly, statistically analyzing the EES promises to reveal previously hidden structures and scales in data sets (Figure 3). This paper presents a broad view of the approach and its applications, but there are a few limitations that will require further consideration. Most practically, the code as written rests on the speed of current implementations of DR algorithms that can be chained together to generate many (null) embeddings of the same data. This is somewhat slow, requiring several hours to run a full scale-parameter sweep, however the structure of the method is obviously parallelizable, so there is some optimism that this can be improved.

This efficiency concern directly relates to the fact that because the statistics are performed empirically, there is a finite resolution to the calculated *p*-values. This is not a huge practical concern except that it also provides a lower-bound on the *p*-values, as there will often be cells whose embedding statistics are completely outside the support of the null distribution. Other than improved computational efficiency, remedies may include theoretical work to describe the tails of these null distributions or a principled method for parameterizing the null distribution. Because of this effect, in this work, we have refrained from making more precise interpretations of these *p*-values (we do not make any significance assessments or cutoffs, nor do we do any ordered analysis of the cells by *p*-value), instead leveraging the fact that the *p*-values definitely convey strong *relative* information within the context of a data set and DR algorithm.

Moving forward, it is clear that this information can be leveraged in a variety of ways not presented in this work. Several of these directions are suggested in Figure 4, where more comprehensive efforts could be undertaken to assess the quality of DR algorithms generically, such as in [36, 37], or to incorporate the statistical approach into an unsupervised clustering algorithm more directly. Non-computationally, Figure 3 suggests that this approach may be of widespread utility in the analysis of high-dimensional biological data sets in order to detect and to assess the stability of biologically relevant structures. The ability of the method to form model-free, non-parametric scale spectra presents a new way to look at these data sets that may reveal heretofore unseen phenomena.

## Supplemental Materials

## References

- 1.↵
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.
- 22.↵
- 23.
- 24.
- 25.↵
- 26.↵
- 27.↵
- 28.
- 29.↵
- 30.↵
- 31.
- 32.
- 33.
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.
- 51.
- 52.
- 53.
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.
- 59.↵
- 60.↵
- 61.
- 62.
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵