A Statistical Approach to Dimensionality Reduction Reveals Scale and Structure in scRNA-seq Data

Single-cell RNA sequencing (scRNA-seq) experiments often measure thousands of genes, making them high-dimensional data sets. As a result, dimensionality reduction (DR) algorithms such as t-SNE and UMAP are necessary for data visualization. However, the use of DR methods in other tasks, such as for cell-type detection or developmental trajectory reconstruction, is stymied by unquantified non-linear and stochastic deformations in the mapping from the high- to low-dimensional space. In this work, we present a statistical framework for the quantification of embedding quality so that DR algorithms can be used with confidence in unsupervised applications. Specifically, this framework generates a local assessment of embedding quality by statistically integrating information across embeddings. Furthermore, the approach separates biological signal from noise via the construction of an empirical null hypothesis. Using this approach on scRNA-seq data reveals biologically relevant structure and suggests a novel “spectral” decomposition of data. We apply the framework to several data sets and DR methods, illustrating its robustness and flexibility as well as its widespread utility in several quantitative applications.


Introduction
result, sophisticated methods must be employed in order to make statistical inferences from the data 23 (e.g., the DESeq2 algorithm to infer simple differences in fold-change expression [15]). 24 and the choice of algorithms only serve to complicate this process. 48 To address these issues, much work has been done to provide guidelines on how to use these 49 algorithms [10,44,45,47] and to make improvements to the algorithms themselves [46,[48][49][50][51][52][53][54] 50 that aim to correct or account for these distortions. At the same time, an entire set of methods for  In addition, the non-linearity and stochasticity of popular DR methods is well known to cause 78 large variation in the arrangement and shape of embedded structures [45], as can be seen in Figure 1 79 for t-SNE. (Similar results can be shown for other common algorithms, such as UMAP [19] and 80 PHATE [34].) As a result, we should expect that this will introduce variation into any quality metrics, 81 and this variation should be incorporated into any downstream analysis. That is, we don't just want 82 to know whether a cell is well-embedded one time, but whether it is consistently well-embedded. 83 While there is a large body of work on ensemble visualization [72,73], only recently [74] has there 84 been an attempt to apply this theory to assess the variability of DR embeddings. Our approach 85 differs in that it proposes a statistical framework in which to consider quality metric variability. 86 Specifically, we can consider each cell's local quality score to be a measurement of a quality statistic 87 and we now want to assess the distribution of this statistic across embeddings. 88 This approach then allows us to incorporate concerns about signal and noise as a statistical 89 hypothesis test, where we can use consistently elevated embedding quality as evidence of real 90 biological structure. To perform this test, we propose the use of resampling to generate "null" 91 data sets that contain no biological structure. These null data are then embedded to provide a null 92 distribution for local quality scores. Combining this null distribution with the actual quality metrics 93 from the data, the output of our method is a -value assigned to each sample, requiring no further 94 corrections, indicating the likelihood that it was embedded better than noise. This assessment of 95 the presence of biological structure is especially useful in the context of scRNA-seq data, which is 96 notoriously noisy and sparse [13,44].

97
In this way, the statistical approach to dimensionality reduction provides a biologically relevant 98 quantification of DR quality. The approach, outlined in the next section, addresses several unique 99 concerns that arise with scRNA-seq data including the local fidelity of embeddings, and the 100 variability in embedding that is due to both biological noise and the DR algorithm. In our results, 101 we show that the application of this approach indicates that heterogeneity in embedding quality is 102 generic across data sets and DR algorithms. We then show that examining this cell-wise embedding 103 variability across scale parameters reveals a spectral view of the data. We demonstrate that the 104 method can be used to rigorously compare DR methods and data sets, allowing the user to untangle 105 analysis choices like those presented in Figure 1. Finally, we show that the approach may have 106 utility in downstream analyses such as unsupervised clustering that can incorporate uncertainty in 107 embedding.

108
The Statistical Approach 109 The statistical approach to dimensionality reduction consists of three components: (1) the embedding each embedding, we will calculate the embedding statistic, which we denote , . We use an * 119 to indicate null data generated by resampling, so that a resampled high-dimensional data vector 120 is ì * and it's position in the embedded space would be ì * . The final step of the hypothesis test 121 process involves calculating the -value: , = ( * ≤ , ) using the empirically generated 122 distribution of * . We elaborate on each of these three steps below. 123 1.  S5 for an illustration of this process). In this way, the null data contains biologically realistic 148 distributions of individual genes, but no grouping of the samples as a function of these genes.

149
This provides a basis on which to empirically generate a distribution of quality scores by 150 embedding the null data multiple times. Figure S6 shows that the null distribution is generally 151 stable across embeddings, so that only 5-20 are needed to generate a sufficient distribution.

164
Embedding Quality is Heterogeneous Across an Embedding 165 As noted earlier, there is no good reason to assume a priori that a generic data set has any uniformity 166 (in terms of density, continuity, topology, etc.) in the high-dimensional space. This lack of uniformity, 167 is one of the central difficulties of dimensionality reduction and the analysis of high-dimensional data.

168
There have been many methods proposed to address this heterogeneity, from t-SNE's scale-sensitive

181
The previous result highlights one of the difficulties of working with real data: that we don't know 182 whether the data are spread throughout the high-dimensional space with a uniform spatial scale.

183
To deal with this, t-SNE applies a similarity measure between data points where the scale of that  . Setting the scale parameter too large or small results in embeddings that are not much better structured than noise (Left and Right insets). Setting the perplexity parameter to that of the smallest average -value shows indicates a potential consensus scale at which to examine the embedding (Middle inset). Examining previously annotated cell types' embedding quality as a function of perplexity suggests that cell types might be able to be characterized by their scale spectra (Panels B-E). PCA analysis of the scale spectra in Figure S7 indicate two spectral modes, with peaks near ∼100 and ∼2200. Applying the statistical approach at each mode (left sides of Panels F and G, respectively) reveals new biologically relevant structure, especially when compared to the expert annotations from the Tabula Muris project [8] (right side of Panels F and G).
We can use these spectra to select interesting scales at which to examine the data. For example, 207 (3B, D) suggest that these cell types are best embedded at large perplexity (∼1500), while (3C, E) 208 suggest that perplexity ∼100 may be more appropriate. A more rigorous approach that applies PCA 209 to the scale spectra is shown in Figure S7, and suggests that perplexity ∼100 and ∼2200 may be Dimensionality reduction is a complicated procedure, even in the best of circumstances. Single-cell 251 RNA sequencing offers a path towards untold biological discovery, but its high-dimensional nature 252 and relatively noisy measurements require the careful application of dimensionality reduction 253 algorithms in order to make progress. Unfortunately, the state of the art in dimensionality reduction 254 currently rests on ever-changing heuristics to a degree that limits data analysis.

255
The statistical approach presented in this work provides a rigorous approach to the evaluation 256 of these heuristics, and at the same time unearths information about data sets that is of immediate 257 utility to biological researchers. The statistical approach is relatively simple (Figure 2), and can 258 be applied and utilized in a variety of contexts (Figure 4). Perhaps more importantly, statistically 259 analyzing the EES promises to reveal previously hidden structures and scales in data sets (Figure 3).

260
This paper presents a broad view of the approach and its applications, but there are a few 261 limitations that will require further consideration. Most practically, the code as written rests on the 262 speed of current implementations of DR algorithms that can be chained together to generate many 263 (null) embeddings of the same data. This is somewhat slow, requiring several hours to run a full 264 scale-parameter sweep, however the structure of the method is obviously parallelizable, so there is 265 some optimism that this can be improved.

266
This efficiency concern directly relates to the fact that because the statistics are performed 267 empirically, there is a finite resolution to the calculated -values. This is not a huge practical 268 concern except that it also provides a lower-bound on the -values, as there will often be cells whose   Non-computationally, Figure 3 suggests that this approach may be of widespread utility in the 281 analysis of high-dimensional biological data sets in order to detect and to assess the stability of 282 biologically relevant structures. The ability of the method to form model-free, non-parametric scale 283 spectra presents a new way to look at these data sets that may reveal heretofore unseen phenomena.          The null data sets are generated via marginal resampling, which is a process by which "null" cells are created by randomly selecting from the distribution for each measurement (gene) independently. This results in data sets that have the same marginal distributions for each gene as the original data, but have no correlative structure (compare the left and right panels). The left panel shows a random selection of these distributions, while the right panel shows the -value from a Kolmogorov-Smirnov test between each pair of null distributions. The right panel shows that most pairs display a high -value, suggesting close agreement between these distributions. Based on this and other tests (not shown), we recommend combining between 5-20 null data sets in order to form a stable null distribution for the EES.   Figure S8: Sweeping Across Perplexity Reveals Different Natural Scales for Different Cell Types: Examining the scale-spectra of each cell type as in Figure 3 suggests that some cell types may have characteristic spectra or characteristic scales at which they are well resolved in a 2D embedding.