## Summary

High-dimensional data, such as those generated using single-cell RNA sequencing, present challenges in interpretation and visualization. Numerical and computational methods for dimensionality reduction allow for low-dimensional representation of genome-scale expression data for downstream clustering, trajectory reconstruction, and biological interpretation. However, a comprehensive and quantitative evaluation of the performance of these techniques has not been established. We present an unbiased framework that defines metrics of global and local structure preservation in dimensionality reduction transformations. Using discrete and continuous scRNA-seq datasets, we find that input cell distribution and method parameters are largely determinant of global, local, and organizational data structure preservation by eleven published dimensionality reduction methods. Code available at github.com/KenLauLab/DR-structure-preservation allows for rapid evaluation of further datasets and methods.

## Introduction

Single-cell RNA sequencing (scRNA-seq) offers parallel, genome-scale measurement of tens of thousands of transcripts for thousands of cells (Klein *et al*., 2015; Macosko *et al*., 2015). Data of this magnitude provide powerful insight toward cell identity and developmental trajectory – states and fates – which are used to interrogate tissue heterogeneity and characterize disease progression (Regev *et al*., 2017; Wagner *et al*., 2019). Yet, extracting meaningful information from such high-dimensional data presents a massive challenge. Numerical and computational methods for dimensionality reduction have been developed to reconstruct underlying distributions from native “gene space” and provide low-dimensional, latent representations of single-cell data for more intuitive downstream interpretation. Basic clustering methods and linear transformations such as principal component analysis (PCA) have proven to be valuable tools in this field (Sorzano, Vargas and Montano, 2014; Levine *et al*., 2015; Kiselev *et al*., 2017; Tsuyuzaki *et al*., 2019). However, given the distribution and sparsity of scRNA-seq data, complex, nonlinear transformations are often required to capture and visualize expression patterns. Unsupervised machine learning techniques and, more recently, deep learning methods, are being rapidly developed to assist researchers in single-cell transcriptomic analysis (Van der Maaten and Hinton, 2008; Pierson and Yau, 2015; Linderman *et al*., 2017; Wang *et al*., 2017; Becht *et al*., 2018; Ding, Condon and Shah, 2018; Lopez *et al*., 2018; Risso *et al*., 2018; Eraslan *et al*., 2019; Townes *et al*., 2019). Because these techniques condense cell features in the native space to a small number of latent dimensions for visualization, lost information can result in exaggerated or dampened cell-cell similarity. Furthermore, depending on input data and user-defined parameters, the structure of resulting embeddings can vary greatly, potentially altering biological interpretation (Kobak and Berens, 2019).

With a deluge of computational techniques for dimensionality reduction, the field is lacking a comprehensive assessment of native organizational distortion consequential to such methods. Characterization of these tools would enable end users to confidently determine the most suitable methods for their application. We present an unbiased, quantitative framework for evaluation of data structure preservation by dimensionality reduction transformations. We propose three metrics for broad characterization of strengths and weaknesses of these methods based on cell-cell distance in native gene space. Initial benchmarking of eleven published software tools on discrete and continuous cell distributions shows global, local, and organizational data structure conservation under different parameter and input conditions. Interpretation and best practices, as well as extensibility of this method to other dimensionality reduction tools, is discussed.

## Results

### Cell distance distributions describe global structure of high-dimensional datasets

In order to evaluate dimensionality reduction techniques, Euclidean cell-cell distance in native, high-dimensional space is used as a quantitative standard. Counts of unique molecular identifiers (UMI) for each gene make up the features, or columns of the dataset, while every observation, or row, represents a single cell (Figure 1A). In this way, transcriptomic data can be represented as an *m × n* matrix (cells x genes).

The global data structure in the native space can be constructed by first calculating an *m × m* matrix that contains the pairwise distances between all observations in *n* dimensions (Figure 1B, top). The upper triangle of this distance matrix contains all unique cell distances in the dataset, which can then be represented by a probability density distribution as in Figure 1D. From these distances, local “neighborhoods” can be defined in the form of a K nearest-neighbor (Knn) graph. The Knn graph is represented by a binary *m × m* matrix that defines the K cells with the shortest distances from each cell in the dataset (Figure 1B, bottom). Similarly, a distance matrix, distance distribution, and Knn graph can be constructed from a low-dimensional latent space resulting from dimensionality reduction (Figure 1C).

Overall distance preservation following dimensionality reduction is measured by Mantel correlation (Figure 1D, right). This method was designed for symmetrical matrices that represent element-wise similarity between two vectors, and appropriately accounts for multiple testing of distances (Mantel, 1967). Structural alteration of the cell distance distribution – constructed from the upper triangle of the distance matrix – is quantified by the Wasserstein metric, or Earth-Mover’s Distance (EMD) (Figure 1D, left). Widely applied to image processing, this metric determines the energy associated with shifting one distribution to another, and can also be defined as the area between two cumulative probability distributions (Werman, Peleg and Rosenfeld, 1985; Rubner, Tomasi and Guibas, 1998, 2000; Levina and Bickel, 2001). Finally, preservation of the Knn graph before and after low-dimensional embedding can also be quantified as the percentage of total binary matrix elements conserved in order to describe maintenance of local substructures in the data.

### Discrete and continuous cell distributions exemplify common biological patterns *in vivo*

A major consideration for testing dimensionality reduction techniques is the true structure of the input data in native, high-dimensional space. For the scope of our evaluation applied to single-cell transcriptomics, we identify two overarching classes of scRNA-seq data for proof-of-principal: discrete and continuous. Discrete single-cell data are comprised of differentiated cell types with unique, highly discernable gene expression profiles. These data include classic PBMC experiments and neuronal datasets which can be easily clustered into distinct cell types (Zeisel *et al*., 2015; Rheaume *et al*., 2018). Conversely, continuous data represent multi-faceted expression gradients present during cell development and differentiation, and are commonly associated with dynamic systems such as erythropoiesis or embryonic development (Tusi *et al*., 2018; Wagner *et al*., 2019). Recently, computational tools for trajectory inference and lineage reconstruction from these data are being rapidly developed to query differential expression and gene-regulatory events involved in cell fate decisions (Qiu *et al*., 2017; Herring, Chen, *et al*., 2018; Saelens *et al*., 2019; Van den Berge *et al*., 2019).

Mouse retina cells, analyzed using Drop-seq by Macosko and coworkers, provide a discrete cell distribution for our analysis (Macosko *et al*., 2015) (GEO accession ID GSM1626793). Counts data from 20,478 genes for 1,326 cells were analyzed using PhenoGraph to determine “ground-truth” cell clusters by the Louvain algorithm (Figure 2A) (Levine *et al*., 2015). We performed relatively coarse clustering, ignoring subtype heterogeneity in favor of clusters reflecting principal cell identity amenable to our downstream analyses (see Methods). A t-SNE projection primed with 100 principal components (PCs) of all transcript counts allows for visualization of the data structure and represented cell types (Figure 2B). As evident from the 2D embedding, these data are highly discrete, and constituent cell clusters are easily distinguished by gene expression (Figure 2C, Figure S2A).

Mouse colon data, representing a continuous distribution of actively differentiating cells along the crypt-villus axis of the colonic epithelium, were generated using inDrops scRNA-seq (Herring, Banerjee, *et al*., 2018) (GEO accession ID GSM2743164). Counts data from 25,504 genes for 1,117 cells were similarly analyzed by PhenoGraph and t-SNE to visualize continuous data structure (Figure 2D,E). The six clusters form a branching continuum of cell states identified by expression markers (Figure 2F, Figure S2B), resolving two major lineages in the colon: absorptive colonocytes and secretory goblet cells (Lepourcelet *et al*., 2005; Tamura *et al*., 2007; Larsson *et al*., 2012). These clusters are linked together by pseudotemporal trajectories and thus their arrangement is expected to be conserved upon low-dimensional embedding into a latent space.

### Input cell distribution determines performance of global structure preservation

Using the metrics outlined in Figure 1, we compared eleven published dimensionality reduction techniques applied to continuous and discrete datasets. To allow for direct input to dimensionality reduction tools, raw counts for both datasets were feature-selected to the 500 most variable genes. This “brute force” feature selection technique may not be optimal for enriching rare cell types for higher-level analyses (Chen, Herring and Lau, 2018). However, for our purpose of proof-of-principal valuation of dimensionality reduction methods, it easily provides a high level of differential information for unsupervised algorithms without biasing data input. Calculating our metrics on all cells in the dataset, we first assess global structure preservation following transformation to a dimension-reduced latent space. Representative examples of two-dimensional projections and their corresponding distance distributions and correlations using SIMLR for the retina dataset and UMAP for the colon dataset are shown in Figure 3A and Figure 3E, respectively. Notably, the largest discrepancy in structural preservation is between the two datasets, highlighting the significance of input cell distribution to overall method performance. Intuitively, Knn preservation is higher for all eleven methods when applied to the colon dataset, reflecting the notion of continuous neighborhoods – a moving window of expression gradients – connecting all cells through developmental pseudotime. Another important observation regarding the dimension-reduced latent spaces involves the directionality of the cell distance distribution shift. A compression of distances from native to latent space is indicated by a shift left in the cumulative distance distribution (Figure 3B,F, Figure S1A) or below the identity line in the unique distance correlation (Figure 3D,H, Figure S1B). Alternatively, a shift right in the cumulative distance distribution or above the identity line of the distance correlation signifies an exaggeration of native distances (Figure S1). These phenomena are important in the context of global versus local structure preservation. For example, UMAP appears to compress small, local distances to a greater extent than t-SNE, while both methods maintain relative global structure as indicated by a favorable correlation of large distances between clusters. Although this characteristic of UMAP embeddings causes greater information loss reflected in less favorable preservation metrics (Figure 3C,G), clusters within the resulting projections tend to be highly condensed and perhaps more easily interpreted (Figure S3A,B).

### Parameter optimization plays key role in structural preservation

User-defined parameters for unsupervised algorithms often present themselves as “black-box” knobs with unknown consequences. Tuning these parameters can be a daunting task for the single-cell analyst, but is known to be crucial to algorithm performance (Belkina *et al*., 2018; Kobak and Berens, 2019; Tsuyuzaki *et al*., 2019). Using our proposed metrics, we evaluated global structure preservation across a range of perplexity values for t-SNE and UMAP algorithms applied to both discrete and continuous data. Through a balance of distance correlation, EMD, and Knn preservation, we can identify an initial range of optimal perplexity values between 2 and 10 % of the total number of cells in the dataset (Figure S3C).

### Substructure analysis illuminates contribution to global performance

To corroborate results of global structure preservation and dissect contribution of local (within cluster) and organizational (between cluster) distances to overall dimensionality reduction performance, clusters were isolated for targeted substructure quantification. Here, we can measure distance preservation of individual clusters as well as distances between clusters to emphasize local arrangement (Figure S1C,D).

Retinal cone cells (Figure 2A, cluster 4, *n* = 94) were used as an example of local distances in the discrete dataset, while mature colonocytes (Figure 2D, cluster 1, *n* = 273) were isolated in the colon dataset (Figure 4A,B,E,F). Local distance compression represents the overarching trend for the eleven evaluated tools, indicated by a correlation shift below the identity line. The latent spaces from scVI and 10-component PCA are notable exceptions, yielding the two lowest EMD values for each dataset (Figure S4A). This most likely results from the 10-dimensional latent spaces of these methods capturing more cellular variability than 2D projections. Added noise in the SIMLR latent space of mouse retina cells indicates a disagreement with Louvain cluster membership, and may be attributed to the truncated, 500-feature input used for our analysis (Figure 4B). Moreover, this observation suggests that discrete, “on-off” expression patterns are less robust to dropouts that cause misassignment of cell type than continuous gradients of gene expression.

Besides maintenance of intra-cluster local structure, dimensionality reduction methods are also tasked with preserving cellular neighborhoods, or relationships between clusters. By calculating distance distributions from cells in one cluster to those in another, we can evaluate these associations. Furthermore, we can analyze pairwise cluster-cluster distances to investigate organization of data substructures (Figure S1C). In the mouse retina dataset, distances between bipolar cells, rod cells, and amacrine cells (Figure 2A, clusters 0, 1, 2, *n* = 309, 281, 258) are marked largely by compression, with some tools altering the arrangement of the three clusters (Figure 4D, red boxes). For example, the bipolar and amacrine clusters are closest to one another in the native gene space, but the bipolar cell cluster is closer to the rod cell cluster in the UMAP embedding, as indicated by the ordering of each distribution on the axes of the 2D histogram plot. Conversely, relative distances between three adjacent clusters along the goblet cell lineage (Figure 2D, clusters 0, 3 and 4, *n* = 274, 140 and 135) are more highly conserved by all dimensionality reductions. These results confirm that related cells in continuous scRNA-seq data are tethered to their neighbors through intermediate expression states, resulting in improved local structure preservation upon latent projection (Figure S4).

## Discussion

Single-cell RNA sequencing (scRNA-seq) allows for high-throughput, genome-scale measurement of mRNA expression in individual cells. Interpretation, pattern detection, and visualization of such high-dimensional observations present major challenges, and current datasets are constantly expanding in breadth and resolution. Systems biologists have derived and adapted numerical and computational methods for dimensionality reduction to allow for low-dimensional representation of single-cell data and deduction of cell states and fates (Van der Maaten and Hinton, 2008; Pierson and Yau, 2015; Linderman *et al*., 2017; Wang *et al*., 2017; Becht *et al*., 2018; Ding, Condon and Shah, 2018; Lopez *et al*., 2018; Risso *et al*., 2018; Eraslan *et al*., 2019; Townes *et al*., 2019). As software tools for single-cell analysis become widely available to lab-based researchers, there is a need to thoroughly understand how underlying biological information is maintained or distorted by these techniques. Two scRNA-seq datasets, representing discrete, differentiated cell types (Macosko *et al*., 2015) and continuous, hierarchical differentiation states (Herring, Banerjee, *et al*., 2018) were used to investigate cell distance preservation by eleven published dimensionality reduction methods. These dichotomous data offer insight into strengths and weaknesses of these tools for different applications. Distance correlation, EMD between distance distributions, and nearest-neighbor preservation were assessed to quantify cell dispersion, global data structure preservation, and neighborhood maintenance.

We identified dispersion trends in local and global distance distributions that denote expansion and contraction of native cell distances (Figure S1). This allowed us to evaluate general performance of dimensionality reduction methods on entire single-cell datasets (Figure 3), and take a deeper dive to examine how local distances – within or between clusters – contribute to the global structure of a low-dimensional latent space (Figure 4). With a goal of grouping cells by their gene expression profiles, most dimensionality reduction tools evaluated herein compressed local distances, embellishing cluster similarity, while maintaining or expanding global distances, exaggerating cluster distinction (Figure 3, Figure 4). These characteristics of dimensionality reduction methods are desirable for most applications. However, resolution of rare cell types and sub-cluster heterogeneity may be lost, stressing the importance of input data quality, feature selection, and user-defined parameters.

Discrete scRNA-seq data are more susceptible to structural perturbation by downstream dimensionality reduction, as indicated by larger EMD values and lower distance correlations in the retina dataset than colonic epithelial cells (Figure 3, Figure S4). The possibility of cell type misclassification, or “noisy” cluster membership in low-dimensional embeddings compared to consensus Louvain clustering, is exacerbated in discrete data. This noise is intensified by gene dropouts, and is therefore sensitive to sequencing depth and capture efficiency. We also observed cluster rearrangement within the retina dataset, suggesting that relative substructure organization is poorly defined for discrete datasets (Figure 4D, Figure S4). On the other hand, continuous cell distributions are more robust to these effects. In a continuum of gene expression, as in the actively differentiating cells along the crypt-villus axis of the colonic epithelium, cell clusters are tethered to one another through intermediate states. In this way, preservation of local and relative substructures is built-in to dimensionality reduction analysis of continuous datasets, and results in more credible representations of native data structure (Figure 4F,H, Figure S4). Finally, cursory exploration of the perplexity parameter in t-SNE and UMAP reveals a range of optimal values that yield favorable structure preservation metrics, endorsing the need for parameter optimization for dimensionality reduction of scRNA-seq datasets (Figure S3C).

As high-dimensional datasets become increasingly pervasive in systems biology, computational tools for reliable and reproducible analysis of these data become tremendous assets to discovery. Dimensionality reduction techniques allow for embedding cellular observations with tens of thousands of gene features into a low-dimensional space for visualization and downstream processing. Many such methods exist, calling on concepts from mathematics and computer science to aggregate underlying biological patterns into a latent representation of the data. We present an unbiased, quantitative framework based on native cell distance to evaluate data structure preservation by dimensionality reduction tools. All code associated with this project is available at github.com/KenLauLab/DR-structure-preservation, and is readily extensible to additional scRNA-seq datasets and dimensionality reduction methods.

## Author Contributions

CNH and KSL conceived of the study. CNH developed methodology, analyzed the data, and generated visualizations. CNH wrote the manuscript. KSL participated in the writing of the manuscript and interpretation of results.

## Declaration of Interests

The authors declare no competing interests.

## Methods

### Cell Filtering

Raw counts expression matrices downloaded from GEO (accession IDs GSM1626793, GSM2743164) were filtered for high-quality cells prior to downstream analysis. The cumulative sum of total UMI counts for each cell was plotted along with the slope of the secant line to the curve as a function of rank-ordered cell. The distance between these two curves was used as a metric for determining the rate of diminishing cell quality. The cell number at which this distance was 50 % of its maximum was chosen as a cutoff, with cells contributing less UMI counts were removed. Next, a 100-component PCA and UMAP with n_neighbors value of 0.5 % of the total cells in the dataset were used to visualize cell populations and manually gate out clusters containing high mitochondrial counts, indicating dead cells.

Process shown in: github.com/KenLauLab/DR-structure-preservation/dev/QC.ipynb

### Clustering

PhenoGraph (Levine *et al*., 2015) was used to perform Louvain clustering on both datasets in Python. To create coarse, ground-truth clusters, the algorithm was run on 100 principal components of all genes in each dataset. For the retina data, 100 PCs of 20,478 genes explained 33.5 % of the variance in the dataset. For the colon data, 100 PCs of 25,505 genes explained 54.0 % of the variance. *k* values of 50 and 100 for generating the Knn graph to seed the Louvain algorithm for the retina and colon datasets, respectively, were chosen to provide coarse clustering of major cell types. Nine resulting clusters for the retina dataset and six resulting clusters in the colon dataset were analyzed by Seurat’s FindAllMarkers and DoHeatmap functions (Butler *et al*., 2018) to obtain visualizations of up- and down-regulated genes in each cluster (Figure 2A,D).

Process shown in:

### Dimensionality Reduction

All dimensionality reduction was performed on feature-selected data containing the most variable genes in each dataset. Genes were rank-ordered by variance using the Pandas (version 0.22.0) DataFrame.var function in Python, and the top 500 were chosen. Each dimensionality reduction technique was run “out-of-the-box” with default parameters on the feature-selected data. DCA, scvis, scVI, ZINB-WaVE and GLM-PCA take raw, unnormalized counts as input. Developers of ZIFA recommend a log2 transformation of counts, which we first normalized to the maximum UMI count within each cell. Arcsinh-transformed counts normalized to the maximum UMI count in each cell were used for all other methods (t-SNE, FIt-SNE, UMAP, SIMLR, PCA).

Process shown in:

### Distance Metric Calculations

Mantel test for correlation between symmetric Euclidean distance matrices (Figure 3C,D,G,H) was performed using the skbio.stats.distance.mantel function from the scikit-bio package (version 0.5.4). Pearson correlation was performed for local distance preservation analysis between clusters, as the resulting cell distance matrices are not symmetrical (Figure 3C,D,G,H). The scipy.stats.pearsonr function from the scipy package (version 1.1.0) was used. The scipy.stats.wasserstein_distance function from the scipy package (version 1.1.0) was used to calculated Earth Mover’s Distance between the flattened vectors containing unique distances between all cells in the dataset (upper triangle of distance matrix, Figure 3B,C,F,G), except for local comparisons between clusters, where the entire flattened matrix was used as the cell-cell distance matrices are not symmetrical (Figure 3C,D,G,H). A Knn graph with K = 30 was constructed using sklearn.neighbors.kneighbors_graph function from the scikit-learn package (version 0.20.0). Knn preservation was calculated as the percentage of elements in the Knn graph matrix that are conserved.

Functions used for above calculations can be found in:

### Visualization

Cumulative cell distance distributions were plotted from the upper triangle of symmetrical cell distance matrices (using triu_indices function from the numpy Python package (version 1.16.3)). The histogram and cumsum functions numpy Python package (version 1.16.3) were used to plot cumulative distribution functions using n/100 bins, where n is the length of the flattened distance vector. Unique distance correlation was visualized using the JointGrid and kdeplot functions from the seaborn package (version 0.9.0), as well as the pyplot.hist2d function from the matplotlib package (version 3.0.1).

Functions used for above visualizations can be found in:

### Lead Contact and Code Availability

Further information and requests for resources and reagents should be directed to and will be fulfilled by the Lead Contact, KSL (ken.s.lau@ vanderbilt.edu).

All code for this project is available at github.com/KenLauLab/DR-structure-preservation.

Original data for this project is available on GEO:

Accession ID GSM1626793 (mouse retina, Macosko

*et al*., 2015)Accession ID GSM2743164 (mouse colon, Herring, Banerjee,

*et al*., 2018)

## Key Resources Table

## Acknowledgments

The authors would like to acknowledge the Vanderbilt Epithelial Biology Center and the Quantitative Systems Biology Center for helpful discussions. CNH and KSL are funded by NIH grants R01DK103831, U2CCA233291, and R01CA238553.

## Footnotes

↵5 Lead Contact