Abstract
Dimensionality reduction is standard practice for filtering noise and identifying relevant dimensions in large-scale data analyses. In biology, single-cell expression studies almost always begin with reduction to two or three dimensions to produce ‘all-in-one’ visuals of the data that are amenable to the human eye, and these are subsequently used for qualitative and quantitative analysis of cell relationships. However, there is little theoretical support for this practice. We examine the theoretical and practical implications of low-dimensional embedding of single-cell data, and find extensive distortions incurred on the global and local properties of biological patterns relative to the high-dimensional, ambient space. In lieu of this, we propose semi-supervised dimension reduction to higher dimension, and show that such targeted reduction guided by the metadata associated with single-cell experiments provides useful latent space representations for hypothesis-driven biological discovery.
Introduction
The high-dimensionality of “big data” genomics datasets is a considered a practical nuisance as it requires dimension reduction to filter noise, enable tractable computation, and to facilitate exploratory analysis. Trial and error application of common techniques has resulted in a currently accepted workflow combining initial dimension reduction to a few dozen dimensions using principal component analysis (PCA) with further non-linear reduction to two dimensions using t-SNE [1] or UMAP [2–5].
The methods used are largely unsupervised, with dimension reduction applied directly to the data modality of interest, consisting of expression, genetic, or other genomic data, without regard to metadata that can include cell type in the case of single-cell experiments, spatial information, environmental conditions, etc. Instead, data is reduced to two dimensions with metadata layered on the image to confirm hypotheses and to visually detect interesting patterns or anomalies.
While it is often claimed that quantitative properties of genomics data are recapitulated in such two-dimensional spaces [4, 5], there is surprisingly little supporting theory for this. For example, while the popular t-SNE and UMAP methods were intended to faithfully represent local and global structure of high-dimensional data in two or three dimensions, there is evidence they fail in this regard [3, 6], and theorems providing guarantees on the embeddings rely on numerous assumptions unlikely to hold in practice and ignore the coupling of PCA to non-linear methods [7]. Yet, in single-cell gene expression analysis, PCA pre-conditioned t-SNE and UMAP visuals are often used to infer or confirm relationships between cells where apparent closeness in two-dimensional spaces are assumed to translate to transcriptional compatibility in higher dimensions. Such visuals are used in qualitative and quantitative manners to ‘validate’ clustering [3, 7], cell trajectories [8, 9], data integration [10–12], and to direct focus of downstream analysis or experimentation [4, 8, 13].
Dimension reduction methods are used across many disciplines, but single-cell gene expression analyses are particularly well-suited to explore the properties of various low-dimensional reduction methods. Analyses of single-cell RNA-seq data with biological ground truth can lead to testable hypotheses and rigorous benchmarks of predictions [14]. However, the organization of tissues as composites of distinct cell types is not well understood [15], and thus meaningful reduction of singlecell data for facilitating analyses is of great relevance, creating a pressing need to assess the validity of omnipresent techniques such as t-SNE and UMAP. On a more fundamental level, estimates of the intrinsic dimension of transcriptomes may yield insights into the nature of transcriptional programs.
We therefore investigate dimensionality reduction for single-cell gene expression, focusing on fundamental obstacles in embedding high-dimensional data in two dimensions and on the scale of subsequent distortions. Our results lead us to consider semi-supervised reduction into higher dimensions as a way to circumvent problems with current approaches.
Results
Distortion by Projection to Low Dimensions
To understand the effects of standard two-dimensional reduction techniques on single-cell RNA-seq data, we began by examining the extent of distortion in embedded cells. A key result on linear low-distortion embedding of points in Euclidean space is the Johnson-Lindenstrauss Lemma [16], which provides a sufficiency condition for low-distortion dimension reduction: the preservation of pairwise distances of m points within a factor of 1 ± ϵ can be accomplished with order log(m)/ϵ2 dimensions. While this shows that dimension reduction can preserve the structure of high dimensional data, the number of required dimensions is much higher than two or three. Using the constant factor from [17], distortion of pairwise distances within 20% for a dataset of 10,000 points can be achieved with at least 1,842 dimensions.
To further understand the extent of distortion in two dimensions, we first focused on a particularly difficult case: the embedding of equidistant points. It is impossible to embed greater than n + 1 equidistant points in for k ≤ n (Supplementary Note 1), but even relaxing the equidistance constraint to near-equidistance, where for any three points in a pair will be at unit distance, one can only accommodate seven points in or ten in [18]. Even settling for near-equidistance is impossible: the ratio of the maximum distance, D, to minimum distance, d, among n points in two dimensions grows as [19] (Supplementary Note 2). Moreover, the distortion of equidistant points can be particularly acute with PCA, often used to “pre-condition” data. PCA of equidistant points is tantamount to applying a random projection (Supplementary Note 3), and as a result projected points display numerous mirages of structure in two dimensions (Supplementary Fig. 1).
We then asked whether equidistant points are present in biological data, and if dimension reduction methods such as t-SNE and UMAP applied to PCA preconditioned data induce distortion. We scanned the Seurat-integrated [10] ex- and in-utero mouse embryo dataset (at the E10.5 stage) from [20] for cells with pairwise distances close to the same value (see Methods). This data was chosen because the structure of the two-dimensional embedding was used as part of the validation for ex-utero cultured mouse embryos precisely recapitulating in-utero development. To limit the search space, we examined cells only within the Chondrocytes and Osteoblasts (Fig. 1a) (see Methods). We found visible distortion of equidistant points in 1,511,502 distinct ‘near and equidistant’ groups of cells (Fig. 1b) and 1,020,120 distinct ‘far and equidistant’ groups (Fig. 1c), with both appearing similarly clustered or distributed in two-dimensions, despite their differing properties in ambient space. We measured changes in variance within these groups, ranging from 3 to 9 equidistant cells, as well as distortion of distances, i.e. the ratio of the maximum to minimum (max/min) pairwise distances (Supplementary Fig. 2a,e). For ‘near and equidistant’ cells, variance of pairwise distances in the UMAP space increased, on average, 135-to 1,040-fold in comparison to the high dimensional variance. Distortion of distances increased 25-to 95-fold (Supplementary Fig. 2b,c). Though large, distortion in the higher 15-dimension PCA space was less pronounced, with a 214-fold increase in variance and a 4.5-fold increase in distortion (Supplementary Fig. 2b,c). ‘Far and equidistant’ cells showed similar trends, with 321-to 1029-fold larger variance and 22-to 39-fold larger distortion ratios (Supplementary Fig. 2f,g). We then examined 3,763,130 groups of equidistant cells with distances centered around the mode of all pairwise distances. We found groups ranging from 3 to 10 cells per group (Supplementary Fig. 3a,b), and found sizeable changes for these large numbers of groups: 221-fold to 1,226-fold increases in variance and 25-fold to 68-fold distance distortion ratios (Supplementary Fig. 3a,b).
Importantly, this extent of distortion is not specific to the data in [20]. We separately analyzed a (10x sequenced) mouse ventromedial hypothalamus (VMH) neuron dataset [21], consisting of a large number of cell types and a rich experimental design. Within the Estrogen Receptor 1-expressing cells (Esr1_6) we uncovered 10,375,096 distinct groups of equidistant cells, with up to 14 cells per group (Supplementary Fig. 4a,b). We found variance increases ranging from 42-fold to 77-fold in UMAP space, and 443-fold to 625-fold in t-SNE space (Supplementary Fig. 4c). Distortion ratios ranged from 104-to 154-fold (Supplementary Fig. 4d). Zooming out to cell types, distortion was also present in distances between their equidistant centroids (Supplementary Fig. 4f), with 3.1-to 4.3-fold distortion in both UMAP and t-SNE space (Supplementary Fig. 4g). These 110 distinct groups of cell type centroids encompassed up to 7 types, highlighting that even large-scale structure of cell types is not accurately reflected with t-SNE or UMAP. For both datasets, we also identified large distortion ratios within groups of nearest neighbors. The distortion ratios of distances between each cell’s 10 nearest neighbors ranged from 6.5-fold to 7.2-fold that of the high-dimensional space for the integrated embryo data (Supplementary Fig. 2d,h, 3e), and 17-fold to 64-fold in the VMH data (Supplementary Fig. 4e). Thus, somewhat contrary to conventional wisdom on the preservation of high-level properties by these methods, inaccurate neighborhood representations occur locally and globally across these projections.
As implied by the distortion of nearest neighbor distances, we find a warping of cell relationships beyond the ‘extreme’ case of equidistant cells. In the integrated ex- and in-utero E10.5 dataset, we assessed the fraction of each cell’s nearest neighbors with the same label as the cell, measuring whether the embeddings accurately reflect the extent of mixing of ex- and in-utero cells in the integrated data (Fig. 1d,e). For the ‘Variance-stabilized and Scaled’ ambient data, the UMAP embedding to two-dimensions displayed a unimodal distribution with a mode of 0.53 (approximately half of a cell’s neighbors share the same condition) while the ambient space reflected a bimodal distribution with extreme modes near 0.1 and 1.0, denoting less ‘mixing’ of neighbor conditions (Fig. 1d). The ‘Log-Normalized’ ambient data (counts prior to stabilization and scaling) revealed the opposite trend, with the mode of the UMAP embedding distribution at 1.0, in contrast with the 0.5 mode for the ambient data itself, indicating less ‘mixing’ in the UMAP space (Fig. 1e). This suggests that methods like UMAP can create illusions of more or less ‘mixing’ compared to the ambient space. This observation is concordant with previous results highlighting largely nonoverlapping nearest neighbors between ambient and reduced spaces [3, 6].
These drawbacks of two-dimensional embeddings suggest their visual utility as ‘faithful’ data representations is limited. Nevertheless, one might imagine that (PCA pre-conditioned) t-SNE or UMAP embeddings may yield qualitatively meaningful representations, thanks to the optimizations they perform. To assess whether this is the case, we constructed metrics to quantify the ability of embeddings to reveal relevant biological characteristics i.e. local and global properties often inferred from the visuals (see Methods). We measured inter- and intra-distances with respect to biological labels in two-dimension and ambient spaces, examining the correlation of these distances to assess whether embeddings capture important relative relationships between cells (see Methods) (Supplementary Fig. 5). We examined these properties in three datasets, ex-utero cultured mouse embryos (E8.5 stage) [20] (Fig. 2a), (SMART-seq) mouse VMH neurons [21] (Fig. 2b), and mouse primary motor cortex (MOp) cells [22] (Fig. 2c) once embedded with t-SNE or UMAP, pre-conditioned with PCA. On average the t-SNE and UMAP embeddings respectively displayed low maximum correlations of 0.57 and 0.52 for inter-type (inter-’cell type’) distances, and 0.10 and 0.21 for intra-type distances, 2.7-fold lower than the inter-correlations (Fig. 2).
As a control experiment, we developed an autoencoder framework to fit cells from any dataset to an arbitrary shape (defined by user-specified (x, y) coordinates) while preserving cell-to-cell distances (see Methods). The latter was achieved by including in the neural network loss function, a term minimizing reconstruction error between the ambient space and the decoder output (see Methods). This creates two-dimensional representations for the cells that attempt to recapitulate the ambient data under the constraint of approximating a user-specified shape. We call this method Picasso in homage to the eponymous artist’s skill in imitating other artistic works. We compared correlations of inter- and intra-distances between Picasso embeddings with those of t-SNE, UMAP and PCA, for each of the three datasets. For the ex-utero dataset, data was fit to the shape of a world map and the ‘von Neumann elephant’ (‘Map’ and ‘Elephant’, Fig. 2a), and to the ‘von Neumann elephant’ for the VMH and MOp datasets (‘Elephant’, Fig. 2b,c). Each Picasso embedding demonstrated comparable metrics to the respective t-SNE and UMAP projections, even improving upon the t-SNE/UMAP intra-type correlations by 2-3% in both ex-utero shapes (Fig. 2a), and up to 4.6% in the MOp embedding (Fig. 2c).
Thus, Picasso can represent datasets in a way that is quantitatively similar to, or better, than the respective t-SNE/UMAP embeddings, while producing user-specified shapes. A single dataset, e.g. the ex-utero cultured mouse embryos, can be equally well represented as a world map or elephant (Fig. 3a,b), while a single shape, e.g. von Neumann’s elephant, can represent two different datasets (mouse VMH neurons vs. MOp cells) (Fig. 3 c,d). This shows that two-dimensional visualizations, obtained with Picasso, t-SNE or UMAP, are not in any sense canonical and often quantitatively poor in absolute terms, casting doubt on their value for biological inference, particularly for understanding patterns of variation within, rather than between cell types (intra-distances, Fig. 2; Supplementary Fig. 5,6).
The arbitrary nature of two-dimensional embedding was further highlighted by our ability to easily embed all the datasets as another shape: the ‘flower’ (Supplementary Fig. 6a,e). Many of the Picasso-generated latent spaces also displayed comparable correlations to the densVis algorithms [23] designed to improve spatial distribution of cells in t-SNE/UMAP embeddings (dens-Sne/densMAP) (Supplementary Fig. 6c,d,f,h). Interestingly, even untrained Picasso was able to produce a two-dimensional embedding with comparable performance to existing methods. That is, the He initialization [24] of the neural network is competitive with t-SNE, UMAP and densVis (Supplementary Note 4). This is not surprising as the He Gaussian initialization of weights [24] mimics the structure-preserving properties of Gaussian, random projections of data which provide the constructive proof of the Johnson-Lindenstrauss lemma [25–30] (Supplementary Note 4).
Semi-Supervision for Targeted Latent Space Construction
Given the limits on two-dimensional embedding of high-dimensional data and the resultant pitfalls of relying on such, largely arbitrary, visualizations, there is a need to rethink the role of such embeddings in single-cell expression analysis. Rather than relying on unsupervised two-dimensional embedding and subsequent visualization for validation, the goal of dimension reduction should be to improve results of analysis by virtue of filtering noise and extracting relevant features. In this regard, unsupervised dimensionality reduction, that does not account for the increasingly complex nature of multi-labeled genomics data including competing features in varying abundance, is likely to be suboptimal. Several publications have demonstrated advantages of supervised dimension reduction in constructing interpretable and separable latent space structures for marker gene extraction [31] and cell type label prediction [32], and we hypothesized that an extension of this work to multi-class, multi-label (MCML) data where each class (e.g. cell type, experimental condition, spatial location) contains a label for each cell, would improve upon unsupervised methods.
Building on the linear-decoded autoencoder framework [33] as adapted in the Picasso algorithm, we implemented a label-based cost defined by the Neighborhood Component Analysis algorithm (NCA) [34], which optimizes the likelihood that cells of the same label are near each other (‘pushing’ equivalently labeled cells together) in the latent space without overfitting (Fig. 4a; Supplementary Fig. 7) [34]. This algorithm, which we call ‘MCML’, combines reconstruction error with the label-aware cost to optimize nearest neighbor structure of a latent space while maintaining the higher dimensional input structure (Fig. 4a) (see Methods). Within this supervised, or semi-supervised, framework we incorporate multiple labels across the discrete and continuous spectrum, and predict labels for unlabeled cells (Fig. 4a) (see Methods).
With the embedding-based distortions of the ‘mixed’ nature of cells between the ex- and in-utero conditions (Fig. 1d,e), we sought to more quantitatively determine which cell types contained the greatest differences between the growth conditions in higher dimensions. We applied MCML to ‘push’ together cells within the same cell type and growth condition, reducing the stabilized and scaled count matrix to 15 dimensions (following the original study). From this we determined that the myocytes and hepatocytes contained the largest distances between their ex- and in-utero cells (see Methods), concordant with the findings of [20] that myocytes contained greater disagreement in marker genes between the conditions (Extended Data Fig. 8 in [20]), but also highlighting hepatocytes as a cell type of interest. We then extracted differentially expressed (DE) genes between the conditions, which contribute to these internal separations, finding 398 DE genes for the Myocytes including upregulation of the GAPDH housekeeping gene in ex-utero cells, and 59 genes for the Hepatocytes including downregulation of the hepatokine LECT2 in ex-utero cells (Fig. 4b) (see Methods).
Given increasing numbers of available experimental labels and gold-standard cell type annotations across single-cell datasets, we also applied MCML to the prediction of unlabeled cells. We found comparable cell type prediction to SCANVI, a single-cell annotation package in the widely used scVI framework [35], and netAE which meshes modularity and separation of a priori defined clusters into a label-aware cost for latent space construction and label prediction [32] (Fig. 4c; Supplementary Fig. 8a). We were unable to run netAE for the two larger datasets as it attempted to allocate over 2TB of RAM for each (see Methods). We also measured classification accuracy for LDVAE [33], PCA, and Recon MCML (MCML with reconstruction error only) as non-label based baselines (Fig. 4c), all in 50 dimensions. Cell type labels were predicted on the 30% of unlabeled cells, for SMART-seq and 10x sequenced mouse VMH neurons [21], and 10x sequenced mouse developing brain [36], using a KNN classifier where the prediction was made according to the majority label from a cell’s 50 nearest neighbors (Fig. 4c) (see Methods). Across these datasets ‘Cell Type MCML’, MCML with cell type labels, demonstrated 12.9% improvement in accuracy over LDVAE, a 2.0% improvement over SCANVI, and a 14.6% improvement on netAE. However, the maximum accuracy was only 0.56 for the developing brain data (Fig. 4c; Supplementary Fig. 8b). Accuracy of 50-dimensional PCA remained within 3% of the cell type MCML score, possibly reflecting how cell types/states can represent the greatest directions of variance in the data (Fig. 4c).
We then extended the cost function to integrate cells’ continuous positions, from zero to one in pseudotime for developing C. elegans neurons [13], or spatial locations (physical two-dimensional coordinates) from the MERFISH MOp dataset [22], using pseudotime or spatial pairwise cell distances to weight cell-to-cell relationships in the latent space, in addition to reconstruction error (see Methods). This resulted in smaller Jaccard distances (dissimilarity) [6] of the MCML nearest neighbors to the ambient (continuous) neighbors, with average dissimilarities of 0.67 and 0.81 in MCML representations compared to averages of 0.88 and 0.96 in the baselines for the C. elegans and MOp embeddings (Fig. 5a) (see Methods). 1.0 denotes completely non-overlapping sets of neighbors. From here we combined discrete (cell type) and continuous (spatial location) class labels into the latent space construction, retaining the predictive properties of each of the classes individually when tested with 20% of cells unlabeled. Both the ‘Type-Spatial’ and ‘Spatial-Only’ MCML representations reduced the distance of predicted locations from the actual coordinates by 10.9% and 14.6% respectively (by 77.3 μm and 103.8 μm) compared to the baselines (Fig. 5b). The ‘Type-Spatial’ and ‘Type-Only’ representations also demonstrated comparable cell type prediction, with overall accuracy of 0.91 and 0.92 (Fig. 5c). These results show that our framework enables prediction of type, space, and time for single cells.
Additionally, we defined a label-aware loss to bias decoder reconstruction towards improving metrics of interest, such as the recapitulation of intra-label variances. We denote this method as ‘bMCML’ (‘biased MCML’) (see Methods). As a proof-of-concept we optimized latent space construction for high correlation of intra-sex (Supplementary Fig. 9a,b) or intra-type (Supplementary Fig. 9c,d) distances to the ambient space, as intra-correlations were lower than inter-correlations for most latent representations (Supplementary Fig. 9). For the SMART-seq and 10x VMH neuron datasets, intra-sex correlations were increased by at least 126% and 26%, while intra-type correlations were increased by at least 91% and 12% as compared to the PCA and Recon MCML baselines. Although bMCML may reduce the accuracy of other metrics not in the loss, e.g. inter-distances (Supplementary Fig. 9), this demonstrates targeted preservation of desired patterns and an alternative to unsupervised reconstruction which may not capture these specific properties.
Other benefits of the MCML framework include the ability of reconstruction loss to better preserve metric correlations in rare and orthogonal cells (i.e. gene expression not shared by other cell types) as opposed to PCA (Supplementary Fig. 10) (see Methods), as PCA is designed to find directions of maximal variance which may suppress low, orthogonal expression when only top PCs are selected. The linear decoder layer also provides interpretability, analogous to the linear transformation of data with PCA, for easy extraction of genes which contribute to each of the latent dimensions (Supplementary Fig. 11) [33]. With the use of a nonlinear encoder, despite the linear decoder, and mini-batch training, MCML provides a faster and more accurate implementation of NCA itself as compared to the sklearn implementation (Supplementary Fig. 12).
Discussion
Despite claims that common dimensionality reduction techniques for single-cell genomics data preserve local [4] and/or global [5] cell relationships, our work demonstrates that blind applications of such heuristic transformations can result in significant distortions at multiple scales. Although popular two-dimensional embeddings can reflect the broader strokes of the data such as cell type inter-distances, or highlight correlations between features [37], we find that quantitative relationships between cells, nearest neighbors, and cell types are highly distorted. Researchers are therefore tasked with navigating multiple, possibly contradictory interpretations of the same data. Additionally though current methods preserve some qualitative properties of datasets, these properties can be recapitulated in an arbitrary manner, bringing into question the biological meaning of widely utilized two-dimensional representations.
We therefore believe in lessened reliance on two-dimensional artwork for the purposes of identifying biological patterns. At least if visualizations are used, they should be presented alongside the kinds of metrics we and others have proposed for quantitative assessment of ‘global’ and ‘local’ scale [6]. In particular, we urge researchers to exercise caution in assigning biological interpretations to images with no theoretical guarantees, or “canonical” properties. There is an opportunity to develop two-dimension embedding methods that, with theoretical guarantees, could provide meaningful visualizations of high-dimensional data. The lower bounds on distortion that we have derived leave room for “reasonable” embeddings, and it is an interesting open problem to achieve optimal low-distortion [38, 39]. Some promising directions include work to define more robust distortion metrics and unified embedding frameworks [40], and to preserve equidistance as possible [41] (Supplementary Note 5). Generally, for tasks such as cluster validation and trajectory inference where t-SNE and UMAP are commonly employed, quantitative/statistical metrics on marker gene specificity and strength of expression (usually employed regardless of the visual embedding) [42, 43] provide more reliable bases for analyses. Furthermore, higher dimensional inference of differentiation trajectories [44, 45], and incorporation of probabilistic inference methods [46, 47], offer meaningful analysis approaches sans visualization.
Beyond the goals of two-dimensional visualization, higher dimensional representations offer spaces for multiple tasks to be performed, partial to researchers’ interests. To better adapt these spaces for biological investigation, we have presented a semi-supervised framework for direct incorporation of biological features into latent space structure, as opposed to unsupervised approaches unaware of the task goals. The semi-supervised MCML methodology expands the domain of latent space structure and prediction to discrete and continuous properties of cells, offers a targeted alternative to unsupervised reconstruction with bMCML, and maintains linear interpretability between the latent space and the input features [33]. Our results are based on limited parameter optimization of our multi-objective optimization schemes, however methods such as grid-search could be implemented to determine parameters e.g. the ‘best’ fractional weighting between label-aware and reconstruction costs (see Methods). MCML could also be extended to filter for labels which contribute to explaining variance in the data or to the accuracy of a specific task (e.g. spatial location prediction, likely dependent on multiple covariates [48]), or to parametric models of single-cell count data utilizing existing variational autoencoder models [33, 49].
Finally, we note that our work on distortion in low dimension embeddings and our framework for semi-supervised multi-class multi-label dimensionality reduction demonstrates a step towards developing more precise answers to questions about the dimensionality of transcriptomes. Identification of groups of equidistant cells provides weak lower bounds on the dimension, which appears to be much higher than two. Our results demonstrating the advantages of semi-supervised reduction suggest such methods could be utilized to refine upper bounds on the dimension of transcriptomes. Moreover, these questions and results are relevant to multi-faceted datasets outside of single-cell genomics, such as in phylogenetics or population genetics, where UMAP/t-SNE are used to explore structure of genetic interactions and evolutionary relationships [50]. Our findings should also be of interest beyond the biological sciences, e.g. in chemistry [51], geology [52], astronomy [53], and the social sciences [54], where dimensionality reduction is used to find latent representations capturing key structural features of data.
Methods
Datasets and Pre-processing
All datasets used in this study are listed in Table 1, and were chosen to cover a range of sequencing platforms, experiment sizes, and experimental designs.
For the SMART-seq and 10x mouse VMH datasets, cells were filtered according to the steps outlined in [21]. Unless already provided, the top 2000 highly-variable genes (HVGs) were found for all datasets using Scanpy’s highly_variable_genes [43]. The top 300 genes were used for the C. elegans neural lineage cells as there were only 1000 cells after selecting for the ASE, ASJ, and AUA neurons [13]. Counts were log-normalized, if not already provided, with the log-count matrices representing the ‘ambient’ data for metric comparisons (see below). Unless otherwise indicated, ‘ambient’ space refers to the log-normalized count matrices filtered for HVGs. All count matrices were zero-centered and scaled before application of the Picasso, MCML methods, or PCA. All PCA analysis was performed using sklearn TruncatedSVD to 50 dimensions by default. 15 dimensions was used for the PCA of the integrated mouse embryo E10.5 dataset to facilitate direct comparison to the original study [20].
The t-SNE and UMAP algorithms were applied to the 50 (or 15 in the case of the integrated mouse embryo E10.5 dataset) dimensional PCA embeddings with default settings. As per the discussion in [6], though slight changes in parameters can drastically impact low-dimensional embeddings, the choice of parameters for tuning is often informed by empirical observations/prior knowledge leaving open the question of which metric(s) to use for determining ‘optimal’ parameters. This tuning is additionally contradictory to the common use or desire of such techniques to produce ‘unsupervised’ representations of the data [6]. [Code]
Determining Groups of Equidistant Cells
To find equidistant cells within cell types, we selected cells from within sizeable cell types to narrow the search space, as the algorithm we used, namely clique detection in undirected graphs, is NP-complete. The cell types we investigated were ‘Esr1_6’ in the 10x VMH dataset and ‘Chondrocytes and Osetoblasts’ in the integrated embryo E10.5 dataset. We calculated all pairwise distances between the cells in the ambient space, and using those defined a graph where two cells were adjacent if the cell-cell distance was within a small fraction of the standard deviation around the mean, and the 0.1 and 0.9 quantile marks. The filtering for distances within a particular range helps to limit the size of the search space as well as produce a range of mutually equidistant cells. We used the sklearn pairwise_distances for the pairwise calculations. From the graph of ‘connected’ cells we looked for cliques, namely subsets of cells in which all cells are connected (adjacent) to each other. This same strategy was employed to determine equidistant centroids of cell types for the 10x VMH data. Equidistant cell type centroids were identified by constructing a graph where two nodes, associated with centroids, were adjacent if their distance was close to the average pairwise distance. To find cliques we used the find_cliques function from the networkx package, which employs a variant [55] of the Bron & Kerbosch algorithm [56], to detect cliques in undirected graphs. [Code]
Metrics for Correlation and Distortion of Ambient Space Properties
Distortion Metrics for Equidistant Cells
We used two metrics to assess distortion of equidistant cells in two dimensions. The first is the variance of the pairwise distances between cells (or centroids) in each group, as compared to the low variance in the distances in the ambient space. We also calculated the ratio of the maximum to minimum distance between cells in each group (the ‘max/min ratio’), a quantity for which we derived a lower bound (see Theorem 1 in Supplementary Note 2):
All variance and min/max comparisons were done in the ambient space, the PCA-reduced spaced, and the UMAP/t-SNE spaces, which were generated from the PCA-space. The ambient space for the integrated embryo E10.5 data is the ‘Variance-Stabilized and Scaled’ (Fig. 1a-d) data (as opposed to solely log-normalized counts in Fig. 1e), as this was used as input for the original UMAP embedding in [20]. [Code]
These distortion metrics were also measured between every cell and its 10 nearest neighbors to demonstrate distortion outside of groups of necessarily equidistant cells. The sklearn Nearest-Neighbors function was used to find these 10 neighboring cells as well as for Fig. 1, to extract each cells’ 30 nearest neighbors, and compare neighbor labels, in the UMAP versus ambient space (using L1 distance). [Code]
Inter- and Intra-Label Distances
To assess the relative differences and similarities within and across biological properties of interest we defined inter- and intra-label distance metrics. Inter-label distances (Supplementary Fig. 5) are calculated as pairwise L1 (defined below) distances between the centroids of each label within a class (e.g. between centroids of each ‘cell type’ label). These represent the relative distances, or closeness, between labels. For two vectors x, y the L1 distance is defined as the absolute value of their differences:
For this we used sklearn pairwise_distances. For more ‘internal’ labels such as sex, inter-label distances were calculated as the mean pairwise distance between the cells of each label (i.e. ‘male’ and ‘female’) within each cell type. Intra-label distances are the means of all pairwise distances within each label (Supplementary Fig. 5). These represent the relative, internal variances among the cells within each label. For sex labels i.e. ‘male’ and ‘female’, these means were calculated within each cell type.
We used L1 distance as it is suitable for measuring distance between points in high dimensions, particularly in comparison to other L-norms [57, 58], and is comparable to the probabilistic Jensen-Shannon divergence in single-cell (transcriptomic) distance calculations [57]. The correlation of these metrics, in the latent space, to their values in the ambient space were then calculated by Pearson correlation. The latent spaces represent PCA, t-SNE, UMAP, Picasso, or MCML embeddings.
To test recapitulation of the relative distances of ‘rare’ and orthogonal cells to other cell types, using PCA and the MCML framework below, we added the expression of five new ‘cells’ in the MERFISH MOp data with expression (single gene counts) in three gene dimensions not expressed in other cells (Supplementary Fig. 10). We then ran Recon MCML (MCML with only reconstruction error) and PCA on this new matrix, and calculated the inter-distances for this group of cells to all other cell types (Supplementary Fig. 10).
General Autoencoder Architecture
The autoencoder network used in the Picasso and MCML algorithms is outlined below. The structure of the neural network remains the same between algorithms though each has a unique set of cost functions for network optimization.
The input is a centered/scaled count matrix , n cells by g genes. For MCML embeddings C is the set containing label vectors for each class k, C: {c1,…, ck}. Classes can be discrete or continuous, and multi-dimensional in the case of continuous classes (e.g. cell type, sex, location).
The input is passed through two fully-connected layers of 128 nodes and d nodes respectively with d = 50 by default. Batch normalization, the ReLU activation function, and dropout regularization are applied between the layers. The second layer represents the latent representation in denoted as Z. The final linear, decoder layer produces . No activation function or bias terms are used between the latent and decoder layer as the decoder output solely represents a linear transform of the latent space.
Mini-batch training was employed for all algorithms, with a default batch size of 128, though larger batch sizes were used for Picasso embeddings. Adam optimization [59] was used for network training with a default learning rate of 10−3 and weight-decay term of 10−5.
Picasso Cost Function for Shape Imitation
We defined two loss functions: LShapeAware and LReconstruction, which balance the fit of the input points to the desired shape and reconstruction error in the decoder output as compared to the input. represents the coordinates comprising the desired shape, where d = 2 and p ≥ n. The latent space Z is also limited to d = 2 dimensions. The pairwise distance matrix represents Euclidean distances between the cell coordinates in Z and shape coordinates S such that
Using D, we define a Boolean, n × p adjacency matrix A, where ∑ Ai = 1. This matrix uniquely specifies an adjacent coordinate point for every cell, in a bipartite graph mapping the n cells to the p coordinates. A is determined by the linear_sum_assignment scipy package, which assigns a shape coordinate to each cell by solving the minimization where aij = 1 iff row i is assigned to column j. Thus, which we attempt to minimize i.e. map cells to their closest, unique shape coordinates. The reconstruction loss is the L2 norm of the difference between the reconstructed and input data:
The total loss then incorporates both loss functions, balancing their contributions with f, a user-defined fraction weighting the effect of each term on the resulting embedding:
Correlation metrics, as defined above, are measured for the output Z, PCA to two dimensions (PCA 2D), and 2D t-SNE/UMAP (PCA t-SNE and PCA UMAP) which are run on the output of PCA to 50D by default. Picasso was tested on the SMART-seq VMH neurons [Code], the ex-utero mouse embryo E8.5 data [Code], and the MERFISH MOp dataset [Code].
MCML Framework with Cost Function for Discrete and Continuous Properties
We use the acronym ‘MCML’ (multi-class multi-label) to denote the semi-supervised, label-aware methodology which directly incorporates the label-aware cost into the latent space structure (Fig. 4a). For MCML we use two loss functions: LLabelAware and LReconstruction, where LReconstruction is as defined in (1). For LLabelAware, we utilize the Neighborhood Component Analysis (NCA) algorithm from [34]. For all cells a probability matrix is created where
For discrete labeled data (e.g. cell type names) we define LDiscrete for all pairs of cells i, j where
Only the probabilities of cell pairs which are of the same label, for each class k, are summed and normalized to the total number of these cell pairs (which represents the maximum value of the numerator). For continuous classes of labels, such as spatial coordinates or pseudotime values, we use a separate loss function, LContinuous. A probability weight matrix is generated for every pair of cells such that
In place of the indicator function, the weights bias the masking of the original probability matrix P towards higher-weighted (‘closer’) pairs of cells. Probabilities are also normalized to the maximum of the numerator (treating the weights W as constants):
The final loss function is
LLabelAware is negated for minimization, as opposed to maximization of the positive probabilities, and is additionally weighted by a constant factor of 10 in comparison to LReconstruction. For all datasets, excluding the integrated mouse embryo E10.5 dataset, the latent space Z is in d = 50 dimensions, and d = 15 for the Integrated data.
Measuring Distance between Cells of Differing Conditions
We applied MCML to the integrated mouse embryo E10.5 dataset, including both cell type and condition (ex- or in-utero) labels, for dimensionality reduction. We then measured the pairwise L1 distances between the centroids of the ex- and in-utero cells within each cell type, as a measure of ‘internal’ distance. Within the cell types with the largest distances, we extracted differentially expressed (DE) genes between the conditions, following the metrics in the original study (genes with log2 fold-change greater than 1 and adjusted p-values greater than 0.05) [20]. Here we utilized the log-normalized data only, as it represents the counts prior to scaling and filtering for highly variable genes. We used the non-parametric Wilcoxon test to identify DE genes, with p-values adjusted with Benjamini-Hochberg correction. [Code]
Prediction Accuracy for Unlabeled Cells
To assess the ability of semi-labeled, MCML-reduced latent spaces for class-specific label prediction, we measured their accuracy in continuous and discrete label prediction for unlabeled cells.
For the comparative cell type label analysis in Fig. 4c, 70% of each dataset was used as training and the remaining for testing/prediction for SCANVI, netAE, and MCML which are label-aware methods. netAE was unable to run on the 10x VMH and 10x developing brain datasets as it attempted to allocate over 2TB of RAM. [Code] All of the data was used to train LDVAE and Recon MCML (reconstruction error only, f = 0) as baseline prediction comparisons. For the continuous and mixed label annotation in Fig. 5, 80% of each dataset was used as training and the remaining for testing/prediction. The full count matrices were input into the MCML framework, but only the denoted percentages of cells were labeled.
We applied sklearn’s KNNClassifier with 50 nearest neighbors, weighted by their distance, for discrete label prediction in each latent space, and used the accuracy_score function from sklearn to determine the fraction of correct labels predicted (Fig. 4c). [Code 10x VMH] [Code SMART-seq VMH] [Code Developing Brain]
The KNNRegressor from sklearn was used to predict continuous values in the same manner. We used Jaccard distance/dissimilarity [6], defined as where A, B represent the sets of 50 nearest neighbors in the ambient and latent spaces respectively, and Euclidean distance between each cell’s predicted and true value, to assess the efficacy of the continuous predictions (Fig. 5a,b). Continuous labels were either one-dimensional pseudotime values, or two dimensional coordinates for each cell’s spatial location. Pseudotime values were generated using the diffusion map-based ‘dpt’ methods from Scanpy [43], on the PCA-reduced C. elegans dataset. The ‘ambient’ space for determining continuous-label nearest neighbors was the n × 1 or n × 2 matrix containing the original values for all n cells. Confusion matrices, produced by sklearn plot_confusion_matrix, were also generated to compare true and false positive cell type label predictions for MCML-generated embeddings with and without dual incorporation of discrete (cell type) and continuous (spatial coordinate) label classes. [Code]
Runtime and sklearn Comparisons
For runtime comparisons between the various cell type prediction/annotation methods in Fig. 4c (see Supplementary Fig. 8), we timed all methods on a range of datasets, processed with 1 GPU over 5 cores each with 40G of memory. [Code]
To compare the capabilities of the NCA algorithm by MCML to the standard sklearn NCA implementation, MCML was run with f = 1 (no reconstruction error) and sklearn’s NCA with default settings, to produce 50 dimensional latent space representations incorporating cell type labels only (see Supplementary Fig. 11). The NCA loss, represented by LDiscrete, was measured for the generated latent spaces. The GPU was not utilized for these comparisons to accommodate the sklearn implementation. [Code]
Biased MCML (bMCML) with Targeted Reconstruction Cost Function
Here we denote ‘bMCML’ as the label-aware, biased reconstruction methodology which adapts the original MCML cost functions in (2). This targeted reconstruction loss utilizes only one term in its loss. Here L is defined by the correlation of the inter- or intra-distances (as described above) of a particular class to the ambient data X, b represents the vector of the specified inter-/intra-distances in the ambient space and represents those same distances calculated for the reconstruction . L is then defined in (3) as the negation of the Pearson correlation of these two vectors. Negation, again, facilitates minimization.
In Supplementary Fig. 9, we demonstrate the implementation of either intra-sex (Supplementary Fig. 9a,b) or intra-type (cell type) (Supplementary Fig. 9c,d) distance correlation in the loss and the effect of these targeted losses on the resulting correlation metrics. This was tested on the SMART-seq [Code] and 10x mouse VMH neurons [Code].
Data Availability
Accession links for the original data used to generate the figures and results in the paper are listed in Table 1. Processed and normalized versions of the count matrices are available on CaltechData, with links provided in Supplementary Table 1.
Code Availability
All analysis code used to generate the figures and results in the paper is available at https://github.com/pachterlab/CBP_2021 with Picasso and MCML analyses provided in notebooks which can be run on Google Colab. Picasso is also available at https://github.com/pachterlab/picasso. The MCML method as well as tools for quantitative analysis are available via a Python pip installable package from https://github.com/pachterlab/MCML.
Acknowledgements
Some of the computations presented here were conducted using machines in the Resnick High Performance Center, a facility supported by the Resnick Sustainability Institute at the California Institute of Technology. We thank Gennady Gorin and Benjamin Rivière for helpful discussions regarding the MCML and Picasso analyses, Sina Booeshaghi for helpful discussions regarding NCA and dimensionality reduction, Ingileif Hallgrímsdottir for valuable feedback on the manuscript, and Pall Melsted for useful insights regarding Theorem 1. The work was supported in part by NIH grant U19MH114830 and Joeyta Banerjee was supported in part by the Caltech Summer Undergraduate Research Fellowship (SURF).