TopOMetry systematically learns and evaluates the latent dimensions of single-cell atlases

A core task in single-cell data analysis is recovering the latent dimensions encoding the genetic and epigenetic landscapes inhabited by cell types and lineages. However, consensus is lacking for optimal modeling and visualization approaches. Here, we propose these landscapes are ideally modeled as Riemannian manifolds and present TopOMetry, a computational toolkit based on Laplacian-type operators to learn these manifolds. TopOMetry systematically learns and evaluates dozens of possible representations, eliminating the need to choose a single dimensional reduction method a priori. The learned visualizations preserve more original information than current PCA-based standards across single-cell and non-biological datasets. TopOMetry allows users to estimate intrinsic dimensionalities and visualize distortions with the Riemannian metric, among other challenging tasks. Illustrating its hypothesis generation power, TopOMetry suggests the existence of dozens of novel T cell subpopulations consistently found across public datasets that correspond to specific clonotypes. TopOMetry is available at https://github.com/davisidarta/topometry.


Introduction
Analyzing high-dimensional data (i.e., data in which the number of variables per observation exceeds hundreds or thousands) is a challenge shared among fields as diverse as signal processing, economics, and structural chemistry -and, more recently, the life sciences, particularly with the development of single-cell experiments. Single-cell assays have inaugurated a new era in the unbiased molecular sampling of the cellular diversity in biological systems, from cell lines to whole organisms, leading to an ongoing revolution across the life sciences 1,2 . In such high-dimensional data (hd-data), information regarding cellular phenotype (e.g., gene expression, chromatin accessibility, etc.) is individually gathered for every cell, resulting in high-dimensional matrices containing up to millions of cells and dozens of thousands of variables for each. This property creates challenges in data analysis, collectively known as the curse of dimensionality, and requires sophisticated statistical and geometric methods to allow meaningful exploration and understanding of the latent features underlying the data.
These are broadly referred to as dimensionality reduction techniques and include matrix decomposition methods, autoencoders, and graph-layout optimization algorithms.
Many dimensionality reduction approaches have been developed and recently reviewed elsewhere [3][4][5] ; here, we will briefly describe the strategies behind these algorithms and outline their current use standards in sc-data. Briefly, there are two main ways to perform dimensionality reduction (DR): i) decomposing matrices; and ii) optimizing graph layouts. Matrix decomposition DR methods embed data into an arbitrary number of latent dimensions and include the classic Principal Component Analysis (PCA) 6,7 , a linear method that involves the eigendecomposition of the covariance matrix. More recent non-linear dimensionality reduction (NLDR) methods are based on manifold-learning techniques and assume that the input data lie on or near a low-dimensional manifold, which is almost always the case in real-world datasets.
In particular, spectral methods (i.e., based on the decomposition of Laplacian-type operators), such as Laplacian Eigenmaps (LE) 8 and Diffusion Maps (DM) 9,10 , have arisen as robust non-linear alternatives, in which the eigendecomposition of Laplacian-type operators 8,9,[11][12][13] yield orthonormal components that encode the latent graph structure of the manifold. The other broad class of DR techniques consists of graph-layout optimization methods. Their approach involves learning a similarity graph from samples, then minimizing a loss function to optimize the layout of this graph in 2 or 3-D for visualization. Graph-layout optimization (GLO) methods include the popular t-Stochastic Neighborhood Embedding (t-SNE) 14 , Uniform Manifold Approximation and the entire representation. The current gold standard of experimentally validating some hypotheses drawn from the learned representations cannot be considered a validation of the whole representation per se either. Moreover, each algorithm's assumptions (often implicit and by construction) regarding data's statistical and geometrical properties are also disturbingly challenging to assess in practice. For example, the universally used PCA holds the strong assumption that data is uniformly distributed and lies approximately in a series of hyperplanes (the principal components) by definition, which generates misleading results when faced with non-uniform or non-linear structure by missing local information and introducing linear distortions 6,14,15,[34][35][36][37] . The popular UMAP 15 , whereas more extensive by considering that the data is dispersed over or close to a locally connected Riemannian manifold, assumes this distribution is uniform across a manifold, with a locally constant metric. As such, representations of data with non-uniform data sampling could be filled with artifacts without any warning to users. As most downstream analysis and biological conclusions drawn from sc-data rely on the lower-dimensional mappings learned by DR, it is crucial for them to be formally guaranteed to represent latent biology. These should be independent of the data distribution or the underlying geometry, which should be considered unknown even for extensively investigated systems.
Regardless, this is not the case when using the current standard approach (i.e., using UMAP or t-SNE on the top Principal Components); in fact, its assumptions likely do not hold true for all datasets, in which case artifactual distortions can be introduced in the learned representations.
Any inferences made from these representations are critical and will impact downstream analyses, thus raising concerns about the credibility of the vast literature in which this practice was adopted.
Despite the importance of minimizing assumptions regarding the latent geometry underlying high-dimensional data, an approach that fully yields the power of topological Laplacian-type operators is yet to be proposed. These operators have been widely and successfully used in many applications, from computer vision to natural language processing, but not in single-cell biology. They are particularly advantageous because they approximate the Laplace-Beltrami Operator (LBO) at the limit of large data (i.e., when the number of samples is very large, as in sc-data). We consider the extremely loose assumption that data lies on a Riemannian manifold.
It is a fundamental result in mathematics that the LBO encodes all the local geometric information of a Riemannian manifold with an unknown shape 8,9,11,36,38 . The top eigenfunctions of this operator can then be used to approximate the original data 9 so that each eigenfunction carries a fundamental harmonic of the data 8,9,13 . In addition, because current theoretical biology points to an underlying Waddington landscape within the sampled biological system 20 , these can be naturally modeled as a Riemannian manifold. We refer to such theoretical manifolds as phenotypic manifolds. We reasoned that recovering such manifolds by approximating the LBO would thus render optimal sc-data representations while holding no prior assumptions regarding the underlying data geometry. Here, we show how combining LBO eigenfunctions with existing projection (GLO) techniques yields representations of high-dimensional data that preserve more original information than those learned with current standards (PCA-based) or with GLO alone.
This strategy makes the Riemannian metric 36 a natural metric for evaluating the distortion of these visualizations, a most critical feature that the field currently lacks. We present TopOMetry, a python toolkit to combinatorially obtain numerous LBO approximations using different computational approaches and systematically evaluate them from a quantitative and qualitative perspective. We show this strategy is remarkably superior to the currently held PCA-based approach or to using stand-alone GLO implementations when analyzing single-cell data.

A framework to systematically learn and evaluate manifolds in single-cell data
To achieve the task of approximating the LBO to recover the phenotypic manifolds underlying sc-data, we developed TopOMetry (Topologically-Optimized geoMetry), a toolkit that combines and enhances state-of-the-art techniques to approximate the LBO through weighted graph Laplacian-type operators. TopOMetry efficiently obtains the eigenspectra of these operators and uses them to approximate the LBO once again. Users can then visualize its graph representation with any existing graph-layout optimization (GLO) algorithms. This modular approach allows combinatorially evaluation of different options of kernels, decomposition, post-processing, and GLO algorithms. It eliminates the need to choose a single approach a priori and allows users to decide which of the possible representations better describes each dataset objectively.
TopOMetry takes as input a data matrix of variables per observation (e.g., genes per cell; Figure 1A). It then computes a k-nearest-neighbors graph to build a kernel matrix with state-of-the-art similarity functions (the default adaptive bandwidth estimation is related to the local intrinsic dimensionality). TopOMetry includes a new family of manifold-adaptive kernels closely related to the intrinsic dimensionalities of the underlying manifold 39 . These kernels are then used to learn Laplacian-type operators, which asymptotically approximate the LBO ( Figure   1B). Next, the eigenspectra of the learned Laplacian-type operators are obtained, allowing the estimation of the data's global intrinsic dimensionality ( Figure 1C). The resulting orthonormal eigenbasis is related to a Fourier basis and encodes all latent signals present in the original data 8,9,13 . Such eigenbasis can then be used to compute a new weighted kernel (referred to as graph kernel) to obtain a second LBO approximation encoded as a similarity graph (the associated topological graph; Figure 1D), which can then be visualized with existing GLO algorithms ( Figure 1E). These graphs can also be used for downstream tasks such as clustering, signal filtering, and pseudotime estimation ( Figure 1F). In TopOMetry, dozens of visualizations obtained from the graphs or directly from the eigenbasis can be obtained with a single line of code. These representations' preservation of local 40 and global structure is then systematically evaluated ( Figure 1G), alleviating practitioners' choice of a single similarity-learning or GLO method. The top-performing projections can then be qualitatively evaluated using the Riemannian metric to assess the distortion induced by the mapping ( Figure   1G).
The outlined approach has numerous advantages in comparison to the current standards. First, it holds an extremely loose assumption regarding the underlying data geometry compared to PCA or UMAP (the data lies in or close to a Riemannian manifold). Second, such an assumption is mathematically solid and intrinsically related to the core biological goal in sc-data analysis (i.e., to recover the Waddington landscape or an extended version thereof). Third, it involves the estimation of the manifold local and global dimensionality, which alleviates the similarity-learning process and eliminates the need to guess a number of latent dimensions to keep for downstream analysis (as needed when using PCA or autoencoders). Another major advantage is using the Riemannian metric 36 as a "distortion ruler," allowing qualitative assessment of the distortion present in two-dimensional mappings and the robust estimation of the global and local dimensionality of the underlying manifold. Finally, TopOMetry allows users to obtain and evaluate dozens of possible representations with a single command, eliminating the need to choose a pre-defined method (e.g., UMAP on PCA) and allowing users to pick which performs best for each case. This is the toolkit's main advantage and represents a  Single-cell experiments are preprocessed into high-dimensional matrices of cells and measured variables (e.g., gene expression, protein content, chromatin accessibility) and require dimensional reduction tools for their analysis. Most tools assume unknown aspects of the underlying geometry (e.g., linearity) or distribution (e.g., uniformity). Instead, relying on the Laplace-Beltrami Operator (LBO) and its eigenfunctions allows learning such geometry assuming only the manifold hypothesis. (B) Approximating the LBO involves constructing a k-nearest-neighbors (kNN) graph using adaptive affinity estimation, which can be done through several kernels so to make the affinity graph insensitive to neighborhood densities. The Laplacian-type operators (particularly the anisotropic diffusion operator) are approximations of the LBO. The resulting matrices can be used for several tasks, such as imputation of missing data, signal filtering and interpolation, and graph sparsification and coarsening. (C) The eigendecompositions of the LBO approximations yield eigenvectors and eigenvalues that are weighted or multiscaled to form a new orthogonal eigenbasis. The eigenvalues can be used to estimate an eigengap or spectral gap to estimate the intrinsic dimensionality of the data. The intrinsic dimensionality can also be estimated using neighborhood-based methods (e.g., FSA and MLE) or ad-hoc inspection of eigenvectors for discriminative potential. (D) A second neighborhood graph is learned from the eigenbasis, and its Laplacian-type operators are used to obtain a new LBO approximation, rendering 'topological graphs'. (E) From a spectral initialization obtained from the topological graph, any graph layout optimization (GLO) can be used on the topological graph or the eigenbasis for visualization. (F) Downstream tasks such as clustering, RNA velocity estimation, and imputation can be performed with the learned topological graphs and layouts. (G) The learned eigenbases and visualizations can be evaluated regarding the preservation of global and local structure, and the Riemannian metric can be used to visualize distortions in layouts to aid biological interpretation.
fundamental change of paradigm in sc-data analysis: instead of a priori electing a single method to be used at all times across all datasets, it allows users to discover which particular model works best for representing each individual dataset before other downstream tasks.
Computationally, TopOMetry is easy-to-use and centered around a single python class, the TopOGraph object, which orchestrates the analyses (Supplementary Figure 1).

Representations learned with TopOMetry preserve local and global structures across single-cell datasets
We first performed a qualitative sanity test on TopOMetry by running it on synthetic toy data To determine TopOMetry's performance in real-world sc-data, we curated a diverse set of 20 single-cell datasets from different biological systems with varying numbers of cells and variables per cell across diseases, species, and technologies ( Table 1). The datasets were then processed following current standards in scRNA-seq analysis 24 : i) library size normalization; ii) logarithmic transformation; iii) selection of highly variable genes (HVG), and iv) scaling (Z-score normalization, i.e., mean centering and variance clipping). For each dataset, we computed the first 100 principal components and obtained two-dimensional (2-D) projections with three existing DR methods: i) t-SNE 14 ; ii) UMAP 15 ; iii) PaCMAP 17 . Projections were computed using the principal components (the current standard) or the full data matrices. We then used TopOMetry to obtain 2-D projections using similar graph-layout optimization (GLO) methods.
Each visualization was then scored regarding the preservation of the global and local structure.
The Mean Reconstruction Error (MRE) was used for quantifying the preservation of the global structure as recently proposed 41 , with PCA being a reference by construction ( Figure 3A). It has been shown that the preservation of global structure is mainly driven by the initialization given to GLO algorithms 42 , thus making the preservation of local structure a more important factor to consider when choosing between options. The trustworthiness score 40 was used to assess the preservation of local structure ( Figure 3B). In addition, we also propose a score based on the correlation of geodesic distances in the high-and low-dimensional spaces ( Figure 3C), drawing from previous approaches 33,43 ; however, its high computational cost makes its application impractical for most real-world datasets.
Across all datasets, the eigenbases ( Figure 2D) and 2-D representations ( Figure 2E) that yielded the highest local scores (LS) for all datasets were generated with TopOMetry. Global scores (GS) were as high as PCA for TopOMetry's eigenbases (Supplementary Figure 3A) and discretely lower for projections when using TopOMetry compared to GLO on data or top principal components on some datasets, with a maximum difference of 0.06 for the AgeingMouseBrain dataset (Supplementary Figure 3B), further suggesting that the preservation of local structure is a more robust metric to discriminate against different visualizations. We found that PCA yielded lower LS for all datasets ( Figure 2D), as expected from a global and linear method. Strikingly, we also found that projections derived from using GLO methods on the top principal components had lower LS when compared to using GLO alone ( Figure 2E), further suggesting that this practice could be inadequate despite being currently held as a default approach. The opposite happens when using TopOMetry's Laplacian-type eigenbases to build the neighborhood graphs projected with PaCMAP or the diffusion potential from these eigenbases with MAP, as these approaches achieve higher LS than using GLO alone ( Figure   2E). Within TopOMetry, the variation of LS differed mostly across eigenmap methods, with LE scoring lower than DM and msDM, and little to no variation across different kernel methods ( Figure 2E). These results show the superiority of TopOMetry over the standard workflow or GLO methods alone and highlight the value of critically evaluating methods before electing one for the downstream analysis approach. (C) Schematic overview of the assessment of distances preservation in manifolds -the pairwise geodesic distances in the high-and low-dimensional spaces are computed, and Spearman R correlation between the rank of neighbors for each cell is obtained as a score. (D) Annotated heatmap of local scores for TopOMetry's eigenbases and PCA for 20 single-cell datasets (higher is better). (E) Annotated heatmap of local scores for projections learned using expression data, the first 100 principal components, and TopOMetry's eigenbases or topological graphs for 20 single-cell datasets (higher is better).

Estimating and visualizing intrinsic dimensionalities with TopOMetry
Intrinsic dimensionality (i.d.) can be loosely defined as the minimum number of parameters needed to describe a multiparameter (high-dimensional) system accurately 44  In this dataset, TopOMetry found an eigengap at 19 dimensions using the eigenvalues' first derivatives ( Figure 3A). The corresponding eigenfunctions separate the ground-truth classes well, in contrast to the principal components ( Figure 3B . Furthermore, the estimates obtained with these algorithms are highly correlated ( Figure 3D), reinforcing their reliability. To interrogate whether i.d.
estimates translated to actual insights regarding the data, we used a subsample of the dataset and visualized the digits with the highest and lowest i.d. Interestingly, the 100 digits with the highest local i.d. were predominantly 3's, 5's, and 8's written with varying calligraphy, indicating an association between local i.d. and intra-cluster heterogeneity ( Figure 3E, left). Conversely, the 100 digits with the lowest local i.d. were predominantly 2's, 6's, and 1's written with consistent calligraphy, further suggesting this association ( Figure 3E, right).
We next examined the PBMC 3k dataset following the same approach to test whether these results translated to real-world sc-data. This dataset comprises~2,700 peripheral blood mononuclear cells (PBMC) from a healthy human donor. Once again, the global estimate obtained with the eigengap method (72) ( Figure 3F) was reasonably similar to those obtained with MLE (122-59) and FSA (122-74) ( Figure 3G). Similar to the MNIST dataset, the estimates of local i.d. obtained with MLE and FSA were also highly correlated ( Figure 3H) but strikingly varied across different regions of the represented manifold ( Figure 3I), which is inhabited by discrete cell types ( Figure 3J, K). In this dataset, we found that megakaryocytes, natural killer

TopOMetry allows precise manifold learning of small and large-scale datasets
We next aimed to explore how TopOMetry can be useful in inferring cellular development trajectories. For a first demonstration, we obtained scRNAseq data from the developing murine pancreas 48  gives rise to alpha cells, which is faithfully represented by the RNA velocity vector field based on the TopOMetry model ( Figure 3F). Finally, it is also shown that the latent dimensions learned by TopOMetry (the multiscaled diffusion components) are associated with particular regions of the underlying manifold ( Figure 4G) corresponding to specific cellular states. This effectively reconstructs the manifold's topology one region at a time (i.e., one identity per component).
These components can also be used for downstream tasks, such as inferring changes in gene signatures along several developmental axes.
Next, we show how TopOMetry can be used for lineage inference in organism-wide cell atlases to learn detailed cellular hierarchies. For this, we harnessed 1.3 million high-quality pre-filtered single-cell transcriptomes from the mouse embryo during organogenesis 50 . Standard processing was performed, and gene expression data was then represented with PCA ( Figure 3H) or UMAP on the top 300 PCs ( Figure 3I and 3J -annotations are from the original study). TopOMetry analysis was then performed using its default model ( Figure 3K, topoMAP for brevity), and the Leiden community detection algorithm was employed on its diffusion potential graph. Both approaches successfully mapped major cell types as distinct regions of the manifold without apparent connections between them. However, TopOMetry revealed considerably finer-grained cellular lineages ( Figure 3K) than those represented with UMAP on PCA ( Figure 3I). In total, we found approximately 380 subpopulations, while only 56 were originally described (and labeled as subtrajectories). Most of these populations were neuronal populations, which is consistent with the expected diversity of the central system. Some of the populations found when clustering with TopOMetry are also suggested by the UMAP on PCA ( Figure 3I) representation, albeit with less clearly defined trajectories. Hence, it is shown that TopOMetry excels at faithfully representing vast datasets comprising hundreds of coexisting lineages, being ideal for representing whole-organism data and performing lineage inference due to its intrinsic relation with the Waddington epigenetic landscape.

TopOMetry augments representations with the Riemannian metric to visualize distortions
A central issue when learning low-dimensional representations from sc-data is the introduction of distortions. Distortion is inevitably introduced when a high-dimensional manifold is embedded into a lower-dimensional space. As an illustrative example, consider the Earth and the planar projections used for making two-dimensional maps (e.g., the Mercator projection): in these maps, there is always some distortion, which can involve distances between points or shapes, angles, and areas within the projection. However, while Earth's true geometry can be physically visualized from space, the epigenetic manifolds 20 of phenotypic identities from which sc-data is sampled are unknown and cannot be visually assessed, which makes it extremely challenging to qualitatively evaluate how the low-dimensional mapping distorted them. The theoretical model of cellular diversity as Riemannian manifolds proposed here allows a direct connection with a unique method for evaluating these distortions: expressing the Riemannian metric of the original manifold 36 in any given 2-D mapping. The Riemannian metric (RM) is expressed in 2-D coordinates so that it makes the represented distances and angles approximately equal to their original manifold counterparts. In practice, the RM can be visualized as ellipses whose axes indicate the magnitude and directions in which the manifold was distorted. Visually, these ellipses will have eccentricity equal to one if the distortion is uniform in all directions, rendering a circle with a radius approaching zero in the special case of no distortion (i.e., the representation of a given point fully preserves its geometric properties on the manifold). Such an approach allows users to consider how the representation distorted different regions of the manifold and to qualitatively compare them for hypothesis generation, effectively working as a "distortion ruler." To illustrate these concepts, we used a toy dataset with three randomly distributed clusters with distinct variances ( Figure 4A) and learned their representations with PCA, DM (within TopOMetry), UMAP, and a TopOMetry model (MAP on the diffusion potential of a diffusion map eigenbasis). The RM was then estimated for each embedding and visualized using ellipses.
These ellipses present high eccentricity when using the first two principal components, indicating distortion towards the periphery of the embedding and reasonably high eccentricities across all three clusters when using UMAP. In contrast, when using the first two diffusion components learned with TopOMetry, the distortion was negligible for one of the clusters and happened along with the components themselves for the other two. This was expected, as each cluster is encoded by a separate diffusion component in this scenario. When using the TopOMetry MAP on DM potential model, we found a distortion pattern similar to that observed within UMAP.
Next, to demonstrate its usability in sc-data analysis, we estimated the RM for representations of the classic PBMC 3k dataset learned with PCA, UMAP, UMAP on PCA, and a TopOMetry model ( Figure 4B). As represented by the ellipses, each representation differently distorts specific regions of the manifold. To allow for a more uniform comparison, we color-coded the eccentricity of ellipses corresponding to each cell across methods and visualized them within the TopOMetry layout ( Figure 4C) and with violin plots ( Figure 4D). As observed, the populations associated with the greatest distortion were reasonably consistent across representations; CD4 T cells, CD14+ monocytes, B lymphocytes, and NK cells. These findings suggest that all representations will preferentially misrepresent some cell types, often at different fashions when comparing different methods and that this could be related to the intrinsic geometry of the underlying manifold. Albeit illustrative, this example is limited by the small number of cells available. As CD4+ T cells were more favorably represented within the TopOMetry model and are known for their extreme diversity, we asked whether analyzing a larger amount of samples (cells) could harbor additional insights into their biology.  The Riemannian metric can be used to estimate distortions in two-dimensional visualizations and can be represented by ellipses. If no distortion is present in any preferential direction, the ellipses will have zero eccentricities and correspond to circles. If distortion is present in a preferential direction, the ellipse will be aligned in that direction, and its eccentricity indicates the degree of distortion.

TopOMetry reveals novel transcriptionally defined T cell populations
To further investigate CD4+ T cells and distortions of the underlying manifold, we harnessed a public dataset comprising approximately 68,000 PBMCs from a healthy donor (10X Genomics, PBMC68K dataset). Representations were obtained with UMAP (using either the top 50 or 300 PCs or the highly-variable genes) and TopOMetry models. Cells were then clustered by the Leiden community detection algorithm using weighted neighborhood graphs learned from; i) the top 50 PCs; ii) the top 300 PCs; and iii) TopOMetry's diffusion potential. We then used   Figure 6E). The average number of cells in these clusters was expressively smaller using HVGs (608) and TopOMetry (320) in comparison to using 50 PCs (2489). In addition, the adjusted mutual information score for different clustering strategies followed a similar pattern to that observed with the PBMC68K dataset, in which clustering results using the expression data are more similar to those found with TopOMetry than with the standard PCA-based workflow (Supplementary Figure 5G). Unexpectedly, different TopOMetry models presented a high AMI score, demonstrating their successful convergence towards the LBO to describe the underlying phenotypic manifold.
To explore whether or not the identified T cell subpopulations were functionally distinct from a T cell receptor perspective, we harnessed an additional dataset -the T cell compartment of the Tissue Immune Cell Atlas 51 , comprising scRNA-seq and scVDJ-seq from T cells obtained from multiple human donors and tissues, including blood, that were jointly analyzed using the standard PCA-based workflow ( Figure 6F). When analyzing this data with TopOMetry, we found a mismatch between the clusters identified with it and the clusters described in the publication, particularly around CD4+ cells ( Figure 6G, Supplementary Figure 6A and 6B), in a similar fashion to the previous datasets. T cell clonotypes were identified based on amino-acid sequence similarity using ScirPy 60 , and were found to match the clusters found with TopOMetry (Supplementary Figure 6C) better than the originally described clusters (Supplementary Figure   6D). When analyzing TCR clonal expansion, we found that subpopulations with larger expansion corresponded to NK and CD8+ cells, as expected and previously described 51  The arrows indicate populations recognizing specific epitopes that precisely match the manifold structure uncovered by TopOMetry. (N) Heatmap of T cell repertoire overlap between the clusters found in the original study, (O) using expression data and (P) TopOMetry. Note how the original clusters present high repertoire overlap, which is less evident in clusters found using expression data and nearly absent from clusters found with TopOMetry.
6H). However, populations with modest expansion corresponded to the CD4+ T cell clusters identified with TopOMetry ( Figure 6H), suggesting these clusters correspond to specific clonal identities. To further explore this hypothesis, the 30 clonotypes with the most cells were visualized on the original UMAP embedding ( Figure 6J), on which they are sparsely and randomly distributed, and on the TopOMetry embedding (topoMAP, Figure 6K), on which they are remarkably localized alongside the clusters found with TopOMetry. Consistent with these findings, the clonotype modularity (how densely connected the mRNA-derived neighborhood graph underlying the cells in a given clonotype is) was found to be higher when using TopOMetry in comparison to the standard approach (Supplementary Figure 6E and 6F). Next, we cross-referenced the TCR structure to VDJdb, a public database of antigen epitopes 60,61 , and identified two transcriptionally defined clusters that present TCR that bind to SARS-CoV-2 and EBV epitopes ( Figures 6L and 6M); these populations are well-defined in the TopOMetry analyses but are unidentifiable in the original UMAP. These results suggest that the clusters of T cells identified by TopOMetry correspond to specific clonal populations. This hypothesis is confirmed by analysis of TCR repertoire overlap across clustering results -while the clusters defined in the publication present a high repertoire overlap ( Figure 6N), those identified using HVGs are expressively more well-separated ( Figure 6O), and those obtained with TopOMetry present no overlap at all ( Figure 6P). Indeed, we found that cells belonging to specific clonotypes defined by amino-acid sequence similarity also expressed the transcripts for the TCR chains identified by VDJ-seq ( Supplementary Figures 6G-J). These results confirm that TopOMetry is able to identify clonal populations from mRNA expression alone and strongly suggest that the T cell clusters found in the other analyzed datasets (for which VDJ data is not available) also correspond to specific clones.
Throughout the analyzed datasets containing T cells, it was noticeable that the major cell populations (CD4+ and CD8+ T, NK, and B cells, monocytes, and megakaryocytes) were similarly represented by the default workflow using 50 or 300 PCs, stand-alone UMAP, and clustering on HVG and TopOMetry. Two main reasons could underlie such discrepancy in the underlying manifold of different cell types: i) high intrinsic dimensionalities (i.d.) and ii) highly non-linear geometry at a local level. To test the former, we used the FSA and MLE i.d. estimation methods and found that T cells did not present particularly high i.d. in any of the tested datasets (Supplementary Figure 7A). To test the latter, we computed PCA with up to 300 PCs and calculated the total explained covariance and found that 50 and 300 PCs explained an average of only 20.09% and 41.89% of the total covariance for these datasets when using the default parameters in scanpy 52 , with an ad hoc 'elbow point' around 30 PCs ( Supplementary   Figures 7B-F), at which only 17.7% of covariance was explained, in average. Remarkably, no HVG selection approach yielded orthogonal bases that explained more than 55% of covariance at 300 PCs. Such low values of explained covariance are generally found in highly non-linear systems. Taken together, these findings and the lack of specific marker genes for clusters learned from a PCA-based neighborhood graph strongly suggest that the phenotypic manifolds underlying T cell identity are highly non-linear and poorly approximated by hyperplanes of maximum covariance and point to the introduction of linear distortion by PCA preprocessing.
Accordingly, we found that the distortion was significantly higher in the PCA-based UMAP representations compared to the topoMAP representation across all datasets in T cells (p < 10 -90 , two-sided Wilcoxon test), as represented by ellipse eccentricities when projecting the Riemannian metric (Supplementary Figure G). Thus, it is clear that the CD4+ T cell clusters found with the standard workflow are PCA-derived artifacts rather than real biological subpopulations and that the clusters found using HVGs or TopOMetry better correspond to the actual biological subpopulations.

Discussion
Finding low-dimensional spaces encoding the phenotypic manifolds underlying single-cell data is a fundamental task in single-cell analysis and a prerequisite for the performance of most existing algorithms used in downstream analyses (e.g., clustering, marker genes, RNA velocity). However, how to properly find and evaluate such spaces has remained elusive. Here, we draw from recent advances in spectral graph theory and high-dimensional data analysis to show that such spaces are best modeled as Riemannian manifolds in a comprehensive theoretical framework and present TopOMetry, a robust, user-friendly, and efficient toolkit that allows users to fully explore such manifolds in single-cell atlases with minimal assumptions about the data (i.e., the manifold hypothesis 62 ). A first-of-its-kind, the toolkit addresses numerous gaps in high-dimensional and single-cell data analysis, providing means to estimate intrinsic dimensionalities, learn consistent topological representations 13 and evaluate subspaces from a global and local perspective. The ability to learn and evaluate dozens of representations with a single line of code empowers single-cell analysis, as one cannot be sure whether a particular computational algorithm will perform well in any given set of data: extending the conclusions obtained from even dozens of datasets to the vast and exponentially growing corpora of single-cell data would be a dangerous assumption. Instead of relying on the presumed supremacy of any given algorithm or assuming data structures, users can test different approaches that pursue the same unbiased mathematical goal (LBO approximation) and decide which to adopt based on quantitative and qualitative evaluations.
Remarkably, it allows assessing distortion in visualizations for the first time by harnessing the Riemannian metric, addressing current criticism 63 on the biological interpretation of embeddings from single-cell data. Its modular design, extensive documentation, and its interoperability with the larger python ecosystem for single-cell analyses ought to make its adoption easy and seamless for the majority of users. Similar to any other tool, it presents limitations, the major one being that, in its current form, it does not allow online or inverse learning, although that may be bypassed by the use of recently developed geometric regularized autoencoders 64 . Another limitation is the high computational cost of evaluating subspaces, a shared feature of existing tools 33,43 , as it involves operating on dense matrices pairwise distances on memory. We expect that our results will inspire the development of numerous other approaches harnessing the power of Laplacian-type operators and consistent manifold learning for the analysis of sc-data.

Public data acquisition and availability
Public single-cell RNA-seq datasets (Table 1) were obtained from various sources and processed following the standard data analysis. The used sources and processing parameters are available in Table 2, alongside additional information for each dataset.

Code availability
TopOMetry is freely accessible as a python library. The source code is available at https://github.com/davisidarta/topometry, and can be easily installed through the python package index (PyPI) at https://pypi.org/project/topometry/ . The library is extensively documented at https://topometry.readthedocs.io . Code used to generate the exposed examples is available at https://github.com/davisidarta/topometry-manuscript. Code used for the benchmark shown in Figure 2 and Supplementary Figure 3 is available at https://github.com/davisidarta/topometry-dr-benchmark. Any additional information can be requested from the authors and will be promptly made available.

Standard data analysis
Count matrices were used to create AnnData objects within scanpy 52 , which were filtered to recommended thresholds. Data were then processed with the following workflow: i) data were library-size normalized to a factor of 10,000 and variance stabilized by log normalization; ii) highly variable genes (HVGs) were identified and kept for downstream analysis-selection parameters were adjusted so that each dataset had between 1,500 and 4,000 HVGs; iii) HVG-data was then centered and scaled (i.e., standardization or z-score transformation); iv) dimensionality reduced to 50 or 300 dimensions with PCA; v) neighborhoods were computed using UMAP fuzzy simplicial sets using the cosine metric distance; vi) the neighborhood graph was projected with UMAP and t-SNE; and vii) the neighborhood graph was clustered with the Leiden community detection algorithm, with the resolution parameter set to 2. These steps correspond to the workflow applied in most manuscripts in which single-cell data is analyzed, being the norm in the field and regarded as the current gold-standard 65 . Thus, we refer to these as the standard workflow throughout the manuscript. Unless otherwise stated, this was performed with default parameters within the SCANPY toolkit for single-cell analysis.
When stated that stand-alone UMAP and clustering was performed on HVGs, we refer to the scaled data matrix of cells vs. HVGs. In these cases, all analyses were still performed within scanpy, but setting the parameters use_rep='X' and n_pcs=0 in the scanpy.pp.neighbors() function, which causes the neighborhood graph to be computed on the scaled data instead of top principal components. UMAP projections and Leiden clustering results were then obtained normally using scanpy built-in functions.
Cluster marker genes were found with logistic regression 66  The LBO is a powerful descriptor of data geometry under the manifold assumption, as on a Riemannian manifold (a smooth manifold with a Riemannian metric defined at every point ), the LBO entirely determines the Riemannian metric -and therefore, all topological ∈ properties of the manifold. This remarkable feature was previously explored by methods like Laplacian Eigenmaps (LE) and Diffusion Maps (DM) 9,10,68 , in which Laplacian-type operators obtained from affinity matrices approximate the LBO. The eigendecomposition of these operators is then used to find latent subspaces that completely preserve the geometry of the underlying manifold. This approach has been shown to be particularly robust when using affinity matrices learned with variable bandwidth kernels 68 , continuous-k-nearest-neighbors (CkNN) 13 or fuzzy simplicial sets 15,69 .
TopOMetry combines and expands these previous approaches into a comprehensive framework that approximates the LBO iteratively to learn faithful low-dimensional representations of high-dimensional data. First, a k-nearest-neighbors (kNN) graph is obtained from the standardized data (the base kNN graph), and then used to estimate affinity matrices using different kernels (the base kernel). Laplacian-type operators are obtained from these matrices (the 'geometric' Laplacian with DM, also referred to as the diffusion operator, and the 'random-walk' Laplacian with LE), and their eigenvectors are used to form a new orthogonal basis in which the latent geometry of the original data is preserved. The learned orthogonal bases (the eigenbases) can be used for numerous downstream tasks, such as computing specialized neighborhood graphs for graph layout optimization (GLO) algorithms such as t-SNE and PaCMAP. It has been previously shown that multiple eigenvectors can be needed to encode data heterogeneity 43,70,71 , each encoding a different aspect of data structure, i.e., distinct cellular trajectories, and dozens or hundreds of diffusion components could be needed to describe large-scale, whole-organism single-cell data. Thus, these orthogonal bases in which all latent geometry is encoded are still high-dimensional, and as such, can be described through a second round of LBO approximation. In this step, TopOMetry learns a new kNN graph from these bases, from which new affinity matrices and Laplacian-type operators can be computedthese operators encode the latent geometry on a higher level, and can be used for other downstream tasks such as clustering and GLO. The diffusion potential is used by default for clustering (e.g., with the Leiden community-detection algorithm) and GLO (e.g., with MAP, a simplified version of UMAP), and its eigendecomposition is used as initialization for GLO algorithms. As a result, TopOMetry harnesses LBO approximations at each step of dimensional reduction, offering a pipeline that is completely unsupervised and dependent uniquely on the intrinsic geometry of the underlying manifold, while also allowing the flexibility of testing multiple options of kernels and types of Laplacian operators.

Computational implementation
TopOMetry was designed to be easy to use, and as computationally efficient as possible.

Construction of k-nearest-neighbor graphs
One of the most fundamental aspects of machine-learning algorithms dealing with high dimensional data is the curse of dimensionality: in higher-dimensional spaces (HDS), the capacity of distance metrics to discriminate dissimilarity decays, i.e. as dimensionality increases, the Euclidean distance between the most similar samples tends to approach the distance between the most dissimilar one 42 . Thus, the task of accurately and efficiently computing distances between samples in a large HDS is a challenge by itself, and several algorithms and tools have been developed to address it 72 . The most commonly used metric in single-cell data analysis is the Euclidean distance, followed by the cosine and the manhattan distances -yet, in principle, any metric distance could be applied. Recent benchmarks 73,74 have achieved initial insight on the suitability of different metrics to estimate distances between single-cells, pointing towards correlation metrics as a favorable choice, although these were limited to only some single-cell RNA sequencing datasets in PCA-based workflows, and further studies are needed.
This guided the choice of the cosine distance metric as a default in TopOMetry and in all presented analyses, as it trivially corresponds to the Pearson correlation distance when using standardized data (which is the current standard in single-cell analysis).
TopOMetry can employ scikit-learn and various ANN libraries to construct kNN graphs. Guided by a recent benchmark between distance estimation algorithms 75 , we defined the Hierarchical Navigable Small Worlds (HNSW) 76,77 as a default. HNSW has been shown to achieve fast and near-optimal results while being useful for computing a variety of distance metrics (e.g. euclidean, cosine, manhattan). Similarly to other non-linear algorithms, the first step of the diffusion basis is to find a k-nearest-neighbors graph (kNN) graph. As previous studies have shown that using only highly variable features (HVF) for computations increases the power of single-cell analysis 65,78 , we estimated distances from the cell by HVF matrix for the shown experiments. As a reference,~1,500 to~4,000 genes are generally considered a reasonable range of HVFs in single-cell RNA sequencing data analysis.

Affinity estimation Variable bandwidth kernels
The implemented kernels draw from recent advances in kernel operators and expand them to include two other possible kernels. Instead of the classic Gaussian kernel with a fixed ε bandwidth, several studies have described that variable bandwidth kernels 13,68,70,79 act as manifold-adaptive estimators, rendering more robust results. These variable bandwidth kernels were adapted into TopOMetry as a bandwidth-adaptive kernel 68,68,71,79 in which the distance between two samples (cells) is normalized by each sample's distance to its median k-nearest-neighbor (i.e. the k/2 -nearest-neighbor). As an extension of this concept, we introduce a novel kernel that includes a neighborhood expansion procedure in which the original kNN graph is expanded to additional neighbors using a data-derived estimate of neighborhood sparseness. In addition, we introduce a novel adaptively decaying kernel, in which the rate of similarity learning is adaptively adjusted by each cell neighborhood density instead of by a fixed hyperparameter as otherwise proposed by PHATE's 'alpha-decaying kernel' 43 .
Guided by previous approaches, we first construct a -nearest-neighbors graph for a given dataset, with N samples (cells) and M measures (genes), using a given distance metric as a local similarity measure. Let be a Riemannian manifold with unknown sampling ℳ⊂ℝ distribution (not necessarily uniformly distributed) in ambient dimension n. For a given sample (the manifold hypothesis -data is assumed to be contained in the proximities of ∈ × ⊂ ℳ the manifold), consider to be the ordered set of its k-nearest-neighbors under some ( ) distance metric : ( , ) ( ) = { (1) , (2) ,..., ( 2 ) , ..., ( ) } | < , ⊂ ℕ Given that the vector of sample observations , a scaling factor corresponding to the distance σ of sample i to its median nearest-neighboring sample(s) in the high-dimensional space is defined: Drawing from previous work on adaptive bandwidth diffusion kernels, the following kernel function could be initially used to convert distances into affinities: It is of notice that the inverse of the scaling factor is employed in the manifold-adaptive FSA σ method of intrinsic dimensionality estimation 39 .
To minimize the impact of the choice of the number of neighbors hyperparameter on the learned graph structure, we harness the potential of the scaling factor to carry information

Continuous k-nearest neighbors
Continuous k-nearest neighbors (CkNN) is a recently introduced 80 method that constructs a single unweighted graph from which the underlying manifold's structure (i.e. topological information) can be retrieved, given any compact Riemannian manifold. In its introductory manuscript, it was also shown that CkNN is the only unweighted graph construction strategy for There are two main advantages to this approach: first, it allows one to interpret the CkNN graph construction as a kernel method, enabling interpreting the graph Laplacian in terms of ; δ second, it allows fixing the hyperparameter, enabling the use of as a local density ( , ) estimate, similarly to the approximation of geodesics with fuzzy simplicial sets and diffusion harmonics. In practice, this captures the natural multiscale topology of data, wielding smooth geometries where data is sparsely sampled and fine-grained details where data is densely sampled.

Fuzzy simplicial sets
The application of 1-skeleton fuzzy simplicial sets as topological metrics on TopOMetry draws entirely from previous work, particularly from UMAP 15 theory and the work of David Spivak 69 .
Our innovation is represented by the use of these metrics to construct new orthogonal bases that capture latent topology, and subsequently to construct topological graphs prior to layout optimization. When using this method to learn metrics from some data , it is assumed is × uniformly distributed on a locally connected Riemannian manifold with a locally constant ℳ metric . UMAP originally addresses the uniformity assumption by creating a custom distance for each sample by normalizing distances with respect to their k-th nearest-neighbor to effectively approximate geodesics. To merge each of the discrete metric spaces generated per sample into a global informational space, the metric spaces are converted into fuzzy simplicial sets. A consensus metric can then be obtained by taking a fuzzy union across all metric spaces encoded as fuzzy simplicial sets, rendering a single fuzzy simplicial set which is granted to capture relevant topological metrics on . The Laplace-Beltrami Operator acting on can ℳ ℳ then be approximated from the normalized laplacian of this fuzzy graph representation.
The aforementioned approach is grounded on solid category theory and no major modifications to the fuzzy simplicial set algorithms defined and implemented in UMAP were made, aside from improving its computational performance with approximate-nearest-neighbors libraries. For brevity, we direct the reader to UMAP's 15 And because is a simplicial set, the realization of ( ) is , involve exploiting the fact that the realization functor and singular set functors form an adjunction, translating between topological spaces and simplicial sets -for this, a categorical presentation of fuzzy sets was outlined.
A fuzzy set can be defined as a set for which membership is not binary -rather, it is a fuzzy property represented by continuous values in the unit interval (with , and with = (0, 1] ⊆ℝ topology given by intervals of the form ). This allows one to define [0, ) ∈ (0, 1] presheafs: a presheaf P on is a functor from to . A fuzzy set is a presheaf on such that ∆ all maps P are injections. P is the set of all elements with at least membership ( ≤ ) ([0, )) strength. With this concept in mind, the category of fuzzy sets is defined as the full subcategory of sheaves on spanned by fuzzy sets. Consequently, defining fuzzy simplicial sets (FSS) corresponds to considering presheaves of to be valued in instead of . The ∆ objects of the category of FSS are thus functors from to , and its morphisms are ∆ given by natural transformations. Conversely, FSS can be viewed as sheafs over the product topology , allowing one to apply them to a broad category of extended-pseudo-metric ∆ × spaces. This will be useful as the initial geodesic normalization step involves creating a unique metric space for each sample. is defined in terms of standard fuzzy simplices ( ), similarly to the the classic realization functor : A morphism exists only if , and is determined by a morphism The action of Real on such a morphism is then given by the non-expansive map

Learning a latent basis from Laplacian-type operators
TopOMetry employs eigendecomposition to learn latent bases from Laplacian-type operators but can also be parametrized as by the anisotropy value, which controls how much the (α) α data distribution is allowed to bias the obtained results so that only the underlying geometry is considered and the LBO is approximated when . For that, we first obtain an -normalized α = 1 α : and form a new degree matrix : , which is used to intrinsic dimensionality (number of non-trivial eigenvalues) of , one can achieve the spectral ℳ decomposition: In which encodes the diffusion distance between samples and , along 2 We next multiscale distances to account for all possible scales, rendering multiscale diffusion distances : Which can be reformulated as: The eigenvalues of the resulting decompositions are useful to perform an empiric estimation of the data intrinsic dimensionality during computations, as the numerical values hit the limits of floating-point precision, which is encoded either by a sign-flip or by the decomposition algorithm outputting null outputs for the eigenvalues that pass this limit. We found that this is a useful way to set the key hyperparameter n_eigs, which defines a fixed number of latent dimensions to be computed, and recommend putative users to increase this parameter up to a point where such a spectral gap (or an eigengap) can be observed in scree plots. As shown in Figure 3, the identified spectral gap often corresponds to external estimates of intrinsic dimensionality.

Graph-layout optimization
Graph layout optimization (GLO) are techniques that aim to optimize a lower-dimensional representation (usually in two or three dimensions, for visualization) that preserves most of some properties from the original data by minimizing a given cost function. Differences between these methods are mostly ough to the different methods to learn similarities from distances and to the number and design of cost functions 82  A similar conditional probability can be computed for the points in the low-dimensional | embedding. Traditionally, the gaussian employed in is set to , and these similarities are And its update with a momentum term (to avoid local minima) is given by: ) t-SNE's superiority to SNE is due to two differences: first, the implementation of a symmetrized version of the SNE cost function; second, the use of heavy-tailed distributions to address the crowding problem. Significant advances have also been made regarding the optimization of the t-SNE cost function to avoid local minima and increase performance. To symmetrize the cost function, instead of minimizing the sum of all KL divergences between high-dimensional and low-dimensional probability distributions, a single KL divergence between the joint probabilities distributions P (high-dimensional) and Q (low-dimensional) is used: In t-SNE, a Student t-distribution with one degree of freedom (the same as a Cauchy distribution) is used as a heavy-tailed distribution for the low-dimensional mapping. This defines the joint probabilities : The gradient of the KL divergence between P and Q (computed using the above equation) is There are numerous modifications and adaptations of the t-SNE algorithm (e.g. heavy tailing, density preservation, perplexity combination), and these exceed the scope of this manuscript.
The version of t-SNE included in TopOMetry relies on the MulticoreTSNE 84,85 package.

Uniform Manifold Approximation and Projection (UMAP) and Manifold Approximation and Projection (MAP)
The similarity metric when using the 'fuzzy' algorithm in TopOMetry corresponds to this 1-skeleton of the fuzzy simplicial sets, that is, a fuzzy graph (i.e. a fuzzy set of edges generate visualizations, the graph layout is optimized by minimizing the cross-entropy between the high-dimensional and low-dimensional fuzzy sets in terms of the ( , µ) ( , ) reference set : The MAP approach included in TopOMetry corresponds to a near-identical implementation of UMAP's optimization procedure, but using arbitrary graphs that are directly used for layout optimization (i.e. the manifold has already been approximated), thus only assuming that the cross-entropy optimization finds global near-optima solutions.

Pairwise Controlled Manifold Approximation and Projection (PaCMAP)
PaCMAP holds significant similarities with UMAP and TriMAP, but introduces its own unique insights. The core idea of PaCMAP is establishing robust loss functions that take into account not only a sample's local neighborhood, but how that neighborhood relates to closely located neighborhoods and with far away points. First, the neighborhood affinity graph from data is built in a manner identical to TriMAP triplet weighting procedure, in which distances between samples are scaled by the distance between the target sample and its 4th to 6th nearest neighbor: Next, pairs of samples are selected as follows: • Near pairs -a sample's k-nearest-neighbors, based on the scaled distance. Where is a weight given to i neighbors, a weight given to its mid-near pairs and a weight given to its further pairs. The optimization is performed in three stages, each with a number of iterations t that can be controlled by the user as hyperparameters. In the first stage The uniqueness and novelty of PaCMAP lies in the fact that local and global structure are balanced throughout the optimization process by the choice of sampling neighbors, mid-near pairs and further pairs simultaneously. When employed in TopOMetry, PaCMAP uses a denoised topological orthogonal basis instead of the raw data; similarly to TriMAP, feeding it the downstream topological graph may be counterproductive due to its peculiar neighborhood sampling process.

Estimation of intrinsic dimensionalities
Intrinsic dimensionalities were estimated using the Maximum-Likelihood Estimator (MLE) 47  , in which represents the distance between and its k-nearest-neighbor. The global ( ) estimate is obtained by averaging all local i.d. Estimates.

Manifold adaptive dimensionality estimation (FSA)
The FSA (Farahmand, Szepesvári & Audibert) method 39,86 is extremely simple: it uses two neighborhoods around a data point to estimate its local intrinsic dimensionality . In δ( ) particular, it uses the ratio between the distance from to its k and k/2 nearest neighbors: Note that the ratio corresponds to the scaling factor used in the bandwidth-adaptive

Estimating and representing distortions with the Riemannian metric
We draw from the seminal work of Perrault-Joncas and Meila 36 , which first used the Riemannian metric to visualize distortions in the data and learn locally isometric embeddings by correcting projections, using fixed bandwidth kernels. In that regard, our contribution is restricted to extending their work using kNN graphs and variable bandwidth kernels and to applying these principles to single-cell data.
The Riemannian metric is a symmetric positive definite tensor field that defines an inner product on the tangent space for every , being a Riemannian manifold. The <, > ∈ Laplace-Beltrami operator (LBO) can be expressed by means of g: Because is approximated by the random-walk graph Laplacian and the anisotropic △ diffusion operator , one can then use it to estimate the inverse of the Riemannian metric can then be eigendecomposed, with eigenvalues ordered from largest to smallest, and the ℎ embedding metric is given by the pseudoinverse of . For each point, the ellipsoid ℎ ℎ representation can then be obtained using its eigendecomposition. When using two dimensions, the ellipse orientation is given by the arctangent of the ratio between the eigenvectors, and the size of the major and minor axes are given by for the largest and smallest 2 × λ × κ λ eigenvalues, respectively, being a scaling factor. κ

Quantitative evaluations
Global score To evaluate the preservation of global structure by embeddings and layouts, we elected the PCA resemblance score. It was first introduced in the TriMAP manuscript 41 Interestingly, as exposed in the results, the global score remains rather stable and close to 1.0 across all embedding and layout algorithms that were tested, given a PCA or spectral initialization. We empirically show that this score is approximately the same using either PCA or a spectral layout (LE) and implement functions to perform such tests within TopOMetry.

Trustworthiness
For evaluating the preservation of local structure from an isomorphic perspective, we harnessed the well-known trustworthiness 40 metric available on the scikit-learn machine-learning toolkit.
Given a small number of k-nearest-neighbors, this metric is defined as: In which for each sample , are its k-nearest-neighbors in the low-dimensional space, and is its -th nearest-neighbor in the original high-dimensional space. Essentially, this penalizes ( , ) for samples (cells) that are nearest neighbors in the low-dimensional space, but not in the original space, in proportion to their original rank. A score of 0 implies complete loss of local geometry, and a score of 1 implies complete preservation of local geometry, at the rank of k.

Geodesic correlation
For evaluating the preservation of local structure from an isometric perspective, we built upon previous approaches that explored geodesic distances and the Spearman correlation between original high-dimensional and downstream low-dimensional distances 43,87 . This task is computationally very intensive and thus was not performed by default in the benchmark. kNN graphs are built from both the original high-dimensional data matrix and the representations to be scored. Next, the geodesic distances (i.e., the weighted shortest-path pairwise distances) are obtained for each of these graphs, and their distributions are then scored accordingly to the Spearman correlation (R). Cells can then be ranked accordingly to their neighborhood indiceslet be the number of cells and be the difference between ranks for cell in the original high-dimensional space and in the latent low-dimensional space. Then, we can calculate the Spearman correlation as: The Spearman rank-order correlation coefficient (R) is a nonparametric measure of association between two sets of data, and superior to the Pearson correlation in this case because it assumes only monotonicity, not linearity. Intuitively, this score tells whether increasing distances in the low-dimensional embedding is related to increasing distances in the high-dimensional data: a score of 0 implies no association at all, and a score of 1.0 implies perfect association.

Benchmark
For the benchmark presented in Figure 2 and Supplementary Figure 3, we used the global score and the trustworthiness score. 200 principal components or eigencomponents were used for each non-biological dataset, and 600 principal components or eigencomponents were used for each single-cell dataset. For all datasets, the cosine metric distance was used with a value of 30 nearest-neighbors. All other hyperparameters used in t-SNE, UMAP, PaCMAP, and TopOMetry were set to defaults. We restricted the tested GLO techniques to t-SNE, UMAP, and PaCMAP due to practical limitations in calculating results for additional methods (e.g., including TriMAP and MDE or the plethora of available options would lead to an unreasonable increase in runtime).

RNA velocity analysis and lineage inference
For RNA velocity analysis of the murine pancreas development dataset 48 , we used the python toolkit scvelo 26 . Analysis was performed with standard parameters to reproduce scvelo documentation. TopOMetry analysis was performed using the msDM eigenbasis (obtained with the bandwidth_adaptive kernel) and the MAP projection method was used on the learned diffusion potential from this eigenbasis with standard parameters. RNA velocity was then recomputed using the msDM representation to compute neighborhood graphs using the scvelo.pp.moments function, with all other hyperparameters being identical to the standard analysis.
For the MOCA dataset, TopOMetry msDM eigenbasis was used to decompose the data into 800 eigenvectors (796 positive). Clustering with the Leiden community detection algorithm and projection with MAP were then carried out using the diffusion potential learned from this eigenbasis, using default parameters.

Analysis of T cell diversity and clonality
The datasets used to explore T cell diversity were cell-type annotated using CellTypist, using the graph connectivities learned from the scaled HVG data matrix for overclustering and majority voting 51 . The Immune_All_High model was used to annotate main cell types, and the Immune_All_Low to annotate subtypes. For quantifying the amount of detected T cell clusters, a threshold of 0.8 prediction probability was used.
All datasets were analyzed following the standard workflow within scanpy, using 50 or 300 principal components and with the Leiden clustering algorithm set to resolution=2. All other hyperparameters were set to defaults. Marker genes were found using logistic regression 66 with max_iters=1000 and all other hyperparameters set to defaults. TopOMetry analysis was performed using the msDM eigenbasis with the bandwidth-adaptive kernel to calculate a total of 500 eigencomponents for all datasets, from which the diffusion potential was used to compute MAP projections. All other TopOMetry hyperparameters were set to defaults.
For the analysis of the TICA dataset 51 , an AnnData object containing scRNA-seq and VDJ information for the T cell compartment was obtained from tissueimmunecellatlas.org (GEX+TCRab data). We then reanalyzed this data with scirpy 60 following the steps exactly as detailed in its documentation without any change to parameters other than to use TopOMetry latent representations and clustering results. TopOMetry analysis was performed using the same parameters for all other PBMC datasets.

Computational Environment
The benchmark presented in Figure 2   Schematic overview of the current standard workflow for single-cell analysis, starting from a high-dimensional matrix generated from raw sequencing data, that undergoes quality control (QC) and cell filtering and library-size normalization. High-variable genes are selected and used to compute PCA, from which the top principal components are used to find k-nearest-neighbors (kNN) graphs and affinity matrices that are used for clustering and projection into two-dimensional visualizations. (B) Schematic overview of the TopOMetry workflow, which consists of the same default processing steps, after which kNN graphs and specialized kernels are used to learn Laplacian-type operators that consist of topological affinity matrices. The eigendecomposition of such an operator yields eigenbases that preserve the latent topological information from the original high-dimensional manifold. The eigenbasis can then be used to learn new topological graphs using the same kernels and operators, which are used for clustering or learning two-dimensional projections for visualization through graph-layout optimization techniques. The learned projections are then evaluated for the preservation of global and local structure and the amount of introduced distortion. (C) Schematic overview of the computational implementation of TopOMetry, which is centered around the TopOGraph class. (C) Clonotype network for T cells with a detectable TCR in this dataset, based on amino acid sequence similarity. Each dot represents an individual clonotype, and its size represents the number of cells belonging to that clonotype. Dots are colored by the clusters from the original study. Note how different clones are grouped together under the same cluster assignment. (D) Same as in (C), but with dots colored by the clusters found with TopOMetry. Note how most clonotypes belong to a single cluster assignment, without overlap. (E) PCA-based UMAP and (F) topoMAP projections of the same data, colored by clonotype modularity. Clonotype modularity was calculated using a neighborhood graph learned from the top 50 principal components (PCs) or the diffusion potential graph. Note how modularities are higher with the latter, indicating a better connection between the manifold learned from RNA-seq and the clonotypes learned from VDJ-seq. (G-J) Violin plots of comparative gene expression between cells belonging to a clone and the rest, detected akin to marker genes using the Wilcoxon test, showing gene expression of the receptor chains that define clonotypes identified by VDJ-seq.