A comprehensive dimensional reduction framework to learn single-cell phenotypic topology uncovers T cell diversity

The core task when analyzing single-cell data (SCD) is dimensional reduction (DR), which aims to find latent signals encoding biological heterogeneity. Here, we dissected DR steps to build TopOMetry, a machine learning framework that learns latent data topology to perform DR in a modular fashion and show that current analysis practices are biased due to the non-uniformity and non-linearity of the geometry underlying SCD. We used TopOMetry to analyze SCD from peripheral blood mononuclear cells (PBMC), and consistently found a plethora of unreported T CD4 subpopulations that are missed by current workflows across datasets and diseases. We also found comparable T CD4 diversity in the human cerebrospinal fluid. The proposed framework scalates to millions of cells, excels at discovering fine-grained local structure, and holds natural connections to clustering and trajectory inference. We hope this powerful tool will accelerate biomedical discovery and inspire new methods to learn and explore phenotypic topology.


Introduction
Finding meaningful, latent information from high-dimensional data (HDD) is a central concern in several research areas and particularly critical when analyzing single-cell data (SCD).
Single-cell assays have inaugurated a new era in the unbiased molecular sampling of the cellular diversity present in biological systems, from cell lines to whole organisms 1,2 . However, the complexity and vastitude of the data generated by such experiments require sophisticated computational frameworks for its analysis 3,4 . In such data, information regarding cellular phenotype (e.g. gene expression, chromatin accessibility, etc.) is individually gathered for every cell, resulting in matrices containing up to millions of cells and dozens of thousands of observations. Thus, a core task in the analysis of SCD is dimensional reduction (DR), which aims to embed these cell-specific observations into a latent representation lying on a lower-dimensional space (usually dozens of latent dimensions, or in 2-or 3-D for visualization) while preserving as much original information as possible 4,5 . Essentially, this attempts to find latent signals encoding phenotypic diversity. Furthermore, DR is indispensable for downstream tasks in SCD analysis 6 , such as clustering, RNA velocity 7,8 , batch-correction 9,10 , lineage inference, and pseudotime estimation 11,12 . Thus, it is crucial to obtain representations that faithfully represent the complex topologies underlying phenotypic diversity in SCD.
The concept of finding latent signals representative of cellular identity is intrinsically related to defining a finite number of cell lineages and types within a given dataset. There is also broad interest in the somewhat familiar property of DR of frequently revealing the dynamics underlying cellular differentiation when applied to SCD, as numerous cells asynchronously undergoing differentiation trajectories are sampled simultaneously. Given adequate low-dimensional mappings, moving through continuous topologies in the form of connected populations often corresponds to differentiation or stimulus-response processes, a feature explored by trajectory inference (TI) algorithms 11,13 . This property is particularly noticeable in Diffusion Maps (DM), a DR method broadly employed for pseudotime estimation, due to its intrinsic relation to the heat kernel and Markov chains, and is extensively used to model cellular development 6,12,14,15 . The essence of TI through DR is also intrinsically related to a theoretical epigenetic landscape where cell states co-exist through development trajectories (i.e. an extended Waddington's landscape 16 ).
Numerous DR methods have been developed, and many have been applied to SCD. The seminal Principal Component Analysis (PCA) 17,18 algorithm performs linear transformations to find hyperplanes of maximum covariance to denoise redundantly correlated features that are named Principal Components, and is a linear method. More recent and sophisticated algorithms most often obtain non-linear embeddings by assuming that data exists in or close to an underlying lower-dimensional manifold (i.e. the manifold hypothesis). In practice, such methods rely on first building k-nearest-neighbors graphs by computing distances (i.e., dissimilarities) between cells in the high-dimensional space, although random projections and neural-network methods can also be employed. A similarity weighting function (i.e., a kernel) can also be used to balance initially measured distances to their neighborhoods, usually to account for non-linearity, such as in kernel-PCA. The eigendecomposition of such similarity matrices or their graph Laplacian can then be used to explore latent aspects of data geometry, such as in Laplacian networks to find latent signals, and UMAP to optimize layouts. Another popular method, PHATE (Potential of Heat Affinity and Transition Estimation) 6 , uses an adaptation of DM internal Markov chains to learn similarities with a diffusion operator and then a modified version of MDS to optimize a 2-or 3-D layout. Further exceptions are ZIFA 37 and ZINB-WAVE 38 , which are tailored to deal with the dropout phenomenon observed in single-cell RNA sequencing (scRNAseq) -albeit others claim these could be rather informational 39 , thus distorted by these methods. Other alternatives encompass advanced matrix factorization methods, such as non-negative matrix factorization (NMF) 40 , which can also be helpful for transfer learning and data integration. A systematic review of existing methods surpasses the scope of this manuscript; however, we note that most DR methods fall into two broad categories depending on their focus: embedding and layout algorithms, used respectively for finding latent embeddings or optimizing a layout (usually into 2-or 3-D for visualization) from an initialization. UMAP and MDE are corner cases that can also learn latent embeddings through layout optimization to an arbitrary number of dimensions. Embedding methods tend to hold a greater focus on similarity learning, a feature generally overseen in layout methods in favor of sophisticated optimization functions tailored for visualization.
A particularity of methods grounded on discrete differential geometry such as LE, DM, and UMAP is their ability to independently achieve discrete approximations of the Laplace-Beltrami Operator (LBO) 19,22,25,41 , a generalization of the graph Laplacian and an optimal tool to recover information about data intrinsic structure 19,42 . Methods that approximate the LBO are advantageous due to their natural connection to spectral theory and clustering and are intuitively related to the graph smoothing process employed by these methods 19,31,32,41,43 , such as in continuous k-nearest-neighbors (CkNN), which can also approximate higher-order operators (the 0-th Laplace-de Rham Operator is the Laplace-Beltrami Operator) 32 .
Some work has been done in comparing the performance of distinct methods for DR in single-cell data [44][45][46] -yet, these are limited to quality metrics in benchmarking pipelines and lack both the needed tuning of each algorithm to the analyzed data sets and the qualitative assessment of each algorithm's topology preservation of actual ground-truth biological landmarks, such as the cell cycle as a closed logic loop. Even more complex is validating the faithfulness of the learned topologies themselves. Even when downstream hypotheses are experimentally validated, these can hardly be extended as direct proofs of the learned representations per se 33 . Each algorithm's assumptions regarding data structure and distribution are also disturbingly challenging to assess in practice. The universally used PCA holds the relatively strong assumption that hyperplanes (the PCs) can approximate data geometry and can reach misleading results when faced with non-linear structure, missing local information, and introducing linear distortions. The popular UMAP, while more general by loosely assuming that data is distributed in or near a locally connected Riemannian manifold, assumes manifold uniformity and a locally constant metric. An alternative to assuming data topological properties would be to hold only the manifold hypothesis by seeking approximations of the LBO 32 . It would only be needed to assume that data lies in or close to a set of topological manifolds, possibly with non-linear and non-uniform data structure and unknown sampling distribution. In practice, this is true when the relation of neighborhood size S to the total number N of cells approaches zero, that is, when it is small enough for one to reasonably explore differential calculus (i.g. this ratio is , using 3 × 10 −4 45,000 cells and 15 k-nearest neighbors). Assuming such naturally-occurring phenotypic manifolds exist within SCD holds strong connections to the concept of cell types, identities, and cellular differentiation along phenotypically-defined trajectories in the high-dimensional observation-derived topological space. In such an informational space, continuous submanifolds can naturally represent the concept of cellular trajectories, while discrete submanifolds with little to no edges between their graph representation would represent discrete, unrelated cellular populations. Multiple continuous and discrete subspaces must exist in this space to allocate the plethora of cellular diversity of unknown shapes. In essence, such subspaces represent an extension of the Waddington's landscape, as cells progress along multiple discrete differentiation trees 16 . We name this property phenotypic topology, i.e., the geometric structure underlying cellular identities, and use approximations of the LBO to extract it from scRNAseq data.
Multiple algorithms have been developed for this task, and it would be naive to elect a single one for each and every data set. A solution would be to use different algorithms to approximate the LBO, which could naturally converge to a common goal. Qualitatively evaluating how these algorithms perform in representing single-cell data through clustering and visualization and quantitatively scoring global and local structure preservation would allow one to select the best method for their particular data, rather than naively electing a single method a priori.
Grounded on these insights, we dissected the steps of DR (similarity learning, learning a new latent orthogonal basis, graph learning, and layout initialization and optimization) to design TopOMetry(Topologically Optimized geoMetry), a general and modular framework for topological data analysis. Each step independently approximates the LBO except for layout optimization, leading to dozens of possible combinations that converge to the same mathematical operator. These combinations include numerous methods unused in SCD, such as similarity learning with CkNN, building a new orthogonal basis with multiscale diffusion maps, and graph layout optimization with PaCMAP. This framework was implemented in the python package TopOMetry for high-dimensional data analysis. Its companion single-cell package, CellTOMetry, was designed to be user-friendly and function seamlessly within the python single-cell analysis environment. CellTOMetry systematically runs several possible algorithmic combinations for DR, including the default PCA-primed UMAP workflow, and quantitatively evaluates all results, allowing users to interpret which models to best trust and if different high-scoring models converge to the same topological representation. Here, we use it to show how single-cell data is most commonly non-linear and non-uniform by revealing dozens of previously hidden T CD4 subtypes in various public scRNAseq datasets of circulating blood cells from healthy donors and patients with systemic lupus erythematosus, dengue fever, and multiple sclerosis. We also show that such T CD4 diversity can be partly found in the cerebrospinal fluid. This approach escalates to whole-organism data composed of millions of cells, revealing fine-grained differentiation trajectories that would otherwise remain hidden. Quantitatively, we show that TopOMetry's topological models outperform the standard PCA-centered approach for single-cell analysis, defining new directions for developing computational biology tools.

Results
A modular framework for learning topological metrics, graphs, and layouts from high-dimensional single-cell data.
Unlike most methods that assume unknown aspects of data topology, we built a framework to explore sequential approximations of the Laplace-Beltrami Operator (LBO) ( Figure 1A). The LBO represents latent topology when the data is or lies near a manifold. In other words, knowledge about the LBO means knowledge about data topology 19,22,32 -nevertheless, its approximation for discrete, finite, real-world datasets can be challenging. Diffusion Maps (DM) 22 , Laplacian Eigenmaps (LE) 19 , continuous k-Nearest-Neighbors (CkNN), and UMAP's construction of fuzzy simplicial sets (FSS) have explored LBO approximations with distinct -yet related -mathematical and computational means, thus being useful for different tasks, such as metric-learning (CkNN, diffusion harmonics, FSS), denoising (DM, LE) and projection (spectral decomposition of the graph Laplacian, FSS cross-entropy optimization). To recover high-dimensional topology with high resolution without preference for a particular model, we adapted these previous approaches into a comprehensive framework that aims to obtain discrete approximations of the LBO at each step of the dimensionality reduction procedure with different algorithms ( Figure 1B). From a given high-dimensional data matrix, TopOMetry first learns a topological metric with either multiscale self-adaptive diffusion harmonics (DM), CkNN, or FSS (Fig. 1B) to build a topological graph from data that approximates the LBO; then, the learned metric is used to build a first, non-linear topology-preserving orthogonal basis with an eigendecomposition (Fig. 1C)  Processed single-cell data from a given modality in the form of a data matrix composed of features per cell (i.e., gene expressions per cell) is fed to the algorithm. Its main advantages are looser assumptions regarding data topology than current standards, focusing on approximating the Laplace-Beltrami Operator (LBO), which can be done with different algorithms at each step. After learning latent topology (i.e., the extended Waddington's landscape), it can be visualized with any existing algorithms (t-SNE, MAP, PaCMAP, TriMAP, NCVis, and MDE are the current options, but more can be easily done externally due to TopOMetry modular design). (B) First, cell-cell similarity metrics are learned with algorithms that extract data topology; currently available algorithms are diffusion harmonics, continuous k-nearest-neighbors, and fuzzy simplicial sets. The advantage of these methods is adaptively estimating affinities based on neighborhood sparsity. It has been independently shown that these metrics approximate the LBO. (C) Next, spectral decompositions of these graph Laplacians are used to obtain a latent lower-dimensional embedding -the dataset intrinsic dimensionality can also be indirectly estimated by the number of non-negative eigenvalues (up to the 64-bit precision limit). (D) A second topological graph is obtained by applying the metrics used in (B) to the topological basis learned in (C) -note that different metrics can be combined. (E) The topological graph is used to obtain a spectral layout initialization, which is then optimized to match the graph by a layout optimization algorithm. particular advantage of TopOMetry models is that they frequently have discrete eigenspectra: as the precision of eigenvalues and eigenvectors reach increasingly high values, the bit limit is frequently hit, which causes null or negative values to emerge as one reaches computational precision. This property and visualization of the eigenspectra can be used to obtain an indirect, empirical estimation of data intrinsic dimensionality (Fig. 1C). This is a significant advantage over current approaches in which users need to select a certain number of dimensions to use in downstream tasks. Next, a further refined topological graph approximates the LBO of this new orthogonal topological basis (Fig. 1D). A lower-dimensional initialization that preserves global structure 47 is then generated with a spectral decomposition of this topological graph. Finally, the layout can then be optimized using either an adapted version of UMAP (MAP -Manifold Approximation Projection) that minimizes the cross-entropy between the topological graph and the embedding or any other graph layout optimization algorithms; TopOMetry built-ins are tSNE, MAP, MDE, TriMAP, NCVis and PaCMAP (Fig. 1E). All steps were designed to approximate the LBO independently, except the graph layout optimization, which minimizes the differences between the learned topological graph and its lower-dimensional layout and should ideally be executed with multiple algorithms for visual comparisons. TopOMetry latent orthogonal bases can also be used in downstream analysis of SCD to alleviate the computational burden (instead of PCA) while providing denoised phenotypical signals, meaning they can be used to enhance the results of existing methods that rely on latent, denoised signals for clustering, RNA velocity estimation, batch correction, and multi-omics data integration.
In summary, by investigating data heterogeneity at multiple levels with sequential approximations of the LBO with topological metrics, graphs, and spectral layout initialization followed by optimization with state-of-the-art algorithms, TopOMetry wraps previous advances in DR into a unified framework focused on preserving data topology. These steps are all illustrated in Figure 1, and the differences to the current workflow are highlighted in Suppl. Identifying the non-linear and non-uniform topology of circulating T CD4

cells.
We first explored our algorithm in a famous real-world scRNAseq experiment: the pbmc68k dataset from 10X Genomics, which contains the transcriptome from~68,000 peripheral blood mononuclear cells (PBMCs) from healthy human donors. pbmc68k is an exhaustively explored dataset, commonly used for teaching single-cell analysis and developing new tools, and its clustering results, PCA embeddings, and t-SNE and UMAP layouts are considered well known 2A-C), a feature we associate with the assumptions of data linearity and uniformity being broken for this system. Furthermore, we show that layout algorithms such as t-SNE, PHATE, PaCMAP, and TriMAP fail to reveal such heterogeneity when performed on PCs or directly on the data matrix (Suppl. Fig. 2A-F). Disturbingly, priming embeddings with PCA seems to introduce linear distortion, creating artificial clusters that do not have any specific marker genes (Fig 2B-C, G). However, these algorithms perform similarly in representing T CD4 heterogeneity when primed with either the diffusion or fuzzy orthogonal bases and their topological graphs (Suppl.  Fig. 2G-R), pointing towards the convergence of these two models to the LBO. As will be discussed further on, the current implementation of the continuous model is computationally inefficient for large datasets (>15,000 cells) and therefore was excluded from most analyses.
An empirical explanation for the distortion exhibited by current approaches is the lack of meaningful latent information extracted by PCA after a dozen or so dimensions (Suppl. Fig. 3A, B), a feature also related to data non-uniformity and non-linearity. Conversely, TopOMetry diffusion and fuzzy models lead to uncorrelated orthogonal components that keep extracting latent information until a numerical bit limit is reached at relatively similar number of dimensions (Suppl. Fig. 3A, D). Jointly, this points to topology non-linearity. Furthermore, computing layouts directly from data leads to severe underappreciation of the heterogeneity of T CD4 substates, which is also related to most algorithms' assumption of topology uniformity, as, in this scenario, the linearity and uniformity assumptions are violated by a classic biological hallmark encoded within the data geometry: T CD4 diversity.
T CD4 cells are considered uniquely diverse and heterogeneous and are central for the maintenance of the immune repertoire and the regulation of the immune response. This prior-known ground-truth biological diversity is encoded in data by numerous sharp, non-uniform, and non-linear variations in data structure learned by TopOMetry and can be directly visualized as an intricate 'blooming tree' of cellular subpopulations. To assess if these new models effectively led to the identification of cell clusters defined by exclusive, specific marker genes, we harnessed the clusters found with the Leiden community detection algorithm from SCANPY default affinity graph (PCA-primed fuzzy simplicial sets) and with the four topological graphs. Clusters' marker genes were then surveyed with logistic regression 48 .
Coincidently with the visualization and clustering results, we show that the standard PCA-derived approach fails to identify exclusive, specific marker genes across distinct T CD4 subpopulations (Fig. 2G, Suppl. Fig. 4A-B). In contrast, CellTOMetry successfully identifies around a hundred novel T CD4 subpopulations defined by very specific marker genes (Fig 2H, Suppl. Fig. 4A, C-F). Collectively, these observations provide first-of-its-kind evidence that phenotypic topology (i.e., SCD underlying structure representing biological information) is potentially non-linear and non-uniform -assuming otherwise a priori could represent a dangerously overseen bias in existing approaches. It also sheds light on a previously unappreciated degree of T CD4 cellular heterogeneity.

Circulating T CD4 diversity is present across diseases and in the cerebrospinal fluid
To evaluate whether the reported T CD4 diversity is reproducible and if such diversity could be detected in human disease, we harnessed three additional PBMC datasets from donors with 1) dengue-virus (DENV) infection (dengue72k) 49 , 2) systemic lupus erythematosus (SLE; lupus61k) 50 and 3) multiple sclerosis (MS; ms47k) 51 . For the lupus61k and the ms47k datasets, data from healthy controls were also available, and data was available both for PBMCs and for Additionally, we show that this new level of definition allows identifying highly-specific marker genes across models and datasets (Suppl. Fig. 8, Suppl. Fig. 9, Suppl. Fig 10).
Interestingly, for the dengue72k dataset, T CD4 marker genes corresponded to unique TCR genes, effectively demonstrating that individual clusters can be related to specific clonal lineages. To the best of our knowledge, this is the first analysis to identify T CD4 clones from gene expression data alone, without considering V(D)J sequences or TCR-sequencing data.
These findings confirm a strikingly higher transcriptional diversity of circulating T CD4 cells than previously thought (approximately by an order of magnitude, with a total of approximately a hundred conserved T CD4 states across datasets). The similarity between the general structure

Figure 3 -T CD4 diversity is consistent across diseases and partly present in the CSF.
Single-cell RNA sequencing (scRNAseq) data from the three discussed datasets: dengue72k, lupus61k, and ms47k. Data were preprocessed accordingly to the current standard workflow, and clustering was performed with the Leiden algorithm on either the graph obtained with fuzzy simplicial sets from the results of PCA (left columns, standard approach) or the diffusion graph from the diffusion basis (right columns, topological approach). ms47k was filtered to contain only clusters annotated as T cells prior to analysis. results between models and datasets is remarkable, and suggests that the most if not all T CD4 clusters identified are clonal lineages. Additionally, we show that such striking diversity is also present in the cerebrospinal fluid of healthy donors and MS patients. These results highlight the significance of T CD4 transcriptional heterogeneity to homeostasis and disease, and critically exemplify the limitations imposed by the currently held assumptions.

Phenotypic topology representations enable high-resolution trajectory inference and visualization of the cell cycle.
To further explore representations of ground-truth biology within single-cell data, we harnessed ungated bone marrow scRNAseq data from the Human Cell Atlas project, in which both PBMCs and hematopoietic stem cells (HSC) are classically described ( Fig. 4A-C). When exploring these data with CellTOMetry, a similarly unreported transcriptional heterogeneity of T CD4+ cells was found ( Fig. 4D-F), holding a strong graphic similarity to those observed in the pbmc68k, dengue72k, sle61k, and ms47k datasets across models (Suppl. Fig. 11). Consistently with these results, we found that the topological models could identify significantly more T CD4 and HSC subtypes than the standard approach ( Fig. 4G-H). Similar to previous results, TopOMetry led to the identification of a more specific gene signature for discrete T CD4 subpopulations across models (Suppl. Fig. 12), a feature exemplified with several marker genes (Suppl. Fig. 13).
Interestingly, the topological approach was able to map clusters expressing MKI67 (a mitotic marker) and CD34 (an HSC marker) to closed-loop trajectories that could arguably represent the structure of the cell cycle itself (Fig. 4I). To validate this hypothesis, we scored cells along the cell cycle using a standard reference 52 to show that variations in a cell's learned position within the cell cycle correspond to smooth topological transitions in the detected loops and trajectories ( Fig. 4I). These results agree with recent work on cell-cycle topology 53,54 ; with canonical 55 views on the very concept of cellular types (i.e., topology discreteness), identities (i.e., topological branching), and states (i.e., topological neighborhoods); and with the proposed unifying concept of phenotypic topology. By noticing known biological hallmarks are missing from standard embeddings and yet present in some topological layouts, we argue that TopOMetry models and layouts effectively recover phenotypic topology. This association between canonical gene signatures, dynamic biological processes, and the represented phenotypic topology is fundamental during the inspection of results and their interpretation to generate new hypotheses.
TopOMetry scales to whole-organism data and reveals fine-grained developmental hierarchies during mouse organogenesis.
The introduction of new assays that can profile hundreds of thousands of single-cells drives future analysis to ever-growing datasets, often comprising whole organisms and millions of cells -thus, proposed algorithms must be scalable to such massive information. To evaluate TopOMetry scalability to large data and its usefulness to represent these as detailed atlases of cellular hierarchies, we harnessed 1.3 million high-quality pre-filtered single-cell transcriptomes from the mouse embryo during organogenesis 56  diffusion model, which performs similarly to state-of-the-art PCA and SVD (Single-Value Decomposition) implementations (Suppl. Fig. 14B). These results elect the diffusion model as the most computationally efficient TopOMetry approach for approximating the LBO. It is also shown that the diffusion model computation time escalates linearly with the number of analyzed samples (cells), even when considering up to a million cells (Suppl. Fig. 14C). It still performs slower than PCA and SVD, but this was largely expected, given the higher complexity of operating on large kNN graphs than performing simple linear operations. We note that this difference is primarily due to the computation of multiple kNN graphs, thus dependent on the performance of nearest-neighbors (NN) algorithms, a prolific subfield of study 57 . The presented version of TopOMetry uses Hierarchical Navigable Small Worlds (HNSW) 58,59 , a current top performer on speed and accuracy among approximate NN methods, and also offers a flexible and user-friendly module for its use and prediction of its accuracy. Despite the accounted limitations, we note TopOMetry runtime and memory usage was amenable to commercial laptops when analyzing up to ~200,000 cells, thus being accessible to most researchers.

Quantitative assessment of embedding quality.
Objectively evaluating the performance of DR algorithms is a problem of its own due to the inevitable need for assumptions on underlying data structure and distribution (i.e., what is the reference if the data structure is unknown?). Nevertheless, guided by recent studies 28,46 , we used general quantitative metrics to assess the preservation of global and local structures (Fig. 6A)  PCA, based on previous work 28 , as it intrinsically preserves most of the global structure by design. To evaluate local structure preservation, we designed a score based on Spearman's correlations between the pairwise geodesic distances on the high-dimensional ambient space and the low-dimensional latent orthogonal bases and layouts. This scoring system is independent of TopOMetry models. It is important to note that, in this system, scores are computed using the input data for each algorithm (e.g., PCs for standard layouts such as UMAP, the diffusion basis for diffusion-based layouts) to isolate the preservation of the local structure of the orthogonal basis from that obtained by the layout technique (Fig. 6A). It is also important to note that the computations involved in evaluating the results of DR are far more expensive than performing DR itself. The evaluation pipeline included in TopOMetry demands less memory, but is only slightly faster than current methods. Computing scores for MOCA, for instance, was unfeasible in our computational settings.
We applied this scoring system to all exposed datasets, except for the MOCA dataset.
Regarding the global score, we found that all orthogonal bases preserve nearly all global structure present in the high-dimensional data ( Fig. 6B-G) and that all layout methods also perform similarly given a PCA or spectral initialization (Suppl. Fig. 15). The main differences between methods were related to changes in the local score, for which the diffusion basis was the top performer, and PCA was the worst method across all datasets (the fuzzy and continuous basis, when computed, varied). This was expected given the locality-preserving nature of TopOMetry's models and quantitatively confirms their superiority to the standard PCA approach.
The graphs obtained from these bases were also evaluated for local structure preservation (Suppl. Regarding the visualization (layout) algorithms, we found MAP, t-SNE, and PaCMAP to be the top performers when used with topological models, and MDE and TriMAP to be the best alternatives when used with PCA, although this varied across datasets (Suppl. Fig. 15). The high variability between the performance of different layout algorithms across datasets and orthogonal bases highlights the need to obtain, inspect and compare multiple visualization results before interpretation and hypothesis generation. This is opposite to the current practice in single-cell analysis, during which a single orthogonal basis is used (PCA), and then only one or two methods are used for layout -usually t-SNE or UMAP, thus reinforcing the need to carefully reassess currently held assumptions and standards.

Discussion
Here, we employed existing state-of-the-art algorithms to construct a framework that efficiently extracts meaningful information from data through the approximation of the Laplace-Beltrami Operator (LBO). Despite its known power to describe high-dimensional data, the approximation of LBO has remained unexplored in single-cell data analysis. We show that by encoding data's phenotypic topology through different metrics, bases, and graphs through sequential LBO approximations, it is possible to identify several more cellular subtypes in single-cell data that are defined by the selective expression of marker genes, as seen in the previously undescribed and rather prodigious T CD4 transcriptional heterogeneity we herein report. Furthermore, LBO's geometrical properties are harnessed throughout the exposed examples to represent known biological hallmarks faithfully and intuitively into graphic topological structures, such as the cell cycle as a closed-loop. The underlying geometry of such biological hallmarks is referred to as the phenotypic topology of single-cells.
From a knowledge representation perspective, these remarkable results point to the qualitative superiority of dissecting the steps of dimensional reduction (i.e., metric and graph learning, layout optimization) to approximate the LBO at each step. We acknowledge that only a fraction of the employed theory is novel -yet, by standing on the shoulders of researchers who developed earlier methods, we show that a carefully designed framework that aims to obtain sequential approximations of the LBO can result in striking new biological insights. The fact that visual representations carry properties from the graph layout optimization algorithms is also a notable feature and an argument contrary to the consistent superiority of a given optimization method over others and to the choice of a single method throughout all analysis (i.e., UMAP vs. tSNE)instead, the choice of which similarity metrics, orthogonal bases, and neighborhood graphs to feed these optimization methods with appears to be far more critical. For example, PCA seemingly introduces linear distortions in data throughout the analyses, which can be problematic as most available toolkits perform it by default. Additionally, PCA also fails to denoise biological signals as linearly uncorrelated latent dimensions, leading to poorly-performing neighborhood graphs compared to the LBO-approximating orthogonal bases and graphs, which preserve most of the global and significantly more of the local information in high-dimensional spaces. By allowing control of each step of DR, TopOMetry empowers scientists to obtain biological insights from the visualization of common patterns and structures across different models and visualizations. We hope that these insights may prove helpful in the design of future machine-learning frameworks.
Our results also impact the life sciences as a whole. A fundamental definition in biology is what is a cell type. By introducing the concept of phenotypic topology, we tie the concepts of cell types, identities, and states to differential geometry operators that mathematically describe discontinuous, contiguous, or fine-grained cellular populations in the form of latent topological signals across a set of phenotypic manifolds. This idea makes the task of trajectory inference (TI) as simple as walking through topology-oriented graphs and embeddings, as had been widely suggested. Even though the graph-walking concept is in accordance with current TI algorithms and reported trajectories, most approaches rely on PCA and UMAP representations to generate results. Thus, we point out that these could also be further enhanced by using topological representations. It is also noticeable that most TI methods struggle with the coexistence of discrete, continuous, tree-shaped, and cyclic topologies within the same dataset and often hold strong assumptions on the data structure (e.g., Monocle2 60 assumes data is entirely tree-shaped).
The contribution of the proposed models to clustering is even more direct and leads to the efficient identification of cell types, developing identities, and states.
Although not yet

Data acquisition, availability, and processing
pbmc68k Single-cell RNA sequencing (scRNAseq) data from peripheral blood mononuclear cells (PBMCs) was obtained from https://www.10xgenomics.com/resources/datasets/. After filtering, 66,834 cells were retained, and the top 2,318 highly variable genes were selected for downstream analysis. The selection of highly variable genes was performed with default parameters. Data were processed and analyzed accordingly to the standard workflow described in the subsection 'Standard data processing and analysis'. TopOMetry was executed with the parameter n_eigs set to 150, and the remaining parameters were set to defaults.
dengue72k scRNAseq data from PBMCs from two dengue-infected subjects were obtained from a previously published study 49 . After filtering, 72,412 cells were retained, and the top 1,924 highly variable genes were selected for downstream analysis. The selection of highly variable genes was performed with default parameters. Data were processed and analyzed according to the standard workflow described further in the 'Standard data processing and analysis' subsection.
TopOMetry was executed with the parameter n_eigs set to 300, and the remaining parameters were set to defaults. were processed and analyzed accordingly to the standard workflow described further in the subsection 'Standard data processing and analysis', and the top 2,358 highly variable genes were selected for downstream analysis. The standard workflow was performed with the following modifications: 1) the number of PCs was set to 300, and UMAP parameters were customized (min_dist was set to 0.3, spread was set to 2, alpha was set to 2 and maxiter was set to 1000).
This choice of hyperparameters tends to make UMAP represent more details and spread clusters more evenly across the embedding when analyzing a large dataset such as this. Similarly, for fairness of comparison, TopOMetry was executed with the following modifications: 1) n_eigs was set to 600; 2) base_knn and graph_knn were both set to 10; and 3) the hyperparameters of MAP layout optimization were modified to exactly match the UMAP hyperparameters used in the modified standard workflow (min_dist was set to 0.3, spread was set to 2, alpha was set to 2 and maxiter was set to 1000).

Standard data analysis
Count matrices were used to create AnnData objects within SCANPY 62 , which were filtered to recommended thresholds. Data were then processed with the following workflow: 1) first, data were normalized and variance stabilized by log normalization; 2) next, highly variable genes (HVG) were identified and kept for downstream analysis; 3) HVG-data was then centered and scaled and; 4) dimensionality reduced to 100 dimensions with PCA; 5) neighborhoods were computed using UMAP fuzzy simplicial sets as per SCANPY's defaults; 6) the neighborhood graph was projected with UMAP and t-SNE; 7) the neighborhood graph was clustered with the Leiden community detection algorithm. These steps correspond to the workflow applied in most manuscripts in which single-cell data is analyzed, being the norm in the field and regarded as the current gold-standard 63 . Thus, we refer to these as the standard workflow throughout the manuscript. Unless otherwise stated, this was performed with default parameters within the SCANPY toolkit for single-cell analysis.
Cluster marker genes were found with logistic regression 48 by using scanpy's built-in functions. The maximum number of iterations was increased until convergence, i.e., until error messages claiming results did not converge with the maximum number of iterations stopped being triggered. Anecdotally, we noticed that the number of iterations required for convergence increased with the number of cells and detected clusters, which is rather intuitive. The maximum number of iterations used for each dataset is explicitly defined in our analysis code that can be freely reproduced.

Topologically Optimized geoMetry
Fundamentally, TopOMetry is based on the discrete approximation of the Laplace-Beltrami Operator (LBO), formally defined as the divergence of the gradient: The LBO is a powerful descriptor of data geometry and has been approximated by some previous approaches, namely Diffusion Maps (DM), Laplacian Eigenmaps (LE), continuous-k-nearest-neighbors (CkNN), and fuzzy simplicial sets used for metric learning in UMAP. TopOMetry explores the power of each algorithm by combining them into a comprehensive framework that iteratively approximates the LBO during metric learning and affinity estimation, decomposition of a lower-dimensional basis, and building a new graph from such a basis. Layout optimization algorithms do not intrinsically harness the ability to approximate the LBO, but they can be offered a LE initialization, which powers them with such information. Likewise, most algorithms operate on iteratively constructing a layout that minimizes some differences to the informed high-dimensional graph, corresponding to an LBO approximation in TopOMetry's models. It is of note that this carries the manifold assumption, as on a Riemannian manifold, the LBO entirely determines the Riemannian metric -and therefore, all topological properties of the manifold.
In an overview, TopOMetry dissects the problem of graph-based dimensionality reduction (DR) into four steps, aiming to approximate the LBO at each step. The first consists of learning k-nearest-neighbor (kNN) similarities from distances (i.e., metric learning) and can be performed The widespread use of DM in single-cell biology associates the proposed concept of phenotypic topology with existing trajectory inference and pseudotime estimation frameworks.
DM has been shown to describe cellular state dynamics in scRNAseq data and is useful for pseudotime estimation and dynamical modeling of cellular differentiation. However, their direct use in visualization is unsuitable for complex high-dimensional data such as those generated by single-cell assays since its results remain high-dimensional, rendering poor visualizations (similarly to PCA). Accordingly, it has been previously shown that multiple diffusion components can be needed to encode data heterogeneity 6,12,15 , each encoding a different aspect of data structure, i.e., distinct cellular trajectories, and dozens or hundreds of diffusion components could be needed to describe large-scale, whole-organism single-cell data. We noticed that these high-dimensional, phenotypic encoding components that approximate the LBO can be seen as lying in a phenotypic manifold of their own, being thus suitable for learning new graphs and optimizing layouts. Because of this property, lower-dimensional orthogonal bases that encode phenotypic topology can be obtained with DM or with Laplacian Eigenmaps built from CkNN or Similar to PCA preprocessing, a new similarity graph needs to be constructed from the learned topological bases to perform clustering and optimizing layouts (all layout optimization algorithms operate on similarity graphs). This graph is also needed to obtain a further LE of the learned basis that can be used as initialization for layout algorithms 47 . For this step, the three previously used metric-learning algorithms are further employed, rendering nine possible combinations of topological models, referred to by their basis prefix and graph suffix (e.g., fb_diff, db_cknn).
Noticeably, the final, layout optimization step is independent from the adopted upstream topological approach; thus, established (tSNE, MAP) and recent (densMAP, PaCMAP, TriMAP, MDE) methods can also be employed by the user to explore different visualizations of the same data structure, resulting in a family of topology-preserving layouts with different metric models for obtaining lower-dimensional basis and graphs. Within TopOMetry, the layout optimization method is added as a suffix to graph names (e.g. db_diff_MAP).

Distances estimation
One of the most fundamental aspect of machine-learning algorithms dealing with high dimensional data is the curse of dimensionality: in higher-dimensional spaces (HDS), the capacity of distance metrics to discriminate dissimilarity decays, i.e. the measured distance of the most similar samples tends to approach that of the most dissimilar ones as dimensionality increases 47 . The task of accurately and efficiently computing distances between samples in a large HDS is a challenge by itself, and several algorithms and tools have been developed to address it 64 . The most commonly used metric in single-cell data analysis is the Euclidean distance, followed by the cosine and the manhattan distances -yet, in principle, any metric distance could be applied. Recent benchmarks 65,66 have achieved initial insight on the suitability of different metrics to estimate distances between single-cells, pointing towards correlation metrics as a favorable choice, although these were limited to only some single-cell RNA sequencing datasets in PCA-based workflows, and further studies are needed. This guided the choice of the cosine distance metric as a default in TopOMetry and in all presented analysis. Of note, the distance metric used in learning an orthogonal basis and in learning a new graph from it can differ, and assessing which particular combinations work best to recover latent information is a task yet to be done.
Guided by a recent benchmark between distance estimation algorithms 57 , we employed the Hierarchical Navigable Small Worlds (HNSW) 58,59 algorithm to power the computational efficiency of finding k-nearest-neighbors of large single-cell datasets. HNSW has been shown to achieve fast and near-optimal results while being useful to compute a variety of distance metrics (e.g. euclidean, cosine, manhattan). Similarly to other non-linear algorithms, the first step of the diffusion basis is to find a k-nearest-neighbors graph (kNN) graph. As previous studies have shown that using only highly variable features (HVF) for computations increases the power of single-cell analysis 63,67 , we estimated distances from the cell by HVF matrix for the shown experiments. As a reference,~2,000 genes are generally considered a reasonable number of HVFs in single-cell RNA sequencing data analysis.

Diffusion Harmonics
The implemented diffusion operator includes many recent theoretical and empirical advances in diffusion harmonics. Of note, 1) an anisotropic diffusion kernel is used to individualize the diffusion metric to a cell-by-cell basis 21,22 ; 2) the anisotropic kernel is adapted by each cell neighborhood density, being normalized by each cell's distance to its median k-nearest-neighbor 14,15,68 ; 3) neighborhood search expansion, a process in which the original kNN graph is expanded to additional neighbors using a data-derived estimate on neighborhood sparseness 15,32 . Additionally, we introduce an adaptive kernel decay rate, in which the rate of similarity learning is adaptively adjusted by each cell neighborhood density instead of by a fixed hyperparameter as otherwise proposed 6 .
Guided by previous approaches, we first construct a -nearest-neighbors graph for a given dataset, with N samples (cells) and M measures (genes), using a given distance metric as a local similarity measure. Let be a Riemannian manifold with unknown sampling ℳ⊂ℝ distribution (not necessarily uniformly distributed) in ambient dimension n. For a given sample (the manifold hypothesis -data is assumed to be contained in the proximities ∈ × ⊂ ℳ of the manifold), consider to be the ordered set of its k-nearest-neighbors under some ( ) distance metric : ( , ) ( ) = { (1) , (2) ,..., ( 2 ) , ..., ( )

} | < , ⊂ ℕ
Given that the vector of sample observations , a scaling factor corresponding to the σ distance of sample i to its median nearest-neighboring sample(s) in the high-dimensional space is defined: ) Drawing from previous work on adaptive bandwidth diffusion kernels, the following kernel function could be initially used to convert distances into affinities: In which setting yields a differential operator that encodes the probabilities of reaching = 1 a given sample from another given sample and approximates the Laplacian-Beltrami Operator independently of the data sampling distribution:

Continuous k-nearest neighbors
Continuous k-nearest neighbors (CkNN) is a recently introduced 32 method that constructs a single unweighted graph from which the underlying manifold's structure (i.e. topological information) can be retrieved, given any compact Riemannian manifold. In its introductory manuscript, it was also shown that CkNN is the only unweighted graph construction strategy for which the graph Laplacian converges to the Laplace-Beltrami Operator (LBO). Conceptually, CkNN points to the superiority of consistent homology over persistent homology. However, differently from diffusion harmonics, LE and fuzzy simplicial sets, this is achieved through an unweighted graph rather than through a weighted graph. density estimate, similarly to the approximation of geodesics with fuzzy simplicial sets and diffusion harmonics. In practice, this captures the natural multiscale topology of data, wielding smooth geometries where data is sparsely sampled and fine-grained details where data is densely sampled.

Fuzzy simplicial sets
The application of 1-skeleton fuzzy simplicial sets as topological metrics on TopOMetry draws entirely from previous work, particularly from UMAP 25 theory and the work of David Spivak 69 . Our innovation is represented by the use of these metrics to construct new orthogonal bases that capture latent topology, and subsequently to construct topological graphs prior to layout optimization. When using this method to learn metrics from some data , it is × assumed is uniformly distributed on a locally connected Riemannian manifold with a locally ℳ constant metric . UMAP originally addresses the uniformity assumption creating a custom distance for each sample by normalizing distances with respect to their k-th nearest-neighbor to effectively approximate geodesics. To merge each of the discrete metric spaces generated per sample into a global informational space, the metric spaces are converted into fuzzy simplicial sets. A consensus metric can then be obtained by taking a fuzzy union across all metric spaces encoded as fuzzy simplicial sets, rendering a single fuzzy simplicial set which is granted to capture relevant topological metrics on . The Laplace-Beltrami Operator acting on can then ℳ ℳ be approximated from the normalized laplacian of this fuzzy graph representation.
The aforementioned approach is grounded on solid category theory and no major modifications to the fuzzy simplicial set algorithms defined and implemented in UMAP were made, aside from improving its computational performance with approximate-nearest-neighbors with HNSW. For brevity, we direct the reader to UMAP's 25  form an adjunction, translating between topological spaces and simplicial sets -for this, a categorical presentation of fuzzy sets was outlined.
A fuzzy set can be defined as a set for which membership is not binary -rather, it is a fuzzy property represented by continuous values in the unit interval (with , and = (0, 1] ⊆ℝ with topology given by intervals of the form ). This allows one to define [0, ) ∈ (0, 1] presheafs: a presheaf P on is a functor from to . A fuzzy set is a presheaf on such that ∆ all maps P are injections. P is the set of all elements with at least membership strength. With this concept in mind, the category of fuzzy sets is defined as the full subcategory of sheaves on spanned by fuzzy sets. Consequently, defining fuzzy simplicial sets (FSS) corresponds to considering presheaves of to be valued in instead of . The ∆ objects of the category of FSS are thus functors from to , and its morphisms are ∆ given by natural transformations. Conversely, FSS can be viewed as sheafs over the product topology , allowing one to apply them to a broad category of extended-pseudo-metric ∆ × spaces. This will be useful as the initial geodesic normalization step involves creating a unique metric space for each sample.
Extended-pseudo-metric-spaces are defined as sets and maps ( , ) , being objects of the category , which has non-expansive maps :

Learning a latent basis
Once an adequate affinity graph has been constructed, obtaining it's weighted graph Laplacian is trivial, albeit extremely useful. The idea of using the spectral decomposition of graph Laplacians is not new in machine learning 19

Multiscale Diffusion Maps
Besides providing adaptations to the diffusion operator described in the Diffusion Harmonics subsection, we adapted the diffusion maps algorithm for better resolution through multiscaling. This includes component multiscaling, which avoids setting a fixed number of t diffusion steps hyperparameter, rendering multiscale diffusion maps. Instead, diffusion distances are weighted across all possible timescales and jointly considered to build multiscale distances 15,41 .. Although our adaptations add computational performance and discriminative power to the DM algorithm, core steps remained unaltered in order to preserve this key property.
Given a symmetric differential operator with a defined number of steps and a given estimate L of the intrinsic dimensionality (number of non-trivial eigenvalues) of , one can ℳ achieve the spectral decomposition: In which encodes the diffusion distance between samples and , along the ϕ ( ) = ϕ( ) 2 We next multiscale distances to account for all possible scales, rendering multiscale diffusion distances: Which can be reformulated as: Resulting in an orthogonal basis that is robust to the choice of L components and density differences.

Laplacian Eigenmaps
Laplacian Eigenmaps (LE) is a seminal geometrically-inspired method to perform DR through spectral decomposition that drew attention due to its natural connections to clustering and locality-preserving properties 19 . In their manuscript introducing LE 19 , Belkin and Niyogi explore the concept of computing a neighborhood graph and harnessing its Laplacian to obtain a lower-dimensional representation of data that preserves local information. They explicitly approximate the LBO by using the weighted Laplacian of the neighborhood adjacency graph. To appropriately choose weights, the connection between the LBO and the heat kernel 70 was originally explored to learn a similarity metric in a geometrically principled fashion. As a consequence, the new orthogonal basis provided by solving the sparse eigenvalue problem approximates the eigenmaps of the LBO. Interestingly, they also address the connection of preserving local information to clustering, whereas global methods such as Isomap 29 that try to preserve pairwise geodesic distances do not wield similar results to that of clustering algorithms.
In summary, given a nearest neighbors graph built from samples, a given similarity metric is learned (originally, LE similarities are learned with a standard diffusion kernel on a k-NN graph; here, we extend it to those learned with CkNN and fuzzy simplicial sets). This leads to an affinity matrix where each element corresponds to a weighted

Topological Graphs
The construction of new, denoised orthogonal bases from which one can learn new similarity graphs is a hallmark of TopOMetry. These bases are still highly-dimensional, and ideally, each component is associated with a specific cellular identity topology. This property leads to the poor performance of DM and LE when used exclusively for visualization: these algorithms are optimized to encode latent information from high-dimensional data in the form of denoised dimensions; however, the number of dimensions on which data may be located is unknown and definitely far beyond only two or three. Conceptually, we consider these bases as high-dimensional spaces encoding phenotypic topology, and aim to encode them into new, refined graphs. For that task, the aforementioned manifold learning techniques (diffusion harmonics, fuzzy simplicial sets, continuous k-nearest neighbors) are harnessed to learn new similarity metrics on top of the extracted latent information. These methods are employed within TopOMetry to learn topological graphs identically to when learning topological metrics directly from data; however, in this step, computations are performed on top of the denoised bases. This naturally leads to the efficiency shortcoming of needing to compute additional neighborhood graphs; however, the computational cost of this second neighborhood search is negligible compared to the first, full-data neighborhood search. From a practical point of view, this step corresponds to the neighborhood search employed in several algorithms and pipelines (i.e. UMAP, PaCMAP, Scanpy, Seurat), with the difference that similarity metrics are learned with multiple topological algorithms and that these metrics are learned from a topological orthogonal basis rather than naively from Principal Components.

Graph layout optimization
Graph layout optimization techniques are methods that aim to optimize a lower-dimensional representation (usually in two or three dimensions, for visualization) that preserves most of some properties from the original data by minimizing a given cost function.
Differences between these methods are mostly ough to the different methods to learn similarities from distances and to the number and design of cost functions 31 . The seminal t-SNE method, for instance, employs t-Student scores to estimate similarities between samples, and minimizes the Kullback-Leibler divergence between the high-dimensional and lower-dimensional representations. UMAP, which followed t-SNE as the de facto gold standard for visualization, learns similarities with fuzzy simplicial sets (FSS), and then minimizes the (fuzzy) cross-entropy between the FSS representations of the high and lower-dimensional topologies. PaCMAP has shed additional light on how these methods work by analyzing the asymptotics of cost functions and empirically testing and defining the characteristics of efficiently designed cost functions; it also introduced the use of more than one cost function during the layout optimization process SNE aims to find a low-dimensional representation that minimizes the difference between these two probability distributions by considering the sum of all Kullback-Leibler (KL) divergences between the high-and low-dimensional representation of each data point.
Considering as the conditional probability distribution of samples in their ambient high-dimensional space and its low-dimensional counterpart, the sum of KL divergences for all samples is then minimized through gradient descent, with a cost function defined by: The gradient of of this cost function has the form: And its update update with a momentum term (to avoid local minima) is given by: ) t-SNE's superiority to SNE is due to two differences: first, the implementation of a symmetrized version of the SNE cost function; second, the use of heavy-tailed distributions to address the crowding problem. Significant advances have also been made regarding the optimization of the t-SNE cost function to avoid local minima and increase performance. To symmetrize the cost function, instead of minimizing the sum of all KL divergences between high-dimensional and low-dimensional probability distributions, a single KL divergence between the joint probabilities distributions P (high-dimensional) and Q (low-dimensional) is used: In t-SNE, a Student t-distribution with one degree of freedom (the same as a Cauchy distribution) is used as a heavy-tailed distribution for the low-dimensional mapping. This defines the joint probabilities : The gradient of the KL divergence between P and Q (computed using the above equation) is given by: With an update with momentum identical to that of SNE.
There are numerous modifications and adaptations of the t-SNE algorithm (e.g. heavy tailing, density preservation, perplexity combination), and these exceed the scope of this manuscript.
The version of t-SNE included in TopOMetry relies on the MulticoreTSNE 72,73 package.

Uniform Manifold Approximation and Projection (UMAP)
The similarity metric when using the 'fuzzy' algorithm in TopOMetry corresponds to this 1-skeleton of the fuzzy simplicial sets, that is, a fuzzy graph (i.e. a fuzzy set of edges to generate visualizations, the graph layout is optimized by minimizing the cross-entropy between the high-dimensional and low-dimensional fuzzy sets in terms of the ( , µ) ( , ) reference set : TriMAP is a graph layout optimization method focused on preserving global structure.
Differently than nearly all other methods, instead of relying on pairwise (dis)similarities between samples, TriMAP incorporates triplets to encode global structure. Saying that is a triplet The weight is defined first in the unnormalized form , which is a function of distance measures between samples: In the TriMAP algorithm, the distance measure is normalized by a scaling factor defined σ by the average distances between a target sample and its 4-th to 6-th nearest neighbors: To optimize an embedding for each sample, a given number of triplets (n_outliers, default 5) are sampled from its k-nearest-neighbors (n_inliers, default 10). Additionally, some randomly sampled triplets containing the target sample are considered (n_random, default 5), which arguably helps preserve global structure. The final loss is defined as the sum of the losses of each sampled triplet in : When used in TopOMetry for graph-layout, the TriMAP embedding is initialized with LE, and a topological orthogonal basis is used for sampling triplets. Because of its peculiar sampling process, providing it with a downstream topological graph could be counterproductive.

Pairwise Controlled Manifold Approximation and Projection (PaCMAP)
PaCMAP holds significant similarities with UMAP and TriMAP, but introduces its own unique insights. The core idea of PaCMAP is establishing robust loss functions that take into account not only a sample's local neighborhood, but how that neighborhood relates to closely located neighborhoods and with far away points. First, the neighborhood affinity graph from data is built in a manner identical to TriMAP triplet weighting procedure, in which distances between samples are scaled by the distance between the target sample and its 4th to 6th nearest neighbor: Next, pairs of samples are selected as follows: • Near pairs -a sample's k-nearest-neighbors, based on the scaled distance.
• Mid-near pairs -for each sample i, 6 other samples are randomly selected, and the second-closest to i is paired to it in a mid-near pair. The number of mid-near pairs to be sampled is k times a hyperparameter (MN_ratio, default 0.5).
• Further pairs -uniformly sample non-neighbors of i. The number of further pairs to be sampled is k times a hyperparameter (FP_ratio, default 2).
Given an initialization (PCA in default PaCMAP, LE in TopOMetry), the embedding is then optimized in three different stages, each with different weights for near-pairs, mid-near pairs and further pairs. Consider a sample i, its neighbor j, its mid-near pair k and its further pair l. Given Where is a weight given to i neighbors, a weight given to its mid-near pairs and a weight given to its further pairs. The optimization is performed in three stages, each with a number of iterations t that can be controlled by the user as hyperparameters. In the first stage (i.e. first 100 t iterations), , , and . In The uniqueness and novelty of PaCMAP lies in the fact that local and global structure are balanced throughout the optimization process by the choice of sampling neighbors, mid-near pairs and further pairs simultaneously. When employed in TopOMetry, PaCMAP uses a denoised topological orthogonal basis instead of the raw data; similarly to TriMAP, feeding it the downstream topological graph may be counterproductive due to its peculiar neighborhood sampling process.

Minimum Distortion Embedding
Recently, the problem of mapping high-dimensional information to a lower-dimensional embedding under a set of constraints (i.e. preserving distances, preserving neighborhoods, etc) was approached in a more generalized way with the introduction of Minimum Distortion Embedding (MDE). Embedding constraints are encoded by distortion functions associated with embedding distances , which are assumed to be differentiable. These functions derive from affinity (or dissimilarity) measurements between samples, or from their deviations, or from graphs on samples. For example, given nonzero weights , graph edges can be partitioned ∈ ℜ into similar items (positive weights) and dissimilar items (negative weights), and a distortion function derived from these weights have the form: Nevertheless, it can still be challenged by big, non-linear, non-uniform data, and therefore benefit from TopOMetry topological analysis for data denoising and similarity learning. Given a topological graph and an orthogonal basis, TopOMetry harnesses MDE to learn layouts that can be representative of the underlying data structure. Noticeably, these layouts are highly customizable, and comparisons between various distortion functions and constraints that can be used for their generation exceed the scope of the present work.

Global score
To evaluate preservation of global structure by embeddings and layouts, we elected the PCA resemblance score. It was first introduced in the TriMAP manuscript 28  Interestingly, as exposed in the results, the global score remains rather stable and close to 1.0 across all embedding and layout algorithms that were tested, given a PCA or spectral initialization. We empirically show that this score is approximately the same using either PCA or a spectral layout (LE) and implement functions to perform such tests within TopOMetry.

Local scores
For evaluating the preservation of local structure, we built upon previous approaches that explored geodesic distances and the Spearman correlation between original high-dimensional and downstream low-dimensional distances 6,46 . This task is very computationally intensive. The included evaluation pipeline demands less memory, but is only slightly faster than current methods. Computing scores for MOCA, for instance, were unfeasible in our computational settings. As depicted in Suppl. Fig. 1B, kNN graphs are built from both the original high-dimensional data matrix and the representations to be scored. Next, the geodesic distances (i.e. the weighted shortest-path pairwise distances) are obtained for each of these graphs, and their distributions are then scored accordingly to the Spearman correlation (R). Cells can then be ranked accordingly to their neighborhood indices -let be the number of cells and be the difference between ranks for cell in the original high-dimensional space and in the latent low-dimensional space. Then, we can calculate the Spearman correlation as: The Spearman rank-order correlation coefficient (R) is a nonparametric measure of association between two sets of data, and superior to the Pearson correlation in this case because it assumes only monotonicity, not linearity. Intuitively, this score tells whether increasing distances in the low-dimensional embedding is related to increasing distances in the high-dimensional data: a score of 0 implies no association at all, and a score of 1.0 implies perfect association. As shown in the results section, real-world applications usually lie somewhere between these two extremes.

Marker gene scores
To evaluate how marker genes found for each clustering result (i.e. leiden clustering on the PCA graph or TopOMetry's graphs) perform at predicting a cluster identity, we developed a simple R package, markeRscore. Given clustering labels and a list of marker genes for each cluster, markeRscore computes the sensibility, specificity and positive likelihood ratio of the top k marker genes for each cluster, for each clustering result. The aggregated values are then visualized using violin plots to assess how well each set of marker genes perform. Naturally, this can be also used to compare methods that find marker genes given a single clustering result; however, here we opted to implement it to compare different clustering results using a single method to find marker genes.

Runtime
The MOCA dataset was randomly sampled into three groups of datasets, each with increasing number of cells. The datasets from the first group ranged from 50,00 to 15,000 total cells, while the datasets from the second group ranged from 1,000 to 50,000 total cells and those from the third group ranged from 1,000 to 1 million total cells. All models were performed for the first group, while only the diffusion and fuzzy models were used in the second group, and only the diffusion model was used in the third group. The PCA and the truncated single-value decomposition (SVD) implementations used were the default scikit-learn 74 algorithms sklearn.decomposition.PCA() and sklearn.decomposition.TruncatedSVD(). The number of eigencomponents was fixed to 100 for all comparisons.

Computational Environment
All analyses were performed on a six-core Lenovo ThinkPad P52 Intel® Xeon(R) E-2176M CPU @ 2.70GHz × 12 threads, with a total of 128GB of ERCC DDR SDRAM running the 64-bit 21.10 Pop!OS distribution of the Linux operating system.

Code availability
TopOMetry is freely accessible as a python library at https://github.com/davisidarta/topometry and extensively documented at https://topometry.readthedocs.io . Any additional information can be requested and will be promptly made available by the authors.