Abstract
Single-cell RNA-seq quantifies biological heterogeneity across both discrete cell types and continuous cell transitions. Partition-based graph abstraction (PAGA) provides an interpretable graph-like map of the arising data manifold, based on estimating connectivity of manifold partitions (https://github.com/theislab/paga). PAGA maps provide interpretable discrete and continuous latent coordinates for both disconnected and continuous structure in data, preserve the global topology of data, allow analyzing data at different resolutions and result in much higher computational efficiency of the typical exploratory data analysis workflow — one million cells take on the order of a minute, a speedup of 130 times compared to UMAP. We demonstrate the method by inferring structure-rich cell maps with consistent topology across four hematopoietic datasets, confirm the reconstruction of lineage relations of adult planaria and the zebrafish embryo, benchmark computational performance on a neuronal dataset and detect a biological trajectory in one deep-learning processed image dataset.
Introduction
Single-cell RNA-seq offers unparalleled opportunities for comprehensive molecular profiling of thousands of individual cells, with expected major impacts across a broad range of biomedical research. The resulting datasets are often discussed using the term transcriptional landscape. However, the algorithmic analysis of cellular heterogeneity and patterns across such landscapes still faces fundamental challenges, for instance, in how to explain cell-to-cell variation. Current computational approaches attempt to achieve this usually in one of two ways [1]. Clustering assumes that data is composed of biologically distinct groups such as discrete cell types or states and labels these with a discrete variable — the cluster index. By contrast, inferring pseudotemporal orderings or trajectories of cells [2–4] assumes that data lie on a connected manifold [5] and labels cells with a continuous variable — the distance along the manifold. While the former approach is the basis for most unsupervised analyses of single-cell data, the latter enables a better interpretation of continuous phenotypes and processes such as development, dose response and disease progression. Here, we unify both viewpoints.
A central example of dissecting heterogeneity in single-cell experiments concerns data that originate from complex cell differentiation processes. However, analyzing such data using pseudotemporal ordering [2, 6–10] faces the problem that biological processes are usually incompletely sampled. As a consequence, experimental data do not conform with a connected manifold and the modeling of data as a continuous tree structure, which is the basis for existing algorithms, has little meaning. This problem exists even in clustering-based algorithms for the inference of tree-like processes [11– 13], which make the generally invalid assumption that clusters conform with a connected tree-like topology. Moreover, they rely on feature-space based inter-cluster distances, like the euclidean distance of cluster means. However, such distance measures quantify biological similarity of cells only at a local scale and are fraught with problems when used for larger-scale objects like clusters. Efforts for addressing the resulting high non-robustness of tree-fitting to distances between clusters [11] by sampling [12, 13] have only had limited success.
Partition-based graph abstraction (PAGA) [14] resolves these fundamental problems by generating graph-like maps of cells that preserve both continuous and disconnected structure in data at multiple resolutions. The data-driven formulation of PAGA allows to robustly reconstruct branching gene expression changes across different datasets and, for the first time, enabled reconstructing the lineage relations of a whole adult animal [15]. Furthermore, we show that PAGA-initialized manifold learning algorithms converge faster, produce embeddings that are more faithful to the global topology of high-dimensional data and introduce an entropy-based measure for quantifying such faithfulness. Finally, we show how PAGA abstracts transition graphs, for instance, from RNA velocity and compare to previous trajectory-inference algorithms.
Results
PAGA maps discrete disconnected and continuous connected cell-to-cell variation
Both established manifold learning techniques and single-cell data analysis techniques represent data as a neighborhood graph of single cells G = (V, E), where each node in V corresponds to a cell and each edge in E represents a neighborhood relation (Figure 1) [3, 16–18]. However, the complexity of G and noise-related spurious edges make it both hard to trace a putative biological process from progenitor cells to different fates and to decide whether groups of cells are in fact connected or disconnected. Moreover, tracing isolated paths of single cells to make statements about a biological process comes with too little statistical power to achieve an acceptable confidence level. Gaining power by averaging over distributions of single-cell paths is hampered by the difficulty of fitting realistic models for the distribution of these paths.
We address these problems by developing a statistical model for the connectivity of groups of cells, which we typically determine through graph-partitioning [18–20] or alternatively through clustering or experimental annotation. This allows us to generate a simpler PAGA graph G* (Figure 1) whose nodes correspond to cell groups and whose edge weights quantify the connectivity between groups. The statistical model considers groups as connected if their number of inter-edges exceeds a fraction of the number of inter-edges expected under random assignment. The connection strength can be interpreted as confidence in the presence of an actual connection and allows discarding spurious, noise-related connections (Supplemental Note 1). While G represents the connectivity structure of data at single-cell resolution, the PAGA graph G* represents the connectivity structure of data at the chosen coarser resolution of the partitioning and allows to identify connected and disconnected regions of data. Following paths along nodes in G* means following an ensemble of single-cell paths that pass through the corresponding cell groups in G. By averaging over such an ensemble of single-cell paths, it becomes possible to trace a putative biological process from a progenitor to fates in a way that is robust to spurious edges, provides statistical power and is consistent with basic assumptions on a biological trajectory of cells (Supplemental Note 2). Note that by varying the resolution of the partitioning, PAGA generates PAGA graphs at multiple resolutions, which enables a hierarchical exploration of data (Figure 1, Supplemental Note 1.3).
To trace gene dynamics at single-cell resolution, we extended existing random-walk based distance measures (Supplemental Note 2, Reference [8]) to the realistic case that accounts for disconnected graphs. By following high-confidence paths in the abstracted graph G* and ordering cells within each group in the path according to their distance d from a progenitor cell, we trace gene changes at single-cell resolution (Figure 1). Hence, PAGA covers both aspects of clustering and pseudotemporal ordering by providing a coordinate system (G*, d) that allows us to explore variation in data while preserving its topology (Supplemental Note 1.6). PAGA can thus be viewed as an easily-interpretable and robust way of performing topological data analysis [10, 21] (Supplemental Note 3).
PAGA-initialized manifold learning produces topology-preserving single-cell embeddings
The computationally almost cost-free coarse-resolution embeddings of PAGA can be used to initialize established manifold learning and graph drawing algorithms like UMAP [22] and ForceAtlas2 (FA) [23]. This strategy is used to generate the single-cell embeddings throughout this paper. In contrast to the results of previous algorithms, PAGA-initialized single-cell embeddings are faithful to the global topology, which greatly improves their interpretability. To quantify this claim, we took a classification perspective on embedding algorithms and developed a cost function KLgeo (Box and Supplemental Note 4), which captures faithfulness to global topology by incorporating geodesic distance along the representations of data manifolds in both the high-dimensional and the embedding space, respectively. Independent of this, PAGA-initialized manifold learning converges about 6 times faster with respect to established cost functions in manifold learning (Supplemental Figure 10).
Box Taking a classification view on embedding algorithms, we quantify how faithful an embedding is to the global topology of the high-dimensional data by comparing the distributions P and Q of edges in the high-dimensional and embedding spaces using a weighted Kullback-Leibler divergence where pe and qe are the probabilities for an edge being present in the kNN graphs in the high-dimensional and embedding spaces, respectively. Analogously, and denote random-walk based estimators of geodesic distances on the manifolds in these spaces, respectively. Efc denotes the edge set of the fully connected graph (Supplemental Note 4, Supplemental Figure 10).
PAGA consistently predicts developmental trajectories and gene expression changes in datasets related to hematopoiesis
Hematopoiesis represents one of the most extensively characterised systems involving stem cell differentiation towards multiple cell fates and hence provides an ideal scenario for applying PAGA to complex manifolds. We applied PAGA to simulated data (Supplemental Note 5) for this system and three experimental datasets: 2,730 cells measured using MARS-seq [24], 1,654 cells measured using Smart-seq2 [25] and 44,802 cells from a 10x Genomics protocol [26]. These data cover the differentiation from stem cells towards, cell fates including erythrocytes, megakaryocytes, neutrophils, monocytes, basophils and lymphocytes.
The PAGA graphs (Figure 2) capture known features of hematopoiesis, such as the proximity of megakaryocyte and erythroid progenitors and strong connections between monocyte and neutrophil progenitors. Under debate is the origin of basophils. Studies have suggested both that basophils originate from a basophil-neutrophil-monocyte progenitor or, more recently, from a shared erythroid-megakaryocyte-basophil progenitor [27, 28]. The PAGA graphs of the three experimental datasets highlight this ambiguity. While the dataset of Paul et al. falls in the former category, Nestorowa et al. falls in the latter and Dahlin et al., which has by far the highest cell numbers and the densest sampling, allows us to see both trajectories. Aside from this ambiguity that can be explained by insufficient sampling in Paul et al. and Nestorowa et al., even with the very different experimental protocols and vastly different cell numbers the PAGA graphs show consistent topology between the three datasets. Beyond consistent topology between cell subgroups, we find consistent continuous gene expression changes across all datasets — we observe changes of erythroid maturity marker genes (Gata2, Gata1, Klf1, Epor and Hba-a2) along the erythroid trajectory through the PAGA graphs and observe sequential activation of these genes in agreement with known behaviour. Activation of neutrophil markers (Elane, Cepbe and Gfi1) and monocyte markers (Irf8, Csf1r and Ctsg) are seen towards the end of the neutrophil and monocyte trajectories, respectively. While PAGA is able to capture the dynamic transcriptional processes underlying multi-lineage hematopoietic differentiation, previous algorithms fail to produce robust or meaningful results (Supplemental Figures 8 and 9).
PAGA maps single-cell data of whole animals at multiple resolutions
Recently, Plass et al. [15] reconstructed the first cellular lineage tree of a whole adult animal, the flatworm Schmidtea mediterranea, using PAGA on scRNA-seq data from 21,612 cells. While Plass et al. focussed on the tree-like subgraph that maximizes overall connectivity — the minimum spanning tree of G* weighted by inverse PAGA connectivity — here, we show how PAGA can be used to generate maps of data at multiple resolutions (Figure 3a). Each map preserves the topology of data, in contrast to state-of-the-art manifold learning where connected tissue types appear as either disconnected or overlapping (Figure 3b). PAGA’s multi-resolution capabilities directly addresses the typical practice of exploratory data analysis, in particular for single-cell data: data is typically reclustered in certain regions where a higher level of detail is required.
PAGA abstracts information from RNA velocity
Even through the connections in PAGA graphs often correspond to actual biological trajectories, this is not always the case. This is a consequence of PAGA being applied to kNN graphs, which solely contain information about the topology of data. Recently, it has been suggested to also consider directed graphs that store information about cellular transition based on RNA velocity [29]. To include this additional information, which can add further evidence for actual biological transitions, we extend the undirected PAGA connectivity measure to such directed graphs (Supplemental Note 1.2) and use it to orient edges in PAGA graphs (Figure 3c). Due the relatively sparsely sampled, high-dimensional feature space of scRNA-seq data, both fitting and interpreting an RNA velocity vector without including information about topology — connectivity of neighborhoods — is practically impossible. PAGA provides a natural way of abstracting both topological information and information about RNA velocity.
Next, we applied PAGA to 53,181 cells collected at different developmental time points (embryo days) from the zebrafish embryo [30]. The PAGA graph for partitions corresponding to embryo days accurately recovers the chain topology of temporal progression, whereas the PAGA graph for cell types provide easily interpretable overviews of the lineage relations (Figure 4a). Initializing a ForceAtlas2 layout with PAGA coordinates from fine cell types automatically produced a corresponding, interpretable single-cell embedding (Figure 4a). Wagner et al. [30] both applied an independently developed computational approach with similarities to PAGA (Supplemental Notes 3) to produce a coarse-grained graph and experimentally validated inferred lineage relations. Comparing the PAGA graph for the fine cell types to the coarse-grained graph of Wagner et al. reproduced their result with high accuracy (Figure 4b).
PAGA increases computational efficiency and interpretability in general exploratory data analysis and manifold learning
Comparing the runtimes of PAGA with the state-of-the-art UMAP [22] for 1.3 million neuronal cells of 10x Genomics [31] we find a speedup of about 130, which enables interactive analysis of very large-scale data (90 s versus 191 min on 3 cores of a small server, tSNE takes about 12 10 h). For complex and large data, the PAGA graph generally provides a more easily interpretable visualization of the clustering step in exploratory data analysis, where the limitations of two-dimensional representations become apparent (Supplemental Figure 12). PAGA graph visualizations can be colored by gene expression and covariates from annotation (Supplemental Figure 13) just as any conventional embedding method.
PAGA is robust and qualitatively outperforms previous lineage reconstruction algorithms
To assess how robustly graph and tree-inference algorithms recover a given topology, we developed a measure for comparing the topologies of two graphs by comparing the sets of possible paths on them (Supplemental Note 1.3 and 1.4, Supplemental Figure 4). Sampling widely varying parameters, which leads to widely varying clusterings, we find that the inferred abstraction of topology of data within the PAGA graph is much more robust than the underlying graph clustering algorithm (Supplemental Figure 5). While graph clustering alone is, as any clustering method, an ill-posed problem in the sense that many highly degenerate quasi-optimal clusterings exist and some knowledge about the scale of clusters is required, PAGA is not affected by this.
Several algorithms [6, 11–13] have been proposed for reconstructing lineage trees (Supplemental Note 3, [4]). The main caveat of these algorithms is that they, unlike PAGA, try to explain any variation in the data with a tree-like topology. In particular, any disconnected distribution of clusters is interpreted as originating from a tree. This produces qualitatively wrong results already for simple simulated data (Supplementary Figure 6) and only works well for data that clearly conforms with a tree-like manifold (Supplementary Figure 7). To establish a fair comparison on real data with the recent popular algorithm, Monocle 2, we reinvestigated the main example of Qiu et al. [6] for a complex differentiation tree. This example is based on the data of Paul et al. [24] (Figure 2), but with cluster 19 removed. While PAGA identifies the cluster as disconnected with a result that is unaffected by its presence, the prediction of Monocle 2 changes qualitatively if the cluster is taken into account (Supplementary Figure 8). The example illustrates the general point that real data almost always consists of dense and sparse — connected and disconnected — regions, some tree-like, some with more complex topology.
Discussion
In view of an increasing number of large datasets and analyses for even larger merged datasets, PAGA fundamentally addresses the need for scalable and interpretable maps of high dimensional data. In the context of the Human Cell Atlas [32] and comparable databases, methods for their hierarchical, multi-resolution exploration will be pivotal in order to provide interpretable accessibility to users. PAGA allows for the first time to present information about clusters or cell types in an unbiased, data-driven coordinate system by representing these in PAGA graphs. In the context of the recent advances of the study of simple biological processes that involve a single branching [7, 8], PAGA provides a similarly robust framework for arbitrarily complex topologies. In view of the fundamental challenges of single-cell resolution studies due to technical noise, transcriptional stochasticity and computational burden, PAGA provides a general framework for extending studies of the relations among single cells to relations among noise-reduced and computationally tractable groups of cells. This could facilitate obtaining clearer pictures of underlying biology.
In closing, we note that PAGA not only works for scRNA-seq based on distance metrics that arise from a sequence of chosen preprocessing steps, but can also be applied to any learned distance metric. To illustrate this point, we used PAGA for single-cell imaging data when applied on the basis of a deep-learning based distance metric. Eulenberg et al. [33] showed that a deep learning model can generate a feature space in which distances reflect the continuous progression of cell cycle. Using this, PAGA correctly identifies the biological trajectory through the interphases of cell cycle while ignoring a cluster of damaged and dead cells (Supplemental Note 5.6).
Code and Data availability
PAGA as well as all processing steps used within the analyses are available within Scanpy [34]: https://github.com/theislab/scanpy. The analyses and results of the present paper are available from https://github.com/theislab/paga. Data is linked from https://github.com/theislab/paga.
Acknowledgements
We thank N. Yosef and D. Wagner for stimulating discussions, S. Tritschler for valuable feedback when testing the code and M. Luecken for comments on graph partitioning algorithms. F.A.W. acknowledges support by the Helmholtz Postdoc Programme, Initiative and Networking Fund of the Helmholtz Association. J.S.D. is supported by a grant from the Swedish Research Council. Work in B.G.’s laboratory is supported by grants from Wellcome, Bloodwise, Cancer Research UK, NIH-NIDDK, and core support grants by Wellcome to the Cambridge Institute for Medical Research and Wellcome-MRC Cambridge Stem Cell Institute. F.K.H. is the recipient of a Medical Research Council PhD Studentship. The work from M.P., J.S., and N.R. was funded by the German Center for Cardiovascular Research (DZHK BER 1.2 VD) and the DFG (grant RA 838/5-1). F.J.T. is supported by the German Research Foundation (DFG) within the Collaborative Research Centre 1243, Subproject A17.
Footnotes
References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].↵
- [65].↵
- [66].↵
- [67].↵