Graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells

F. Alexander Wolf; Fiona Hamey; Mireya Plass; Jordi Solana; Joakim S. Dahlin; Berthold Göttgens; Nikolaus Rajewsky; Lukas Simon; Fabian J. Theis

doi:10.1101/208819

Abstract

Single-cell RNA-seq quantifies biological heterogeneity across both discrete cell types and continuous cell transitions. Partition-based graph abstraction (PAGA) provides an interpretable graph-like map of the arising data manifold, based on estimating connectivity of manifold partitions (https://github.com/theislab/paga). PAGA maps provide interpretable discrete and continuous latent coordinates for both disconnected and continuous structure in data, preserve the global topology of data, allow analyzing data at different resolutions and result in much higher computational efficiency of the typical exploratory data analysis workflow — one million cells take on the order of a minute, a speedup of 130 times compared to UMAP. We demonstrate the method by inferring structure-rich cell maps with consistent topology across four hematopoietic datasets, confirm the reconstruction of lineage relations of adult planaria and the zebrafish embryo, benchmark computational performance on a neuronal dataset and detect a biological trajectory in one deep-learning processed image dataset.

Introduction

Single-cell RNA-seq offers unparalleled opportunities for comprehensive molecular profiling of thousands of individual cells, with expected major impacts across a broad range of biomedical research. The resulting datasets are often discussed using the term transcriptional landscape. However, the algorithmic analysis of cellular heterogeneity and patterns across such landscapes still faces fundamental challenges, for instance, in how to explain cell-to-cell variation. Current computational approaches attempt to achieve this usually in one of two ways [1]. Clustering assumes that data is composed of biologically distinct groups such as discrete cell types or states and labels these with a discrete variable — the cluster index. By contrast, inferring pseudotemporal orderings or trajectories of cells [2–4] assumes that data lie on a connected manifold [5] and labels cells with a continuous variable — the distance along the manifold. While the former approach is the basis for most unsupervised analyses of single-cell data, the latter enables a better interpretation of continuous phenotypes and processes such as development, dose response and disease progression. Here, we unify both viewpoints.

A central example of dissecting heterogeneity in single-cell experiments concerns data that originate from complex cell differentiation processes. However, analyzing such data using pseudotemporal ordering [2, 6–10] faces the problem that biological processes are usually incompletely sampled. As a consequence, experimental data do not conform with a connected manifold and the modeling of data as a continuous tree structure, which is the basis for existing algorithms, has little meaning. This problem exists even in clustering-based algorithms for the inference of tree-like processes [11– 13], which make the generally invalid assumption that clusters conform with a connected tree-like topology. Moreover, they rely on feature-space based inter-cluster distances, like the euclidean distance of cluster means. However, such distance measures quantify biological similarity of cells only at a local scale and are fraught with problems when used for larger-scale objects like clusters. Efforts for addressing the resulting high non-robustness of tree-fitting to distances between clusters [11] by sampling [12, 13] have only had limited success.

Partition-based graph abstraction (PAGA) [14] resolves these fundamental problems by generating graph-like maps of cells that preserve both continuous and disconnected structure in data at multiple resolutions. The data-driven formulation of PAGA allows to robustly reconstruct branching gene expression changes across different datasets and, for the first time, enabled reconstructing the lineage relations of a whole adult animal [15]. Furthermore, we show that PAGA-initialized manifold learning algorithms converge faster, produce embeddings that are more faithful to the global topology of high-dimensional data and introduce an entropy-based measure for quantifying such faithfulness. Finally, we show how PAGA abstracts transition graphs, for instance, from RNA velocity and compare to previous trajectory-inference algorithms.

Results

PAGA maps discrete disconnected and continuous connected cell-to-cell variation

Both established manifold learning techniques and single-cell data analysis techniques represent data as a neighborhood graph of single cells G = (V, E), where each node in V corresponds to a cell and each edge in E represents a neighborhood relation (Figure 1) [3, 16–18]. However, the complexity of G and noise-related spurious edges make it both hard to trace a putative biological process from progenitor cells to different fates and to decide whether groups of cells are in fact connected or disconnected. Moreover, tracing isolated paths of single cells to make statements about a biological process comes with too little statistical power to achieve an acceptable confidence level. Gaining power by averaging over distributions of single-cell paths is hampered by the difficulty of fitting realistic models for the distribution of these paths.

Figure 1 Partition-based graph abstraction generates a topology-preserving map of single cells

High-dimensional gene expression data is represented as a kNN graph by choosing a suitable low-dimensional representation and an associated distance metric for computing neighborhood relations — in most of the paper we use PCA-based representations and Euclidean distance. The kNN graph is partitioned at a desired resolution where partitions represent groups of connected cells. For this, we usually use the Louvain algorithm, however, partitions can be obtained in any other way, too. A PAGA graph is obtained by associating a node with each partition and connecting each node by weighted edges that represent a statistical measure of connectivity between partitions, which we introduce in the present paper. By discarding spurious edges with low weights, PAGA graphs reveal the denoised topology of the data at a chosen resolution and reveal its connected and disconnected regions. Combining high-confidence paths in the PAGA graph with a random-walk based distance measure on the single-cell graph, we order cells within each partition according to their distance from a root cell. A PAGA path then averages all single-cell paths that pass through the corresponding groups of cells. This allows to trace gene expression changes along complex trajectories at single-cell resolution.

We address these problems by developing a statistical model for the connectivity of groups of cells, which we typically determine through graph-partitioning [18–20] or alternatively through clustering or experimental annotation. This allows us to generate a simpler PAGA graph G^* (Figure 1) whose nodes correspond to cell groups and whose edge weights quantify the connectivity between groups. The statistical model considers groups as connected if their number of inter-edges exceeds a fraction of the number of inter-edges expected under random assignment. The connection strength can be interpreted as confidence in the presence of an actual connection and allows discarding spurious, noise-related connections (Supplemental Note 1). While G represents the connectivity structure of data at single-cell resolution, the PAGA graph G^* represents the connectivity structure of data at the chosen coarser resolution of the partitioning and allows to identify connected and disconnected regions of data. Following paths along nodes in G^* means following an ensemble of single-cell paths that pass through the corresponding cell groups in G. By averaging over such an ensemble of single-cell paths, it becomes possible to trace a putative biological process from a progenitor to fates in a way that is robust to spurious edges, provides statistical power and is consistent with basic assumptions on a biological trajectory of cells (Supplemental Note 2). Note that by varying the resolution of the partitioning, PAGA generates PAGA graphs at multiple resolutions, which enables a hierarchical exploration of data (Figure 1, Supplemental Note 1.3).

To trace gene dynamics at single-cell resolution, we extended existing random-walk based distance measures (Supplemental Note 2, Reference [8]) to the realistic case that accounts for disconnected graphs. By following high-confidence paths in the abstracted graph G^* and ordering cells within each group in the path according to their distance d from a progenitor cell, we trace gene changes at single-cell resolution (Figure 1). Hence, PAGA covers both aspects of clustering and pseudotemporal ordering by providing a coordinate system (G^*, d) that allows us to explore variation in data while preserving its topology (Supplemental Note 1.6). PAGA can thus be viewed as an easily-interpretable and robust way of performing topological data analysis [10, 21] (Supplemental Note 3).

PAGA-initialized manifold learning produces topology-preserving single-cell embeddings

The computationally almost cost-free coarse-resolution embeddings of PAGA can be used to initialize established manifold learning and graph drawing algorithms like UMAP [22] and ForceAtlas2 (FA) [23]. This strategy is used to generate the single-cell embeddings throughout this paper. In contrast to the results of previous algorithms, PAGA-initialized single-cell embeddings are faithful to the global topology, which greatly improves their interpretability. To quantify this claim, we took a classification perspective on embedding algorithms and developed a cost function KL_geo (Box and Supplemental Note 4), which captures faithfulness to global topology by incorporating geodesic distance along the representations of data manifolds in both the high-dimensional and the embedding space, respectively. Independent of this, PAGA-initialized manifold learning converges about 6 times faster with respect to established cost functions in manifold learning (Supplemental Figure 10).

Box Taking a classification view on embedding algorithms, we quantify how faithful an embedding is to the global topology of the high-dimensional data by comparing the distributions P and Q of edges in the high-dimensional and embedding spaces using a weighted Kullback-Leibler divergence where p_e and q_e are the probabilities for an edge being present in the kNN graphs in the high-dimensional and embedding spaces, respectively. Analogously, and denote random-walk based estimators of geodesic distances on the manifolds in these spaces, respectively. E_fc denotes the edge set of the fully connected graph (Supplemental Note 4, Supplemental Figure 10).

PAGA consistently predicts developmental trajectories and gene expression changes in datasets related to hematopoiesis

Hematopoiesis represents one of the most extensively characterised systems involving stem cell differentiation towards multiple cell fates and hence provides an ideal scenario for applying PAGA to complex manifolds. We applied PAGA to simulated data (Supplemental Note 5) for this system and three experimental datasets: 2,730 cells measured using MARS-seq [24], 1,654 cells measured using Smart-seq2 [25] and 44,802 cells from a 10x Genomics protocol [26]. These data cover the differentiation from stem cells towards, cell fates including erythrocytes, megakaryocytes, neutrophils, monocytes, basophils and lymphocytes.

The PAGA graphs (Figure 2) capture known features of hematopoiesis, such as the proximity of megakaryocyte and erythroid progenitors and strong connections between monocyte and neutrophil progenitors. Under debate is the origin of basophils. Studies have suggested both that basophils originate from a basophil-neutrophil-monocyte progenitor or, more recently, from a shared erythroid-megakaryocyte-basophil progenitor [27, 28]. The PAGA graphs of the three experimental datasets highlight this ambiguity. While the dataset of Paul et al. falls in the former category, Nestorowa et al. falls in the latter and Dahlin et al., which has by far the highest cell numbers and the densest sampling, allows us to see both trajectories. Aside from this ambiguity that can be explained by insufficient sampling in Paul et al. and Nestorowa et al., even with the very different experimental protocols and vastly different cell numbers the PAGA graphs show consistent topology between the three datasets. Beyond consistent topology between cell subgroups, we find consistent continuous gene expression changes across all datasets — we observe changes of erythroid maturity marker genes (Gata2, Gata1, Klf1, Epor and Hba-a2) along the erythroid trajectory through the PAGA graphs and observe sequential activation of these genes in agreement with known behaviour. Activation of neutrophil markers (Elane, Cepbe and Gfi1) and monocyte markers (Irf8, Csf1r and Ctsg) are seen towards the end of the neutrophil and monocyte trajectories, respectively. While PAGA is able to capture the dynamic transcriptional processes underlying multi-lineage hematopoietic differentiation, previous algorithms fail to produce robust or meaningful results (Supplemental Figures 8 and 9).

Figure 2 PAGA consistently predicts developmental trajectories and gene expression changes across datasets for hematopoiesis

The three columns correspond to PAGA-initialized single-cell embeddings, PAGA graphs and gene changes along PAGA paths. The four rows of panels correspond to simulated data (Supplemental Note 5) and data from Paul et al. [24], Nestorowa et al. [25] and Dahlin et al. [26], respectively. The arrows in the last row mark the two trajectories to Basophils. One observes both consistent topology of PAGA graphs and consistent gene expression changes along PAGA paths for 5 erythroid, 3 neutrophil and 3 monocyte marker genes across all datasets. The cell type abbreviations are: Stem for stem cells, Ery for erythrocytes, Mk for megakaryocytes, Neu for neutrophils, Mo for monocytes, Baso for basophils, B for B cells, Lymph for lymphocytes.

PAGA maps single-cell data of whole animals at multiple resolutions

Recently, Plass et al. [15] reconstructed the first cellular lineage tree of a whole adult animal, the flatworm Schmidtea mediterranea, using PAGA on scRNA-seq data from 21,612 cells. While Plass et al. focussed on the tree-like subgraph that maximizes overall connectivity — the minimum spanning tree of G^* weighted by inverse PAGA connectivity — here, we show how PAGA can be used to generate maps of data at multiple resolutions (Figure 3a). Each map preserves the topology of data, in contrast to state-of-the-art manifold learning where connected tissue types appear as either disconnected or overlapping (Figure 3b). PAGA’s multi-resolution capabilities directly addresses the typical practice of exploratory data analysis, in particular for single-cell data: data is typically reclustered in certain regions where a higher level of detail is required.

Figure 3 PAGA applied to a whole adult animal

a, PAGA graphs for data for the flatworm Schmidtea mediterranea [15] at tissue, cell type and single-cell resolution. Only by initializing a single-cell embedding with the embedding of the cell-type PAGA graph, it is possible to obtain a topologically meaningful embedding. Note that the PAGA graph is the same as in Reference [15], only that here, we neither highlight a tree subgraph nor used the corresponding tree layout for visualization. b, Established manifold learning for the same data. c, d, Predictions of RNA velocity evaluated with PAGA for two example lineages: epidermis and muscle. We show the RNA velocity arrows plotted on a single-cell embedding, the standard PAGA graph representing the topological information (only epidermis) and the PAGA graph representing the RNA velocity information.

PAGA abstracts information from RNA velocity

Even through the connections in PAGA graphs often correspond to actual biological trajectories, this is not always the case. This is a consequence of PAGA being applied to kNN graphs, which solely contain information about the topology of data. Recently, it has been suggested to also consider directed graphs that store information about cellular transition based on RNA velocity [29]. To include this additional information, which can add further evidence for actual biological transitions, we extend the undirected PAGA connectivity measure to such directed graphs (Supplemental Note 1.2) and use it to orient edges in PAGA graphs (Figure 3c). Due the relatively sparsely sampled, high-dimensional feature space of scRNA-seq data, both fitting and interpreting an RNA velocity vector without including information about topology — connectivity of neighborhoods — is practically impossible. PAGA provides a natural way of abstracting both topological information and information about RNA velocity.

Next, we applied PAGA to 53,181 cells collected at different developmental time points (embryo days) from the zebrafish embryo [30]. The PAGA graph for partitions corresponding to embryo days accurately recovers the chain topology of temporal progression, whereas the PAGA graph for cell types provide easily interpretable overviews of the lineage relations (Figure 4a). Initializing a ForceAtlas2 layout with PAGA coordinates from fine cell types automatically produced a corresponding, interpretable single-cell embedding (Figure 4a). Wagner et al. [30] both applied an independently developed computational approach with similarities to PAGA (Supplemental Notes 3) to produce a coarse-grained graph and experimentally validated inferred lineage relations. Comparing the PAGA graph for the fine cell types to the coarse-grained graph of Wagner et al. reproduced their result with high accuracy (Figure 4b).

Figure 4 PAGA applied to zebrafish embryo data of Wagner et al. [30]

a, PAGA graphs obtained after running PAGA on partitions corresponding to embryo days and coarse cell types and a PAGA-initialized single-cell embedding colored by the same quantities. b, Performance measurements of the PAGA prediction compared to the reference graph of Wagner et al. show high accuracy. False positive edges and false negative edges for the threshold indicate by a vertical line in the left panel are also shown.

PAGA increases computational efficiency and interpretability in general exploratory data analysis and manifold learning

Comparing the runtimes of PAGA with the state-of-the-art UMAP [22] for 1.3 million neuronal cells of 10x Genomics [31] we find a speedup of about 130, which enables interactive analysis of very large-scale data (90 s versus 191 min on 3 cores of a small server, tSNE takes about 12 10 h). For complex and large data, the PAGA graph generally provides a more easily interpretable visualization of the clustering step in exploratory data analysis, where the limitations of two-dimensional representations become apparent (Supplemental Figure 12). PAGA graph visualizations can be colored by gene expression and covariates from annotation (Supplemental Figure 13) just as any conventional embedding method.

PAGA is robust and qualitatively outperforms previous lineage reconstruction algorithms

To assess how robustly graph and tree-inference algorithms recover a given topology, we developed a measure for comparing the topologies of two graphs by comparing the sets of possible paths on them (Supplemental Note 1.3 and 1.4, Supplemental Figure 4). Sampling widely varying parameters, which leads to widely varying clusterings, we find that the inferred abstraction of topology of data within the PAGA graph is much more robust than the underlying graph clustering algorithm (Supplemental Figure 5). While graph clustering alone is, as any clustering method, an ill-posed problem in the sense that many highly degenerate quasi-optimal clusterings exist and some knowledge about the scale of clusters is required, PAGA is not affected by this.

Several algorithms [6, 11–13] have been proposed for reconstructing lineage trees (Supplemental Note 3, [4]). The main caveat of these algorithms is that they, unlike PAGA, try to explain any variation in the data with a tree-like topology. In particular, any disconnected distribution of clusters is interpreted as originating from a tree. This produces qualitatively wrong results already for simple simulated data (Supplementary Figure 6) and only works well for data that clearly conforms with a tree-like manifold (Supplementary Figure 7). To establish a fair comparison on real data with the recent popular algorithm, Monocle 2, we reinvestigated the main example of Qiu et al. [6] for a complex differentiation tree. This example is based on the data of Paul et al. [24] (Figure 2), but with cluster 19 removed. While PAGA identifies the cluster as disconnected with a result that is unaffected by its presence, the prediction of Monocle 2 changes qualitatively if the cluster is taken into account (Supplementary Figure 8). The example illustrates the general point that real data almost always consists of dense and sparse — connected and disconnected — regions, some tree-like, some with more complex topology.

Discussion

In view of an increasing number of large datasets and analyses for even larger merged datasets, PAGA fundamentally addresses the need for scalable and interpretable maps of high dimensional data. In the context of the Human Cell Atlas [32] and comparable databases, methods for their hierarchical, multi-resolution exploration will be pivotal in order to provide interpretable accessibility to users. PAGA allows for the first time to present information about clusters or cell types in an unbiased, data-driven coordinate system by representing these in PAGA graphs. In the context of the recent advances of the study of simple biological processes that involve a single branching [7, 8], PAGA provides a similarly robust framework for arbitrarily complex topologies. In view of the fundamental challenges of single-cell resolution studies due to technical noise, transcriptional stochasticity and computational burden, PAGA provides a general framework for extending studies of the relations among single cells to relations among noise-reduced and computationally tractable groups of cells. This could facilitate obtaining clearer pictures of underlying biology.

In closing, we note that PAGA not only works for scRNA-seq based on distance metrics that arise from a sequence of chosen preprocessing steps, but can also be applied to any learned distance metric. To illustrate this point, we used PAGA for single-cell imaging data when applied on the basis of a deep-learning based distance metric. Eulenberg et al. [33] showed that a deep learning model can generate a feature space in which distances reflect the continuous progression of cell cycle. Using this, PAGA correctly identifies the biological trajectory through the interphases of cell cycle while ignoring a cluster of damaged and dead cells (Supplemental Note 5.6).

Code and Data availability

PAGA as well as all processing steps used within the analyses are available within Scanpy [34]: https://github.com/theislab/scanpy. The analyses and results of the present paper are available from https://github.com/theislab/paga. Data is linked from https://github.com/theislab/paga.

Acknowledgements

We thank N. Yosef and D. Wagner for stimulating discussions, S. Tritschler for valuable feedback when testing the code and M. Luecken for comments on graph partitioning algorithms. F.A.W. acknowledges support by the Helmholtz Postdoc Programme, Initiative and Networking Fund of the Helmholtz Association. J.S.D. is supported by a grant from the Swedish Research Council. Work in B.G.’s laboratory is supported by grants from Wellcome, Bloodwise, Cancer Research UK, NIH-NIDDK, and core support grants by Wellcome to the Cambridge Institute for Medical Research and Wellcome-MRC Cambridge Stem Cell Institute. F.K.H. is the recipient of a Medical Research Council PhD Studentship. The work from M.P., J.S., and N.R. was funded by the German Center for Cardiovascular Research (DZHK BER 1.2 VD) and the DFG (grant RA 838/5-1). F.J.T. is supported by the German Research Foundation (DFG) within the Collaborative Research Centre 1243, Subproject A17.

Footnotes

↵† fabian.theis{at}helmholtz-muenchen.de
↵1 This is a simple expression in the bipartitioned case h = h_i + h_j

References

[1].↵
Wagner, A., Regev, A. & Yosef, N. Revealing the vectors of cellular identity with single-cell genomics. Nature Biotechnology 34, 1145–1160 (2016).
OpenUrl CrossRef PubMed
[2].↵
Trapnell, C. et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nature Biotechnology 32, 381–386 (2014).
OpenUrl CrossRef PubMed
[3].↵
Bendall, S. C. et al. Single-Cell Trajectory Detection Uncovers Progression and Regulatory Coordination in Human B Cell Development. Cell 157, 714–725 (2014).
OpenUrl CrossRef PubMed Web of Science
[4].↵
Saelens, W., Cannoodt, R., Todorov, H. & Saeys, Y. A comparison of single-cell trajectory inference methods: towards more accurate and robust tools (2018).
[5].↵
A set of data points can never be an algebraic manifold as it lacks any smoothness property and the notion (algebraic) variety would be more appropriate. However, the notion manifold has been established both in machine learning and in single-cell biology as one often thinks of data as arising from a noisy measurement of the smooth manifold of a dynamical system.
[6].↵
Qiu, X. et al. Single-cell mRNA quantification and differential analysis with Census. Nature Methods 14, 309–315 (2017).
OpenUrl
[7].↵
Setty, M. et al. Wishbone identifies bifurcating developmental trajectories from single-cell data. Nature Biotechnology 34, 637–645 (2016).
OpenUrl CrossRef PubMed
[8].↵
Haghverdi, L., Büttner, M., Wolf, F. A., Buettner, F. & Theis, F. J. Diffusion pseudotime robustly reconstructs branching cellular lineages. Nature Methods 13, 845–848 (2016).
OpenUrl
[9].
Street, K. et al. Slingshot: Cell lineage and pseudotime inference for single-cell transcriptomics. BMC Genomics 19, 477 (2017).
OpenUrl
[10].↵
Rizvi, A. H. et al. Single-cell topological RNA-seq analysis reveals insights into cellular differentiation and development. Nature Biotechnology 35, 551–560 (2017).
OpenUrl CrossRef PubMed
[11].↵
Qiu, P. et al. Extracting a cellular hierarchy from high-dimensional cytometry data with SPADE. Nature Biotechnology 29, 886–891 (2011).
OpenUrl CrossRef PubMed
[12].↵
Giecold, G., Marco, E., Garcia, S. P., Trippa, L. & Yuan, G.-C. Robust lineage reconstruction from high-dimensional single-cell data. Nucleic acids research 44, e122–e122 (2016).
OpenUrl CrossRef PubMed
[13].↵
Grün, D. et al. De novo prediction of stem cell identity using single-cell transcriptome data. Cell Stem Cell 19, 266–277 (2016).
OpenUrl CrossRef PubMed
[14].↵
We borrowed the term “graph abstraction” from the class of “pattern-based graph abstraction” algorithms [66, 67]. Their idea is to compute a simple abstraction of a complicated graph based on a set of fixed rules, for example, the contraction of a chain of edges to a single edge — similar to graph coarsening. As applying these exact-rule based algorithms to single-cell data is impractical, confusion with PAGA is unlikely and we will often use “graph abstraction” as a synonym for PAGA.
[15].↵
Plass, M. et al. Cell type atlas and lineage tree of a whole complex animal by single-cell transcriptomics. Science 360, eaaq1723 (2018).
OpenUrl Abstract/FREE Full Text
[16].↵
van der Maaten, L. & Hinton, G. Visualizing Data using t-SNE. Journal of Machine Learning Research 9, 2579–2605 (2008).
OpenUrl
[17].
Islam, S. et al. Characterization of the single-cell transcriptional landscape by highly multiplex RNA-seq. Genome Research 21, 1160–1167 (2011).
OpenUrl Abstract/FREE Full Text
[18].↵
Levine, J. H. et al. Data-Driven Phenotypic Dissection of AML Reveals Progenitor-like Cells that Correlate with Prognosis. Cell 162, 184–197 (2015).
OpenUrl CrossRef PubMed
[19].↵
Blondel, V. D., Guillaume, J.-L., Lambiotte, R. & Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech. 2008, P10008 (2008).
OpenUrl CrossRef
[20].↵
Xu, C. & Su, Z. Identification of cell types from single-cell transcriptomes using a novel clustering method. Bioinformatics 31, 1974–1980 (2015).
OpenUrl CrossRef PubMed
[21].↵
Singh, G., Mémoli, F. & Carlsson, G. E. Topological methods for the analysis of high dimensional data sets and 3d object recognition. In Eurographics Symposium on Point-Based Graphics (2007).
[22].↵
McInnes, L. & Healy, J. arXiv 1802.03426 (2018).
[23].↵
Jacomy, M., Venturini, T., Heymann, S. & Bastian, M. ForceAtlas2, a Continuous Graph Layout Algorithm for Handy Network Visualization Designed for the Gephi Software. PLoS ONE 9, e98679 (2014).
OpenUrl CrossRef PubMed
[24].↵
Paul, F. et al. Transcriptional Heterogeneity and Lineage Commitment in Myeloid Progenitors. Cell 163, 1663–1677 (2015).
OpenUrl CrossRef PubMed
[25].↵
Nestorowa, S. et al. A single-cell resolution map of mouse hematopoietic stem and progenitor cell differentiation. Blood 128, e20–e31 (2016).
OpenUrl Abstract/FREE Full Text
[26].↵
Dahlin, J. S. et al. A single cell hematopoietic landscape resolves eight lineage trajectories and defects in Kit mutant mice. Blood blood–2017–12–821413 (2018).
[27].↵
Görgens, A. et al. Multipotent hematopoietic progenitors divide asymmetrically to create progenitors of the lymphomyeloid and erythromyeloid lineages. Stem cell reports 3, 1058–1072 (2014).
OpenUrl
[28].↵
Tusi, B. K. et al. Population snapshots predict early haematopoietic and erythroid hierarchies. Nature 555, 54–60 (2018).
OpenUrl
[29].↵
Manno, G. L. et al. RNA velocity in single cells (2017).
[30].↵
Wagner, D. E. et al. Single-cell mapping of gene expression landscapes and lineage in the zebrafish embryo. Science eaar4362 (2018).
[31].↵
10X Genomics. 1.3 Million Brain Cells from E18 Mice.
[32].↵
Regev, A. et al. Science Forum: The Human Cell Atlas. eLife 6 (2017).
[33].↵
Eulenberg, P. et al. Reconstructing cell cycle and disease progression using deep learning. Nature communications 8, 463 (2017).
OpenUrl
[34].↵
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biology 19 (2018).
[35].↵
Newman, M. E. J. & Girvan, M. Finding and evaluating community structure in networks. Phys. Rev. E 69, 026113 (2004).
OpenUrl
[36].↵
Newman, M. E. J. Modularity and community structure in networks. Proceedings of the National Academy of Sciences 103, 8577–8582 (2006).
OpenUrl Abstract/FREE Full Text
[37].↵
David, G. & Averbuch, A. Hierarchical data organization, clustering and denoising via localized diffusion folders. Applied and Computational Harmonic Analysis 33, 1–23 (2012).
OpenUrl
[38].↵
Weinreb, C., Wolock, S., Tusi, B. K., Socolovsky, M. & Klein, A. M. Fundamental limits on dynamic inference from single-cell snapshots. Proceedings of the National Academy of Sciences 115, E2467–E2476 (2018).
OpenUrl Abstract/FREE Full Text
[39].↵
Fruchterman, T. M. J. & Reingold, E. M. Graph drawing by force-directed placement. Software: Practice and Experience 21, 1129–1164 (1991).
OpenUrl CrossRef Web of Science
[40].↵
Weinreb, C., Wolock, S. & Klein, A. M. SPRING: a kinetic interface for visualizing high dimensional single-cell expression data. Bioinformatics btx792 (2017).
[41].↵
Amir, E.-a. D. et al. viSNE enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia. Nature Biotechnology 31, 545–552 (2013).
OpenUrl CrossRef PubMed
[42].↵
Satija, R., Farrell, J. A., Gennert, D., Schier, A. F. & Regev, A. Spatial reconstruction of single-cell gene expression data. Nature Biotechnology 33, 495–502 (2015).
OpenUrl CrossRef PubMed
[43].↵
Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nature Communications 8, 14049 (2017).
OpenUrl CrossRef
[44].↵
Wang, B., Zhu, J., Pierson, E., Ramazzotti, D. & Batzoglou, S. Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning. Nature methods 14, 414–416 (2017).
OpenUrl
[45].↵
Lopez, R., Regier, J., Cole, M. B., Jordan, M. & Yosef, N. Bayesian Inference for a Generative Model of Transcriptome Profiles from Single-cell RNA Sequencing (2018).
[46].↵
Eraslan, G., Simon, L. M., Mircea, M., Mueller, N. S. & Theis, F. J. Single cell RNA-seq denoising using a deep count autoencoder (2018).
[47].↵
Traag, V. Louvain. GitHub (2017).
[48].↵
Pons, P. & Latapy, M. Computing communities in large networks using random walks. Computer and Information Sciences - ISCIS 284 (2005).
[49].↵
Farrell, J. A. et al. Single-cell reconstruction of developmental trajectories during zebrafish embryogenesis. Science eaar3131 (2018).
[50].↵
Lovász, L. Random Walks on Graphs: A Survey. Combinatorics, Paul Erdös is Eighty 2, 1 (1993).
OpenUrl
[51].↵
von Luxburg, U. A Tutorial on Spectral Clustering. Statistics and Computing 17, 395 (2007).
OpenUrl
[52].↵
Coifman, R. R. et al. Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps. Proceedings of the National Academy of Sciences 102, 7426–7431 (2005).
OpenUrl Abstract/FREE Full Text
[53].↵
Safro, I., Sanders, P. & Schulz, C. (2012).
[54].↵
Fouss, F., Pirotte, A., Renders, J.-M. & Saerens, M. Random-walk computation of similarities between nodes of a graph with application to collaborative recommendation. IEEE Transactions on knowledge and data engineering 19, 355–369 (2007).
OpenUrl
[55].↵
Mao, Q., Wang, L., Tsang, I. & Sun, Y. Principal Graph and Structure Learning Based on Reversed Graph Embedding. IEEE Transactions on Pattern Analysis and Machine Intelligence PP, 1–1 (2017).
OpenUrl
[56].↵
Ji, Z. & Ji, H. TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis. Nucleic acids research 44, e117 (2016).
OpenUrl CrossRef PubMed
[57].↵
Chen, J., Schlitzer, A., Chakarov, S., Ginhoux, F. & Poidinger, M. Mpath maps multi-branching single-cell trajectories revealing progenitor cell progression during development. Nature Communications 7, 11988 (2016).
OpenUrl
[58].↵
van Unen, V. et al. Visual analysis of mass cytometry data by hierarchical stochastic neighbour embedding reveals rare cell types. Nature Communications 8 (2017).
[59].↵
Herring, C. A. et al. Unsupervised Trajectory Analysis of Single-Cell RNA-Seq and Imaging Data Reveals Alternative Tuft Cell Origins in the Gut. Cell Systems 6, 37–51.e9 (2018).
OpenUrl
[60].↵
Velten, L. et al. Human haematopoietic stem cell lineage commitment is a continuous process. Nature Cell Biology 19, 271–281 (2017).
OpenUrl CrossRef PubMed
[61].↵
Qiu, X. et al. Reversed graph embedding resolves complex single-cell trajectories. Nature methods 14, 979–982 (2017).
OpenUrl
[62].↵
Venna, J., Peltonen, J., Nybo, K., Aidos, H. & Kaski, S. Information retrieval perspective to nonlinear dimensionality reduction for data visualization. Journal of Machine Learning Research 11, 451–490 (2010).
OpenUrl
[63].↵
Wittmann, D. M. et al. Transforming Boolean models to continuous models: methodology and application to T-cell receptor signaling. BMC Syst. Biol. 3, 98 (2009).
OpenUrl CrossRef PubMed
[64].↵
Krumsiek, J., Marr, C., Schroeder, T. & Theis, F. J. Hierarchical Differentiation of Myeloid Progenitors Is Encoded in the Transcription Factor Network. PLoS ONE 6, e22649 (2011).
OpenUrl CrossRef PubMed
[65].↵
Moignard, V. et al. Decoding the regulatory network of early blood development from single-cell gene expression measurements. Nature Biotechnology 33, 269–276 (2015).
OpenUrl CrossRef PubMed
[66].↵
Boneva, I., Rensink, A., Kurban, M. & Bauer, J. Graph Abstraction and Abstract Graph Transformation. Tech. Rep., Centre for Telematics and Information Technology, University of Twente, Enschede (2007).
[67].↵
Rensink, A. & Zambon, E. Pattern-Based Graph Abstraction, 66–80 (Springer Berlin Heidelberg, Berlin, Heidelberg, 2012).

View the discussion thread.

Posted November 04, 2018.

Download PDF

Citation Tools

Subject Area

Bioinformatics

Subject Areas

All Articles

Animal Behavior and Cognition (5210)
Biochemistry (11736)
Bioengineering (8746)
Bioinformatics (29186)
Biophysics (14964)
Cancer Biology (12084)
Cell Biology (17401)
Clinical Trials (138)
Developmental Biology (9418)
Ecology (14176)
Epidemiology (2067)
Evolutionary Biology (18299)
Genetics (12235)
Genomics (16793)
Immunology (11863)
Microbiology (28066)
Molecular Biology (11580)
Neuroscience (60925)
Paleontology (451)
Pathology (1870)
Pharmacology and Toxicology (3238)
Physiology (4956)
Plant Biology (10422)
Scientific Communication and Education (1683)
Synthetic Biology (2883)
Systems Biology (7338)
Zoology (1650)

[1] [1].↵
Wagner, A., Regev, A. & Yosef, N. Revealing the vectors of cellular identity with single-cell genomics. Nature Biotechnology 34, 1145–1160 (2016).
OpenUrl CrossRef PubMed

[2] [2].↵
Trapnell, C. et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nature Biotechnology 32, 381–386 (2014).
OpenUrl CrossRef PubMed

[3] [3].↵
Bendall, S. C. et al. Single-Cell Trajectory Detection Uncovers Progression and Regulatory Coordination in Human B Cell Development. Cell 157, 714–725 (2014).
OpenUrl CrossRef PubMed Web of Science

[4] [4].↵
Saelens, W., Cannoodt, R., Todorov, H. & Saeys, Y. A comparison of single-cell trajectory inference methods: towards more accurate and robust tools (2018).

[5] [5].↵
A set of data points can never be an algebraic manifold as it lacks any smoothness property and the notion (algebraic) variety would be more appropriate. However, the notion manifold has been established both in machine learning and in single-cell biology as one often thinks of data as arising from a noisy measurement of the smooth manifold of a dynamical system.

[6] [6].↵
Qiu, X. et al. Single-cell mRNA quantification and differential analysis with Census. Nature Methods 14, 309–315 (2017).
OpenUrl

[7] [7].↵
Setty, M. et al. Wishbone identifies bifurcating developmental trajectories from single-cell data. Nature Biotechnology 34, 637–645 (2016).
OpenUrl CrossRef PubMed

[8] [8].↵
Haghverdi, L., Büttner, M., Wolf, F. A., Buettner, F. & Theis, F. J. Diffusion pseudotime robustly reconstructs branching cellular lineages. Nature Methods 13, 845–848 (2016).
OpenUrl

[9] [9].
Street, K. et al. Slingshot: Cell lineage and pseudotime inference for single-cell transcriptomics. BMC Genomics 19, 477 (2017).
OpenUrl

[10] [10].↵
Rizvi, A. H. et al. Single-cell topological RNA-seq analysis reveals insights into cellular differentiation and development. Nature Biotechnology 35, 551–560 (2017).
OpenUrl CrossRef PubMed

[11] [11].↵
Qiu, P. et al. Extracting a cellular hierarchy from high-dimensional cytometry data with SPADE. Nature Biotechnology 29, 886–891 (2011).
OpenUrl CrossRef PubMed

[12] [12].↵
Giecold, G., Marco, E., Garcia, S. P., Trippa, L. & Yuan, G.-C. Robust lineage reconstruction from high-dimensional single-cell data. Nucleic acids research 44, e122–e122 (2016).
OpenUrl CrossRef PubMed

[13] [13].↵
Grün, D. et al. De novo prediction of stem cell identity using single-cell transcriptome data. Cell Stem Cell 19, 266–277 (2016).
OpenUrl CrossRef PubMed

[14] [14].↵
We borrowed the term “graph abstraction” from the class of “pattern-based graph abstraction” algorithms [66, 67]. Their idea is to compute a simple abstraction of a complicated graph based on a set of fixed rules, for example, the contraction of a chain of edges to a single edge — similar to graph coarsening. As applying these exact-rule based algorithms to single-cell data is impractical, confusion with PAGA is unlikely and we will often use “graph abstraction” as a synonym for PAGA.

[15] [15].↵
Plass, M. et al. Cell type atlas and lineage tree of a whole complex animal by single-cell transcriptomics. Science 360, eaaq1723 (2018).
OpenUrl Abstract/FREE Full Text

[16] [16].↵
van der Maaten, L. & Hinton, G. Visualizing Data using t-SNE. Journal of Machine Learning Research 9, 2579–2605 (2008).
OpenUrl

[17] [17].
Islam, S. et al. Characterization of the single-cell transcriptional landscape by highly multiplex RNA-seq. Genome Research 21, 1160–1167 (2011).
OpenUrl Abstract/FREE Full Text

[18] [18].↵
Levine, J. H. et al. Data-Driven Phenotypic Dissection of AML Reveals Progenitor-like Cells that Correlate with Prognosis. Cell 162, 184–197 (2015).
OpenUrl CrossRef PubMed

[19] [19].↵
Blondel, V. D., Guillaume, J.-L., Lambiotte, R. & Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech. 2008, P10008 (2008).
OpenUrl CrossRef

[20] [20].↵
Xu, C. & Su, Z. Identification of cell types from single-cell transcriptomes using a novel clustering method. Bioinformatics 31, 1974–1980 (2015).
OpenUrl CrossRef PubMed

[21] [21].↵
Singh, G., Mémoli, F. & Carlsson, G. E. Topological methods for the analysis of high dimensional data sets and 3d object recognition. In Eurographics Symposium on Point-Based Graphics (2007).

[22] [22].↵
McInnes, L. & Healy, J. arXiv 1802.03426 (2018).

[23] [23].↵
Jacomy, M., Venturini, T., Heymann, S. & Bastian, M. ForceAtlas2, a Continuous Graph Layout Algorithm for Handy Network Visualization Designed for the Gephi Software. PLoS ONE 9, e98679 (2014).
OpenUrl CrossRef PubMed

[24] [24].↵
Paul, F. et al. Transcriptional Heterogeneity and Lineage Commitment in Myeloid Progenitors. Cell 163, 1663–1677 (2015).
OpenUrl CrossRef PubMed

[25] [25].↵
Nestorowa, S. et al. A single-cell resolution map of mouse hematopoietic stem and progenitor cell differentiation. Blood 128, e20–e31 (2016).
OpenUrl Abstract/FREE Full Text

[26] [26].↵
Dahlin, J. S. et al. A single cell hematopoietic landscape resolves eight lineage trajectories and defects in Kit mutant mice. Blood blood–2017–12–821413 (2018).

[27] [27].↵
Görgens, A. et al. Multipotent hematopoietic progenitors divide asymmetrically to create progenitors of the lymphomyeloid and erythromyeloid lineages. Stem cell reports 3, 1058–1072 (2014).
OpenUrl

[28] [28].↵
Tusi, B. K. et al. Population snapshots predict early haematopoietic and erythroid hierarchies. Nature 555, 54–60 (2018).
OpenUrl

[29] [29].↵
Manno, G. L. et al. RNA velocity in single cells (2017).

[30] [30].↵
Wagner, D. E. et al. Single-cell mapping of gene expression landscapes and lineage in the zebrafish embryo. Science eaar4362 (2018).

[31] [31].↵
10X Genomics. 1.3 Million Brain Cells from E18 Mice.

[32] [32].↵
Regev, A. et al. Science Forum: The Human Cell Atlas. eLife 6 (2017).

[33] [33].↵
Eulenberg, P. et al. Reconstructing cell cycle and disease progression using deep learning. Nature communications 8, 463 (2017).
OpenUrl

[34] [34].↵
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biology 19 (2018).

[35] [35].↵
Newman, M. E. J. & Girvan, M. Finding and evaluating community structure in networks. Phys. Rev. E 69, 026113 (2004).
OpenUrl

[36] [36].↵
Newman, M. E. J. Modularity and community structure in networks. Proceedings of the National Academy of Sciences 103, 8577–8582 (2006).
OpenUrl Abstract/FREE Full Text

[37] [37].↵
David, G. & Averbuch, A. Hierarchical data organization, clustering and denoising via localized diffusion folders. Applied and Computational Harmonic Analysis 33, 1–23 (2012).
OpenUrl

[38] [38].↵
Weinreb, C., Wolock, S., Tusi, B. K., Socolovsky, M. & Klein, A. M. Fundamental limits on dynamic inference from single-cell snapshots. Proceedings of the National Academy of Sciences 115, E2467–E2476 (2018).
OpenUrl Abstract/FREE Full Text

[39] [39].↵
Fruchterman, T. M. J. & Reingold, E. M. Graph drawing by force-directed placement. Software: Practice and Experience 21, 1129–1164 (1991).
OpenUrl CrossRef Web of Science

[40] [40].↵
Weinreb, C., Wolock, S. & Klein, A. M. SPRING: a kinetic interface for visualizing high dimensional single-cell expression data. Bioinformatics btx792 (2017).

[41] [41].↵
Amir, E.-a. D. et al. viSNE enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia. Nature Biotechnology 31, 545–552 (2013).
OpenUrl CrossRef PubMed

[42] [42].↵
Satija, R., Farrell, J. A., Gennert, D., Schier, A. F. & Regev, A. Spatial reconstruction of single-cell gene expression data. Nature Biotechnology 33, 495–502 (2015).
OpenUrl CrossRef PubMed

[43] [43].↵
Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nature Communications 8, 14049 (2017).
OpenUrl CrossRef

[44] [44].↵
Wang, B., Zhu, J., Pierson, E., Ramazzotti, D. & Batzoglou, S. Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning. Nature methods 14, 414–416 (2017).
OpenUrl

[45] [45].↵
Lopez, R., Regier, J., Cole, M. B., Jordan, M. & Yosef, N. Bayesian Inference for a Generative Model of Transcriptome Profiles from Single-cell RNA Sequencing (2018).

[46] [46].↵
Eraslan, G., Simon, L. M., Mircea, M., Mueller, N. S. & Theis, F. J. Single cell RNA-seq denoising using a deep count autoencoder (2018).

[47] [47].↵
Traag, V. Louvain. GitHub (2017).

[48] [48].↵
Pons, P. & Latapy, M. Computing communities in large networks using random walks. Computer and Information Sciences - ISCIS 284 (2005).

[49] [49].↵
Farrell, J. A. et al. Single-cell reconstruction of developmental trajectories during zebrafish embryogenesis. Science eaar3131 (2018).

[50] [50].↵
Lovász, L. Random Walks on Graphs: A Survey. Combinatorics, Paul Erdös is Eighty 2, 1 (1993).
OpenUrl

[51] [51].↵
von Luxburg, U. A Tutorial on Spectral Clustering. Statistics and Computing 17, 395 (2007).
OpenUrl

[52] [52].↵
Coifman, R. R. et al. Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps. Proceedings of the National Academy of Sciences 102, 7426–7431 (2005).
OpenUrl Abstract/FREE Full Text

[53] [53].↵
Safro, I., Sanders, P. & Schulz, C. (2012).

[54] [54].↵
Fouss, F., Pirotte, A., Renders, J.-M. & Saerens, M. Random-walk computation of similarities between nodes of a graph with application to collaborative recommendation. IEEE Transactions on knowledge and data engineering 19, 355–369 (2007).
OpenUrl

[55] [55].↵
Mao, Q., Wang, L., Tsang, I. & Sun, Y. Principal Graph and Structure Learning Based on Reversed Graph Embedding. IEEE Transactions on Pattern Analysis and Machine Intelligence PP, 1–1 (2017).
OpenUrl

[56] [56].↵
Ji, Z. & Ji, H. TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis. Nucleic acids research 44, e117 (2016).
OpenUrl CrossRef PubMed

[57] [57].↵
Chen, J., Schlitzer, A., Chakarov, S., Ginhoux, F. & Poidinger, M. Mpath maps multi-branching single-cell trajectories revealing progenitor cell progression during development. Nature Communications 7, 11988 (2016).
OpenUrl

[58] [58].↵
van Unen, V. et al. Visual analysis of mass cytometry data by hierarchical stochastic neighbour embedding reveals rare cell types. Nature Communications 8 (2017).

[59] [59].↵
Herring, C. A. et al. Unsupervised Trajectory Analysis of Single-Cell RNA-Seq and Imaging Data Reveals Alternative Tuft Cell Origins in the Gut. Cell Systems 6, 37–51.e9 (2018).
OpenUrl

[60] [60].↵
Velten, L. et al. Human haematopoietic stem cell lineage commitment is a continuous process. Nature Cell Biology 19, 271–281 (2017).
OpenUrl CrossRef PubMed

[61] [61].↵
Qiu, X. et al. Reversed graph embedding resolves complex single-cell trajectories. Nature methods 14, 979–982 (2017).
OpenUrl

[62] [62].↵
Venna, J., Peltonen, J., Nybo, K., Aidos, H. & Kaski, S. Information retrieval perspective to nonlinear dimensionality reduction for data visualization. Journal of Machine Learning Research 11, 451–490 (2010).
OpenUrl

[63] [63].↵
Wittmann, D. M. et al. Transforming Boolean models to continuous models: methodology and application to T-cell receptor signaling. BMC Syst. Biol. 3, 98 (2009).
OpenUrl CrossRef PubMed

[64] [64].↵
Krumsiek, J., Marr, C., Schroeder, T. & Theis, F. J. Hierarchical Differentiation of Myeloid Progenitors Is Encoded in the Transcription Factor Network. PLoS ONE 6, e22649 (2011).
OpenUrl CrossRef PubMed

[65] [65].↵
Moignard, V. et al. Decoding the regulatory network of early blood development from single-cell gene expression measurements. Nature Biotechnology 33, 269–276 (2015).
OpenUrl CrossRef PubMed

[66] [66].↵
Boneva, I., Rensink, A., Kurban, M. & Bauer, J. Graph Abstraction and Abstract Graph Transformation. Tech. Rep., Centre for Telematics and Information Technology, University of Twente, Enschede (2007).

[67] [67].↵
Rensink, A. & Zambon, E. Pattern-Based Graph Abstraction, 66–80 (Springer Berlin Heidelberg, Berlin, Heidelberg, 2012).