## Abstract

Single-cell transcriptomic assays have enabled the de novo reconstruction of lineage differentiation trajectories, along with the characterization of cellular heterogeneity and state transitions. Several methods have been developed for reconstructing developmental trajectories from single-cell transcriptomic data, but efforts on analyzing single-cell epigenomic data and on trajectory visualization remain limited. Here we present STREAM, an interactive pipeline capable of disentangling and visualizing complex branching trajectories from both single-cell transcriptomic and epigenomic data.

## Main text

STREAM (Single-cell Trajectories Reconstruction, Exploration And Mapping) can accurately recover complex developmental trajectories and provide informative and intuitive visualizations to highlight important genes that define subpopulations and cell types. STREAM reliably reconstructs trajectories and pseudotime (the distance from the start of a developmental trajectory) when multiple branching points are present, assumes no prior knowledge about the structure or the number of trajectories, and does not require extensive bioinformatics knowledge thanks to a user-friendly and interactive web interface. Additionally, STREAM has four innovations compared to other existing methods: 1) a novel density-level trajectory visualization useful to study subpopulation composition and cell-fate genes along branching trajectories, 2) a documented end-to-end pipeline to reconstruct trajectories from chromatin-accessibility data, 3) the first interactive database focused on single-cell trajectory visualization for several published studies, and 4) a trajectory mapping procedure to readily map new cells to precomputed structures without pooling data and recomputing trajectories. This last innovation allows facile analysis of data from genetic perturbation studies or to assign diseased/stimulated cells to a normal/resting developmental hierarchy. STREAM has been extensively tested using several published datasets from different organisms (zebrafish, mouse, human) and single-cell technologies (qPCR, scRNA-seq, scATAC-seq). It also has been compared to 10 other methods on both synthetic and real datasets.

STREAM takes as input a single-cell gene expression or epigenomic profile matrix and approximates the data in three or more dimensions with a structure called the principal graph, a set of curves that naturally describe the cells’ pseudotime, trajectories and branching points **(Fig. 1a)**. STREAM first identifies informative features such as variable genes or top principal components. Using these features, cells are then projected to a lower dimensional space using a non-linear dimensionality reduction method called Modified Locally Linear Embedding (MLLE), which preserves distances within local neighborhoods. In the MLLE embedding, STREAM infers cellular trajectories using a novel **El**astic **P**r**i**ncipal **Graph** implementation called ElPiGraph^{1}. ElPiGraph is a completely redesigned algorithm for elastic principal graph optimization introducing the elastic matrix Laplacian, trimmed mean square error, explicit control for topological complexity and scalability to millions of points. In STREAM, ElPiGraph is integrated with a heuristic graph structure seeding and several graph grammars rules optimized for single-cell data. In contrast to the majority of existing methods, ElPiGraph does not rely on kNN graphs or minimum spanning trees. ElPiGraph is very robust to background noise, does not require pre-clustering, can work in multidimensional space, and is able to manage large-scale datasets on an ordinary laptop **(Online Methods)**.

To illustrate STREAM, we first reanalyzed scRNA-seq data from Nestorowa et al. 2016^{2}, which sorted and profiled 1,656 single mouse hematopoietic stem and progenitor cells. Starting from the hematopoietic stem cells (HSCs), STREAM accurately recapitulates known bifurcation events in lymphoid, myeloid and erythroid lineages and positions the multipotent progenitors before the first bifurcation event. To facilitate the exploration of the inferred structure, STREAM includes *a flat tree plot that* intuitively represents trajectories as linear segments in a 2D plane. In this representation, the lengths of tree branches are preserved from the MLLE embedding **(Fig. 1b)**. In addition, cells are projected onto the tree according to their pseudotime locations and the distances from their assigned branches. If the process under study has a natural starting point (for example a known origin in a developmental hierarchy), the user can specify a root node. This allows easy re-organization of the tree using a breadth-first search to obtain a *subway map plot* that better represents pseudotime progression from a selected starting node **(Fig. 1c)**. Although these visualizations capture trajectories and branching points, they are not informative on the density and composition of cell types along pseudotime, a common challenge when modeling large datasets. To solve this problem, we propose a novel trajectory visualization method called the *stream plot.* This compact representation summarizes cellular developmental trajectories, user-defined annotations, branching points, cell density, and gene expression patterns **(Fig. 1d)**. Density information, an aspect overlooked by other methods, is very important to track how the composition of subpopulations changes along a trajectory or gets partitioned around branching events. Additionally, STREAM detects potential marker genes of different types: *diverging genes,* i.e. genes important in defining branching points that are differentially expressed between diverging branches, and *transition genes*, i.e. genes for which the expression correlates with the cell pseudotime on a given branch. The expression patterns of the discovered genes can then be visualized using either subway maps or stream plots **(Fig. 1e-f, Supplementary Fig. 1-2, Supplementary Note 1)**.

STREAM is the only trajectory inference method that explicitly implements a *mapping* procedure, which allows reusing a previously inferred principal graph as reference to map new cells not included in the original fitting procedure. This avoids pooling old and new cells and re-computing trajectories from scratch, a computationally-intensive operation that also perturbs the original structure and complicates the interpretation of the pseudotime. This feature is particularly helpful when studying genetic perturbation data or exploring unlabeled data. To show the utility of the mapping feature, we applied STREAM to scRNA-seq data from *Olsson et al ^{3}*. This study focused on the mouse hematopoietic system, specifically on the consequences of cell determination within the granulocyte-monocyte progenitors (GMP) population after the transcription factors Gfi1 and/or Irf8 are knocked out. STREAM recovers the correct trajectories for the wild-type cells and, using the mapping feature, also predicts and effectively visualizes the consequences of the genetic perturbation as validated in the original study

**(Supplementary Fig. 3-4, Supplementary Note 2)**.

To test the robustness and scalability of STREAM, we next explored data derived from different platforms and organisms. We used two recently published zebrafish datasets obtained with single-cell qPCR^{4} and inDrop^{5} (profiling ∼10000 cells) assays. These data provided the first comprehensive model of the zebrafish hematopoiesis system without biases introduced by FACS sorting subpopulations. Our analyses successfully recovered developmental trajectories at unprecedented resolution compared to previous analysis **(Supplementary Fig. 5-6, Supplementary Note 3-4)**.

We next systematically compared STREAM with 10 other state of the art methods for pseudotime inference on three different datasets^{6-14}. First, we assessed the quality of the topology using a previously proposed synthetic dataset^{14}. Second, we assessed the pseudotime accuracy using known marker genes on scRNA-seq data for myoblast differentiation, a classic dataset to compare trajectory inference methods^{15}. Finally, we quantitatively compared the number and quality of trajectories in terms of precision and recall for known marker genes that completely diverge during development. In all the comparisons, STREAM consistently outperforms the other methods in inferring the correct topology, provides smooth pseudotime for myoblast differentiation and reconstructs the most balanced branching structure (avoiding under/over branching) in terms of precision and recall (F1-score) among the methods **(Supplementary Figs. 7-14, Supplementary Note 5)**.

Importantly, we extended STREAM to infer trajectories from human single-cell epigenomic data. This task is particularly challenging since the number of chromatin peaks (∼450,000 peaks across hematopoiesis) far exceeds the number of genes and the accessibility at each peak is sparse, often containing only 0, 1, or 2 reads. Additionally, trajectory reconstruction based on scATAC-seq human data is more difficult than the recently obtained trajectories in non-mammalian organisms^{16} with much smaller genomes. STREAM is able to perform pseudotime ordering on human cell chromatin-accessibility data without relying on accessibility of known transcription factor binding sites^{17} or *a priori* knowledge of sampling time^{18}, hence providing a truly unbiased approach. STREAM in fact uses an unbiased set of DNA sequence features (7-mers), scoring each cell with chromVAR^{19} based on its accessibility deviations across cells **(Fig. 2a)**.

To test the effectiveness of STREAM, we examined open chromatin profiles of > 2,000 cells profiled by scATAC-seq in known human hematopoietic lineages^{20}. STREAM not only accurately reconstructs cellular developmental trajectories of the human blood system, but also recovers key sequence features and master regulators that have been implicated in differentiation and lineage commitment for different subpopulations **(Supplementary Fig. 15, Supplementary Note 6)**. For example, two of the detected 7-mer sequences match binding models for the transcription factors *GATA1* and *CEPBA*, which regulate differentiation towards erythroid and myeloid lineages, respectively **(Fig. 15b-c)**.

STREAM is available as user-friendly open source software and can be used interactively to explore several precomputed datasets and to compute new trajectories at stream.pinellolab.org **(Supplementary Fig 16**), or as a standalone command-line tool using Docker (github.com/pinellolab/stream) **(Supplementary Note 7-8)**.

## ONLINE METHODS

### STREAM framework

#### Trajectories inference

**Feature selection:** For transcriptomic data (single-cell RNA-seq or qPCR), the input of STREAM is a gene expression matrix, where rows represent genes, columns represent cells. Each entry contains an adjusted gene expression value (library size normalization and log2 transformation) **(Supplementary Note 7)**. The most variable genes are selected as features, using a procedure we have previously proposed^{1}. Briefly, for each gene, its mean value and standard deviation are calculated across all the cells. Then a non-parametric local regression method (LOESS) is used to fit the relationship between mean and standard deviation values. Genes above the curve that diverge significantly are selected as variable genes.

**Dimensionality reduction:** Each cell can be thought as a vector in a multidimensional vector space in which each component is the expression level of a gene. Typically, even after feature selection, each cell has still hundreds of components, making it difficult to reliably assess similarity or distances between cells, a problem often referred as the *curse of dimensionality*^{2}. To mitigate this problem, starting from the genes selected in the previous step we project cells to a lower dimensional space using a non-linear dimensionality reduction method called Modified Locally Linear Embedding (MLLE)^{3}. MLLE takes into account local similarity of each cell with its neighbors and addresses the regularization problem of standard LLE by introducing multiple weight vectors in each neighborhood. The neighbor size is chosen based on the number of cells and is set by default to 10% of the total number of cells. The number of MLLE components to use depends on the number of branches and on the complexity of the structure to learn. Typically, three components capture the main structure for most datasets, increasing them may recover finer structures (although we observed that there is no benefit for selecting more than 5 components in all the datasets tested).

#### ElPiGraph: structure learning and fitting

##### Seeding initial tree structure

To create an initial seed structure for the principal graph learning by ElPiGraph we first used the affinity propagation^{4} method to cluster cells in the MLLE space. Affinity propagation is based on the idea of *message-passing* between sample points, and finds a small set of exemplars which are considered to be most representative of the other samples. For all our tests we used the scikit-learn implementation^{5} with a damping factor set to 0.75. Based on the exemplars obtained by the affinity propagation procedure, a minimum spanning tree (MST) was constructed using the Kruskal’s algorithm. The obtained tree is then used as initial tree structure for the ElPiGraph procedure.

##### Elastic principal graph method (ElPiGraph)

Elastic principal graphs are structured data approximators^{6-8}, consisting of vertices and edges. The vertices are embedded into the space of the data, minimizing the mean squared distance (MSD) to the data points, similarly to *k*-means. Unlike unstructured *k*-means, the edges connecting the vertices are used to define an elastic energy term. The elastic energy term and MSD are used to create penalties for graph edge stretching and the bending of branches. To find the optimal graph structure, ElPiGraph uses a *graph grammar approach*, which is described below. This approach allows an effective exploration of the graph structure space via a gradient descent-like search. In STREAM, the set of graph grammars ^{9}used always result in the construction of a principal tree (i.e., a graph without cycles) but alternative graph grammars can produce more complex (e.g., circular) topologies.

Let *G* be a simple undirected graph with a set of vertices *V* and a set of edges *E* and ϕ:*V* → **R**^{m} a map that describes an embedding of the graph into the multidimensional space **R**^{m} by mapping a node of the graph to a point in the data space. Let a *k*-star be a subgraph of *G* with *k* + 1 vertices *v*_{0,1…,k} ∈ *V* and *K* edges over these vertices {(*v*_{0}, *v _{i}*)|

**i**= 1, ..,

*k*}. Let

*E*

^{(i)}(0),

*E*

^{(i)}(l) denote two ends of the graph edge

*E*

^{(i)}denote the vertices of a

*k*-star (where is the central vertex, to which all other vertices are connected). Let deg(

*v*) denote a function returning the order

_{i}*k*of the star with the central vertex

*v*and zero if there is no any star centered in

_{i}*v*.

_{i}The *elastic energy of the graph embedment* is defined as the sum of squared edge lengths (weighted by the *λ _{i}*, elasticity moduli and a penalty for excessive branching

*α*) and the sum of squared

*deviations from harmonicity*for each star (weighted by the

*μ*) The second term (the deviation from star harmonicity) in the case of 2-star is a simple surrogate for minimizing the local curvature. In the case of

_{j}*k*-stars with

*k*>2 it can be considered as a generalization of local curvature defined for a branching point

^{8,10}.

Let *K* be a partition of all the data point under consideration (*X*_{1},*X*_{2},… *X _{N}*) such that

*K*(

*i*) = argmm

_{j=1 … N}(

*X*− ϕ(

_{i}*V*))

_{j}^{2}returns an index of the vertex in the graph which is the closest to the

*i*th data point among all graph vertices. The objective function that we want to minimize is defined as where

*w*is a weight of the data point

_{i}*i*(can be unity for all points), |

*V*| is the number of vertices, ‖..‖ is the usual Euclidean distance and

*R*

_{0}is a trimming radius that can be used to limit the effect of

*points distant from the graph*(and hence to enforce a

*local construction*that is more robust to noise)

^{11}.

Given a graph topology for approximating a set of vectors *X*, our goal is to find a map ϕ:*V*→ **R**^{m} such that *U*^{ϕ}(*X*,*G*) → min over all possible elastic graph *G* embedment in **R**^{m}. The local minimum of *U*^{ϕ}(*X*,*G*) is found by applying the usual splitting type algorithm:

Given the partition

*K*of the data points by proximity to the graph vertices, we minimize*U*^{ϕ}(*X*,*G*). Note that this functional is quadratic if*K*is fixed, therefore, the solution can be found very fast by solving a system of |*V*| linear equations.Update

*K*using new vertex positions. This simple step can be also implemented very fast.Repeat 1) and 2) until a convergence criterion is met (i.e., the vertices are being displaced by less than a fixed threshold). Note that convergence is guaranteed by the form of

*U*^{ϕ}(*X*,*G*), which is a Lyapunov function wrt to the iterations 1-2.

A graph grammar-based approach for simultaneous learning of the graph topology and embedment of the graph into the data space starts from a seed graph *G*_{0}and a map ϕ_{0}(*G*_{0}). A set of grammar operations are then applied iteratively to transform the graph topology, and hence the map, starting from a given pair {*G _{i}*, ϕ

*,(*

_{i}*G*)}

_{i}^{12}. Each grammar operation

*Ψ*produces a set of s new candidate graph topologies

^{p}*Ψ*, possibly taking into account the dataset

^{k}*X*: Given the pair {

*G*, ϕ

_{i}*,(*

_{i}*G*)} characterizing the

_{i}*i*step of the algorithm, a set of

^{th}*r*different graph operations {Ψ

^{1},…Ψ′}(which we call a “graph grammar”), and an energy function

*U*

^{ϕ}(

*X*,

*G*), the algorithm applies all the grammar operations selected, fit the newly derived graph topologies to the data, and choose the most energetically favorable embedment as principal graph of the step (

*i+l)*: where {

^{th}*D*, ϕ(

^{k}*D*)} is supposed to be fit to the data after the application of a graph grammar.

^{k}In order to produce principal trees, one defines two operations for graph growth (‘bisect an edge’ and ‘add a node’) and one operation for graph shrinking (‘remove an edge’). Afterwards, two applications of growth operations are followed by one of shrinking. Such an approach allows avoiding local minima in the structure space of all possible tree topologies.

The ElPiGraph algorithm has four parameters with clear meaning and effect on the final result:

*λ*=*λ*, controls the total length of the graph and, at the same time, promotes equal distance between neighbour graph nodes in the data space_{i}*μ*=*μ*controls the smoothness of the graph embedment (for the tree, smoothness of tree branches and harmonicity of graph stars)._{i}*α*controls for excessive branching such that sufficiently large*α*(e.g.,*α*= 1) hardly penalizes any branching while smaller values (e.g.,*α*= 0.01) leads to keeping essential branches.*R*is the trimming radius, allowing robust estimation of node positions. ElPiGraph implements a simple scaling statistics allowing to automatically estimate_{0}*R*if needed._{0}

In the simplest case, *R _{0}* = ∞,

*α*= 0, and it is recommended to keep λ ≈ μ,/10. For all the single cell datasets in this paper it is desirable to set

*α*= 0.02 and sometimes use trimming (automatically defined finite value for

*R*).

_{0}##### Adjusting the final tree structure

The resulting principal graph is refined based on the following procedures: 1) Principal tree branches can be extrapolated from the terminal vertices, i.e. a branch can grow, if necessary, to better fit cells that maybe fall far away from a terminal node. This allows a smoother pseudotime mapping and a more reliable characterization of cells close to initial or terminal points 2) Branches not supported by at least *n _{minload}* data points can be removed or shrunk. 3) A

*k*-star node, (node with connectivity

*k*>2) can be rewired to another graph node if the latter has a higher local density or a larger number of cells projected into it to improve the positioning of candidate branching points.

In this paper, the *ElPiGraph.R* R package has been used, available at https://github.com/sysbio-curie/ElPiGraph.R. Implementations of the ElPiGraph are also available in other programming languages (Matlab, Java, Python, Sca1a)^{13}.

#### Visualization

**Flat Tree Plot:** The tree structure learned in the 3D space (or higher dimensional space), is first approximated by linear segments (each representing a branch) and mapped to a 2D plane based on a modified version of the force-directed layout Fruchterman-Reingold algorithm^{14}. In particular, we adjust each edge length in order to preserve the lengths of the branches of the original tree. Finally, using both the pseudotime location on the assigned branch and the distance from it in the MLLE space, we map cells to the obtained tree in the 2D plane. Cells are represented as dots and randomly placed to either side of the assigned branches. Each node in the tree indicates one cell state (cell states are sequentially named S0, S1, … starting from a randomly selected node) and the resulting structure is called *flat tree plot.*

**Subway map plot:** Starting from the flat tree plot and with a designated root or start node, breadth-first search is used to order and arrange nodes and edges horizontally on a 2d plane. Because we preserve the branch lengths of the original tree, the x-axis represents the distance (namely pseudotime) from the start node along the different branches. Cells are then mapped to the obtained structure, called *subway map plot* with the same strategy used for the flat tree plot. To display gene expression, each cell is colored according to its gene expression (the maximum value in the colormap is set as 90 percentile of gene expression values across all cells).

**Stream plot:** Starting from the subway map plot, for each cell type (if cell labels are provided), using a sliding window approach, we first calculate the number of cells in each window along a developmental branch. To provide smooth transitions around the branching nodes, in those regions the sliding window spans both parent branch and children branches and then proceeds independently on each branch. Then, the numbers of cells in all sliding windows are normalized based on the length of the longest path in the tree. The vertical layout of different branches is optimized by taking into consideration normalized numbers of cells to make sure there will not be overlap between branches. Based on the normalized sliding window values, we first use linear interpolation to construct a set of supporting points. Then the Savitzky-Golay filter (a smoothing filter able to preserve well the signal and avoid oscillations)^{15} is applied to create smooth curves based on the set of supporting points. Finally, the obtained curves polygons (one for each cell type) are assembled to form the *stream plot.* On stream plot, the length of each branch is the same as in the subway map plot and represents pseudotime, whereas the width is proportional to the number of cells at a given position. To display gene expression, we consider, for each sliding window, not only the number of cells but also their average gene expression values smoothed by bicubic interpolation (the maximum value is set as the 90th percentile of the average gene expression values from all the sliding windows).

#### Discovery of marker genes

**Diverging gene detection:** For each pair of branches *B _{i}* and

*B*, and for the gene

_{j}*E*, the gene expression values across cells from both branches are scaled to the range [0,1]. For gene expression

*E*from

_{i}*B*and gene expression

_{i}*E*from

_{j}*B*, we first calculate their mean values. Then, we check the difference between mean values to make sure it is above a specified threshold (the default value is 0.2). Mann-Whitney U test is then used to test whether

_{j}*E*is greater than

_{i}*E*or

_{j}*E*is less than

_{i}*E*Since the statistic

_{j}.*U*could be approximated by a normal distribution for large samples, and

*U*depends on specific datasets, we standardize

*U*to Z-score to make it comparable between different datasets. For small samples where this test is underpowered (<20 cells per branch), we report only the fold change to qualitatively evaluate the differences between

*E*and

_{i}*E*Genes with Z-score or fold change greater than the specified threshold (2.0 by default) are considered as differentially expressed genes between branches. Formally: Where

_{j}.*m*,

_{u}*σ*are the mean and standard deviation, and Where

_{U}*n*=

*n*+

_{i}*n*,

_{j}n_{i}*n*are the number of cells in each branch,

_{j}*t*is the number of cells sharing rank

_{i}*ℓ*and

*k*is the number of distinct ranks.

**Transition gene detection:** For each branch *B _{i}* and for each gene

*E*we first scale the gene expression values to [0,1] for convenience. Then we check if the candidate gene has a reasonable dynamic range considering cells close to the start and end points. To this end, we consider the difference in average gene expressions of the first 20% and the last 80% of the cells based on the inferred pseudotime. If the difference is greater than a specified threshold (the default value is 0. 2), we then calculate Spearman’s rank correlation between inferred pseudotime and gene expression of all the cells along

*B*Genes with Spearman’s correlation coefficient above a specified threshold (0.4 by default) are identified and reported as transition genes.

_{i}.**Mapping procedure:** For a set of unmapped cells X = {*x _{i}* |

*i*= 1,…,

*M*} and a fitted tree

*T*built using the set of cells Y = {

*y*|

_{j}*j*= 1,…,

*N*} currently we have the assumption that X and Y have the same measured genes and are sequenced using the same experiment protocol. Both are X and Y are library size normalized and log2 transformed. To map cell

*x*into the embedding, we first find its nearest

_{i}*K*neighbors in

*Y*, based on the same feature genes and

*K*used to build

*T.*The largest distance between

*x*and its

_{i}*K*neighbors is then chosen as the radius

*r.*Then all the cells in

*Y*within the radius

*j*= {

_{i}*y*|

_{j}*d*(

*x*,

_{j}*y*) ≤

_{j}*r*) are used to compute a set of weights

*W*= {

_{i}*w*

_{ji},

*j*∈

*j*} as defined in the original MLLE procedure. Finally, using the MLLE embedding vectors

_{i}*V*= {

*v*

_{1},…,

*v*}, the new cell position

_{N}*x*′

_{i}is calculated in the embedding with the following equation:

After mapping, each cell is assigned to its closest branch in T.

**STREAM analysis on scATAC-seq data:** For the scATAC-seq analysis, a total of 3,072 cells were profiled using FACS to isolate 9 distinct populations from CD34+ human bone marrow, which encompassed progenitors for four well-defined lineages^{16}. 2,034 high-quality cells passed quality control filtering and were used in the downstream analysis with STREAM. Specifically, cells were filtered so that 1000 unique nuclear fragments were observed for each cell and at least 60% of these reads aligned in open chromatin peaks. After filtering low quality cells, the mean intensity and GC content for each peak that was called for this dataset was computed using the addGCBias function for the hgl9 genome using the BSgenome.Hsapiens.UCSC.hgl9 package available through chromVAR ^{17}. These two coordinates were used to infer an empirically-defined set of background peaks to compute accessibility deviations, which have been described elsewhere^{16,18}. As features we used an unbiased k-mer scoring, which is naive to any known transcription factor motif and thus generalizadle to other systems. We used the matchKmers function in chromVAR with parameters k = 7 and genome = BSgenome.Hsapiens.UCSC.hgl9, which returns a matrix of dimension number of peaks by number of k-mers where a 1 indicates that the peak contains the k-mer sequence. The output of this function was then included in the computeDeviations function to compute chromatin accessibility z-scores for each of the k-mers in our dataset. This matrix of cells by k-mer accessibility z-scores serves as a data-driven dimensionality reduction of the chromatin accessibility profiles of these cells. Based on the z-score matrix of k-mer DNA sequences, all the 7-mer features are standardized to have zero mean and unit variance. PCA is performed on the scaled matrix to convert z-score to principal components. According to the variance ratio elbow plot we selected the top 15 PCs, but excluded the first component since it captured technical noise (dropout and number of reads). The obtained matrix is used to reconstruct trajectories as previously described. Diverging and transition k-mers were selected with the same procedures used for gene selection. Finally, detected k-mers were mapped to known transcription factors using Tomtom ^{19}(http://memesuite.org/tools/tomtom) and a motif database previously assembled [chromvar_and_hocomoco.meme](https://github.com/buenrostrolab/chromVARmotifs)^{16}.

#### Comparison of methods for trajectory inference

**Simulated datasets:** Given a set of *n* cells and assuming we know their developmental/sampling time and topological organization, i.e. how they are organized in branches, we can easily evaluate a generic reconstruction method with the following two metrics:

Difference between the number of inferred and true branches.

Correlation between the true sampling time

*X*and the inferred pseudotime*Y.*For the pseudotime we use either the proposed ranking or the actual distance from the starting point as provided by each method. We used 3 different measure of correlation: Pearson correlation*r*, Spearman correlation*ρ*and Kendall’s tau correlation*τ*, calculated as follow: Where*rg*and_{X}*rg*are the ranks of cells,_{Y}*cov*(*rg*,_{X}*rg*_{Y}) is the covariance of rank variables,*σ*_{rgx}and*σ*are the standard deviations of rank variables. Note that since both Spearman correlation_{rgY}*ρ*and Kendall’s tau correlation*τ*are rank-based methods, the correlation between*X*and*Y*and the correlation between*X*and*rg*are the same, so we consider only the correlation between_{Y}*X*and*Y.*

**Real datasets:** To evaluate the quality of reconstruction in real datasets in which we do not have the real developmental time and topological information, we used the following two metrics:

*Path-specific marker gene correlation analysis:*In real datasets oftentimes, we don’t have the sampling time along a branch. In this case, instead, it is helpful to evaluate how the inferred pseudotime recapitulates the progressive activation or repression of an important gene along that branch. The main idea here is that ordering cells based on a marker gene, which is important in defining a developmental trajectory, as a reasonable surrogate for the correct pseudotime ordering. As in the simulation case we computed 4 correlation coefficients using marker gene expression*X*and the inferred pseudotime*Y.**F*_{1}*score analysis on diverging or mutually exclusive marker genes:*Let us consider a pair of diverging or mutually exclusive marker genes,*G*and_{i}*G*These genes should be highly expressed on different committed branches and rarely co-expressed in the same cell. We define_{j}.*B*as the branch which contains the most cells express_{i}*G*Then we can define as true positive (TP) for the number of cells expressing_{i}.*G*The number of cells expressing_{i}.*G*on the other branches is defined as false negative (FN). The number of cells expressing_{i}*G*on_{j}*B*is defined as false positive (FP). Similarly, for_{j}*G*,_{j}*B*is the branch which has the most cells expressing_{j}*G*TP is the number of cells expressing_{j}.*G*on_{j}*B*FN is the number of cells expressing_{j}.*G*on the other branches. FP is the number of cells expressing_{j}*G*on_{i}*B*Based on the following equations, recall, precision and FI score are calculated respectively for_{j}.*G*and_{i}*G*as follow:_{j}

#### Website and code availability

STREAM is available as a user-friendly open source software and can be used interactively as a web-application at http://stream.pinellolab.org or as a standalone command-line tool: https://github.com/pinellolab/STREAM.

#### Data availability

All the data used in this study have been deposited at https://github.com/pinellolab/STREAM or available as supplementary information.

## Footnotes

↵#

**Contact:**lpinello{at}mgh.harvard.edu, gcyuan{at}jimmy.harvard.edu