## Abstract

Trajectory inference methods analyze thousands of cells from single-cell sequencing technologies and computationally infer their developmental trajectories. Though many tools have been developed for trajectory inference, most of them lack a coherent statistical model and reliable uncertainty quantification. In this paper, we present VITAE, a probabilistic method combining a latent hierarchical mixture model with variational autoencoders to infer trajectories from posterior approximations. VITAE is computationally scalable and can adjust for confounding covariates to integrate multiple datasets. We show that VITAE outperforms other state-of-the-art trajectory inference methods on both real and synthetic data under various trajectory topologies. We also apply VITAE to jointly analyze two single-cell RNA sequencing datasets on mouse neocortex. Our results suggest that VITAE can successfully uncover a shared developmental trajectory of the projection neurons and reliably order cells from both datasets along the inferred trajectory.

## 1 Introduction

Single-cell genomics have become immensely popular over the past five years and are becoming indispensable tools for biologists to understand cellular diversities and cell activities (Tang et al., 2011; Tanay & Regev, 2017; Hovestadt et al., 2019). Single-cell RNA sequencing (scRNA-seq), simultaneously measuring tens of thousands of RNAs inside individual cells, is one of the most mature and widely used technology in single-cell genomics. The development of the scRNA-seq technology has boosted new scientific discoveries in many fields, including immunology, cancer studies, neurology, and developmental and stem cell biology (Chen et al., 2019; Fan et al., 2020; Cembrowski, 2019; Griffiths et al., 2018; Kumar et al., 2017).

Many biological processes, such as differentiation, immune response, or cancer expansion, can be described and represented as continuous dynamic changes in cell-type or cell-state space (Lähnemann et al., 2020). Instead of belonging to distinct cell types, the cells can exhibit a continuous spectrum of states and be involved in transitions between different cellular states. As scRNA-seq measures cells at the single-cell resolution, it has become a valuable tool to understand these dynamic processes. Trajectory inference (TI) of scRNA-seq is to study the contiguous states of the measured cells, to infer the underlying developmental trajectories, and to computationally order these cells along the trajectories (Trapnell et al., 2014). Unlike assigning a discrete group ID to each cell, TI methods position each cell onto the inferred continuous trajectories.

Researchers have developed various computational tools for TI. Many of them are from a graphical point of view, exploiting cells’ similarities calculated by their gene expressions from scRNA-seq (Qiu et al., 2017; Street et al., 2018; Wolf et al., 2019). The directions along trajectories can also be determined when the trajectories’ roots are given based on external biological knowledge. However, since scRNA-seq only captures a static snapshot of each cell’s transcriptome at some specific time points, the true temporal lineages or trajectories of the cells that biologists aim to find are not always identifiable from scRNA-seq (Weinreb et al., 2018; Tritschler et al., 2019). Even though there have been more than 70 computational methods proposed for TI (Saelens et al., 2019), to the best of our knowledge, most of them are not based on an explicit or coherent statistical model, and there is not even a clear definition of the “trajectory” that we can identify and estimate from scRNA-seq. As a consequence, it is vague to discuss the efficiency and accuracy in TI, let alone to have useful statistical inference and uncertainty quantification for the estimated trajectories and cell orderings. In addition, as scRNA-seq datasets continue to expand (Svensson et al., 2020), multiple individuals and datasets are becoming available for the same or closely related biological processes. There is not yet an existing TI method that can simultaneously integrate multiple scRNA-seq datasets and infer a shared underlying trajectory.

In this paper, we build a statistical framework to approximate the distribution of scRNA-seq data when the cells evolve along some underlying trajectories. We propose a new method, **VITAE** (**V**ariational **I**nference for **T**rajectory by **A**uto**E**ncoder), to perform TI. VITAE combines a hierarchical mixture model, describing the trajectory structure in a low-dimensional space, with deep neural networks that nonlinearly map the low-dimensional space to the high-dimensional observed RNA counts. Our framework provides an explicit definition of the trajectory backbone and the “pseudotime” of each cell, which are identifiable from scRNA-seq. We can infer all types of trajectories, allowing loops and disconnected states. With the approximate variational inference, we also provide scalable uncertainty quantification on both the inferred trajectory backbone and the estimated cell positions. Finally, we can simultaneously adjust for known confounding variables, such as the cell-cycle and batch effects, and reliably infer a shared trajectory from multiple scRNA-seq datasets.

The structure of the paper is as follows. Section 2 provides a review of existing TI methods. Section 3 describes the VITAE framework and model training. Section 4 discusses the approximate variational inference with VITAE for trajectory inference and uncertainty quantification. Section 5 evaluates the performance of VITAE on both real and synthetic data. Section 6 studies the developmental trajectory of cells in the mouse neocortex by a joint analysis of two scRNA-seq datasets. A Python package implementing VITAE is available at https://github.com/jaydu1/VITAE.

## 2 Review of Existing TI Methods

Since the first TI method for scRNA-seq proposed by Trapnell et al. (2014), many have been developed in the past few years. Recently, Saelens et al. (2019) performed a comprehensive evaluation comparing and benchmarking 45 TI methods on both real and synthetic datasets with various evaluation metrics. Some of these methods can only infer specific types of the trajectory topology, and some may not be scalable to handle large datasets. Among the 45 TI methods, PAGA (Wolf et al., 2019), Slingshot (Street et al., 2018), and the Monocle series (Trapnell et al., 2014; Qiu et al., 2017; Cao et al., 2019) performed the best overall, and are among the most popular TI methods in practice.

Many TI methods, including Monocle, Slingshot, and PAGA, infer the trajectories from a graphical perspective. These methods typically start with some dimension reduction of the observed data and then fit “trees” or “curves” in the reduced space that can best “connect” the cells. They differ in the algorithmic choices to perform dimension reduction and to construct the connectivity graph. For example, Monocle 1 (Trapnell et al., 2014) uses independent component analysis for dimension reduction, then infers the trajectory by building a minimum spanning tree (MST) directly on the cells. Monocle 2 (Qiu et al., 2017) later improves Monocle 1 by introducing cell centroids and a reversed graph embedding on the centroids with the cells projecting onto the embedded space. Similarly, Slingshot (Street et al., 2018) starts with a given dimension reduction and cell clustering; however, it finds an MST on the cluster centers and simultaneous fits principal curves to infer each cell’s pseudotime. PAGA (Wolf et al., 2019) also starts with clustering and dimension reduction. Instead of building MST, it then partitions, prunes and connects the clusters based on a generic statistical model of connectivity among the clusters. Based on PAGA, Monocle 3 (Cao et al., 2019) is recently proposed to learn a principal graph on each of the PAGA partitions to further obtain a finer trajectory. These graph-based TI methods have shown to be successful algorithms providing biologically meaningful trajectory estimation on many datasets. A drawback of these approaches is that they lack a coherent model for the data distribution and an explicit definition of the “trajectory” and “pseudotime” that they aim to estimate.

In contrast, some other methods use a Gaussian process probabilistic model (Campbell & Yau, 2016; Ahmed et al., 2019) to explicitly model the data distributions with a pseudotemporal ordering of the cells. Though these methods have a clear objective and can perform uncertainty quantification, they are restricted to inferring only linear trajectories, where cells have a total order along the linear path. As they do not allow for any branch, they can only be applied to specific dynamic processes. Another type of TI methods (Marco et al., 2014; Schiebinger et al., 2019; Lin & Bar-Joseph, 2019; Tran & Bader, 2020) make use of the temporal information of the cells with a time-series model when the cells are collected from a sequence of time points. For instance, Waddington-OT (Schiebinger et al., 2019) infers developmental trajectories by temporal couplings of the cells with the optimal transport and CSHMM (Lin & Bar-Joseph, 2019) uses a continuous-state hidden Markov model to assign cells to developmental paths. Though the collection time of the cells is often informative for their pseudotime ordering, they may not necessarily be positively correlated. In addition, the estimation of the temporal coupling or the hidden Markov model may be hard to achieve satisfying accuracy, taking into account the technical noise, batch effects, sampling bias, and the high dimensionality of scRNA-seq data.

Recently, the RNA velocity, which is a vector representing the transcriptional dynamics of each cell, has shown its great potential in inferring the cellular differentiation and thus the developmental trajectory of single cells (La Manno et al., 2018; Bergen et al., 2020). Instead of only using the matrix of total RNA counts in scRNA-seq, the estimation of RNA velocities is through both the matrices of spliced and unspliced RNAs. The ratio between spliced and unspliced RNAs provides exciting extra information in identifying the underlying temporal structure of the cells’ dynamic process. However, using such information needs a completely different framework and is beyond the scope of this paper.

Motivated by the graph-based TI methods, we propose a probabilistic model endowing the whole procedure with statistical inference and estimate the model with the powerful variational autoencoders, which is computational efficient and flexible to simultaneously adjust for confounding covariates. Deep learning has received tremendous attention from all communities, of which generative modeling with specific priors is well developed as a large and essential branch (Kingma & Welling, 2014; Sohn et al., 2015; Jiang et al., 2017; Zong et al., 2018). For scRNA-seq, deep generative models are also widely applied for various purposes, such as cell clustering (Grønbech et al., 2020; Li et al., 2020), batch correction (Huang et al., 2020), data denoising (Wang et al., 2019; Eraslan et al., 2019) and many other tasks.

## 3 Model and Estimation

### 3.1 A hierarchical mixture model for the trajectory structure

Inspired by the common trajectory model proposed in Saelens et al. (2019), we start with the trajectory backbone defined on a complete graph with vertices and edges (Figure 1a). Let the *k* vertices in be the distinct cell states, we use an edge between two vertices to represent the transitioning between two states. To model the scenario that one cell either belongs to a specific state or is developing from one to another, we assume that each cell is positioned either on one vertex or edge in . Specifically, let be the position of cell *i* on the backbone, we have
where *e*_{j} is a one-hot vector with *j*th element 1 and all other elements 0, and the scalar *w _{i}* ∈ [0, 1] describes the relative position of cell

*i*if it is on an edge.

The basic goal of our trajectory inference is to infer the trajectory backbone - a subgraph of - whose vertices and edges have positive proportions of cells:

Meanwhile, we also aim to estimate the relative positions of each cell on this trajectory backbone, and as in other TI methods, the pseudotime order of the cells along the trajectory. Intuitively, the pseudotime corresponds to the biological progression of a cell through the dynamic process (such as cell differentiation) that results in the trajectory structure (Trapnell et al., 2014). In TI, it refers to the cell’s geodesic distance to the root along the trajectory.

Here we give a formal definition of the pseudotime in our framework. Given a root vertex *k*_{0}, we first define pseudotime for each vertex. Assume that each edge is associated with a duration *b*_{ℓ}. Without external information, we set *b*_{ℓ} = 1. The pseudotime for the vertex *j* can be defined as

Let ** o** = (

*o*

_{1}, · · ·,

*o*), then the pseudotime of the cell

_{k}*i*is defined as

Since has at most two nonzero elements, the pseudotime *T _{i}* of a cell equals the pseudotime of the vertex that it is on or the weighted average of the pseudotime of the two vertices of the edge that it belongs to. The pseudotime is only well-defined when there are no loops in . When there are loops, though the pseudotime

*T*is still computable under our definition, we do not think that it is a biologically meaningful quantity to pursue.

_{i}Now we relate the observed scRNA-seq data with the underlying trajectory backbone graph . Let *Y*_{i} = (*y*_{i1}, · · ·, *y*_{iG}) be the observed gene expressions of *G* genes in cell *i* from scRNA-seq. Though *Y*_{i} is the vector of observed counts with complicated dependence across genes, we assume that the dependence can be explained by latent Gaussian variables *Z*_{i} ∈ ℝ^{d}, which is in a space with a much lower dimension and associated with the trajectory backbone . We also take into account the effects of known confounding covariates *X*_{i} ∈ ℝ^{s} (such as cell-cycle or batch effects, which we take as deterministic variables) on *Y*_{i}. Thus, we assume the following latent variable model on the observed counts:

Here ** U** ∈ ℝ

^{d×k}is the unknown matrix of the vertices positions which, together with the cells’ relative positions on , determines the means of

*Z*_{i}. Then, we assume that the observed counts

*Y*_{i}nonlinearly depend on the latent variables

*Z*_{i}. The scalar

*l*is the known library size of cell

_{i}*i*and each

*f*: ℝ

_{g}^{d+s}→ ℝ,

*g*= 1, 2, · · ·,

*G*is an unknown nonlinear function involving the confounding covariate vector

*X*_{i}. The unknown parameters

*θ*are gene-specific dispersion parameters of the Negative Binomial (NB) distribution. Notice that though the edges are assumed to be linear lines in the latent space, they are likely curves in the observed data space via the nonlinear mappings {

_{g}*f*(·),

_{g}*g*= 1, 2, · · ·,

*G*}.

The assumption of using the NB distribution to model scRNA-seq data is based on a detailed review and discussion in Agarwal et al. (2020). Specifically, for scRNA-seq with the unique molecular identifier (UMI) counts, the Negative Binomial distributions can well describe the stochasticity in scRNA-seq, accounting for both the biological and technical noise. However, for scRNA-seq data without UMIs, a Negative Binomial distribution may not be adequate. In that scenario, we assume that the observed counts follow a zero-inflated Negative Binomial (ZINB) distribution as in Risso et al. (2018):
where *ϕ _{ig}* is the zero inflation probability modeled by a nonlinear mapping function

*h*(

_{g}

*Z*_{i},

*X*_{i}) for

*g*= 1, · · · ,

*G*.

The above model defines the trajectory backbone , cells positions , and pseudotime *T _{i}* that are identifiable from scRNA-seq data. However, estimating these quantities are still challenging as the vertices and edges are defined on a latent space. Thus, we further impose a hierarchical model on to simplify model estimation. First, we introduce a latent categorical variable

*c*as the index of all edges and vertices. Specifically, let

_{i}*c*take values in {1, 2, … ,

_{i}*K*} where

*K*=

*k*(

*k*+ 1)/2 is the number of all possible edges and vertices in . It associates with through the following categorical assignment symmetric matrix, When

*c*equals to for

_{i}*j*

_{1}≠

*j*

_{2}, the cell

*i*is at the edge between vertex

*j*

_{1}and vertex

*j*

_{2}. When

*c*equals to

_{i}*C*, the cell

_{jj}*i*is at the vertex

*j*. Then we assume the following mixture prior on : where and if for

*j*

_{1}≤

*j*

_{2}.

To summarize, combining models (3) and (6), we obtain a hierarchical mixture model for scRNA-seq with an underlying trajectory structure. The unknown parameters are the mean positions of the *k* cell types in the latent space ** U**, the prior probabilities of the

*K*categories

**, the mapping functions**

*π**f*(·),

_{g}*g*= 1, 2, · · · ,

*G*and the dispersion parameters

*θ*(and also

_{g}*h*(·) for non-UMI counts).

_{g}### 3.2 Variational autoencoder for approximating the posterior distributions

Our hierarchical mixture model for scRNA-seq is well suited for the complex scRNA-seq data as it connects the Gaussian latent variable *Z*_{i} with the observed sparse counts *Y*_{i} through non-linear mapping functions *f _{g}*(·). To introduce a wide class of mapping functions, we model

*f*(·) by a neural network and further combine our model with the Variational Autoencoder (VAE) (Kingma & Welling, 2014) for approximating the posterior distributions. Following the variational Bayes approach in VAE and the conditional VAE (Sohn et al., 2015), we use Gaussian distributions to approximate the intractable posterior distribution of

_{g}

*Z*_{i}given the observed data

*Y*_{i}and confounding variables

*X*_{i}:

Here the “posterior” mean and covariance are functions of (*Y*_{i}, *X*_{i}). To guarantee flexibility, they are also modeled by a neural network. As shown in Figure 1b, the neural network for and is the encoder, and the network for *f _{g}*(

*Z*_{i},

*X*_{i}) is the decoder. For non-UMI data, as in Eraslan et al. (2019), the functions

*h*(·) for zero-inflation parameters are modeled by a neural network in the decoder and share the same hidden layers as

_{g}*f*(·).

_{g}With all the above setup, we can lower bound the log-likelihood of each observation *Y*_{i} as
where *p*(*Y*_{i}|*Z*_{i}, *X*_{i}) is the conditional distribution in model (3) or (4), and *q*(*Z*_{i}|*Y*_{i}, *X*_{i}) is the Gaussian approximation of the posterior distribution in (7). This lower bound is often referred to as the evidence lower bound (ELBO) (Kingma & Welling, 2014), where the first term denotes the reconstruction likelihood and the second term −*D*_{KL}(*q*(*Z*_{i}|*Y*_{i}, *X*_{i})|| *p*(*Z*_{i})) behaves as a regularizer. The difference between (8) and regular conditional VAE is that we have a mixture model for the trajectory structure encoded in the term *p*(*Z*_{i}), which encourages the posteriors of *Z*_{i} to lie along linear edges and vertices. The resulting loss function for one cell is defined as the negative modified ELBOs in (8),
where Θ is the set of trainable parameters in our model and *β* is a tuning parameter to balance the reconstruction error and regularization, which is introduced in Higgins et al. (2016) as the *β*-VAE. To better adjust for confounding covariates *X*_{i}, we add another penalty, as introduced in Huang et al. (2020), to obtain our final loss function aggregated over all *N* cells as:
where *α* is another tuning parameter. Setting *α* > 0 encourages our decoder to reconstruct *Y*_{i} using only information from *X*_{i}, which can help to decorrelate *X*_{i} from *Z*_{i} so that one can remove the confounding effects of *X*_{i} in *Z*_{i} more thoroughly.

All three terms in (10) can be approximated by the Monte Carlo method efficiently, as we show in Appendix B. Specifically, though the marginal density function *p*(*Z*_{i}) involves in a complex hierarchical mixture model (6), it still has a closed form representation. Define *α*_{zc} = ** U** (

*b*_{c}−

*a*_{c}) and

*β*_{zc}=

**−**

*z*

*Ub*_{c}for any

*c*= 1, 2, · · · ,

*K*where

*a*_{c}and

*b*_{c}are constants defined in (6). Then the marginal density of

*Z*_{i}has the form with where

*φ*(·) and Φ(·) are the probability density function and the cumulative density function of

_{d}*d*-dimensional and univariate standard Gaussians respectively. With these derivations, the optimization can be efficiently done via stochastic backpropagation (Rezende et al., 2014) on mini-batches of data, with the commonly used amortized variation inference (Stuhlmüller et al., 2013) for VAE.

### 3.3 Model estimation with practical considerations

For the gene expressions as input, by default, we select *G* = 2000 most highly variable genes following the procedure in Seurat (Stuart et al., 2019), one of the most widely used software in scRNA-seq. We also preprocess the gene expressions following Seurat (see Appendix A) and use the normalized, log-transformed, and scaled gene expressions as the inputs of our neural network. By default, we set *β* = 2 and *α* = 0.1 (when there are confounding covariates), which we find to give the best results in practice. We use some *β* > 1 to strengthen the regularization on *Z*_{i} especially when its dimension is much smaller than the dimension of *Y*_{i}.

As the stochastic gradient descent algorithm optimizing our loss (10) generally results in a locally optimal solution, we need a good initialization of our parameters, especially for ** U** and

**defined in (3) and (6). Also, our framework requires a pre-determined number of states**

*π**k*. Inspired by other existing TI methods, we design a three-step algorithm for model initialization and estimation.

The first step is pretraining, where we train the model with *β* = 0, to only minimize the reconstruction loss which does not involve the unknown parameters ** π** and

**. The purpose of this step is to obtain better weights for the encoder and the decoder and to get an initial low-dimensional representation, which can be used to initialize**

*U***and determine**

*U**k*in the next step.

The second step is to initialize the latent space. We perform cell clustering with the Louvain algorithm (Blondel et al., 2008) on the estimated posterior means of *Z*_{i} after pretraining. With a default resolution of 0.8 as in Seurat, we let Louvain automatically determine the number of clusters, which we set as the value of *k*, the number of states in our framework. As in Seurat and other clustering algorithms, we suggest users trying different resolutions (typically between 0.4 to 1.2) in practice to obtain satisfying clustering results. We initialize ** U** with the cluster centers. As

**involves both the**

*π**k*vertices and

*k*(

*k*− 1)/2 edges, we have no information yet and just uniformly initialize

**in this step.**

*π*The last step is to train our whole network, optimizing the loss function (10). To obtain a better initialization of ** π**, we warm up our optimization algorithm for the first 5 epochs, where we only train the parameters and in the latent space while keeping the parameters of the encoder and the decoder frozen. We find the warmup step a useful trick for complicated large datasets with many cell types, such as our real data example in Section 6. During both the pretraining and training steps, the optimization is early stopped when the evaluation loss decreases sufficiently slow.

Finally, to make VITAE scalable to efficiently handle large datasets and have a comparable computational cost as other TI methods, we accelerate VITAE for large datasets by reducing the dimension of the input. We replace the expression of *G* genes with its top *R* (*R* = 64 by default) principle component (PC) scores *F*_{i}. The output of the decoder is also replaced by the reconstruction of *F*_{i}, and the likelihood of *p*(*Y*_{i}|*Z*_{i}, *X*_{i}) in the loss function(10) is replaced by Gaussian densities assumed on *F*_{i} (Appendix B). As shown in Section 5, the accelerated VITAE with Gaussian likelihood provides comparable results with our original Negative Binomial likelihood-based VITAE in general, while it can be much faster than the original scheme. By default, both the encoder and decoder have one hidden layer, with 32 and 16 units for the accelerated and likelihood-based VITAE respectively. The bottleneck layer has a dimension of 8. The hidden layers are all fully-connected with the leaky rectified linear activation function (Maas et al., 2013; Huang et al., 2020) and Batch Normalization (Ioffe & Szegedy, 2015).

## 4 Trajectory Inference from Posterior Approximations

After the training step, the model returns the estimated parameters, including , the encoder , and the decoder . Replacing the true posterior density *p*(*Z*_{i}|*Y*_{i}, *X*_{i}) with the approximate posterior density , we can also obtain an approximation of the posterior distribution of (*w _{i}, c_{i}*) for each cell. Specifically, let be

*L*(

*L*= 300 by default) random samples from , and use the fact that (

*w*) ⫫ (

_{i}, c_{i}

*Y*_{i},

*X*_{i})|

*Z*_{i}, we can approximate the posterior density of (

*w*) as where is obtained by plugging in and into the true posterior density of (

_{i}, c_{i}*w*) given

_{i}, c_{i}

*Z*_{i}. Notice that the approximate posterior distribution only depends on posterior samples of the latent variables

*Z*_{i}. Similarly, we can get the approximate posterior distributions of

*c*, the edge or vertex the cell belongs to, as

_{i}For the cell position on the graph , as it is a function of *w _{i}* and

*c*as defined in (6), we can also efficiently obtain the mean and the diagonal elements of the covariance matrix of its posterior distribution . For details in calculating the approximate posterior distributions, see Appendix C. Now we discuss how to infer the trajectory backbone and cell positions along the trajectory with these posterior approximations.

_{i}### 4.1 Infer the trajectory backbone

The total number of categories *K* = *O*(*k*^{2}) can be large, even with a moderate number of *k*. While the trajectory backbone typically only sparsely involves a few edges, our estimated is often a dense vector. Inspired by Frühwirth-Schnatter & Malsiner-Walli (2019) on Bayesian Gaussian mixture models, to encourage sparsity, we infer the nonzero edges a posteriori from the data. Specifically, we define a score for each edge, quantifying the strength of the evidence that the edge exists:
where the denominator is added to make sure that we can capture the continuous transitions between cell states even when these states only involve a small proportion of cells in the cell population. Even though is dense, can be significantly nonzero for much fewer edges. From another point of view, an edge score is some “test statistics” to evaluate whether the edge exists. To make our algorithm scalable, in practice, instead of obtaining the posterior distributions of to dertermine whether it is significantly nonzero or not, we simply use a deterministic version of the edge score as
where
and is the approximate posterior mean of . Other edge scores choices are discussed in Appendix D.

A larger indicates higher confidence assuring that the edge exits in the trajectory backbone . In practice, we include an edge (*j*_{1}, *j*_{2}) into the estimated backbone if , where the cutoff *s*_{0} is 0.01 by default. When we are certain that there are no loops in the trajectory, inspired by Trapnell et al. (2014), we further prune as the MST of the unpruned graph with as edge weights. This typically results in a cleaner shape of our estimated trajectory.

### 4.2 Project cells onto the inferred trajectory backbone

To obtain the position of each cell on the inferred trajectory, we further project onto . Given the approximate posterior distributions of , we would find a point estimate for the best position of each cell *i* on . Specifically, for each cell, we aim to solve the following optimization problem:

Since the support of is restricted to the inferred trajectory backbone , only one or two entries of can be nonzero, depending on whether the cell is at the vertex or edge of . Though the *L*_{2} loss in the objective function of optimization problem (11) is not the only choice, it can result in a closed-form solution that allows fast computation.

*The optimization problem (11) is equivalent to finding*
*where* *is the mean of* . *Denote the jth component of* *as μ*_{j} *and let* , *then the best projection is given by*
*which reduces to be a vertex if* , *and the corresponding solution* *has entries*
*for j* = 1, …, *k*.

The proof of Proposition 1 is included in Appendix E. Intuitively, finding the optimal projection of minimizing the *L*_{2} loss is equivalent to simply projecting the posterior mean . In addition, as shown in our proof of Proposition 1, most cells will either project onto an edge or an isolated vertex. Notice that the solution for the optimization problem (12) may not be unique. However, it is generally unique due to floating-point computation in practice.

Next, we also want to quantify the uncertainty of the projected position as an estimate of . A general metric to evaluate the uncertainty is given by
where *d*(·, ·) is a metric or distance function. There can also be many choices of *d*(·, ·). When is also the *L*_{2} loss, we obtain the projection mean square error (MSE) as
which is easily computable (Appendix C). Notice that our projection MSE ignores the uncertainty in and the approximation error in , so it is an underestimate of the true uncertainty. However, we think that the relative magnitude of our projection MSE would still be a useful quantity to compare the projection accuracy across cells. Another pattern, as we will observe later in Section 6, is that the projection MSE is typically smaller when the cells are near vertices. We provide our understanding and a detailed discussion of this pattern in Appendix F.

Finally, we obtain the point estimate of each cell’s pseudotime *T _{i}* defined in (2). Given the inferred trajectory , the user can assign a root vertex based on prior biological knowledge. We also provide an automatic root selection step following Tempora (Tran & Bader, 2020) when cells are collected from a series of time points. The idea is to choose the vertex with earliest collection time as the root. Specifically, let

*r*be the collection time of cell

_{i}*i*, we calculate the “collection time” of vertex

*j*as an weighted average of the collection time of the cells near the vertex. A vertex is chosen as the root if it has the smallest collection time

*γ*and is not an isolated vertex.

_{j}Once the root vertex of the trajectory backbone is obtained, it is straightforward to obtain , the estimated pseudotime of the vertices, by plugging in B into the definition of in (1). Then a point estimate of the pseudotime *T _{i}* of cell

*i*can be given as

### 4.3 Differential gene expression along the trajectory

We also provide a simple linear regression approach to find differentially expressed genes along our inferred trajectory backbone. Compared with existing approaches (Van Buren et al., 2020) to find differentially expressed genes in trajectory inference, we focus on finding genes that are positively or negatively associated with the pseudotime ordering after adjusting for confounding covariates and provide a more scalable way to obtain the *p*-values of the genes.

Specifically, to find genes that are differentially expressed along the pseudotime ordering for a subset of cells , we work with the linear regression for each gene *g*:
where *Y _{ig}* is the log-transformed and normalized count for cell

*i*and gene

*g*, and the linear term is to adjust for the confounding effects of known covariates. We allow

*e*’s to have unequal variances as

_{ig}*Y*’s are log-transformed and normalized counts. Intuitively, this linear regression is to estimate the correlation between gene

_{ig}*g*’s expression and the pseudotime ordering after correcting for confounding effects, which is defined as

For each gene, we aim to test for the null *H*_{0g} : *ρ _{g}* = 0, or equivalently testing

*H*

_{0g}:

*β*

_{1g}= 0.

The challenge here is that the true pseudotime *T _{i}* is not observed. Instead, we only have estimated pseudotime from the data, which has unknown uncertainty and is correlated with

*Y*. As a consequence, obtaining a valid

_{ig}*p*-value for

*H*

_{0g}is impossible at this stage. For simplicity, we plug in and estimate

*β*via the ordinary least squares. Treating as the true

_{ig}*T*, we obtain the variance of through the sandwich estimator allowing for heterogeneity in

_{i}*e*. The

_{ig}*t*-statistics is then defined as . To make a simple correction of the post-estimation bias replacing

*T*with the estimated , we take the

_{i}*p*-value calibration approach following Wang et al. (2017). Specifically, instead of assuming under the null, we assume that where

*σ*

^{2}is estimated from the median absolute deviation (MAD) of {

*t*= 1, 2, · · · ,

_{g}, g*G*}. To select differentially expressed genes, the calibrated

*p*-values are further adjusted with the Benjamini-Hochberg procedure for multiple testing correction.

Unlike our approach here, tradeSeq (Van Buren et al., 2020) fits a generalized linear regression on the raw scRNA-seq counts assuming a Negative Binomial distribution of the data. In order to find more complicated gene expression patterns along the estimated pseudotime, they regress *Y _{ig}* on the smoothing splines of . We only focus on finding genes that are positively or negatively associated with

*T*. With a linear model, we can more efficiently find differentially expressed genes. Without parallelization, to find out differentially expressed genes in Section 6, it only takes us less than 4 minutes on 5000 cells and 15

_{i}*k*genes, while tradeSeq can take more than 20 minutes as evaluated in Van Buren et al. (2020).

## 5 Validation and Benchmarking with Real and Synthetic Datasets

In this section, we evaluate the performance of VITAE on diverse datasets. These datasets contain both real and synthetic data, with UMI and non-UMI counts and various trajectory topologies. We also use 5 different metrics to measure our accuracy in trajectory inference. Our validations mostly follow the third-party benchmarking paper Saelens et al. (2019) so that we can have a fair evaluation of VITAE.

### 5.1 Datasets

Following the settings in Saelens et al. (2019), which provided a comprehensive overview and guideline for TI methods, we evaluate VITAE’s performance in recovering six types of trajectory topologies, as shown in Figure 2. Our benchmarking datasets include 10 real scRNA-seq data and 15 synthetic datasets, summarized in Table 1.

The datasets from real scRNA-seq studies include 9 datasets from Saelens et al. (2019). Among them, 5 datasets have “gold standard” labels according to Saelens et al. (2019) and are included to cover all trajectory types with at least 200 cells. As most “gold standard” datasets are small, we also include 4 extra datasets (*dentate*, *fibroblast*, *planaria_muscle*, *planaria_full*) with “silver standard” labels. The datasets and labels are all extracted from the Dyno platform^{1} (Saelens et al., 2019), except for the dataset *dentate*, whose cells are mislabeled, and we directly extract the labels from the GEO database (accession number: GSE95315). For each dataset, the Dyno platform also provides its reference trajectory backbone and cell positions, from which we calculate reference pseudotime by our definition. In addition, to evaluate TI methods on a dataset with disconnected states, we create the dataset *immune* by combining purified 10085 B cells, 8385 CD56 NK cells, and 2612 CD14 Monocytes cells from Zheng et al. (2017).

The drawback of using real datasets for benchmarking is that only discrete cell labels are available, though the cells are experiencing continuous transitions. In other words, the true cell positions and ordering along the trajectory are only known at a low resolution. To better evaluate VITAE’s performance in estimating the cell positions and pseudotime, we also include synthetic datasets for evaluation. For synthetic datasets, we consider two different simulation approaches. One simulator we use is dyngen (Cannoodt et al., 2020), a multi-modal simulation engine for studying dynamic cellular processes at single-cell resolution. dyngen is also used in Saelens et al. (2019) and provides a delicate way to generate scRNA-seq data starting from gene regulation and transcriptional factors. However, it is limited to generate only a few hundreds of genes. Thus, we only generate 1000 genes for each dyngen dataset and treat the dyngen datasets as non-UMI data as the generated counts are large typically.

We generate four synthetic datasets from our own model framework. First, we train the model on a real data set to obtain a decoder and . Then, we generate each and *Z*_{i} and the observed UMI counts *Y*_{i} following the hierarchical model (3) and (6). Specifically, we treat the estimated decoder and as the true *f _{g}*(·) and

**and design a trajectory backbone by connecting some edges between the vertices. In the four generated datasets, we do not include any confounding covariates**

*U*

*X*_{i}. We use real dataset

*dentate*to generate synthetic datasets

*linear*and

*bifurcation*, and the real dataset

*fibroblast*to generate synthetic datasets

*multifurcating*and

*tree*.

### 5.2 Evaluation metrics

We use five different scores to measure TI methods’ accuracy in recovering the true trajectory, cell positions, and pseudotime. All these scores range from 0 to 1, and a larger value indicates better performance.

First, to measure the difference between an estimated trajectory backbone and the true trajectory backbone , we compute two scores: the GED score and the IM score, both of which evaluate the difference between two graphs and are invariant to the permutation of vertices. The graph edit distance (Abu-Aisheh et al., 2015) is defined as
where , denotes the set of edit paths transforming graph to the graph and *c*(*e _{i}*) ≥ 0 is the cost of an operation

*e*. In other words, GED is a symmetric distance measure that quantifies the minimum number of graph edit operations needed to transform into an isomorphic graph of . We then standardize the GED as so that it ranges between 0 and 1. Besides the GED score, following Saelens et al. (2019), we also compute the IM score based on the Ipsen-Mikhailov distance (Jurman et al., 2015), , between the estimated and reference trajectory backbone. The IM distance is symmetric, measures the dissimilarity of adjacency matrices’ spectra of the two graphs, and is bounded between 0 and 1. Our IM score is further defined as

_{i}Next, we measure the error of the estimated cell position for each cell *i*. We use two scores, the adjusted rand index (ARI) and the generalized rand index (GRI), both of which are invariant under the permutation of vertices and allow unequal number of vertices. The ARI (Hubert & Arabie, 1985) is a commonly used symmetric measure for the similarity between two clustering results. Following Saelens et al. (2019), the cells are assigned to its nearest discrete state based on their estimated and true values of , and the two groupings are compared using ARI. Since such discretization only compares the accuracy of at a low resolution, we also define a GRI score that directly compares with .

Our GRI metric is an extension of the rand index (RI). For any two pair of cells *i*_{1} and *i*_{2}, we define its “true” similarity as where for a vector ** v** of length

*k*. The square-root operation is to guarantee that if and only if . Similarly, we define the estimated similarity between the two cells as . Then to compare the similar between the estimated and reference cell positions on the trajectory backbone, the GRI is defined as

The rand index is a special case of GRI when all cells are positioned exactly on vertices.

Finally, we also measure the similarity between the reference and estimated pseudotime of cells along the trajectory, which is simply the Pearson correlation between the estimated pseudotime and the reference pseudotime *T _{i}* across cells. Specifically, we define the PDT score as
which has a range between 0 and 1.

### 5.3 Evaluation results

We evaluate the performance of two versions of VITAE. VITAE_NB is VITAE with the likelihood-based loss (NB for scRNA-seq with UMI and ZINB for scRNA-seq with non-UMI). The other is VITAE_Gauss, the accelerated version of VITAE introduced in Section 3.3, where the inputs are the PC scores and the Normal density replaces the likelihood in the loss function. For benchmarking, we compare with three state-of-art TI methods: PAGA (Wolf et al., 2019), Monocle 3 (Cao et al., 2019), and Slingshot (Street et al., 2018).

To make the other TI methods have a comparable form of outputs, we run PAGA, Monocle 3, and Slingshot via the Dyno platform, which converts these TI methods’ outputs into an estimated trajectory backbone , estimated cell positions ’s, and estimated pseudotime ’s. For a fair comparison, the true number of states *k*, clustering labels, and root are provided as prior information to all TI methods. For Slingshot, which requires an extra two-dimensional projection of cells as input, the UMAP (McInnes et al., 2018) coordinates of cells computed following Seurat’s (Stuart et al., 2019) default steps are used.

The evaluation results are summarized in Figure 3 across all 25 datasets on 5 evaluation scores. For the two versions of VITAE, these scores are averages over 100 different random seeds. As computing the exact GED is NP-hard and costly when *k* is large, the GED score is not calculated for the *planaria_full* data with *k* = 33. In addition, the PDT score is not meaningful for the *cycle* and *disconnected* structure, so it is also not computed for these trajectory types.

As shown in Figure 3, both versions of VITAE generally have consistently larger scores than the other three TI methods on all five metrics. Specifically, VITAE provides a more accurate recovery of the trajectory backbones, and thus an improved estimation of the pseudotime. In terms of the ARI and GRI, we see consistent improvement of VITAE in all synthetic data, though our improvements are subtle in some real datasets. A possible explanation is that the “true” cell positions and pseudotime in real datasets are discretized, assuming that cells are only placed on vertices, so they might not be sensitive enough to evaluate the accuracy when estimating the real cell positions. In addition, we do see comparable performance of the accelerated version of VITAE compared to our original scheme, showing that VITAE_Gauss can well approximate the likelihood-based loss and makes VITAE scalable to handle large datasets.

## 6 An Application to the Developing Mouse Neocortex

Finally, we apply VITAE to analyze the developing mouse neocortex. The six-layered neocortex is evolutionarily unique in mammals and forms the physical center for the highest cognitive functions (Geschwind & Rakic, 2013). It has been shown that Neuroepithelial cells (NECs) transition into the radial glial cells (RGCs), while cortical projection neuron types are generated sequentially by RGCs and intermediate progenitor cells (IPCs) (Lui et al., 2011; Götz & Huttner, 2005; Greig et al., 2013). A few scRNA-seq datasets have been generated (Pollen et al., 2014; Yuzwa et al., 2017; Loo et al., 2019; Ruan et al., 2020+) to reveal and understand the developmental process and the dynamic gene expressions in the mouse neocortex. Here, we aim to perform a joint trajectory analysis using both cells from Yuzwa et al. (2017) and Ruan et al. (2020+).

As shown in Table 2, Dataset *A* (Ruan et al., 2020+) contains 10261 cortical cells collected from E10.5, E12.5, E14.5, E15.5, E16.5, and E18.5 mouse embryos and Dataset *B* (Yuzwa et al., 2017) includes 6390 cortically derived cells from E11.5, E13.5, E15.5, and E17.5 mouse embryos. Both datasets sample cells from the same region in the mouse cortex and a joint analysis of them will enable analyzing the dynamic process from a full spectrum of mice from E10.5 to E18.5. We keep the genes that are measured in both datasets (14707 genes), merge all 16651 cells, and preprocess these cells following our default procedure described in Appendix A. As expected, the raw observed gene counts from the two datasets do not merge together due to lab-specific batch effects and the non-overlapping of days in sample collection (Figure 4a).

In order to adjust for confounding batch effect, we include the dataset ID of each cell (whether it is from Dataset *A* or *B*) in *X*_{i}. In addition, as the NECs and RGCs are experiencing cell division, we also adjust for the cell-cycle effects by adding the cell-cycle scores, computed using Seurat, into *X*_{i}. To perform the joint trajectory analysis, we apply the accelerated VITAE (the VITAE_Gauss) with the adjustment of *X*_{i}. To determine *k*, clustering is performed after the pretraining step. We also compare with an alternative approach combining Seurat data integration (Stuart et al., 2019) with Slingshot. Specifically, Seurat is first used to integrate both datasets, regress out cell-cycle scores, and cluster the cells with integrated data. Then Slingshot is applied to the Seurat results to infer the trajectories and pseudotime.

As shown in Figure 4bc, both the Seurat integration and VITAE can successfully remove lab-specific batch effects. To understand whether they retain meaningful biological signals, we then color the cells by their reference major cell types given in the original papers (Figure 5ab). As Yuzwa et al. (2017) did not provide cell type labels, the reference labels of cells in Dataset *B* are obtained by a separate clustering of Dataset *B* using Seurat, and the clusters are annotated using the gene markers provided by Ruan et al. (2020+). Note that these reference labels are only used for evaluation and are not used during the trajectory inference of both approaches. Though both approaches can retain meaningful biological signals and merge only the cells in the same major cell type, VITAE can provide a cleaner integration, leading to the recovery of a more biologically meaningful trajectory shared in both datasets (Figure 5b). VITAE successfully identifies the *NEC-RGC-IPC-Neuron* major cell lineage and disconnects the microglia cells, pericytes, Pia, and interneurons in Dataset *A* that are not derived from the major trajectory. In contrast, the trajectory given by Slingshot after Seurat integration is messy and hard to interpret (Figure 5a).

In terms of uncertainty quantification, VITAE returns the edge score of each edge in the inferred trajectory (Figure 5b, shown as the width of each edge), indicating our confidence in whether an edge exists or not. In addition, we also quantify the uncertainty of each cell’s position along the trajectory by the projection MSE defined in (13) (Figure 5c). As discussed in more detail in Appendix F, the projection MSE tends to be smaller for cells that are near the vertices. Comparing the projection MSEs of the cells on different edges, we have larger uncertainty in the cell positions for the IPCs and immature neurons, most of which are actively involved in dynamic transitions.

Now, we further examine the change in gene expression along sub-trajectories. Figure 6 shows the top marker genes for two sub-trajectories. Using a p-value threshold of 0.05, we can identify 790 differential expressed genes along the sub-trajectory *RGC*→*IPC*→*Immature Neuron* and 813 differential expressed genes along the sub-trajectory *NEC*→*RGC*→*OPC*. The differentially expressed genes are further ranked by their estimated covariates-adjusted correlation defined in (15). For each sub-trajectory, we visualize the expressions of the top 2 genes positively correlated with the pseudotime and the top 2 genes that are negatively correlated. We fit a LOESS curve of each gene for each dataset separately, and find that the gene expression patterns are consistent in both datasets, indicating that our pseudotime ordering is biologically meaningful. All these marker genes have been shown to potentially play essential roles in the development of mouse neocortical neurons (Ruan et al., 2020+). Notice that the sub-trajectory *NEC*→*RGC*→*OPC* is positively correlated with the cell collection time, while there is no clear correlation between the pseudotime and cell collection time in the other sub-trajectory *RGC*→*IPC*→*Immature Neuron* (Figure 5d, Figure 6). In both two cases, the gene expression patterns across the two datasets are consistent, despite the lab-specific batch effects and the non-overlapping of cell collection days between these two datasets.

Finally, in terms of computational cost, our approach takes less than 3 minutes with 1 GPU and less than 10 minutes with 8 CPU cores to infer the underlying trajectory. As a comparison, the data integration and cell-cycle adjustment in Seurat take 11 minutes, and Slingshot takes a further 1 minute using their accelerated version. We have comparable computational costs as these commonly used computational tools for scRNA-seq, while VITAE performs better in recovering the underlying trajectories.

## 7 Discussion

In this paper, we propose VITAE, a hierarchical mixture model combined with variational autoencoders, for trajectory inference. Compared with existing TI methods, VITAE provides a coherent probabilistic framework for TI in scRNA-seq, to explicitly define the trajectories and pseudotime that are estimable from scRNA-seq, to handle various trajectory topologies, and to simultaneously adjust for confounding covariates. VITAE leverages variational autoencoders to approximate posterior quantities, and infers the trajectory backbone and cell positions from posterior approximations. VITAE is scalable to handle large datasets by the automatic parallelization of deep learning models and the reduction of input dimension with PC scores.

Though there are many available tools for integrating multiple scRNA-seq datasets, it is typically challenging to integrate datasets when the cells have a continuous trajectory structure and do not fall into distinct cell types. As shown in Section 6, our approach, which simultaneously performs trajectory analysis with data integration, can better regularize the latent space to infer a cleaner shared trajectory while retaining biologically meaningful differences in each dataset.

Though our approach is powerful for TI, there are still some limitations. First, the approximate variational inference we use to infer the trajectory and quantify cell position uncertainties ignores the estimation uncertainties of our encoder and decoder. As a consequence, though our uncertainty quantification is informative, there is no guarantee that it is close to the true posterior uncertainties. In addition, we still need a more principled statistical framework to find differentially expressed genes along the trajectory, to account for the post-estimation bias and adjust for the confounding effects of known covariates. Since using scRNA-seq counts alone is often not enough to identify the true temporal lineage of cells, we also believe that extending our framework to model both the spliced and unspliced RNA counts with RNA velocities can provide more information in understanding the true dynamic process of cells.

## Acknowledgement

This work was completed in part with resources provided by the University of Chicago Research Computing Center. We thank Xiaochang Zhang for sharing the mouse neocortex dataset *A* and helpful suggestions on our analysis of the mouse neocortex.

## A Data preprocessing

We follow the procedure in Seurat (Stuart et al., 2019) to preprocess the raw scRNA-seq count matrix. First, we remove cells and genes whose expressions are all zero and obtain the raw count data , for *i* = 1*, …, N* and *g* = 1*, …, G*^{raw}. Then, we normalize each cell by its library size and log-transform the normalized counts. Specifically, we compute the library size for each cell as
for a fix number *N*_{C} (with the default value 10^{4} as in Seurat), and transform the raw counts to obtain

Next, we find highly variable genes following the procedure describe in Stuart et al. (2019). Specifically, we calculate the mean and variance of each gene on the raw count data
and then do a local polynomial regression with degree 2 and span 0.3 to regress on log_{10} *μ*_{g}. The fitted variance is then used to compute the standardized gene expressions

For each gene, we calculate a feature importance score as

A larger value of FI_{g} indicates a more highly variable gene. By default, we select the first *G* = min{2000*, G*^{raw}} genes for downstream analyses. The input of VITAE is obtained by a further scaling of the expression of each selected genes across cells so that each gene has mean 0 and variance 1.

## B Loss Function

Note that the loss for each cell in (10) takes the form

Next we show how to compute the above terms efficiently.

where are Monte Carlo samples from the approximate posterior distribution

*q*(*Z*_{i}|*Y*_{i},**X**_{i}). Here*p*(*Y*_{ig}|*Z*_{i},**X**_{i}) depends on the distribution assumptions of*Y*_{ig}.For scRNA-seq with UMI,

*Y*_{ig}|*Z*_{i},**X**_{i}follows a Negative Binomial distribution with a probability mass function, for*i*= 1, …,*N*. It is noted that only*λ*_{ig}dependent on*Z*_{i}and*X*_{i}.For non-UMI data,

*Y*_{ig}|*Z*_{i},*X*_{i}follows a zero-inflated Negative Binomial distribution with the probability mass function, where*p*_{NB}(*y*;*λ*_{ig},*θ*_{g}) is the probability mass function of .For the accelerated VITAE with a Gaussian model,

*Y*_{i}is replaced by the PC scores*F*_{i}. We assume that each*F*_{ig}|*Z*_{i},*X*_{i}follows a Gaussian distribution with probability density function,

When , the conditional density of

*Z*_{i}|*c*_{i}=*c*is given by where*φ*_{d}(·) is the probability density function of the standard*d*-dimensional multivariate Gaussian distribution. When , the conditional density of*Z*_{i}|*c*_{i}=*c*is given by with where Φ(·) is the cumulative density function of the standard Gaussian distribution. Therefore, the marginal density of*Z*_{i}is given bySince the approximate posterior density

*q*(*Z*_{i}|*Y*_{i},*X*_{i}) is , we have

## C Posterior Estimation

*Z*_{i}|*Y*_{i},*X*_{i}Since the approximate posterior distribution of

*Z*_{i}|*Y*_{i},*X*_{i}is , we can use as the latent representation of*Y*_{i}.*c*_{i}|*Y*_{i},*X*_{i}It is worth noted that a useful property of our model is

*Y*_{i}⫫ (*w*_{i},*c*_{i})|*Z*_{i},*X*_{i}, which is due to the fact that the probability mass function of*Y*_{ig}|*Z*_{i},*w*_{i},*c*_{i},*X*_{i}is*p*(*y*|*Z*_{i},*w*) =_{i}, c_{i},**X**_{i}*p*(*y*|*Z*_{i},*X*_{i}), which only depends on*Z*_{i}and*X*_{i}. Note that we have already derived the formula for*p*(|*z**c*_{i}=*c*) in Appendix B, we have So we can obtained the predicted category of cell*i*by compute the MAP estimatorNote that

Let and denote the mean and the covariance matrix for the estimated distribution of . We have

Here, as the integration over

*w*is intractable, we approximate the integral by the rectangular rule with equally spaced*M*points in [0, 1].

## D Edge Score

For the complete graph , recall the definition of the categorical assignment matrix ** C** in (5). Below four kinds of edge scores are presented.

Edge Score Based on Posterior Mean of

*c*.The first edge score is based on the posterior mean . Let where

*t*_{1}= 0.5 is the threshold parameter. We are considering instead of the full data set because not every cell contributes to the existence of the edge . Within the set , proportion of cells at the edge gives us one score which is denoted by , where .Edge Score Based on Modified Posterior Mean of

*c*.A modified version of is given by where an first-order approximation is used.

Edge Score Based on Modified MAP of

*c*.Alternatively, we can also compute the edge score based on the MAP of

*c*_{i}’s,Then an edge score can be defined as the proportion of the number of cells at the edge to the number of cells at the vertices

*j*_{1}*, j*_{2}or the edge ,Edge Score Based on Modified MAP of

*c*.Since the assigned categories by MAP of

*c*are generally at the edges instead of the nodes, the estimated denominator will be smaller than the real one, and will be large when the number of cells in nodes*j*_{1}and*j*_{2}are small. To address this problem, we further modify the denominator, to get the modified MAP edge score,

## E Proof of Proposition 1

**Proof.** Note that
we have
where the last two terms do not involve ** w**. So, minimizing is equivalent to minimizing over

**. The optimization problem (11) is thus equivalent to the optimization problem (12).**

*w*We divide the problem into two cases: finding the best projection onto edges and the best projection onto vertices.

For projection onto edges, the original problem becomes

Notice that for any edge , the objective function is quadratic in (and ) since with equality holds if and only if . That is, for each edge , the solution achieves minimum projection error

If we want edge (

*j*_{1},*j*_{2}) to have the smallest projection error, then for any other edge , we need to haveThus the best projected edge can be obtained by computing

For the vertex , the problem becomes and the solution is . The projection error for the vertex

*i*is given byThen gives the best projection onto vertices.

Consider the following case,

Define , then gives solutions to the optimization problems (11) and (12).

## F Discussion on Uncertainty Quantification

We observed in Section 6 that our calculated projection MSE is always smaller for the cells near vertices. Why do we observe such a pattern? Is it due to the estimation bias in our approximation of the posterior distribution, or is it an intrinsic property of the L2 loss under our mixture model? In order to answer these questions, we consider the case where we observe the latent space *Z*_{i}, so that we can compute the true posterior distributions of . Also, we focus on , as from our observations it is typically the leading term in (13). Specifically, we define our hierarchical model with observed *Z*_{i} following (3) and (6) by setting *d* = 2, *k* = 3 with . The defined backbone is shown in Figure 7a and we choose the vertex 1 as the root of the trajectory.

First, we take , so that the edges and vertices have equal probabilities. As *Z*_{i} is observed, we can compute the true posterior mean and variances from Appendix C for any given *Z*_{i}. Here we show the posterior variances given which are exactly on the trajectory and discuss how the variances change with . The cell *i* is at the vertices when *T _{i}* = 0, 1 or 2. Figure 7b shows how the posterior variances of each change with

*T*. The leading term of the projection MSE summing up the three terms in Figure 7b has an M-shape as shown in Figure 7c, indicating that the cells indeed have smaller projection uncertainties when they are near the vertices.

_{i}One explanation of the above phenomenon is that as there are nonzero probabilities exactly at the vertices, the distribution is denser near the vertices. As a consequence, there are less uncertainty on if it is closer to a vertex. However, what we find surprising is that when we set , where the cells can only be on the edges, we still observe an M-shape for the change of along *T _{i}*. In other words, the projection MSE from true posteriors is smaller near the vertices no matter whether the vertices have nonzero probabilities or not, though the difference is smaller when the vertices have zero probabilities. As a consequence, we believe that this pattern is an intrinsic property of the

*L*

_{2}loss under our mixture model.