## ABSTRACT

RNA velocity analysis of single cells promises to predict temporal dynamics from gene expression. Indeed, in many systems, it has been observed that RNA velocity produces a vector field which qualitatively reflects known features of the system. Despite this observation, the limitations of RNA velocity estimates are poorly understood. Using real data and simulations, we dissect the impact of different steps in the RNA velocity workflow on the estimated vector field. We find that the process of mapping RNA velocity estimates into a low-dimensional representation, such as those produced by UMAP, has a large impact on the result. The RNA velocity vector field is strongly dependent on the k-NN graph of the data. This dependence lead to large estimator errors when the k-NN graph is not a faithful representation of the true data structure; a feature which cannot be known for most real datasets. Finally, we establish that RNA velocity does not estimate expression speed neither at the gene nor at the cellular level. We propose that RNA velocity is best considered a smoothed interpolation of the observed k-NN structure, as opposed to an extrapolation of future cellular states, and that the use of RNA velocity as a validation of latent space embedding structures is circular.

## INTRODUCTION

RNA velocity analysis is a widely used method to infer temporal dynamics in single-cell gene expression data. In it’s original definition, RNA velocity is the time derivative of gene expression state (*ds/dt* with *s* representing the high-dimensional expression state and t time)(La Manno et al., 2018). Based on earlier work of Zeisel et al. (2011), La Manno et al. (2018) propose a model built using an ordinary differential equation model of transcription and the assumption that the relationship between spliced mRNA expression and unspliced pre-mRNA expression can be used to infer whether a gene is in the process of being up- or down-regulated or in a steady expression state. Using this model, they derive an estimator for velocity using single-cell expression data. This model and estimation procedure was later extended by Bergen et al. (2020). These two variants are known as the steady state model (La Manno et al., 2018) and the dynamical model (Bergen et al., 2020).

The primary output of an RNA velocity analysis for single-cell gene expression data is a vector field visualized on a low-dimensional embedding, usually constructed using UMAP or t-SNE. This vector field is supposed to represent local changes in expression state; one might describe the output as an alternative to trajectory inference analysis (Weiler et al., 2021). This output is often used to provide evidence for the correctness of the embedding. An intermediate output towards the low-dimensional vector field is the high-dimensional estimation of *ds/dt*; we refer to these estimates as “velocities” (or gene-specific rates of change). This estimation is performed gene-by-gene using the biophysical ordinary differential equations. We will use “velocity vector field” to refer to the low-dimensional vector field which is obtained by mapping the velocities into a low dimensional space. In many datasets, the visualization of the velocity vector field appear to reflect what is known about the biological system and the relationship between cell states/types in the system.

Despite the widespread use and popularity of RNA velocity, studies that validate the highdimensional RNA velocity estimates at the gene level are lacking. A major reason is that it is extremely hard to measure the instantaneous rate of change of expression in a single cell. Q Qiu et al. (2020) used metabolic labeling coupled with scRNA-seq to distinguish between newly synthesized mRNA and older mRNA. They directly compared the velocity estimates from metabolic labeling with the splicing kinetics based RNA velocity of 3 genes and found a poor correspondence between the two types of velocity estimates. It can be argued, however, that the explicit aspects of mRNA biogenesis estimated by metabolic labeling are subtly different from the models used in RNA velocity.

The potential of predicting future state of individual cells has spurred huge interest in RNA velocity among the single-cell research community, resulting in many reviews and work building off the velocity framework. On the review side, Bergen et al. (2021) highlight examples where RNA velocity vector is not compatible with the known biology of the system, and propose that this is due to assumptions of multiple kinetic regimes, transcriptional boosts, high noise, or time-constant rates of transcription, splicing, and degradation. They further envision the use of gene regulatory network and multimodal omics to expand RNA velocity models. Gorin et al. (2022) is an in-depth discussion and critique of RNA velocity, and complements the work we present here. The paper covers the underlying mathematics of RNA velocity in detail and advocates for a more rigorous approach to RNA velocity respecting the discrete nature of transcription in single cells. Although the work covers all aspects of RNA velocity, there is particular focus on the inherent discrete nature of the problem and its implications for preprocessing and biophysical modeling. The manuscript primarily discusses the python implementation of velocyto.

While numerous tools seek to improve on velocity, many methods build directly off the original framework. For example, Dynamo integrates metabolic labeling into the splicing-unspliced dynamics based model, and tries to mathematically recover the whole velocity vector field in the lowdimensional embedding even for regions without any cells existing (X Qiu et al., 2022). CellRank combines the k-NN graph built from the expression profile with the RNA velocity transition probability to detect initial, intermediate and terminal population during differentiation (Lange et al., 2022). Marot-Lassauzaie et al. (2022) first discusses several theoretical and computational problems, and then propose two alternative RNA velocity or RNA velocity flavored approaches. First, *κ*-velo tries to address the scale invariance problem by incorporating cell densities with the assumption that cell density is inversely proportional to the average velocity for each gene. Second, eco-velo is a heuristic RNA velocity flavored approach which does not infer highdimensional RNA velocity. Instead, eco-velo uses the first mutual nearest neighbor (MNN) of a cell in the unspliced matrix space to the spliced matrix space as the proxy for future state in the spliced matrix space. A vector field is created based on the displacements between future state and current state in the spliced matrix space. Gao et al. (2022) proposes a revised high-dimensional RNA velocities estimation model UniTVelo which impose a gene-shared cell latent time to circumvent the independent estimation issue of other RNA velocity estimations approaches.

Here, we deconstruct the underlying workflow by separating the (gene-level) velocity estimation from the vector field visualization. We then analyze how the methods for mapping and visualizing the vector field impact the interpretation of RNA velocity and discover the central role played by the k-NN graph in both velocity estimation and vector field visualization. By using both simulations and real data, we identify situations where RNA velocity estimates are accurate and evaluate the extent to which the visualizations allow us to discover new structure in the data. We also explore whether – as their name suggests – velocity estimates can provide quantitative information about the speed at which cells progress along a trajectory.

## RESULTS

### RNA velocity analysis and its implementations

RNA velocity analysis consist of two primary steps:

Spliced and unspliced counts are preprocessed (smoothed/imputed) and a cell-specific velocity is estimated separately for each gene.

The high-dimensional velocity estimates are mapped into a low-dimensional embedding and the resulting vector field is visualized on this embedding.

Assessments of RNA velocity analysis results are usually qualitative: does the resulting vector field visualization accurately capture known cellular dynamics of the biological system? This endeavour centers the resulting vector field visualization (step 2) and we will discuss its properties in detail.

The two primary approaches to RNA velocity are the steady-state model (La Manno et al., 2018) and the dynamical model (Bergen et al., 2020). A main differences between the two are the specific assumptions about the parameters in the biophysical models of transcript abundance that are used to estimate cell-specific velocities (step 1 above), although there are many other differences which impacts the result (Supplementary Note 3.1). Here, we primarily focus on the newer dynamic model (Bergen et al., 2020), as implemented in scVelo. We also occasionally make comparisons to the steadystate model, also using its scVelo implementation (which we chose over other implementations of the steady-state model to ensure that we can keep other parameters of the workflow constant)

The complete series of operations involved in an RNA velocity analysis workflow is outlined in Figure 1 using the specific implementation for scVelo (note that Gorin et al. (2022) has a similar velocyto-centric figure). We divide Step 1 (inference of highdimensional RNA velocity values) into two parts: preprocessing and gene-level velocity estimation (Figure 1a). In the preprocessing step the raw count data is smoothed by a k-NN graph constructed exclusively from the spliced counts. A number of additional steps, such as library size adjustment and log-transformation, are performed and discussed in Gorin et al. (2022). Following this, cell-specific velocities are estimated by fitting the smoothed spliced and unspliced counts to the appropriate biophysical model described by either the the steady state or dynamical model for each gene independently (Methods). Importantly, valid velocity estimations are only retained from a reduced set of genes where the model is considered to be well fit.

In the second step, after estimation of the highdimensional velocities, they are then visualized^{1} on a low-dimensional embedding, usually constructed using UMAP or t-SNE from principal components learned from the spliced count matrix. This consists of two steps (Figure 1b). First, the velocity estimates are mapped into the existing lowdimensional embedding. The most common solution is to represent the velocities in this reduced dimensional space using transition probabilities, a method developed by La Manno et al. (2018) and further modified by Bergen et al. (2020). After mapping, we are left with low-dimensional velocity vectors that are visualized using approaches such as streamline plots or gridding average vectors (Supplementary Note 3.2). In practice, this last visualization step almost always includes an additional smoothing, such as kernel smoothing over the embedding; furthermore the lengths of summarized vectors are usually rescaled.

Most attention in the literature has been given to a qualitative visual assessment of the RNA velocity vector field. We therefore start by examining how parts of the workflow impacts visualization.

### Mapping velocities into a low-dimensional embedding

To visualize scRNA-seq data on a nonparametric embedding, such as t-SNE or UMAP, mapping velocity estimates into an existing embedding is a nontrivial problem (unlike recomputing a new embedding based on the expression data supplemented with the predictions), and we briefly describe existing approaches to this problem. La Manno et al. (2018) introduced a method we will refer to as “velocity transition probabilities” (later modified by Bergen et al. (2020) which is the version we focus on here). This method has a central role in RNA velocity analysis, since it is used in most RNA velocity vector field visualizations. The method also serves as the basis for the CellRank method for predicting fate specification (Lange et al., 2022). Velocity transition probabilities are claimed to provide a solution to the mapping problem which is compatible with *any* type of low dimensional embedding including UMAP, t-SNE and PCA, provided the embedding is constructed using expression data from the same cells yielding the velocity estimates. In addition to velocity transition probabilities, we have embedding-specific approaches such as an orthogonal projection operator for principal component plots and UMAP-transform for mapping into UMAP space (a method supplied by the UMAP authors) (McInnes,2021).

After estimation of high-dimensional velocity by either steady model or dynamical model, we have the observed current expression state *s* and the estimated velocity *v* (both high-dimensional) (Methods). A first-order Taylor expansion of the expression state yields the following approximate relationship between current expression level *s*, velocity *v* = *ds/dt* and future expression level *s**:
which requires a choice of time step Δ*t*. For a fixed Δ*t*, there is a one-to-one relationship between the velocity and the future expression.

If we have the ability to compute *E*(*y*) where *y* is any point in the input state, we can map velocities as

Here we are using *E** to indicate that the left hand side is a new operator defined by the right hand side. If the mapping operator is linear (which is the case for principal component analysis), we get
and there is no difference between *E**(*v*) and *E*(*v*), and the impact of choosing a time step Δ*t* is just an overall scaling. If the mapping operator is not linear (which we believe is the case for UMAP-transform), there is a difference between *E**(*v*) and *E*(*v*), and the time step matters.

These computations require us to be able to compute *E*(*y*) where *y* is an arbitrary new point in the expression space. The velocity transition probabilities were designed to work in cases where this is not directly obvious, as in the case of nonlinear embeddings. The only points we have available in the embedding space are the mapping of the observed cells, denoted as *E*(*s*_{1}),…, *E*(*s*_{n}). If we focus on a specific cell *i,* the idea is to represent the velocyto embedding as a weighted sum of empirical differences:
for suitable choices of weights (depending on the cell in question). The weights are constructed to give higher weights to cells that have a higher cosine similarity between *S _{j}* –

*S*and

_{i}*v*(an approach that bypasses the choice of a time step Δ

_{i}*t*). In other words, you compute a future expression state in which high weights are given to cells that have closer neighbors on the k-NN and to which you believe the cell is transitioning towards. In scVelo, this sum is restricted to include only terms from neighbors of neighbors of the cell

*i*in the k-NN graph of the spliced counts. A consequence of this approach is that the direction of the embedded velocity must be towards the (convex hull) of neighbours of neighbours of the cell. This has the undesirable effect of assuring that the direction of the velocity vector for a given cell is entirely dependent on the expression states of its nearest neighbors, with consequences to constrain the vectors pointing to other existing cells around the query cell (detailed below).

### Transition probabilities imposes directional constraints

To illustrate the critical influence of neighbouring cells on the resulting vector field, we consider the pancreas dataset featured in Bergen et al. (2020) using the scVelo package for velocity estimates (Figure 2a, b). As previously described, a standard velocity analysis suggests that pre-endocrine cells (orange) flow towards beta cells (light green) as ex-pected. To demonstrate the dependence on nearest neighbor expression estimates, we then fix the highdimensional velocity estimates (which are based on the full dataset) but remove the beta cells (light green) from the embedding step. Because the beta cells are removed, the recomputed velocity transition probabilities force the vectors to point to another part of the available embedding: the alpha cells (dark blue) (Figure 2c and Supplementary Figure S1a) resulting in a dramatic difference in the interpretation of the future state of the pre-endocrine cells. Figure 2d is a quantitative display of the change in low-dimensional vector directions for all cell-specific vectors of pre-endocrine cells in the red rectangle as a result of removing the beta cells from the embedding.

This example illustrates a significant caveat of using transition probabilities for low-dimensional estimation of velocity vectors, in that it is impossible for velocity transition probabilities to represent unseen parts of an expression state space. Importantly, this suggest that the visualization of RNA velocities is more akin to an interpolation of observed expression states and less of an extrapolation of future state. An example in which this impacts data analysis is the comparison of multiple samples with sample-specific cell types or states, perhaps as a consequence of spatial heterogeneity (i.e., technical). In this case, the missing cell types or states will locally warp the vector field when using transition probabilities.

### Visualizing velocity without noise

To further examine the impact of velocity transition probabilities, we next employ a simulation experiment in which we control and account for varying levels of noise. We follow the simulation strategy of Bergen et al. (2021) and generate 500 cells with 10 genes following the dynamic model with low noise level. Figure 3a depicts one of these genes and shows that the dynamics of the simulated gene fits well with the dynamical model. To focus on visualization instead of high-dimensional velocity estimation, we use our simulation to define true velocities (defined as the velocity of the expression state prior to adding random noise). Our simulated data have a single trajectory without cycles or bifurcations, and this fact is captured by our UMAP representation (Figure 3b, c). Using velocity transition probabilities for visualization, we obtain a vector field which flows in the correct direction (color indicates time). There is one issue with this vector field: At the start of the field, all vectors are close to zero with a seemingly random direction; this is not supported by the simulations, as the true velocities of these cells are relatively uniform in their direction and magnitude. In contrast, using UMAP-transform for visualization, we obtain a vector field largely similar to the one created using velocity transition probabilities, with the exception around the start of the trajectory, where we observe substantially longer vectors with the correct direction.

We next assessed the impact of velocity transition probabilities when mapping velocities to a linear embedding (PCA). For our simulated data, principal component analysis reveals the expected single trajectory in the first two dimensions with a topology which is highly similar to that of the UMAP plot (Figure 3d, e). The median of cosine similarities of cell-level vectors produced by transition probability and direct projection is as high as 0.984, representing a high level of agreement in directions. However, in this linear space as well, the velocity transition probabilities have issues correctly mapping the vectors around the start of the trajectory. Using the PCA projection operator, which can be considered the truth in this particular embedding, reveals vectors of substantial length and the correct direction.

To bypass any kind of visualization summarization that might mask local discrepancies, we compare the vector lengths obtained by the two approaches for each cell and confirm that the start of the trajectory is an area of high discordance between the two mapping methods (Figure 3f). Furthermore, we can quantify the relationship between high-dimensional velocities (true velocities in this simulation) and embedded velocities. This reveals the length of the PCA-projected velocities to be related to the length of the true high-dimensional velocities (Figure 3g). In contrast, the lengths of the embedded vectors using velocity transition probabilities do not have any meaningful relationship with the high-dimensional vectors (Figure 3h). Note that, in physics, speed is defined as the length of the velocity vector.

In conclusion, in this simulation experiment, the different mapping approaches yield a qualitatively similar vector field with an exception around the start of the trajectory where the velocity transition probabilities fail. Importantly, the reasons for why this particular portion of the transition probability mapping fails is unclear, but this failure is not observed when using ‘projection-based’ methods for low-dimensional velocity vector field mapping. Furthermore, for all simulated cells, there is little to no relationship between the length of the vectors mapped using velocity transition probabilities and the true high-dimensional velocities. Using a PCA-based projection does a substantially better job at reflecting velocity vector lengths in high-dimensional space.

### Velocity estimation and visualization is strongly dependent on the k-NN graph

Single-cell data are known to be noisy. We next asked how increasing noise might affect lowdimensional vector field visualization. In the previous section we examined issues with vector field visualization in a simulation setting with (unrealistic) low noise. When we use the same dynamics but increase the noise by 5x, the resulting PCA changes from a one-dimensional manifold to a ball (Figure 4a). Different time points are located roughly in distinct quadrants of the ball and the time progression is clockwise. The similar ball-like observation is made when we visualize the data using UMAP (Supplementary Figure S2). Again, we project the true velocities using the PCA projection matrix and the resulting vector field broadly reflects the time progression (as expected) (Figure 4b). Note that neighboring vectors belonging to different quadrants point in very different directions, giving the impression of a very noisy visualization unless the true time progression is already known for reference. Notably, using a visualization method such as a streamline plot will have a smoothing effect that hides this behaviour. We consider this vector field to be the gold standard for this embedding.

If we estimate velocities using the dynamical model and use the velocity transition probability method for visualization, we obtain a vector field with a smooth progression from left to upper right (Figure 4c). Although visually pleasing (with apparent consistency in direction amongst neighbors), more than 50% of the vectors are strongly dissimilar from the gold standard (Figure 4g). Note that this comparison is made at the cell level; the visualizations in Figure 4 use the gridding approach. If we keep the velocity estimates from the dynamical model, but map them using the PCA projection, we obtain a flow going from the edge of the circle towards its center (Figure 4d); not reflecting the true time progression. Because neither visualization method compares favorably to the gold standard, it is tempting to conclude that this is a result of noise overwhelming the velocity estimates (we observe similar failures for steady model in the Supplementary Figure S3).

However, using the right k-NN graph is critical for correct inference. When we increase the noise level, two things are happening simultaneously. Gene-level measurements have added noise, and the k-NN graph learned from the data is perturbed away from the true graph; the latter is reflected by the significant change in the PCA layout. To investigate why the vector fields are wrong, we obtain the true k-NN graph and use this k-NN graph to preprocess both the spliced and unspliced matrices and to construct the velocity transition probabilities. Using the true k-NN graph yields a vector field that is substantially more aligned to the gold-standard; there is a slight improvement when using the PCA projection matrix compared to velocity transition probabilities (Figures 4e-g). The gridding visualization of the velocity transition probabilities gives a noisy impression, which is not fully reflected by the cell-level comparisons in Figure 4g.

Is the improvement in using the true k-NN graph driven by its usage in preprocessing or in mapping? When we attempt combinations of true and learned k-NN graphs for the two steps, we observe the largest improvement by using the true k-NN graph in the preprocessing step (Supplementary Figure S3).

We draw several conclusions from this example. First, it reveals that RNA velocity is critically dependent on the k-NN graph for vector field visualization. Second, we see how the true k-NN graph can be distorted by noise, an observation that is likely to be relevant outside of RNA velocity given the importance of k-NN graphs in single-cell expression analysis. Third, we observe that smoothness of the vector field does not imply correctness. Fourth, we are intrigued by the similarity of the “ball-of-cells”-like embedding to some existing single-cell embeddings: we refer to embeddings that show a dense structure where different parts of the structure appear to consist of distinct cell types. An example is the data depicted in Figure 4h which is comprehensively discussed in Bergen et al. (2021) as an example where the RNA velocity fails. Our simulation experiment suggest this could arise primarily from noise deforming the k-NN graph.

### Evaluating gene-level RNA velocities using simulations

So far, we have focused on the low-dimensional velocity vector field on a fixed embedding. It is natural to ask to what extent the gene-level estimates are accurate. To do so, we turn to our simulations. Using a noise level of 3, the topology of the PCA plot is a 1-dimensional trajectory (Supplementary Figure S4), with some noise. Using a k-NN learned from the data, the estimated velocities are quite inaccurate when considered as a function of the true time (Figure 5a-c, Supplementary Figure S5 for additional quantities). When switching to the true k-NN graph, the dynamical model works substantially better, although there is still some discrepancy between the true velocity and the estimated velocity (Figure 5d-f). Supplementary Figure S5 expands on the quantities depicted in Figure 5 and Supplementary Figure S6 depicts the situation when we increase the noise level to 5, the situation where the PCA plot changes from a 1-dimensional trajectory to a ball.

Supplementary Figure S7a,b depicts a comprehensive evaluation across all genes and across many noise settings for both the dynamical and the steady state model. As the noise level increases, the dynamical model cannot be fitted to most of the genes, and the performance measures are restricted to the few genes that provide a passable fit (as determined by the software). We make several observations. First, the performance is markedly better when using the true k-NN graph compared to the learned k-NN graph. Second, the steady state model performs better when the assessment criteria is independent of scale (such as PCC) whereas the dynamical model performs better when the assessment criteria is scale-dependent (such as NRMSE).

Aside from comparing the velocity values gene-by-gene, we can compare the speed and direction of the high-dimensional velocity vector cell-by-cell. Speed is the length of the velocity vector. Using a noise level of 3, we find poor concordance between the estimated true speed when using the learned k-NN (Figure 5g). This is substantially improved by using the true k-NN (Figure 5h). The steady state model shows a similar behaviour (Supplementary Figure S8), although the scale of the length is wrong. We can compare the speed estimates with the truth using both absolute difference (a measure dominated by the scale of the length) and Pearson correlation (unaffected by the scale) (Supplementary Figure S9a,b). Together, these two measures reveal that the estimates of speed from both the dynamical and the steady state model fails to reflect the truth, at least when the learned k-NN graph is used. Direction is harder to assess in high-dimensions; the cosine similarity is a step towards this goal, and it suggests poor concordance between estimated and true directions when using the learned k-NN graph (Supplementary Figure S9c). Using the true k-NN graph leads to substantial performance improvements in estimating both speed and direction, but the overall performance is still poor for high noise levels.

In conclusion, the gene level RNA velocity estimations are highly dependent on the k-NN graph used to smooth the data. Even in a simulated setting where the observed k-NN graph reflects the true underlying structure (the PCA plot shows a 1-dimensional trajectory), using the observed k-NN graph results in substantial errors in estimated ve-locities.

### Correct estimate of the vector field does not imply accurate high-dimensional velocity estimation

Our ability to assess RNA velocity is limited by the technical difficulties in measuring the instantaneous rate of change in expression for many genes in single cells. To evaluate the RNA velocity estimates in real data, we consider a recent dataset on the cell cycle measured using the FUCCI system combined with scRNA-seq (Mahdessian et al., 2021). The cell cycle is a well understood periodic process (Whitfield et al., 2002), where cell cyclerelated genes go through phases of up- and downregulation. In this dataset, we can place each cell in a continuum that represents the cell cycle. One approach to this goal is to take advantage of the FUCCI system which tracks cell cycle time by the protein levels of two key cell cycle regulators, CDT1 and GEMININ. However, in our recent work, we have established that it is possible to improve this cell cycle time by projecting this data into a lowdimensional space representing the cell cycle (SC Zheng et al., 2022). We refer to these two time representations as the tricycle cell cycle time (position) and the FUCCI pseudotime.

First, we note that cell cycle genes have a specific role in RNA velocity estimation. Due to the periodicity, cell cycle genes go through both an up- and a down-regulation phase, but, perhaps surprisingly, they dominate among such genes in existing singlecell datasets. In practice this suggests that dynamically regulated cell cycle genes *should* sample both the increasing and decreasing portions of the RNA velocity biophysical models, providing for a potential better fitting of these models. Indeed, if we look at the top ten best fitting genes across ten difference single-cell expression datasets, the only genes having a “complete” phase portrait are cell cycle genes (Supplementary Figure S10). Non cell-cycle genes are either in the induction or repression phase. For this reason, we believe that cell cycle datasets such as the FUCCI dataset we consider here are amongst the most suitable for RNA velocity analysis using velocity estimates derived from current biophysical models.

Using the FUCCI data, the visualized vector field reflects cell cycle time (Figure 6a), and so does the phase plot of the 10 best fitting cell cycle genes (Figure 6b and Supplementary Figure S11).

Using our estimated cell cycle time from tricycle, we can define the up- and down-regulation phases for each gene, by smoothing expression using the cell cycle time loess fitted line (Figure 6c). Comparing these phases with the sign of the estimated velocities using the dynamical model reveals a high concordance (Figure 6d). A feature of the dynamical model is the estimate of a cell-specific latent time which has good concordance with cell cycle time for *MELK* (Figure 6e). Here, we use the gene-specific latent time, which is different from the gene-shared latent time(Bergen et al., 2020).

If we use the steady-state model, we get essentially the same vector field (Supplementary Figure S12a). However, there are non-negligible in-consistencies between gene-level RNA velocity estimations from the steady state and the dynamical model. For example, for the *MELK* gene 24% cells exhibit different directions of change (Supplementary Figure S12b-c) between the two models; the dynamical model agrees better with the direction of change inferred using cell cycle position. For each gene, we compute the PCC between the estimated velocities from the two models as well as the number of cells with inconsistent directions (Supplementary Figure S12d). Across genes, the median PCC between the models is only 40% with 30% of cells showing inconsistent direction of change. It is intriguing that the two models yield qualitatively similar vector fields despite the substantial differences in estimated velocities. This, again, raises the issue of to what extent the velocities actually impact the vector fields. We emphasize that this comparison only focuses on direction of change and not rate of change.

If we take all the cycle genes with a *R*^{2} ≥ 0.5 for Ms over the cell cycle position, we observe that the dynamical model is more consistent in the inference of velocity direction (Supplementary Figure S13).

Using the tricycle cell time can be criticized because it is inferred using the expression data. If we replace tricycle cell cycle time with FUCCI pseudo-time, we observe a higher degree of discordance with FUCCI pseudo-time (Supplementary Figure S14) compared to tricycle time. Supplementary Figure S15 shows 10 additional genes with similar behavior.

In summary, when it comes to the inference of direction of expression change, RNA velocity appears to work well in this cell cycle system, likely due to the fact that dynamic genes in this system experience both induction and decrease over the course of the cell cycle and potentially a better fitting to the underlying models of transcriptional regulation. We note the special role of the cell cycle in velocity analysis: the periodic nature of the cell cycle fits well with the biophysical model of transcription, and cell cycle genes often amongst the best fitting genes across biological systems. As a result, the velocity vector field is correct and the direction of change of the gene-level velocities are largely correct for the dynamical model. We find it noteworthy that the steady state model also yields a correct velocity vector field despite the two models yielding high-dimensional velocity estimates with substantial inconsistencies. Based on this, we believe the cell cycle represents a system which is particularly well suited to RNA velocity and we caution against generalizing its performance to other systems.

### A quality control measure for RNA velocity model

A central question is when to trust RNA velocity vector fields. To assess this, we focus on how well the estimated latent time explains variation in the expression estimates. In the dynamical model, we estimate a gene-specific latent time representing the unknown time parameter in the differential equations; this quantity is not available from the steady model. In a subsequent step, the gene-specific latent time is summarized into a gene-shared latent time, by taking the quantile (across genes) of the estimated gene-specific latent times. We ask how much variation in expression is explained by a loess fit on the gene-specific latent time (Figure 7a,b). As an example, for one specific gene, for simulated data, we observe *R*^{2} = 0.73 using the estimated latent time compared to *R*^{2} = 0.52 using the true time. This suggests some degree of overfitting which could be caused by either the preprocessing step or the model fit itself. Fortunately, in addition to the spliced counts, we also get unspliced counts, which are also smoothed but using the k-NN learned from the spliced counts. By comparing the difference between the spliced and unspliced expression matrices as a function of estimated latent time or the true time (Figure 7c,d compared to Figure 7a,b), we conclude that the source of the overfitting is the preprocessing step of smoothing using the spliced k-NN graph. We argue that this can be avoided by using the unspliced instead of the spliced matrix to assess how well RNA velocity model fitting works across genes, because the pre-crocessing is done using the spliced k-NN graph and not the unspliced.

We applied this to the top 300 genes (ranked by the likelihoods of the dynamical model) in ten datasets (Figure 7e). The PBMC68k data – highlighted by Bergen et al. (2021) as a dataset where RNA velocity fails – is an outlier in this display, with a very high discrepancy between *R*^{2} for the spliced and unspliced expression values, and a very low *R*^{2} for the unspliced values. The two gastrulation data sets that have the second and third lowest *R*^{2} values have been reported to show unexpected directions in their RNA velocity vector fields that oppose the true differentiation path (Barile et al., 2021); the authors speculate that these inconsistencies are driven by time-dependent changes in expression dynamics. We interpret a small value of *R*^{2} for the unspliced matrix as evidence of a poor fit of the RNA velocity model. And we interpret a discrepancy between thee *R*^{2} of the spliced and unspliced matrix to be evidence of overfitting arising from the k-NN smoothing. Altogether, this suggests that our measure has some functionality in the sense that extreme behavior on this measure should inspire low confidence in the vector fields.

## DISCUSSION

In physics, velocity is the combination of direction and speed and is defined as v =ds/ dt, the derivative of position s with respect to time t. RNA velocity promises to bring temporal dynamics into gene expression analysis by estimating the derivative of expression with respect to time. When assessing this promise, it is important to separately consider direction and speed (the length of the vector). For both direction and speed, RNA velocity produces two quantities: high-dimensional velocity vectors (each entry corresponds to a gene) and representations of these vectors in a suitable low-dimensional space, such as a UMAP embedding. To distinguish between these quantities, we refer to the low-dimensional representation as the “velocity the vector field”, which is always relative to the choice of low-dimensional space and is easy to visualize.

### Why does RNA velocity appear to work?

The low-dimensional directions inferred by RNA velocity appear to be successful at describing known biology in many systems. “Describing known biology” is a qualitative statement reflecting that the visualized vector field reflects known progression through a system; an example is the pancreas dataset (Figure 2). We emphasize that this qualitative assessment is always exclusively in reference to the direction of the vector field; we will discuss speed below.

Most single-cell expression analyses – and all existing RNA velocity analyses – start by constructing a low dimensional embedding (usually) using UMAP. It is an open question under which conditions UMAP is ever guaranteed to reflect biological truth. Despite this lack of guarantees, it is indisputable that UMAP often achieves a representation which qualitatively reflects existing knowledge (related cell types and states are placed close together). We argue that any successful RNA velocity analysis starts with a UMAP which is deemed to represent existing or anticipated knowledge about the system. RNA velocity then produces a vector field overlay on this low-dimensional representation.

Here, we show that the direction of RNA velocity is strongly determined by the observed k-NN graph of the data. This k-NN graph is also directly reflected by the UMAP layout (indeed, one can view a UMAP layout as a representation of the k-NN graph). Together, this guarantees a *compatibility* between the UMAP and the vector field: vectors will always point towards neighbors on the UMAP.

Importantly, this occurs regardless of whether the UMAP is in some sense “true”. Therefore, the directions inferred by RNA velocity cannot be interpreted as additional evidence for the correctness of the UMAP. For example, it is impossible for the RNA velocity vector field to construct a vector showing a transition between two cell populations which are distant on the UMAP. Instead, RNA velocity is more similar to a smoothing of the k-NN (and therefore UMAP) structure, and cannot reveal new structure, but only depict structure which is already present in the UMAP.

These statements about the direction of the RNA velocity vector field is supported by our various experiments; we next discuss these points and the experiments in details.

### What influences the direction of RNA velocity?

First, we discuss the low-dimensional velocity vector field and focus on the special case where mapping is done using velocity transition probabilities, as is common when visualizing RNA velocity on a UMAP, t-SNE, or other nonlinear embedding. We show that the resulting vector field is strongly dependent on the structure of the k-NN graph (Figure 4). The use of the k-NN graph to preprocess (smooth) the spliced and unspliced matrix has a larger impact on the resulting vector field than the use of the k-NN graph to estimate velocity transition probabilities (Supplementary Figure S3). Additionally, we show how the resulting vectors are constrained to point exclusively toward sampled cell locations (Figure 2).

Second, we consider direction of lowdimensional RNA velocity vector field when the UMAP and k-NN graph are perturbed away from the “true” structure. In our simulation results we show that increasing noise can significantly perturb the observed (“learned”) k-NN graph away from the true neighbourhood structure. This is best visualized by comparing the UMAP layouts at different noise levels (Supplementary Figure S2). While our observation is based on a specific simulation result, we hypothesize that this is a common phenomenon in single-cell expression data, where technical noise can be the greatest source of variation. In our simulations with moderate to high noise, we observe that a true one-dimensional manifold gets represented as a dense mass of cells on PCA or UMAP embeddings, where the quadrants of the mass are still broadly reflect specific time points, conditions, or cell states (Figure 4 and Supplementary Figure S2). Because specific cell states appear to occupy distinct regions of the dense mass of cells, and related cell types/states are placed close together, this is often interpreted as signal in the data. Our simulations suggest that such a dense mass, caused by the noise of the data, might obscure a simple trajectory. In this scenario, RNA velocity produces a pleasing, smoothed, low-dimensional velocity vector field, which nevertheless is a wrong representation of the structure of the system (Figure 4). Given the importance of the observed k-NN graph in many methods in single-cell expression analysis, this observation may have far reaching implications beyond RNA velocity analysis.

We now turn to direction in high-dimensional space. First, we consider the relationship between high-dimensional velocities and the lowdimensional vector field. In our analysis of the FUCCI data we show that both the dynamical model and the steady state model yields overall similar vector fields, but that the inferred direction of change of expression is qualitatively different between the two models (Figure 6 and Supplementary Figure S12). This highlights that different high-dimensional velocities can result in the same low-dimensional vector field, and cautions us against using the low-dimensional vector field to draw conclusions about the high-dimensional velocities. In our simulations, we observe that the estimated high-dimensional velocities are strongly dependent on the k-NN graph. Using the observed k-NN graph, there is little concordance of directions between true high-dimensional velocities and the high-dimensional velocity estimates (Supplementary Figure S9). Together, this suggests that high-dimensional velocities can be highly misleading. We note that high-dimensional velocities are not directly used in most RNA velocity analyses.

### RNA velocity does not estimate expression speed

We now turn our attention to expression speed: the length of the *v* = *ds/dt* vector. Speed is seldom directly analyzed, but precocious differentiation in disease may be resulting from changes in speed. Again, there is a length of the high-dimensional vector and the length of the low-dimensional vector resulting from mapping into a low-dimensional embedding.

We show conclusively that there is little-to-no relationship between high-dimensional vector length and length in the low-dimensional velocity vector field when velocities are mapped using transition probabilities (Figure 3). This suggests that vector length (speed) in the low-dimensional embedding is meaningless (in this scenario). This is explained by considering the velocity transition probabilities: high-dimensional vector length is simply not used to determine the low-dimensional vector length. The consistency between high- and lowdimensional vector lengths is substantially better when velocities are projected onto a PCA plot using an orthogonal projection operator. However, it is well appreciated that PCA plots in 2 dimensions regularly only capture part of the multiple biological processes occurring simultaneously in singlecell data.

Our simulation experiments show that highdimensional velocity length estimates have poor concordance with true velocity vector lengths, although the concordance can be improved by using the true k-NN graph (Figure 5 and Supplementary Figure S8, S9). However, since the true k-NN graph is always unknown, it is of limited practical utility beyond the illustration of the discordance.

In summary, there is no reason to believe that current RNA velocity workflows are capable of estimating gene expression speed, neither in high nor in low dimensions. We make two comments related to speed. First, there is a significant gap in the ability to experimentally validate the speed of global gene expression transitions. If accurate es-timation of speed is of interest, it is critical to develop experimental approaches that will directly measure this property to provide a ground truth against which validation is possible. Second, there are issues with the theoretical definition of speed of gene expression changes. When computing a vector length, coordinates (genes) contribute with equal weight. This makes sense in 3D physics, but is less intuitively clear in expression analysis: for example, one might want to consider the expression level and intrinsic variation of each gene. In practice, we believe that a more useful concept of speed is one that is coupled to specific biological processes, such as cell cycle speed, differentiation speed, or transcriptional response to perturbation.

### Limitations to our simulations

Following Bergen et al. (2021), we use a very simple simulation strategy where all genes are velocity genes and the dynamics fit the underlying ODE perfectly well. Furthermore, the gene-specific parameters are identical between the genes. This is highly unrealistic, but provides an over-optimistic best case scenario. Given the model failures in this simulation setup, we would expect even larger discrepancies with more real-world models. We have criticized – but not critically evaluated – the aggregation of genespecific latent times into a cell-specific latent time. Because the simulation setup imposes the same ordering of cells for every gene, the aggregation step works well. To investigate issues with inferring a cell-specific latent time, we suggests one would need a simulation design where multiple processes are happening across multiple time scales, for example cell cycling and differentiation happening at the same time.

### The exception that proves the rule: the FUCCI data

Our analysis of the FUCCI data adds to the short list of attempts at validating RNA velocity using experimental data. We take advantage of cell cycle time (which is a useful time concept that is different from wall time). Using this approach, we find relatively good performance of the dynamical model on cell cycle related genes in predicting the direction of gene expression changes (Supplementary Figure S13). Note that this is a partial validation: we are only considering the direction of change of expression (whether a gene is up- or down-regulated) and not whether the “speed” nor whether the predicted new state is correct. There is a substantial discrepancy between the dynamical and the steady state model and both models arrive at the same vector field.

In our work we have substantially criticized RNA velocity. Why do we maintain our criticism in light of our moderately successful validation attempt? We believe that the FUCCI data is rare experimental system which fits the underlying RNA velocity model well. First, there is only a single process underlying processes happening (cell cycle). This is substantial simplification compared to many datasets where multiple processes occur simultaneously. Second, gene expression in this process follows a cyclic pattern with both up- and downregulation. This is in contrast to other processes such as differentiation, in which most dynamic genes are either exclusively down- or up-regulated along the process. An example of such a system is red blood cell development, where the regulation of the some key genes, such as hemoglobin genes, is monotonic (Sankaran and Orkin,2013). Interestingly, RNA velocity was recently reported to result in a reverse (wrong) direction in this system, due to errors in assigning the correct expression phase (up/down) to key genes (Barile et al., 2021). Third, the UMAP suggests that the observed k-NN graph accurately reflects the underlying biology. Our criticism of RNA velocity is about the use of the method to “validate” a given embedding, and the FUCCI data does not really address this problem since we know the initial UMAP accurately reflects cell cycle progression.

### Summary

In light of our results, we believe that RNA velocity has far from achieved its stated goals: quantifying expression dynamics. Indeed, most applications of RNA velocity to date have exclusively relied on an qualitative interpretation of RNA velocity vector field estimates to ‘reinforce’ the validity of learned trajectories in a reduced dimensional embedding. We provide evidence here, that this validation exercise is at best a circular logic, and at worst, potentially inaccurate and misleading. The promise of RNA velocity as a quantitative tool to examine expression dynamics further falls short when the validity of these estimates is explored. Speed is especially problematic, and has received little attention in the literature. At its best, RNA velocity provides a potentially useful visualization tool, conceptually similar to pseudo-time ordering.

## METHODS

### Review of the dynamical model of RNA velocity for scRNA-seq data

RNA velocity is usually introduced through a pair of differential equations for the amount of spliced and unspliced RNA depending on the cell-specific latent time *t* for each gene independently (Bergen et al., 2020):

There are three unknown (constant) parameters for each gene: the transcription rate (*α*), the splicing rate (*β*), and the degradation rate (*γ*). Unlike the steady model, which only searches for the degradation rate *γ*, the dynamical model solves for the three parameters. The analytical solutions to equation 1 and equation 2, as given in Bergen et al. (2020), are the key parts for the estimation of the parameters:

Here, *u*_{0} and *s*_{0} denotes the initial unspliced counts and spliced counts, which are both set to 0 as default in implementation. And *τ* is the difference between the latent time *t* and the time point at which the phase change occurs. The greatest advancement of the dynamical model is that we get the cell-specific latent time *t* of each gene. Calculating the cell-specific velocity is performed by inserting the calculated values of *u*(*t*) and *s*(*t*) into the equation 2 instead of taking the residuals of the quantile regression as done by the steady model. We note that one of the important characteristics of the equation system 3 is that both *s*(*t*) and *u*(*t*) are univariate functions of cell-specific latent time *t* given all es-timated parameters. It follows that equation 2 can also be expressed as an univariate function of the cell-specific latent time *t*.

### Real scRNA-seq datasets filtering and normalization

We got the Forebrain data (La Manno et al., 2018), the Bonemarrow data (Setty et al., 2019), the Dentategyrus LaManno data (Hochgerner et al., 2018; La Manno et al., 2018), the Pancreas data (Bastidas-Ponce et al., 2019; Bergen et al., 2020), the Gastrulation erythroid data (Pijuan-Sala et al., 2019), the Dentategyrus Hochgerner data (Hochgerner et al., 2018; Bergen et al., 2020), the Gastrulation E7.5 data (Pijuan-Sala et al., 2019), and the PBMC68k data (GXY Zheng et al., 2017) from the scVelo package directly (https://scvelo.readthedocs.io/api/#datasets). We got the Chromaffin data (Furlan et al., 2017; La Manno et al., 2018) from http://pklab.med.harvard.edu/velocyto/notebooks/R/chromaffin2.nb.html and the FUCCI data (Mahdessian et al., 2021) from https://drive.google.com/file/d/149ICTtieYjuKWZoLwRLzimwff0n6eWqw/view?usp=sharing. Usually, we did the following procedures for all real dataset: we first filtered genes with more than 20 counts across cells in both spliced and unspliced count matrices; and we only retained cells with more than 200 counts across genes in both spliced and unspliced count matrices; the filtered spliced count matrix was library size normalized across cells and log_{2} transformed by function normalizeCounts, and then used as traditional expression matrix for PCA; we only used the top 2000 highly variable genes for PCA, and the top 30 principal components for UMAP. All cell-type labels were included with the downloaded data.

For the following datasets, special treatments were applied. **FUCCI dataset**: For convenience, we scaled the FUCCI pseudotime to the range [0, 1] by dividing the pseudotime of each cell with the maximum pseudotime between cells. Cell cycle positions are estimated using tricycle Bioconductor package (SC Zheng et al., 2022). **Dentategyrus LaManno and PBMC68k datasets**: Instead of using the sum of spliced and unspliced across cells threshold 20, we used threshold 30 as the number of cells is big for these two datasets.

### Construction of the k-NN graph

The k-NN graph is constructed using the pp.neighbors function in the scVelo package with all default parameters. Internally, it runs functions from the Python umap package, which searches for nearest neighbors in Euclidean distance of the top 30 PCs (default setting) and assigns weights to each edge (McInnes et al., 2018). (For simulations with a number of genes less than 500, the scVelo package will use the spliced counts matrix instead of PCs.) This results in an undirected weighted graph. We could use a symmetric *n* × *n* matrix **W**, of which column sums are normalized to 1, to represent such a k-NN graph.

### Smoothing of the counts matrices

The smoothed count matrices Ms and Mu are calculated using the pp.moment function in the scVelo package with all default parameters. The smoothed count of a cell *i* for a particular gene *g* are given by the weighted sums of raw counts of k nearest neighbors:

Note that the spliced and unspliced counts are smoothed independently, but we use the same weighted k-NN graph for both.

### Gene-level velocity estimation

We use the scVelo tl.velocity function with mode = ‘deterministic’ for steady model velocity estimation. For the dynamical model velocity estimation, we run the scVelo tl.recover_dynamics function to recover dynamics and then get the velocity values by the scVelo tl.velocity function with mode =‘dynamical’. All parameters are default as in the scVelo package. In brief, the steady model fits an extreme quantile regression line and calculate the residuals for each gene, which are returned as the gene-level velocity values. Along with the velocity matrix, we also get the estimated degradation rate *γ* and the coefficient of determination for each gene. Note that only genes with a coefficient of determination greater than a pre-set threshold (0.01 used as default) will be labeled as velocity genes. Only these velocity genes will be used for visualization. Unlike the steady model, the dynamical model simultaneously solves the transcription rate *α*, the splicing rate *β* and the degradation rate *γ*. Inference is also made for each gene independently. For each gene, the direct output of the dynamical model is the gene parameters and a vector of latent time *t*, whose length is the number of cells *n*. The genelevel velocity value is then calculated using the system of equations 3 and the equation 2. Note that we can only recover the parameters for a subset of genes, within which passes the pre-set threshold of coefficient of determination and likelihood are labeled as the velocity genes.

### Simulation settings

The simulation strategy is exactly the same as described in Bergen et al. (2021), as we used the simulation function in the scVelo package. Briefly, we simulate the spliced count and unspliced count matrices based on the system of equations 3. Note that the simulation process is independent for each gene, and it is trivial to get *s*(*t*) and *u*(*t*) as long as we have assigned other parameters. We use the transcription rate *α* = 5, the splicing rate *β* = 0.3, and the degradation rate *γ* = 0.5 for all genes, as used in Bergen et al. (2021). To simulate data with number of cells *n*, a cell-specific pseudotime vector is generated log uniformly distributed and scaled to the range [0, *t _{max}*], with

*t*always set to 25 in our simulations. To make the counts different across genes, we further rescale the cell-specific time vector

_{max}*t*to some interval within [0,

*t*]. This step will keep the orders of cells, but make all cells only cover some part of the full dynamics for a given gene. The same procedure is repeated for each gene to obtain a latent time matrix with

_{max}*m*rows (

*m*genes) and

*n*columns (

*n*cells). We can then plug the gene latent time matrix into the system of equations 3 to obtain the spliced count and unspliced count matrices. The true RNA velocity matrix comes naturally from the equation 2. After the generation of spliced and unspliced matrices, we add Gaussian noise with mean 0 and standard deviation

*σ*to the theoretical spliced and unspliced matrices to make the simulated data used in our analyses. The standard deviation

*σ*is equal to the noise level multiplied by the 99% percentile of the spliced or unspliced counts divided by 10 in our manuscript and in Bergen et al. (2021). Note that, as in Bergen et al. (2021), we do not perform library size normalization and log

_{2}transformation on the spliced count matrix. The k-NN graph construction and smoothing of counts matrices are performed as described previously. In the simulation where we use another “true” k-NN graph, the k-NN is calculated before we add Gaussian noise to the raw spliced and unspliced counts. Both the steady model RNA velocity and the dynamical model RNA velocity are inferred using the scVelo package, resulting in a velocity matrix

*V*to be used later. Also, we force the labels of the “velocity gene” of all genes to be true, as in Bergen et al. (2021). For steady model, estimated velocity of all

*m*genes will be used for later visualization. For dynamical model, some of the genes are still excluded for visualization due to missing velocity estimations from the dynamical model. Coercion could potentially improve visualization results, since we know that all genes are true “velocity genes”.

### Mapping high-dimensional velocities into a low-dimensional embedding

After getting the velocity matrix, which contains a velocity value for *m** (*m** is the “velocity gene” that passes predefined thresholds) genes and n cells, we need to map the high-dimensional velocity into the same low-dimensional embedding of expression (spliced counts). In our manuscript, we use the following three methods to map high-dimensional velocities, with the first two used by La Manno et al. (2018). We use **S** to represent the expression matrix (raw spliced counts matrix), which has the shape of m rows (genes) and n columns (cells). Note that we describe the precise procedures here, which might look a little different from the simplified version in Results section.

#### Direct projection in PCA

We have used this method only on simulated data with known true velocity. Theoretically, this method could be used in any embedding methods with a linear operator . Specifically for PCA, we have
where **R** represents the m-by-p rotation matrix; is a m-by-n spliced count matrix with row-means centered. The resulting n-by-p **P** is the cell-level principal components matrix. We can map velocities as

The formulation here is much simpler than the application to real data, as we omit library size normalization, log_{2} transformation, and highly variable gene selection. Note that in the simple simulation setting and PCA space, we project the velocity vectors directly, which is (almost) equivalent to taking the difference between the future state (**S*** denotes the future expression state matrix) and current state:
*f* (**S***) – *f* (**S**) = *f* (**V**) when Δ*t* = 1. It is clear that Δ*t* is a trivial scalar, since all vectors are rescaled in the final visualization. However, the choice of Δ*t* would matter in real data as non-linear operations are involved, such as normalization of the library size and transformation of log_{2}.

#### Velocity transition probability method

The velocity transition probability method was first introduced by La Manno et al. (2018), and was reused by Bergen et al. (2020) with some modifications. We use the implementation by Bergen et al. (2020) with all default parameters in scVelo package.

Briefly, for a given cell *i*, we have its velocity vector *V _{i}* and its expression vector

*M*(the length for both two is

_{si}*m*

^{*}since we only consider the “velocity” genes filtered by scVelo). Also, note that it is the Ms matrix that is being used for the transition probability instead of the raw spliced count matrix. The first step to get the velocity transition probability matrix is the calculation of a n-by-n cosine similarity matrix. The advantage of using cosine similarity or PCC to quantify the relationship between cells is that we do not need to choose Δ

*t*. For a cell

*i*, we consider cells which are cell

*i*’s k-nearest neighbors and recursive k-nearest neighbors, denoted as {

*n*(

*n*(

*i*,

*k*),

*k*)}. The k-NN graph is precalculated on the PCA of the raw spliced count matrix and was used for Ms and Mu. The cosine similarity between cell

*i*and cell

*j*is given as

The exponential kernel is then applied on the cosine similarity matrix to get the velocity transition matrix. Specifically, the transition probability from cell *i* to cell *j* is given as
with *z _{i}* as the cell normalization factor and the constant kernel width parameter

*λ*. There are optional variance stabilization transformations mentioned in La Manno et al. (2018) and Bergen et al. (2020), but we use the default parameter in scVelo, which does not perform any variance stabilization transformations. Given a embedding

**Q**, the normalized location difference between cell

*i*and

*j*is . The mapped low-dimensional vector for cell

*i*is calculated as where

*f*

^{*}is a non-existing symbolic function. The idea behind the velocity transition probability is to weight the likelihood that the cell

*i*will become the cell

*j*in the future state.

#### UMAP transform method

The UMAP transform method is an experimental method that has not been used in previous papers on RNA velocity (La Manno et al., 2018; Bergen et al., 2020). Unfortunately, we were unable to systematically confirm the correctness of the method since the UMAP transform function itself has not been systematically evaluated. To illustrate, we use *f* to represent the function that projects data into the PCA space. Since we use the PCA results as the input of UMAP function, the UMAP embedding is given as *g*(*f* (**S**)) = **Q**. Note that we use the function to represent the UMAP embedding process, but *g* is not as straightforward as the function *f*. Before we transform new points into the existing UMAP space, we need to get new PCA coordinates, which requires us to choose a Δ*t*. The coordinates of the future states in PCA are given by *f* (**S***) = *f* (**S** + **V** · Δ*t*) = *f* (**S**) + *f* (**V** · Δ*t*). Thus, the coordinates of future states in UMAP are given as

Note that *g* is not a linear function, so we could not expand the right part. Finally, we map the highdimensional velocities into the low-dimensional UMAP space as

In our analysis, we always use Δ*t* = 1. We admit that this is a somewhat random choice, but it is hard to argue the choice because the fuzzy definition and poor intepretability of pseudotime/latent time *t* inherits from the RNA velocity models.

### Vector field visualization

Note that after getting the low-dimensional celllevel vector field, we need to further process it as we could not show that many vectors (arrows) due to overplotting. We use two visualization strategies for the final visualization: gridding method and streamline plot. While the streamline plot is aesthetically more appealing, we turn to the gridding method whenever we want to highlight more details. Again, as we have mentioned previously, the same vector field may look quite different when comparing the two methods. For all vector fields in the manuscript, we adapt the streamline plot and the giridding method implemented in the velociraptor R package (Rue-Albrecht et al., 2021) (Supplementary Figure S17 is an exception). The streamline plot connects vectors flowing towards and into a similar direction. For the streamline plot, which is used for real datasets, we use a resolution between 13 to 20. The gridding method takes the average of all vectors in the grid box. The averaged vectors are further scaled to look good based on the range of axis. For the gridding method, we use a resolution of 20 for simulated data and 30 for the pancreas data. We note that there is not an existing metric to guide us choosing the best resolution, so we have to choose a resolution which, we think, makes sense based on the number of data points and the embedding structure to balance the details and overplotting.

### Calculation of the vector length (speed)

The vector length (speed) in both high-dimensional space and low-dimensional space is defined as the *ℓ*-2 norm of the vector of each cell. Specifically, for high-dimensional velocities, the vector length of cell *i* is with *v _{gi}* the velocity estimation of gene

*g*. For low-dimensional embedding, the vector length of cell

*i*is with

*o*

_{1}

*i*the mapped vector in the first dimension and

*o*

_{2}

*i*the mapped vector in the second dimension.

### Calculation of explained variation *R*^{2} of the fitted loess model

The calculation of the coefficient of determination *R*^{2} of the fitted loess model is given by

Here and . For the FUCCI pseudotime or tricycle cell cycle position, as we they track a full cell cycle, we fit a periodic loess model *y* ~ *t*, with *y* as any response variable, such as the Ms, or Mu, by concatenating triple *y* and triple FUCCI pseudotime *t* with one period shift to form [*y*, *y*, *y*] and [*t* – 1, *t*, *t* + 1] (or [*t* – 2*π*, *t*, *t* + 2*π*] for cell cycle position). Note that instead of using all three copies of data points, we restrict the calculation of *SS*_{res} and *SS*_{total} on the original data points (the middle copy). For the velocity latent time, we could not decide whether the latent time spans a complete period. Thus, we fit normal loess model for the velocity latent time.

### Calculation of Normalized Root Mean Square Error (NRMSE)

In the simulations, we know the true values, such as the true RNA velocity at the gene level. We use both PCC and NRMSE to quantify how good the estimations are. The NRMSE is calculated as

Here, *y _{i}* represents the true values; are the estimated values;

*n*is the number of observations. The standard deviation of the true values

*σ*is used to normalize the root mean square error for making different simulations comparable.

## Data and code availability

The Forebrain data (La Manno et al., 2018), the Bonemarrow data (Setty et al., 2019), the Dentategyrus LaManno data (Hochgerner et al., 2018; La Manno et al., 2018), the Pancreas data (Bastidas-Ponce et al., 2019; Bergen et al., 2020), the Gastrulation erythroid data (Pijuan-Sala et al., 2019), the Dentategyrus Hochgerner data (Hochgerner et al., 2018; Bergen et al., 2020), the Gastrulation E7.5 data (Pijuan-Sala et al., 2019), and the PBMC68k data (GXY Zheng et al., 2017) are all available from the scVelo package directly at https://scvelo.readthedocs.io/api/#datasets (Bergen et al., 2020). The Chromaffin data (Furlan et al., 2017; La Manno et al., 2018) is available at http://pklab.med.harvard.edu/velocyto/notebooks/R/chromaffin2.nb.html. The FUCCI data (Mahdessian et al., 2021) at https://drive.google.com/file/d/149ICTtieYjuKWZoLwRLzimwff0n6eWqw/view?usp=sharing. All the code to analyze the data and generate figures is available at https://github.com/hansenlab/RNAVelocityCode (SC Zheng, 2022).

## Funding

This project has been made possible in part by grant number CZF2019-002443 from the Chan Zuckerberg Initiative DAF, an advised fund of Silicon Valley Community Foundation. The research reported in this publication was supported by the National Institute of General Medical Sciences of the National Institutes of Health under the award R01GM121459. This work was additionally supported by awards from the National Science Foundation (IOS-1665692), the National Institute of Aging (R01AG066768), and the Maryland Stem Cell Research Foundation (2016-MSCRFI-2805). GSO is supported by postdoctoral fellowship awards from the Kavli Neurodiscovery Institute, the Johns Hopkins Provost Award Program, and the BRAIN Initiative in partnership with the National Institute of Neurological Disorders (K99NS122085).

## Disclaimer

The content is solely the authors’ responsibility and does not necessarily represent the official views of the National Institutes of Health or the National Science Foundation.

## Conflict of Interest

None declared.

## 1 SUPPLEMENTARY FIGURES

## 2 SUPPLEMENTARY TABLES

## 3 SUPPLEMENTARY NOTES

### 3.1 The differences between implementations of RNA velocity analysis

Apart from the facts that scVelo implements several models for high dimensional RNA velocity estimations and velocyto (Python and R packages) only implements steady model, the default workflows of the three packages differ in many steps. Here, we discuss some key differences that we think are important.

In the high-dimensional gene-level RNA velocity estimation step, all implementations smooth (also referred to as imputation) the spliced and unspliced counts using a k-NN graph, but the details differ. The scVelo package builds a k-NN graph using the Euclidean distance between cells in a low-dimensional PCA space (derived from the spliced matrix), followed by a weighted smoothing step with weights computed as in McInnes et al. (2018). Both parameter estimation and velocity computation use these smoothed values. The python implementation of velocyto builds a k-NN graph using Euclidean distance in a low dimensional PCA space (using a different k-NN implementation from scVelo), but then performs unweighted smoothing. Both parameter estimation and velocity computation use these smoothed values. In contrast, the R implementation of velocyto uses an unweighted k-NN graph using the correlation distance on the high-dimensional spliced counts matrix. Neighboring cells are aggregated into pseudocells, and the quantile regression line is fitted on these pseudocells to get the estimated degradation rate. Using the estimated degradation rate, velocities are computed for each gene by plugging in the unspliced counts and the spliced counts (prior to forming pseudo-cells) in Equation 2; a substantial deviation from the Python implementation.

Apart from the procedure differences described above, the difference in choice of number of *k* for the k-NN graph also caught our attention. The default parameters used in the scVelo package are *k* = 30 neighbours. For velocyto, there are no defaults for the k but the analyses presented in La Manno et al. (2018) sometimes use very large values (e.g. k =550 for a forebrain dataset with less than 2,000 cells). As the *k* is the most important parameter for all k-NN graph, which plays the central in the high-dimensional RNA velocity estimation step, the choice of *k* might substantially affect the visualized low-dimensional vector field in the end. As an example, we compare the RNA velocity vector field of the Forebrain data by using *k* = 30 and *k* = 550 for both steady and dynamical model (we use scVelo implementation for both models here to avoid differences caused by other steps). Comparing Supplementary Figure S16a to Supplementary Figure S16b, we see that the use of a larger k seems to give us a more smooth vector field using the steady model. Nevertheless, we observe a fairly striking difference for the dynamical model where the low-dimensional vectors of neuroblast cells (purple and red cells) showing opposite direction when using k =30 (Supplementary Figure S16c to using k =550 (Supplementary Figure S16d). Given there is no guide on how to choose the *k* in a new dataset, we have been consistently using the *k* = 30, the default in scVelo package.

In the low-dimensional velocity vector field visualization step, there are also differences from implementation to implementation. A prominent example is that another k-NN graph is constructed and used in velocyto to get the Pearson correlation matrix (the cosine similarity matrix in scVelo), while scVelo uses the same k-NN graph for processing (smoothing) and low-dimensional velocity vector field visualization. In velocyto, as the new k-NN graph is constructed using the low-dimensional embedding, the resulting transition probability matrix would be embedding dependent. The embedding dependent property of a transition probability matrix seems to be counter intuitive to us.

Needless to say, the three implementations diverge sharply in some key steps, with no examination of how these choices affect the resulting output, including both high-dimensional velocities and lowdimensional vector field. We won’t pursue the issue further, but we only use the scVelo package for both steady model and dynamical model to avoid implementation differences.

### 3.2 Visualization of 2d vector fields

Because of overplotting problem, visualizing a 2d vector field has to involve some level of summarization, which can hide local detail. One approach is to grid the 2d embedding and compute the average vectors at each grid location. An alternative is streamline plots which have multiple distinct implementations (The Matplotlib development team, 2022; Campitelli, 2022). Supplementary Figure S17a shows a streamline plot of a forebrain dataset previously discussed in the literature on RNA velocity (La Manno et al., 2018; Gorin et al., 2022). The gridding display of the exact same vector field produces a very different impression (Supplementary Figure S17b). If we examinie the embedding by left part, middle part and right part, the streamline plot hides that the length of vectors in the left part are much shorter than the length of vectors in the middle part.

In addition to the choice of visualization approaches, choices such as resolution also matter. Not only does the different choice of resolution change the aesthetic impression, but it also affect the interpretation of the vector fields (Supplementary Figure S18), especially locally for some regions in the embedding, such as the top right part of the pancreas data.

These illustrate how the qualitative impression of a vector field depends on the visualization method. As the qualitative impression is usually subjective and highly parameter dependent, we will not pursue this important point further. But we strongly recommend using the same visualization tool with choices of appropriate parameters to compare different vector fields. We tend to favor the gridding approach, as it more faithfully depicts local vector fields compared to streamline plots and is easier to reason about.

## Acknowledgements

## Footnotes

↵1 There are many different terminologies for the RNA velocity vector field visualization. For example, Bergen et al. (2020) used “project” to describe transforming the gene-level RNA velocity into an embedding. We use “map”to describe the process of transforming the high-dimensional gene-level RNA velocities into the low-dimensional cell-specific vectors that can be visualized in the embedding space. And we reserve the word “project” to describe a mathematical projection onto a linear subspace. The large number of cells in most datasets implies that the final visualizations depict summarized/smoothed vector fields from the low-dimensional cellspecific vectors, to avoid overplotting. Hence, we use “visualized” to described the final visualized vector field.