ABSTRACT
The availability of high throughput single-cell RNA-Sequencing data allows researchers to study the molecular mechanisms that drive the temporal dynamics of cells during differentiation or development. Recent computational methods that build upon single-cell sequencing technology, such as trajectory inference or RNA-velocity estimation, provide a way for researchers to analyze the state of each cell during a continuous dynamic process. However, with the surge of such computational methods, there is still a lack of simulators that can model the cell temporal dynamics, and provide ground truth data to benchmark the computational methods.
Hereby we present VeloSim, a simulation software that can simulate the gene-expression kinetics in cells along continuous trajectories. VeloSim is able to take any trajectory structure composed of basic elements including “linear” and “cycle” as input, and outputs unspliced mRNA count matrix, spliced mRNA count matrix, cell pseudo-time and true RNA velocity of the cells. We demonstrate how VeloSim can be used to benchmark trajectory inference and RNA-velocity estimation methods with different amounts of biological and technical variation within the datasets. VeloSim is implemented into an R package available at https://github.com/PeterZZQ/VeloSim.
1 Introduction
Applying Single-cell RNA-Sequencing (scRNA-Seq) on cell populations during cell differentiation or development helps researchers to capture each individual cell from different developmental stages within the population at the same time. Computational methods such as trajectory inference1–3 and RNA-velocity4, 5 are used to recover the states of cells and infer the trajectory of the cell dynamic processes in the population, using scRNA-seq data. Trajectory inference reconstruct the trajectory structure in the dataset, assign cells onto different trajectory paths and order cells according to their developmental stages. RNA-velocity, on the other hand, provides a short term prediction of gene expression profile for each cell using unspliced and spliced mRNA count. However, it is a challenging problem to learn the correct trajectories or to predict future states from scRNA-seq data due to the high amount of noise in the data and the complex nature of cell dynamics.
Simulated datasets have been used to benchmark the trajectory inference methods and learn the strength and weakness of each method6, 7. However, there are less existing benchmarking work for methods which involve RNA velocity, as most of the simulators, including Splatter8, SymSim7, powSimR9.
Dyngen10 and SERGIO11 are able to simulate both spliced and unspliced mRNA counts of cells in a continuous dynamic process, but they each has its own limitations in practice. Dyngen provides predefined trajectory structure types in the package, such as bifurcating or cycle, but it is not flexible enough for more customized trajectory structure where user need to specify the exact developmental tree structure and the length of each branch. SERGIO cannot generate the ground truth pseudo-time for the cells, and the cell developmental backbone tends to exhibit strong granularity between branches.
Hereby we present VeloSim, an R package that can simulate both unspliced and spliced mRNA counts of single cells along continuous trajectories, as well as the true RNA velocity. It also outputs the assignment of cells to each trajectory lineage, and the pseudo-time of each cell. To simulate the unspliced and spliced mRNA counts, VeloSim uses the two-state kinetic model, where a gene is considered to be in an on or off state, and when in an on state, the unspliced mRNAs are being produced, and whenever there are unspliced mRNAs, spliced mRNAs are produced using the unspliced mRNAs. VeloSim allows users to provide any trajectory structure that is made of basic elements of “cycle” and “linear”. A tree structure can be generated using multiple linear structures. VeloSim can be used to benchmark different computational methods designed for continuous developmental dataset including trajectory inference and RNA-velocity estimation. The schematics of VeloSim is shown in Fig. 1.
2 Results
2.1 VeloSim models time-series gene expression using the kinetic model
Following SymSim7, VeloSim models the mRNA transcription process with two-state kinetic model12, 13, which incorporates on state and off state of a gene in a cell. The mRNAs are transcribed if the corresponding gene is at the on state, and are inhibited if the gene is at the off state. The transition of each gene between those two states are modeled by a two state markov process, with the transition probability calculated from kon, the rate at which the a gene becomes active, and koff, the rate at which the gene is inhibited. We generalized the kinetic model used in7 and13 to further incorporate the splicing process of mRNA, in order to record the unspliced mRNA counts. The mRNA molecules before the splicing process is called nascent mRNA or unspliced mRNA, and after the process is called mature mRNA or spliced mRNA. The whole process can be modeled using a group of first order differential equation (ODE) (Methods)4, 14 with additional parameters s, β and γ that correspond to transcription rate, splicing rate and degradation rate of the mRNA molecules. Similar modeling was used in the original RNA velocity paper by La Manno et al4.
To introduce how VeloSim generates time-series unspliced counts and spliced counts for a given gene along a trajectory lineage, we assume that we know the kinetic parameters, kon, koff, s, β and γ, of this gene’s expression level in these cells. Then for a gene in each cell, we can generate how its unspliced and spliced counts change for a period of time (Methods), then we move on to the expression of this gene in the next cell. Fig. 2 shows the change of unspliced and spliced counts of a gene in a sequence of cells, with different resolutions. To mimic the snapshot property of scRNA-seq data, for each cell, we randomly select a time point within its time range, and take the value of unspliced and spliced counts for all genes at this time point, as the true counts for all genes in this cell.
2.2 VeloSim models the gradual change of cell identities using EVFs
The kinetic parameters used to generate the dynamic process described above are generated based the cell identities and gene identities. We assume that kon, koff and s are generated differently for each gene in each cell, and the splicing rate β and degradation rate γ, are gene-specific parameters and are shared between cells. The kinetic parameters kon, koff and s are estimated using the dot product of cell identity vectors and gene identity vectors. β and γ are sampled independent and identically from normal distribution with user-defined mean and variance.
Cell identity vectors, or we termed extrinsic variability factors(EVF)s, are low dimensional vectors unique for each cell. It serves as a latent representation of each cell within the cell population, which can models multiple biological factors, such as morphology, maturity, functionality and micro-environment. We dichotomize the factors into two subsets, one include the the factors that is driving the differentiation, termed differential EVF, another include the factors that is similar for all cells within the cell population, termed non-differential EVF. Totally three vectors are generated for each cell, with one vector for each kinetic parameter: kon, koff and s. In order to simulate a continuous cell population where cells from different developmental stages are incorporated, we generate differential EVF using the Brownian motion procedure along the differentiating backbone15, such as tree or cycle. The non-differential EVFs, on the other hand, are sampled directly from an i.i.d normal distribution with mean and variance provided by the user.
The gene identity vectors, also called gene effects, are vectors of the same length as the EVF vectors, and for a given gene and a given cell, the gene effect vector represent how much this gene is affected by different EVFs in the EVF vector of the cell. In VeloSim, we first set a majority of values in the gene effect vectors to be 0, then generate the non-zero values from a Gaussian distribution. In order to produce a more realistic kinetic estimation, we further match the parameter distribution of the inferred kinetic to the kinetic from real world dataset with quantile normalization.
With all the kinetic parameters being estimated, we generate the unspliced and spliced mRNA count using differential equations that model the gene expression dynamics(Fig. 2). We generate one linear trajectory with one simulation on the equations given the starting states and kinetic parameters. The complex tree lineage topology is generated by running the linear trajectory generation process for multiple times with different kinetic parameters for the lineage and starting states extracted from the ending states of previous linear trajectory segment. For cycle topology, the generation process is exactly the same as the generation process of linear trajectory, but we use a different Brownian motion process such that to generate the EVFs such that cells at the end of the cycle have similar identity as the cells at the beginning of the cycle (Methods). We dichotomies the kinetic parameter generation process into forward and backward parts, where cell develops forward to the turning point and backward to the original state. The generated sample scRNA-Seq data of binary-tree topology and cycle-topology is shown in Fig. 3.
In addition to simple cycle and tree topology, VeloSim is able to generate the complex cycle-tree structured dataset, where cell-cycle process coexists with the tree-like cell differentiation process16. Such trajectory structure exists in real life dataset5, but there are limited simulators that can simulate such trajectory structure. VeloSim generate the structure by first simulating the kinetic parameters a multi-cycle cell cycling process, then a tree like branching process. The parameter distribution of the inferred kinetic are matched to the kinetic from real world dataset jointly, and cells in the topology are generated using the inferred kinetic parameters.
After the count matrix is generated, VeloSim simulate the dropout effect of real life sequencing process by sampling the count matrix with small molecule capture rate (Methods).
2.3 Using VeloSim to benchmark trajectory inference methods
VeloSim can be used to benchmark different trajectory inference methods that require either root cell or RNA-velocity information. We benchmark slingshot2 and latent time5 methods, one requires root cell information, another uses RNA velocity information. We provide ground truth root cell cluster for slingshot and RNA velocity information for latent time. We test both methods on multiple dataset with tree-like topology, and measure the performance of different methods using kendall rank correlation coefficient (Methods), the result (Fig. 4) shows that if given correct root cell, slingshot generally performs better compared to latent time in tree-like trajectory structure.
2.4 Using veloSim to benchmark RNA-velocity estimation methods
Providing unspliced, spliced and true RNA-velocity count matrix at the same time, VeloSim is able to benchmark different RNA-velocity inference methods easily. We use VeloSim to benchmark different modes of velocyto4 and scVelo5. We use a bifurcating trajectory structure, and generated 27 datasets with different parameters including random seed, number of cells, number of genes. We test the performance of different RNA velocity inference methods with different molecular capture rate, which correspond to the level of dropout effect induced by real-life sequencing process, and the result(Fig. 5) shows that all current RNA velocity inference methods are sensitive to dropout effect within the dataset, and the stochastic mode of scVelo outperform all the other methods under all scenarios with different capture rates. The RNA velocity inference accuracy of each cell is measured using cosine similarity (Methods).
3 Discussion
We proposed VeloSim, an R package that generates unspliced and spliced mRNA counts for single cells along any continuous trajectories. It is an easy-to-use package which can generate ground truth trajectory topology and cell pseudo-time, and can be used to benchmark trajectory inference methods and RNA velocity inference methods.
4 Methods
4.1 Generating time-series unspliced and spliced counts using kinetic model
In4, the authors modeled the generation of unspliced and spliced counts as follows:
Denoting the transcription rate as α, the amount of unspliced mRNA as u, the splicing rate (the rate of unspliced mRNA turning into spliced mRNA) as s, and the degradation rate as γ, then we have
In VeloSim, we know the kinetic parameters kon, koff and s. 1/kon represents the average waiting time of the gene in the off state before it is switched on; 1/koff represents the average waiting time of the gene in the on state before it is switched off. We aim to calculate the transition probability between on and off states using the kon and koff. First, we define a cycle of a gene as Tc = 1/kon + 1/koff. Then we divide this time window into npart parts, where npart = max{Tc/ min(1/kon, 1/koff), Tc} * 2, then each part has length Δ t = Tc/npart, and this is also what we call stepsize. With this stepsize length, the transition probability matrix Ptran between the on and off states can be calculated as: where Ptran [1, 1] represents the probability of going to an on state from an on state after time of stepsize, Ptran [1, 1] represents the probability of going to an off state from an on state after time of stepsize, etc.
We start with assigning the gene a random state (on or off), then simulate its gene-expression in terms both unspliced and spliced counts from cell to cell. For each cell, we run the time of a full cycle Tc, depending on the kinetic parameters of the gene in this cell. At every step (with stepsize Δt), we apply the transition probability matrix Ptran to determine the next state. If the gene is now in an on state at time t, the unspliced counts at time t is calculated as:
If at time t, the gene is in an off state, we have:
Then we update the amount of spliced counts:
In this fashion, we can simulate the unspliced and spliced counts over a time course of Tc for each cell. Then, a random time point tr is selected during the time course of Tc to extract the corresponding unspliced and spliced counts at time tr, as the data for this gene in this cell. This mimics the snapshot property of single cell RNA-seq data.
4.2 Cell EVF and gene effect vector generation
The cell EVFs are generated along the given trajectory structure with a Brownian motion process, as used in SymSim7. This process mimics that the cells’ identities change gradually along a continuous trajectory. This framework also allows users to input any trajectory, consisting of basic components like edges in a tree and cycles.
Generating a cycle structure is a bit more tricky. In a cycle structure, we need to ensure that the last cell “connect” to the first cell. To realize this, we use the following strategy. Suppose we would like to generate nc cells along a cycle structure. First we set e1 = 1, then for cells 2 to ⌈nc/2⌉, we generate the EVF values using the standard process, which is where ei represents the EVF value of cell i, and N(0, σ2) means a value sampled from a Gaussian distribution with mean 0 and standard deviation σ. σ is a parameter which can adjust how close the cells are to the backbone trajectory structure. Smaller σ makes the cells deviates less from the backbone structure.
While we perform Eq. 6 for i = 2,…, ⌈nc/2⌉, we record the value sampled from N(0, σ2) and set the corresponding values as d2,…, d⌈nc/2⌉. Now to generate ej for k = ⌈nc/2⌉ +,…,nc, for every j, we randomly select a value from d2,…,d⌈nc/2⌉ (withoutreplacement), say we select dk, then we do:
In this fashion, we can ensure that ek is very close or equal to e1, with small difference (sampled from N(0, σ2)) between any two consecutive EVF values.
Gene effect vectors represent the identity of genes. In VeloSim we do not incorporate gene regulatory networks or coexpression gene modules, so the gene effect vectors of genes are generated in the same way: in each gene effect vector, we first set certain positions to be 0s, and for the non-zero entries, we sample
We also models the highly expressed genes that does not follows the general kinetic value distribution. We define a user-specified parameter prophge, which correspond to the proportion of the highly expressed genes, and adjust the transcription rate s for those genes to increase their final count.
4.3 Adding technical noise to unspliced and spliced counts
Real life count matrix obtained from single-cell RNA Sequencing technology usually exhibit high sparsity due to strong molecular dropout effect. We simulate the sequencing process by sampling the true mRNA count with molecular capture probability c(c = 0.2). With each mRNA molecule, there is a probability c = 0.2 that this molecule is captured in the sequenced count.
4.4 Comparing different RNA velocity inference methods
We test the performance of different RNA velocity inference methods with multiple datasets generated with different kinetic parameters. We generate totally 27 single-cell datasets with number of cells set to 500, 750, 1000, number of genes set to 100, 200, 500, and random seeds set to 2, 3 and 4. With different each dataset, we test how the performance of RNA velocity inference varies under different capture rate (c = 0.2, 0.4, 0.6, 0.8 and 1). Totally four different methods are tested, including velocyto, scVelo-deterministic mode, scVelo-stochastic mode and scVelo-dynamical mode. The inference accuracy is measured using cosine similarity, which measures the angle between two inferred velocity directions. For cell i, denote the ground truth RNA velocity as vector vt(j) and the inference RNA velocity as vector vi(j), and the cosine similarity can be calculated as
The score lies within the range between –1 and 1, with higher score correspond to the case where the inferred velocity direction is highly similar to the ground truth velocity direction.
4.5 Comparing different trajectory inference methods
We measure the pseudo-time inference accuracy using kendall rank correlation coefficient between inferred pseudo-time and ground truth pseudo-time. Given , the kendall rank correlation coefficient can be calculated as