Kinetic foundation of the zero-inflated negative binomial model for single-cell RNA sequencing data

Single-cell RNA sequencing data have complex features such as dropout events, over-dispersion, and high-magnitude outliers, resulting in complicated probability distributions of mRNA abundances that are statistically characterized in terms of a zero-inflated negative binomial (ZINB) model. Here we provide a mesoscopic kinetic foundation of the widely used ZINB model based on the biochemical reaction kinetics underlying transcription. Using multiscale modeling and simplification techniques, we show that the ZINB distribution of mRNA abundance and the phenomenon of transcriptional bursting naturally emerge from a three-state stochastic transcription model. We further reveal a nontrivial quantitative relation between dropout events and transcriptional bursting, which provides novel insights into how and to what extent the burst size and burst frequency could reduce the dropout rate. Three different biophysical origins of over-dispersion are also clarified at the single-cell level.


Introduction
Gene expression in living cells is a complex stochastic process, resulting in spontaneous random fluctuations in mRNA and protein abundances [1]. Recent technological advances in single-cell RNA sequencing (scRNA-seq) have made it possible to measure mRNA expression and provide transcriptome profiles at the single-cell level [2][3][4][5]. Compared with traditional bulk RNA sequencing which measures the average mRNA expression levels across millions of cells, scRNA-seq enables the dissection of gene expression heterogeneity in different cell populations and tissues, and thus allows the investigation of many fundamental biological questions such as the identification of novel cell types, the classification of cell subtypes, and the reconstruction of cellular developmental trajectories [6].
Stochasticity in gene expression measurements has two fundamental origins: (i) the intrinsic noise due to small copy numbers of biochemical molecules and random collisions between them, giving rise to various probabilistic chemical reactions [1], and (ii) the extrinsic noise due to limitations of current experimental techniques. Although scRNA-seq provides a new level of data resolution, it also produces a much higher noise level than bulk-level measurements. A remarkable characteristic of scRNA-seq data is the high frequency of zero read counts [7,8]. Given the excessive amount of zero observations in scRNA-seq data, it is important to distinguish between (i) the structural (true) zeros where the genes are truly unexpressed and (ii) the dropout (false) zeros where the genes are actually expressed but fail to be detected [9][10][11][12][13]. While the former is due to intrinsic biological variability, the latter, which is referred to as dropout events, is due to extrinsic technical reasons.
Due to the tiny amount of mRNA in an individual cell, the input material needs to be captured with low efficiency and go through many rounds of amplification before being sequenced. This results in low mRNA capture rate and strong amplification bias, as well as dropout events [14]. To be more specific, the workflow of scRNA-seq experiments includes the following steps: isolation of single cells, cell lysis while preserving mRNA, mRNA capture, reverse transcription of primed RNA into cDNA, cDNA amplification, library preparation, and sequencing [15]. During these steps, possible technical reasons leading to dropouts include mRNA degradation after cell lysis, low efficiency of mRNA capture, reverse transcription, and cDNA amplification, library size differences, and sequencing depth [10]. Recent studies [15] have shown that the efficiency for poly-adenylated mRNA species to be captured, converted into cDNA, and amplified can range between 10% and 40%, depending on the study. This means that if the starting transcripts in an individual cell are in low amount, there is a certain probability that they will not be detected by current scRNA-seq methods.
Besides the dropout effect, other characteristics of scRNA-seq data include over-dispersion [16] and high-magnitude outliers [17] due to the stochastic nature of gene expression at the single-cell level and the related phenomenon of transcriptional bursting [15]. Given these complex features of scRNA-seq data, recent studies have highlighted the need to develop novel statistical and computational methods in data analysis, especially differential expression analysis [5]. When handling dropout events, a popular perspective held by the bioinformatic field is that the complicated probability distributions of mRNA abundances in a cell population need to be explicitly characterized by a global zero-inflation parameter, resulting in various zero-inflated models [18,19]. Among these statistical models, the zero-inflated negative binomial (ZINB) model is the most widely used [20][21][22][23][24][25][26][27][28][29][30][31], where the zero-inflated part describes dropouts and the negative binomial part accounts for over-dispersion. Some other commonly used models are listed in Sec. 7.
Modern sciences emphasize quantitative characterization of experimental observations, which is widely known as mathematical modeling. Along this line, two types of modeling methods should be distinguished: data-driven and mechanism-based modeling [32]. The former explains experimental phenomena in terms of data analysis based on various mathematical formulas and statistical models, while the latter understands the world in terms of mathematical deductions based on various mechanisms and scientific laws. The ZINB model of scRNA-seq data proposed in previous studies belongs to the former category.
In the present work, we provide a mesoscopic kinetic foundation of the widely used ZINB model based on the stochastic biochemical reaction kinetics underlying transcription. In fact, many stochastic models of transcription dynamics have been proposed [33][34][35][36][37][38][39][40]. Although some previous models could provide a clear explanation of over-dispersion, very few of them have incorporated the dropout effect into their model assumptions. So far, there is still a lack of a kinetic basis for the ZINB distribution of mRNA abundance. In addition, it is widely believed that the complex features of scRNA-seq data are closely related to the phenomenon of transcriptional bursting. However, the quantitative relationship among dropout events, over-dispersion, and transcriptional bursting still remains unclear. The present paper addresses these issues.

A novel three-state model of transcription
Based on the central dogma of molecular biology, the transcription of a gene in an individual cell has a standard two-stage representation involving the switching of the gene between an active and an inactive epigenetic state and the synthesis of the mRNA from the gene [1]. In the active state, the gene produces the mRNA. When the gene is inactive, the process of mRNA synthesis is terminated. Due to various technical factors in scRNA-seq experiments such as low mRNA capture rate, amplification bias, and sequencing depth, at a particular time, the mRNA expression in a single cell can be either detectable or undetectable [31]. As a result, it is reasonable to assume that the gene of interest can exist in a third state, referred to as the dropout state, where the mRNA expression of this gene cannot be detected due to technical reasons. Here the dropout state should not be regarded as an epigenetic conformation of the gene. Instead, it characterizes an undetectable state where the transcriptional signal of the gene is missing. These considerations lead to the three-state transcription model illustrated in Fig. 1(a), where a transcript can be synthesized with rate s or be degraded with rate v, and the gene can switch among the active, inactive, and dropout states with certain switching rates a i and b i , i = 1, 2, 3. Compared with the classical two-state transcriptional model without the dropout state [1], the cyclic structure of gene state switching will remarkably increase the theoretical complexity, as we shall see. From the chemical perspective, the microstate of the gene of interest can be described by an ordered pair (i, m): the state i of the gene and the copy number m of detectable transcripts, where i = 1, 2, 3 correspond to the active, inactive, and dropout states, respectively. Then the stochastic dynamics of our three-state transcription model can be described by the Markov jump process (continuous-time Markov chain) with transition diagram illustrated in Fig. 1(b). Since the transcriptional signal is missing when a dropout occurs, it is reasonable to assume that the dropout state can only exist with zero detectable transcript, described by the microstate (3, 0).
Experimentally, it was widely observed that the dropout rate for a given cell strongly depends on its expression level, with dropouts being more frequent for cells with low mRNA expression levels [17]. In general, the total content of mRNA in a single cell is low (0.01-2.5pg per cell) [41] and most genes only transcribe a small copy number of mRNA [42]. Due to the tiny amount of mRNA in an individual cell, the input material needs to be captured with low efficiency and go through many rounds of amplification before being sequenced. This results in low mRNA capture rate and strong amplification bias, as well as dropout events [26]. As a result, microstates (1, m) and (2, m) with tiny mRNA abudance m are more likely to convert to the dropout microstate (3,0). In our Markovian model, for simplicity, we assume that (3, 0) can only be reached from (1, m) and (2, m) with m = 0 ( Fig. 1(b)). In Sec. 7, a removal of this assumption will be discussed and a more realistic model will be given.
There is another reason leading us to consider the three-state transcription model. Recent single-cell experiments have provided evidence that for many genes, more than two states may participate in the transcription process [43][44][45][46][47]. In fact, if a gene can only switch between the active and inactive states, then the sojourn times in the active and inactive states should be exponentially distributed. However, recent single-cell time-lapse measurements in eukaryotic cells [43,44] have indicated that the sojourn time in the inactive state may have a non-exponential peaked distribution. This indicates that the gene dynamics in the inactive state may contain at least two exponential steps, so that in sum the gene would undergo a three-state switching process.
In particular, in two recent studies, the authors monitored gene expression dynamics in mouse fibroblasts [43] and Chinese hamster ovary cells [47] using single-cell time-lapse microscopy and found that both data sets were well described by a three-state gene expression model involving gene switching among an active, an inactive (reversibly silent), and a refractory (irreversibly silent) state. The difference between the inactive and refractory states is that the former has a good chance to switch back to the active state, while the possibility for the latter to switch back is much lower. In the inactive or refractory state, RNA polymerases could either be absent from the promoter or present in a paused state. Therefore, the dropout state in our three-state transcription model may have two different interpretations: It may either correspond to an undetectable state due to purely technical factors or correspond to a refractory state due to real biological factors.
Let p i,m (t) denote the probability of having m detectable transcripts at time t when the gene is in state i. Then the evolution of the Markovian model is governed by the chemical master equation Here s is the transcription rate; v is the degradation rate of the mRNA; a i and b i , i = 1, 2, 3 are the switching rates of the gene among the three states. Since (i, m) represents the microstate of having m transcripts in a single cell when the gene is in state i and each transcript can be degraded with rate v, the transition rate from microstate (i, m) to microstate (i, m−1), which represents the total degradation rate of the m transcripts, should be mv ( Fig. 1(b)) [1]. In addition, since the dropout state could describe a refractory state, which has a lower chance to switch back to the active state than the inactive state, it is natural to assume a 1 > a 3 in our model.

Model simplification via decimation
One of the most important reasons for over-dispersion of bulk and single-cell RNA-seq data is transcriptional bursting, also known as transcriptional pulsing [15], which describes the phenomenon of relatively short transcriptionally active and high expression periods followed by longer transcriptionally silent and low expression periods [43,48], resulting in spontaneous temporal fluctuations of transcript levels ( Fig. 2(a)).
In general, transcriptional bursting results from multiple time scales underlying the transcription process [49]. In fact, the mechanism of transcriptional bursting has been described by Paulsson in his review paper [1], "If genes are mostly inactive but transcribe a large number of mRNAs while in the active state, transcription could occur in bursts". Intuitively, if we require the gene to be mostly inactive, the switching rate b 1 of the gene from the active to the inactive state should be much larger than the reverse switching rate a 1 from the inactive to the active state. On the other hand, if we require the gene to transcribe a large number of transcripts while in the active state, the transcription rate s should be very large, at least at the same order of magnitude as the switching rate b 1 . These considerations lead to the following biochemical conditions for transcriptional bursting: b 1 a 1 and s/b 1 is finite. Here, by saying that s/b 1 is finite, we mean that s and b 1 are roughly at the same order of magnitude. When the gene is active, the large transcription rate s will give rise to fast accumulation of mRNA. Once the gene becomes inactive, the transcription process is terminated and transcripts will be degraded until the gene becomes active again. We stress here that the above biochemical conditions imposed on the rate constants are consistent with a recent single-cell experiment on transcriptional bursting [43], where the authors monitored the transcription kinetics in mouse fibroblasts using time-lapse bioluminescence imaging and found that the three rate constants a 1 , b 1 , and s across different genes are typically at the magnitude of 0.01/min, 0.1/min, and 1/min, respectively.
Due to the timescale separation of the underlying biochemical reaction kinetics, our Markovian model can be simplified to a much simpler one. To see this, let β = b 1 /a 1 1 denote the ratio of the switching rates between the active and inactive states. Moreover, let q (i,m),(i ,m ) denote the transition rate of the Markovian model from microstate (i, m) to microstate (i , m ) and let denote the rate at which the system leaves microstate (i, m), which is defined as the sum of transition rates from (i, m) to other microstates. Since β 1, we say that (i, m) is a fast state if If (i, m) is a fast state, then the time that the system stays in this state will be very short. Since b 1 a 1 and s/b 1 is finite, we write b 1 = βa 1 and s = βa 1 (s/b 1 ), where β 1 and we treat a 1 and s/b 1 as constants. Here b 1 and s are the only two model parameters depending on β and all other model parameters are independent of β. It is easy to check that the leaving rates of all microstates are given by which shows that Therefore, all active microstates (1, m) are fast states and all other microstates (2, m) and (3, m) are slow states ( Fig. 3(a)). By using a classical simplification method of two-time-scale Markov jump processes called decimation [50][51][52][53][54][55], the original Markovian model can be simplified to a reduced one by removal of all fast states. The remaining question is to determine the transition diagram and effective transition rates of the reduced model. This process is described as follows. Suppose that the original model jumps from microstate (i, m) to another microstate a particular time. When β 1, the transition probability from microstate (i, m) to another microstate (i , m ) is given by Let (i 1 , m 1 ), · · · , (i n , m n ) be a sequence of microstates. We say that m 1 ), · · · , (i n , m n ) are all fast states. Moreover, the probability weight w c of the fast transition path c is defined as According to the decimation theory [50][51][52][53][54][55], the effective transition rate from (i, m) to (i , m ) is given byq where c ranges over all fast transition paths from (i, m) to (i , m ). This formula shows that the effective transition rate from (i, m) to (i , m ) is the sum of two parts: the direct transition rate q (i,m),(i ,m ) and the contribution of indirect transitions via all fast transition paths, as illustrated in Fig. 3 By using this criterion, it is easy to see that the original model only has two types of fast transition paths with positive probability weights, which are given by and as illustrated by the red arrows in Fig. 3(a). To proceed, we defined two constants p and q as When β 1, the transition probabilities along the above two fast transition paths are given by The above two formulas indicate that the reduce model may produce large jumps of mRNA abundance within a very short period, which correspond to transcriptional bursts. Each random burst corresponds to a fast transition path of the original model (see the red arrows in Fig. 3(a)). So far, we have obtained all effective transition rates of the reduced model, whose transition diagram is depicted in Fig. 3(c). The above calculations can be understood intuitively as follows. Since b 1 a 1 and s/b 1 is finite, the process of mRNA synthesis followed by gene silencing is essentially instantaneous. Once the gene becomes active, it can either produce a transcript with probability p = s/(s + b 1 ) or switch to the inactive state with probability q = 1 − p = b 1 /(s + b 1 ). Therefore, the probability that the gene produces k transcripts in a single burst before it is finally silenced will be p k q, which follows a geometric distribution. This consideration again leads to the reduced model illustrated in Fig. 3(c). The evolution of the reduced model is governed by the chemical master equation Since the burst size of the mRNA, which is defined as the number of transcripts produced in a single burst ( Fig. 2(a)), is geometrically distributed, its expected value is given by

Theoretical foundation for the ZINB model
Although the topological structure of the reduced model is complicated, its steady-state probability distribution can be solved analytically. To see this, let p ss (i,m) denote the steady-state probability of microstate (i, m). At the steady state, the probabilities of all microstates are time-independent and thus the left side of (3) must equal zero, giving rise to a set of linear equations. Interestingly, this set of linear equations can be solved explicitly with its solution being given by (see Appendix) , where A is a normalization constant,ã 1 is a constant given bỹ and (x) m = x(x + 1) · · · (x + m − 1) is the Pochhammer symbol. Since all steady-state probabilities add up to one, the normalization constant A can be determined as Let p ss m denote the steady-state probability of having m copies of detectable transcripts. Then we obtain Since the probabilities p and q can be represented by the mean burst size h as the steady-state distribution of mRNA abundance can be written in a unified way as where δ 0 (m) is Kronecker's delta function which takes the value of 1 when m = 0 and takes the value of 0 otherwise, and λ > 0 and 0 < w < 1 are two constants given by Here 0 < w < 1 is a result of our model assumption a 1 > a 3 . This is exactly the ZINB distribution of mRNA abundance widely used in scRNA-seq data analysis [20][21][22][23][24][25][26][27][28][29][30][31]. Specifically, the ZINB distribution is the mixture of two distributions: the zero-inflated part is a single-point distribution concentrated at zero and the negative binomial part is a negative binomial distribution. The ZINB distribution is determined by three parameters with clear biological implications: the dropout rate w which characterizes the proportion of the zero-inflated part due to both technical and biological effects, the mean burst size h which describes the average number of transcripts synthesized in a single burst, and the maximum burst frequency λ which represents the maximum number of occurrence of random bursts per mRNA lifetime. A more detailed discussion on the burst frequency will be given in the next section.
The ZINB distribution can exhibit three different types of shapes, as illustrated in Fig. 4. To clarify the conditions for the three types of shapes, we notice that the mode (maximum point) of the negative binomial part p NB m is given by where [x] denotes the integer part of x. In fact, the first type of shape occurs when p ss 0 ≤ p ss 1 , that is, In this case, the dropout rate is small and the mode of the negative binomial part is large. Then the ZINB distribution peaks at the non-zero mode [(λ − 1)h] with no zero-inflation ( Fig. 4(a)). The second type of shape occurs when p ss 0 > p ss 1 and µ mode ≤ 1, that is, In this case, the dropout rate is large and the mode of the negative binomial part is small. Then the ZINB distribution peaks at zero with apparent or inapparent zero-inflation ( Fig. 4(b)). The third type of shape occurs when p ss 0 > p ss 1 and µ mode ≥ 2, that is, In this case, both the dropout rate and the mode of the negative binomial part are large. Then the ZINB distribution becomes bimodal and peaks at both zero and the non-zero mode [(λ − 1)h] with apparent or inapparent zero-inflation (Fig. 4(c)). Three special cases should be paid special attention to. The first case occurs when the mean burst size h → 0 and the maximum burst frequency λ → ∞, while λh = γ is kept as a constant. In this case, we have Then the ZINB distribution of mRNA abundance reduces to the zero-inflated Poisson (ZIP) distribution In fact, the ZIP distribution is also extensively applied in scRNA-seq data analysis [27] and its kinetic mechanism has been clarified in previous studies [37,40]. Our analytic theory shows that the ZIP model also naturally emerges from our three-state transcription model. The second special case occurs when b 2 = b 3 = 0, which means that the switching from the active or inactive state to the dropout state is forbidden. In this case, the three-state model reduces to the classical two-state model without the dropout state [1]. It is easy to verify thatã 1 = a 1 and w = 0 in this regime. This shows that the dropout rate will vanish in the absence of the dropout state.
The last special case occurs when a 3 = 0, which means that the switching from the dropout state to the active state is forbidden. This is especially biologically relevant when the dropout state is understood to be the refractory (irreversibly silent) state found in recent single-cell experiments [43,47]. In this case, we also haveã 1 = a 1 and thus the dropout rate is given by where K 2 = b 2 /a 2 is the equilibrium constant of gene switching between the inactive and dropout states. An increased equilibrium constant K 2 will result in a larger fraction of cells being in the dropout state and thus is expected to enhance the dropout rate w. Interestingly, our theory reveals a nontrivial quantitative relation between dropout events and transcriptional busting: an increased mean burst size h or maximum burst frequency λ will give rise to a decline in the dropout rate w. This relation provides novel insights into how and to what extent the burst size and burst frequency of the mRNA could reduce the dropout rate. The basic reason of such dependency is that an increase in the burst size and burst frequency will both promote rapid accumulation of mRNA from a low to a higher level, which is unfavorable to the occurrence of dropouts.

Mean burst duration and burst frequency
It has been shown that the mean burst size of the mRNA is given by h = s/b 1 . Here we present a more detailed discussion on the burst frequency. In this section, we assume that the time-dependent mRNA abundance in an individual cell could be measured at a series of successive time points, and due to various technical factors, the mRNA expression is undetectable during some periods. Recall that each transcriptional burst is featured by a short transcriptionally active period followed by a long transcriptionally silent period. Mathematically, the mean burst duration, which is defined as the average time needed to complete a single burst ( Fig. 2(a)), can be computed as the inverse of the total probability flux between the active microstates and other (inactive and dropout) microstates [56,57]. From (4), the total flux J between the active microstates and other microstates is given by where 0 < α ≤ 1 is a constant given by and thus the mean burst duration is given by Since the mRNA lifetime is the inverse of the mRNA degradation rate v, the mean burst frequency λ 0 of the mRNA, which is defined as the average number of occurrence of random bursts per mRNA lifetime, is given by the quotient of the mRNA lifetime 1/v and the mean burst duration τ burst : where λ = a 1 /v is the maximum burst frequency defined previously. Since 0 < α ≤ 1, the true mean burst frequency λ 0 is always smaller than the maximal burst frequency λ. We next focus on three special cases. In the limiting case of h → 0 and λ → ∞, while λh = γ is kept as a constant, we have λ 0 → ∞. In this regime, random bursts occur very frequently but each burst only contributes a very small burst size. Due to large burst frequencies, the gene switches very rapidly between the active and inactive states, giving rise to a large number of "futile" switches ( Fig. 2(b)).
In the special case of b 2 = b 3 = 0, the three-state model reduces to the classical two-state model without the dropout state [1]. In this regime, we have α = 1 and the mean burst frequency attains its maximum λ 0 = λ. In the presence of the dropout state, we have b 2 > 0 and α < 1. This shows that dropout events will lead to a reduction of the burst frequency by prolonging the transcriptionally silent periods.
The last special case occurs when a 3 = 0, which means that the switching from the dropout state to the active state is forbidden. In this case, we haveã 1 = a 1 and is the proportion of the negative binomial part. Then the mean burst frequency is given by This quantitative relation reveals how the dropout rate could affect the burst frequency.

Over-dispersion of scRNA-seq data
The simplest kinetic model of transcription is the classical birth-death process, which describes the synthesis and degradation of the mRNA. The steady-state distribution of the birth-death process turns out to be a Poisson distribution, whose mean and variance are equal. In bulk or single-cell RNA-seq experiments, read counts are always over-dispersed relative to Poisson: the variance is higher than the mean [16,58].
In the literature, the dispersion, sometimes referred to as noise, in mRNA abundance within a cell population is often characterized by the Fano factor η = σ 2 / m , which is defined as the ratio of the variance σ 2 and the mean m . A dispersion greater than one reveals a deviation from the Poisson distribution and thus serves as a characteristic signal of over-dispersion. Strictly speaking, the dispersion captures all sources of variation between samples, including contributions from technical factors leading to dropouts as well as real biological variation.
To calculate the mean and variance of mRNA abundance, we consider the generating function of the ZINB distribution: Then the mean and variance can be recovered by taking the derivatives of the generating function: Therefore, the dispersion in mRNA abundance is given by where the constant term 1 is the dispersion of a Poisson distribution arising from individual births and deaths of the mRNA, the middle term h describes the dispersion due to transcriptional burst sizes, and the last term wλh characterizes the dispersion due to the interaction between dropout events and transcriptional bursting. When there are no dropouts, the dispersion reduces to η = 1 + h, which does not depend on the burst frequency [1]. Interestingly, in the presence of dropout events, the dispersion positively depends on the three parameters: the dropout rate w, mean burst size h, and maximum burst frequency λ. This clearly reveals three different biophysical origins of over-dispersion. Statistically, the three parameters involved in the ZINB distribution can be estimated in several different ways. The maximum likelihood estimation has been discussed in [30]. Here we provide two additional approaches. In fact, the first three moments of the ZINB distribution can be recovered from the generating function as By analyzing scRNA-seq data, the first three moments of mRNA abundance can be estimated. Solving the above set of polynomial equations give the moment estimates of w, h, and λ.
When the mRNA levels across cells are relatively high, there is still another method to estimate the three parameters. From (4), when m 1, we have This suggests that for any k ≥ 1, Taking logarithm on both sides gives rise to log p ss m+k ≈ k log which is a linear relation with respect to k. Therefore, we only need to calculate the logarithm of the steady-state probabilities at large mRNA copy numbers and then carry out a linear regression analysis with respect to the copy number difference k. The slope of the linear regression provides an estimate of the mean burst size h. Once h is known, we can solve (6) to obtain the estimates of the dropout rate w and maximum burst frequency λ: where η = σ 2 / m is the dispersion.

Discussion
In this work, we present a comprehensive analysis of a three-state transcription model with dropout events and over-dispersion based on the biochemical reaction kinetics underlying transcription. Using the multiscale simplification technique of decimation, we simplify the original Markovian model to a reduced one by removal of all fast states. It turns out that transcriptional bursts exactly correspond to the fast transition paths of the original model. Although the reduced model has a complicated topology, we obtain its steady-state analytic solution. The widely used ZINB or ZIP model of scRNA-seq data naturally emerges as the steady-state distribution of the reduced model. This provides a mesoscopic kinetic foundation of these statistical models. We further clarify the biological implications of the three parameters involved in the ZINB distribution: the dropout rate w, mean burst size h, and maximum burst frequency λ. In addition, we discover a nontrivial relation between dropout events and transcriptional bursting, which quantitatively reveals how and to what extent the burst size and burst frequency could reduce the dropout rate. Another relation reveals how dropout events could lower the burst frequency by prolonging the transcriptionally silent periods. The dispersion of scRNA-seq data is also investigated at the single-cell level and three different biophysical origins of over-dispersion are found. Finally, two statistical methods are given to estimate the three parameters involved in the ZINB distribution.
Our three-state transcription model is a minimal kinetic model that could account for the ZINB distribution of mRNA abundance. Recently, there has been some discussion on the role of various technical and biological effects on the apparent zero-inflation in scRNA-seq data [27,59,60]. In our minimal three-state model, zero-inflation is realized by the introduction of a dropout state, which may be either interpreted as an undetectable state due to technical factors or interpreted as a refractory state due to biological factors. In other words, our three-state model cannot distinguish whether zero-inflation is a consequence of technical or biological effects. If we would like to empirically decide between the two interpretations, a more realistic model that takes into account more complex features of stochastic transcription dynamics must be developed.
If the dropout state is interpreted as a refractory state due to biological factors, then a more realistic model would be the Markovian model illustrated in Fig. 5(a), where microstates (3, m), m ≥ 1 are incorporated and transitions between microstates (1, m), (2, m), and (3, m) are allowed. Here (3, m) represents the microstate of having m transcripts in an individual cell when the gene is in the refractory state. In fact, the minimal kinetic model depicted in Fig. 1(b) can be viewed as an approximation of the more realistic model when a 2 , a 3 v. This can be understood intuitively as follows. Since a 2 , a 3 v, the degradation of the mRNA is fast and the switching of the gene from the refractory state to the active or inactive state is slow. Once the gene is in the refractory state, before it could switch to the active or inactive state, the microstates (3, m), m ≥ 0 are already in rapid pre-equilibrium due to fast mRNA degradation and thus most of the probability is concentrated on microstate (3, 0). If the dropout state is interpreted as an undetectable state due to technical factors, then a more realistic model would be the Markovian model illustrated in Fig. 5(b), where the microstate of the gene of interest is described by an ordered triple (i, m, k): the activity i of the gene with i = 1, 2 corresponding to the active and inactive states, respectively, the copy number m of the mRNA, and the detection state k of the transcriptional signal with k = 1, 0 corresponding to the detectable and undetectable states, respectively. In scRNA-seq experiments, the variable of interest is the copy number  This Markovian model allows transitions between detectable microstates (i, m, 1) and undetectable microstates (i, m, 0). Since dropouts are more frequent for cells with low mRNA expression levels [17], the transition rate from (i, m, 1) to (i, m, 0) should be a decreasing function of m and the transition rate from (i, m, 0) to (i, m, 1) should be an increasing function of m. Within this framework, the minimal kinetic model depicted in Fig. 1(b) can be roughly viewed as an approximation of the more realistic model with all undetectable microstates (i, m, 0) combined as a single microstate (3,0).
Besides the ZINB and ZIP models discussed in the present work, many other statistical models have also been proposed to analyze scRNA-seq data. Some commonly used models include but not limited to the Gaussian mixture model [61], Poisson-negative binomial mixture model [17,62], Poisson-gamma mixture model [63], Hurdle model [64], zero-inflated log-normal model [18], zero-inflated Gaussian mixture model [19], and Bayesian mixture model [65,66]. We anticipate that the mesoscopic kinetic mechanisms for these models could be clarified. A deeper understanding of the connection between the kinetic approach and the statistical approach is expected.
Cancer Institute of New Jersey, and Xuegong Zhang and Jiaqi Li at Tsinghua University for stimulating discussions. The author is also grateful to the anonymous reviewers for their valuable comments and suggestions which help the author greatly in improving the quality of this paper.

Appendix
Here we provide the detailed derivation of the steady-state probability distribution of the reduced model depicted in Fig. 3(c). At the steady state, the steady-state probabilities of all microstates satisfy the following set of linear equations: By the second equation in (7), we have (a 2 + a 3 )p ss 3,0 = b 2 p ss 2,0 .
Inserting this equation into the first and third equations in (7) whereã 1 is the constant defined in (5). For convenience, set w 0 =ã 1 a 1 p ss 2,0 , w m = p ss 2,m , m ≥ 1.
To proceed, we introduce the generating function Then the algebraic equation (9) can be converted into the ordinary differential equation vF (z) = a 1 p 1 − pz F (z), whose solution is given by where A is a constant. Therefore, w m can be recovered from the generating function F as This shows that p ss 2,0 = A · a 1 a 1 , p ss 2,m = A · p m (a 1 /v) m m! , m ≥ 1.