Urn models for regulated gene expression yield physically intuitive solutions for probability distributions of single-cell counts

Fitting the probability mass functions from analytical solutions of stochastic models of gene expression to the count distributions of mRNA and protein molecules in single cells can yield valuable insights into mechanisms of gene regulation. Solutions of chemical master equations are available for various kinetic schemes but, even for the models of regulation with a basic ON-OFF switch, they take complex forms with generating functions given as hypergeometric functions. Gene expression studies that have used these to fit the data have interpreted the parameters as burst size and frequency. However, this is consistent with the hypergeometric functions only if a gene stays active for short time intervals separated by relatively long intervals of inactivity. Physical insights into the probability mass functions are essential to ensure proper interpretations but are lacking for models of gene regulation. We fill this gap by developing urn models for regulated gene expression, which are of immense value to interpret probability distributions. Our model consists of a master urn, which represents the cytosol. We sample RNA polymerases and ribosomes from it and assign them to recipient urns of two or more colors, which represent time intervals with a homogeneous propensity for gene expression. Colors of the recipient urns represent sub-systems of the promoter states, and the assignments to urns of a specific color represent gene expression. We use elementary principles of discrete probability theory to derive the solutions for a range of kinetic models, including the Peccoud-Ycart model, the Shahrezaei-Swain model, and models with an arbitrary number of promoter states. For activated genes, we show that transcriptional lapses, which are events of gene inactivation for short time intervals separated by long active intervals, quantify the transcriptional dynamics better than bursts. Our approach reveals the physics underlying the solutions, which has important implications for single-cell data analysis.

Gene expression occurs in multiple steps [1]. The biochemical mechanisms of its steps are of great interest [2][3][4]. 2 In particular, a majority of studies have focused on gene regulation, transcription, and translation [5]. Genes 3 might be expressed at a uniform rate or transition between two or more states with different rates of 4 expression [6]. In the latter case, the transition kinetics might be regulated by gene-specific mechanisms 5 such as interactions of the promoters with specific transcription factors or gene-independent mechanisms 6 such as DNA supercoiling [7][8][9][10][11][12][13]. When genes are in transcriptionally active states, mRNA molecules an urn model, whereby the kinetic scheme of mRNA production (which, say, occurs with rate constant v 0 ) 17 and degradation (say, with rate constant d 0 ) is mapped to an urn scheme. To this end, one considers an 18 urn with balls of two colors -black and white. As time progresses, we sample balls one-at-a-time from the 19 urn, i.e., we perform Bernoulli trials [51]. The outcome of each trial, a black or white ball, corresponds to an 20 outcome of transcription or no transcription in physical terms, respectively. Each trial consists of drawing a 21 ball, recording its color, replacing the ball in the urn, and mixing the urn to prepare for the next trial. Let us 22 say that we draw balls without taking any break and the time duration per trial, ∆t is infinitesimal. The 23 kinetic scheme is mapped to the urn scheme by defining that the proportion of black balls in the urn is the 24 same as the probability of transcription in ∆t time, which is v 0 ∆t. Finally, the probability of observing m 25 copies of mRNA molecules is obtained as the probability of drawing m black balls in infinitely many trials 26 during the mean lifetime of mRNAs, d −1 0 . This is done by appealing to intuitive principles of combinatorics 27 concerning Bernoulli trials and the binomial distribution. Poisson distribution is the special case of the 28 binomial distribution when ∆t → 0 [51]. The steps of transcription and translation are mechanistically similar 29 and hence, the urn scheme for the Poisson process applies to both. The stationary state count of proteins 30 in a cell is given by a sum of random variables denoting the number of translations per mRNA molecule 31 that is produced in the time needed to reach stationarity. This results in the negative binomial and Neyman 32 type A distributions for the count depending on whether the noise in transcriptions can be ignored or not, 33 respectively [16,36,38,51]. 34 The models for constitutive expression have been extended to include gene regulation. Peccoud and 35 Ycart studied a gene whose promoter switches between active and inactive states [32]. Shahrezaei and Swain 36 extended the Peccoud-Ycart model by accounting for translation and solved it assuming d 0 d 1 , where d 1 is 37 the rate constant for protein degradation [16]. Numerous generalizations and extensions of these models exist 38 and many have been solved analytically, e.g., the leaky two-state model where the promoter switches between 39 two states with different levels of activity [22][23][24], multi-state models that consider a promoter with more 40 than two states [27,31,34,35,52], models with auto-regulation [25,26,30,33], etc. All of these models result 41 in probability generating functions that are related to the Kemp families of distributions, which have been 42 derived using various urn models of contagion and population heterogeneity, and as compound or mixture 43 distributions [53][54][55]. Notably, in each of their applications, the urn model has a distinct design, which is 44 systematically developed to capture the physical characteristics of the natural system under study. Their 45 distinctive features provide an intuitive mapping to the mechanisms behind their respective systems. These 46 urn models have proven fundamental to studies of their intended systems, have immense pedagogical value 47 and have been called a "standard expression" in statistical language [56][57][58]. For the system of regulated 48 gene expression, while an approach of solving chemical master equations can provide analytical solutions for 49 probability distributions, physical insight into the solutions can be greatly facilitated by the application of 50 urn models. Yet, to the best of our knowledge, urn schemes with well-defined mapping to models of regulated 51 gene expression are still lacking. 52 In this article, we develop an urn model to address this gap and demonstrate its utility by applying it to 53 solve diverse models of gene regulation for stationary state probability distributions. Central to our approach 54 are two principles. First, while transcriptions and translations are regulated by promoter state transitions, 55 arrivals of RNA polymerases or ribosomes occur with fixed propensities independent of the transitions, and 56 can be modeled as unregulated processes. In other words, gene regulation happens because the promoter 57 state determines whether these arrivals result in gene expression, but none of the promoter states exclude 58 polymerases or ribosomes from arriving (Fig. 1a). Second, while there is heterogeneity in promoter activity 59 over long time intervals, i.e., a promoter switches between active and inactive states, in short intervals, the 60 activity is homogeneous. Hence, we map each kinetic model that we consider to an urn model with a master 61 urn (cytosol) and a set of recipient urns of two or more colors (time intervals; see Fig. 1b,c). Each trial in 62 our urn scheme consists of two steps. By virtue of the first principle, the first step of sampling balls (RNA 63 polymerases or ribosomes) from the master urn is done independently of considerations for promoter state 64 transitions. By utilizing the second principle, we devise recipient urns such that each of them represents a 65 time interval with homogeneous promoter activity. In the second step, we assign the balls sampled from the 66 master urn to the recipient urns. We show that the probability distributions of counts of the mRNA and 67 protein molecules from a broad range of models are identical to that of the balls in the recipient urns of a 68   specific color, say red, which represents the active time intervals (Fig. 1d). If the sampling distribution of 69 balls from the master urn is a Poisson distribution and there are recipient urns of two colors, our urn scheme 70 yields the solution of Peccoud and Ycart. If the sampling distribution is negative binomial instead, the urn 71 scheme yields the solution of Shahrezaei and Swain. If there are urns of more than two colors, it yields the 72 solutions for models with multiple promoter states. In all cases, our urn model yields intuitive solutions for 73 the probability distributions and physical interpretations of their parameters with implications for analysis 74 and interpretation of single-cell data.

76
The urn scheme 77 Sampling from the master urn 78 The master urn represents the cytosol and contains balls of two colors -black and white (Fig. 1b). Each 79 black ball represents a birth factor, which we define to be the RNA polymerase for mRNAs and the ribosome 80 for proteins. The white balls represent solutes other than the birth factor. From this urn, we sample black 81 balls over one mean lifetime of the mRNA or protein depending on the time scale, T stat for the intended 82 solution to reach stationarity. The sampling process is defined by the kinetic process under consideration. 83 Most models of transcription implicitly assume that the propensity for RNA polymerases to collide with the 84 promoter site does not vary with time but whether a colliding polymerase successfully binds the promoter 85 3/15 and transcribes the gene depends on the promoter state. Hence, the number of arrivals follows the Poisson 86 distribution, say with mean µ per mRNA lifetime and the urn scheme for Poisson process applies for sampling 87 balls that represent RNA polymerases. We denote the probability distribution of the count, m 1 of polymerase 88 arrivals in one mRNA lifetime as Pois (G m1 |µ), which also denotes the sampling distribution of black balls 89 from the master urn for mRNA distributions (see Supplementary Section 1.1 for a derivation of the Poisson 90 distribution). On the other hand, most models of translation assume that an mRNA molecule is degraded 91 much faster than its protein counterparts, and that in its negligibly short lifetime, it binds a geometrically 92 distributed number of ribosomes resulting in a burst of proteins. In our urn scheme, we draw a geometrically 93 distributed number of ribosomes for each potential event of transcription. Hence, the cumulative count, m 2 94 of ribosome arrivals on mRNAs over one protein lifetime follows the negative binomial distribution with 95 parameters α and β, which represent the number of potential transcriptions in one protein lifetime and the 96 mean number of ribosomes that bind to each mRNA, respectively (the interpretation of α is given in more 97 detail later). We denote this NB (G m2 |α, β) (see Supplementary Section 1.2).

98
Assignment to the recipient urns 99 Each recipient urn represents a time interval with a uniform propensity for gene expression. Each urn has 100 one of two or more colors, with the number of colors dependent on the number of promoter states (Fig. 1c). 101 We defer the mathematical exposition of our urn scheme to a later section. Here, it suffices that at stationary 102 state, the time intervals captured by urns are equal and represent the characteristic time scale of promoter 103 state transitions. Furthermore, the total time captured by the set of recipient urns equals T stat . For a genetic 104 switch with two states, let grey and red urns represent the inactive and active states, respectively. Also, say 105 n grey and n red represent their numbers, respectively. At stationary state, n red and n grey are fixed and their 106 permutations represent samples of state-transition trajectories. The interaction of the time points of arrivals 107 of birth factors with the promoter state transitions determines the outcome of interest, which is the number 108 of births (Fig. 1d). In the urn model parlance, we capture this by assigning the black balls from the master 109 urn to the recipient urns. If a ball is assigned to a red urn, a birth is observed. The outcome of interest 110 is the number of balls assigned to the red urns collectively. Since the times of arrivals of birth factors are 111 independent of the trajectory of promoter state transitions, the process of assigning balls to the recipient 112 urns allows equal likelihood of assignment to all urns. Let us say that in an experiment, we sample m + i 113 black balls from the master urn. The probability that a random assignment to the recipient urns results in 114 exactly m balls assigned to the red urns is given by a negative hypergeometric distribution. To see this, note 115 that the number of ways to divide m + i balls in n red + n grey urns is n red +ngrey+m+i−1 m+i , which is the same as 116 the number of ways to permute m + i identical balls and n red + n grey − 1 identical dividers [51]. Of these, 117 are such that exactly m balls are assigned to the red urns. Hence, the probability of 118 the outcome m is the ratio of this quantity to the total number of ways to assign m + i balls and we denote 119 the distribution as NH (G m → n red |G m+i → {n red , n grey }), where ' →' represents the process of assignment 120 (see Supplementary Section 2).

121
Application to the Peccoud-Ycart model

122
The Peccoud-Ycart model for probability distribution of mRNA counts considers a promoter that can exist 123 in active and inactive states (Fig. 2a). Using the urn model parlance here, the black balls represent RNA 124 polymerases. To observe an outcome of m 1 transcriptions, the sample drawn from the master urn must 125 contain m 1 or more black balls, say m 1 + i 1 with i 1 ≥ 0. The probability of this event is Pois (G m1+i1 |µ). 126 Given m 1 + i 1 balls, n red red recipient urns and n grey grey recipient urns, the probability that exactly m 1 127 balls are assigned to the red urns is NH (G m1 → n red |G m1+i1 → {n red , n grey }). The joint probability of m 1 128 balls assigned to the red urns and i 1 to the grey urns is given by the product of the said Poisson and negative 129 hypergeometric distributions. A marginal of the joint probability yields the probability of m 1 transcriptions, 130 P (G m1 → n red |µ, n red , n grey ) =  In state g1, neither the transcription factor nor the RNA polymerase are bound to the promoter. In state g2, the transcription factor is bound to the promoter and in state g3, both the polymerase and the transcription factor are bound. State g3 releases the polymerase with propensity v0, which results in transcription and translations as well as a transition to g2. In addition, the model allows reversible transitions between g1 and g2 with propensities k1 and k2, transition from g2 to g3 with propensity k3 and transition from g3 to g1 with propensity k2.
Its generating function is given in terms of the Kummer's hypergeometric function of the first kind, 133 1 F 1 n red n red +ngrey ; µ(z − 1) , which is formally identical to the solution of Peccoud and Ycart (see Supple-134 mentary Section 3.1 for the proof). We defer the mapping of the kinetic parameters to parameters of the urn 135 model to a later section.

136
Application to the Shahrezaei-Swain model 137 Similarly to the Peccoud-Ycart model, the Shahrezaei-Swain model considers transcriptionally inactive and 138 active states but also accounts for translation to solve for the probability distribution of protein counts 139 (Fig. 2b). Hence, for urn modeling in this case, we let the black balls represent ribosomes. To observe 140 m 2 translations, the sample drawn from the master urn must contain m 2 + i 2 black balls with i 2 ≥ 0, 141 which happens with probability NB (G m2+i2 |α, β). Once again, we consider n red red recipient urns and n grey 142 grey recipient urns, but in this case they represent translationally active or inactive intervals, respectively. 143 Note that a subset of recipient urns that represented transcriptionally active time intervals in the case of 144 Peccoud-Ycart model might not receive any polymerase arrivals, which renders them translationally inactive. 145 Hence, n red and n grey have a different mapping to the kinetic parameters than in case of the Peccoud-Ycart 146 model, as we show later. Regardless, given n red and n grey , the probability of m 2 translations follows from 147 5/15 similar arguments as before, The generating function for this distribution is given in terms of the Gaussian hypergeometric function, 151 2 F 1 α, n red n red +ngrey ; β(z − 1) , which is formally identical to the solution of Shahrezaei and Swain (see Supple-152 mentary Section 3.2 for the proof).

153
Application to the leaky two-state model 154 The leaky two-state model for mRNAs considers two states with propensities of transcription differing by a 155 constant factor, say λ (Fig. 2c). Let the expected numbers of arrivals of RNA polymerase in one mRNA 156 lifetime be µλ and µ in the leaky and fully active states, respectively, with 0 < λ < 1. Since we consider a 157 model for probability distribution of mRNA counts, the black balls drawn from the master urn represent 158 RNA polymerases. In this case, we consider two parallel experiments in our urn scheme, which represent the 159 contributions of a constitutive component with the expected value of µλ and a regulated component with the 160 expected value of µ (1 − λ). In the first experiment, we draw a sample of m 1,a black balls from the master 161 urn with the probability Pois G m1,a |µλ , which represents the leakage that is unaffected by promoter state 162 transitions. Hence, with respect to this sample, all recipient urns are red and all of it is counted towards 163 transcriptions. In the second experiment, we draw a sample of m 1,b + i 1 balls from the master urn with the 164 probability Pois G m 1,b +i1 |µ − µλ , which represents the regulated component of polymerase arrivals. This 165 experiment proceeds as described earlier for the Peccoud-Ycart model and allows us to derive the probability 166 of m 1,b transcriptions from the regulated component by substituting m 1 with m 1,b and µ with µ (1 − λ) in 167 Eq. 1. The overall outcome of interest is m 1 = m 1,a + m 1,b . Its probability is given by the convolution 168 rule and has a generating function that is the product of the generating functions for Pois G m1,a |µλ and 169 that for the regulated component, i.e., e µλ(z−1) , which is consistent with 170 the solution of Cao and Grima [22]. Note that previously, we have shown in the context of a model for the 171 lac operon of E. coli that this generating function can be viewed as a convolution of contributions from a 172 leaky sub-system of Lac repressor-bound states and regulated transitions to the repressor-free state [24]. A 173 recent preprint also utilizes an identical concept [59]. In this section, we have shown that the same concept is 174 easily accommodated in the framework of our urn model. Further, the distribution of proteins from a leaky 175 two-state model can be derived similarly.

176
Application to models with multiple states 177 Analytical solutions are available for models with more than two but a fixed number of promoter states [27,34] 178 as well as those with an arbitrary number of states [31]. Next, we consider the model by Cao et al., where the 179 promoter exists in three states, say g 1 , g 2 and g 3 with g 3 being active (Fig. 2d). For the mRNA counts, the 180 sampling distribution of m 2 + i 2 + j 2 balls (RNA polymerases) from the master urn is Pois (G m1+i1+j1 |µ), 181 where i 2 , j 2 ≥ 0. We account for the additional promoter state by adding another layer of recipient urns. 182 First, we divide the balls between urns that correspond to the sub-system of state g 1 and the sub-system of g 2 183 and g 3 collectively. Let there be n blue blue and n ppl purple urns for these sub-systems, respectively. Then, the 184 probability of assigning m 2 + i 2 balls to the purple urns is NH (G m2+i2 → n ppl |G m2+i2+j2 → {n ppl , n blue }). 185 The balls assigned to the purple urns are further re-assigned to another layer of red and grey recipient 186 urns -n red and n grey in number, respectively. Let the grey and red urns represent the states g 2 and g 3 187 respectively. The outcome of interest, assignment of m 2 balls to the red recipient urns has the probability 188 6/15 NH (G m2 → n red |G m2+i2 → {n red , n grey }). Hence, the probability of m 2 transcriptions is 189 P (G m2 → n red |µ, n red , n grey , n ppl , n blue ) = Relationship between the kinetic and urn model parameters 202 We have shown that our urn model yields probability distributions that are formally identical to those from 203 kinetic models. In this section, we obtain the relationship between parameters of the kinetic and urn models. 204 To this end, say, we follow a cell in real time starting at time t = 0 when the gene system under consideration 205 is at stationary state (e.g., Fig. 1a shows trajectories of three cells). We define that event E m occurs when m 206 births are observed in a time interval given by the time scale, T stat for reaching stationarity (e.g., the top 207 panel in Fig. 1a illustrates the event E 2 ). Let M be a random variable representing the number of arrivals of 208 birth factors in T stat time (the number of black balls in any trajectory shown in Fig. 1a), {T 1 , T 2 , ..., T M } be 209 the random variables representing the time points of arrivals (x-coordinates of the spikes in Fig. 1a), and 210 R ON be a random variable representing the set of time points when the promoter is active (set of all the 211 time points in red colored segments along the x-axis in any trajectory shown in Fig. 1a). Then, E m occurs 212 when M =0 1 T ∈RON = m, where 1 T ∈RON is an indicator variable that equals 1 if T ∈ R ON and 0 otherwise. 213 For this to happen, there must be at least m arrivals of the birth factors in total, i.e., M = m + i such that 214 i ≥ 0. Given that this condition is met, there must be exactly m arrivals during the active time intervals, 215 i.e., m+i =1 1 T ∈RON = m. Essentially, we find that the event of m births can be decomposed into two simpler 216 events, whose probabilities can be derived separately. Next, let us define R red as the counterpart of R ON in 217 the urn space (Fig. 1d). In other words, R red is the subset of time points from the set [0, T stat ] that fall in 218 red urns for a sample permutation of the recipient urns. Say, E urn,m represents the event M =1 1 T ∈R red = m. 219 Then, we must choose n red and n grey such that P (E m ) = P (E urn,m ).

220
For the Peccoud-Ycart model, T stat = d −1 0 and RNA polymerases function as the birth factors. As we 221 mentioned, the sampling of polymerases from the master urn can be modeled as a Poisson process, which 222 yields µ = v0 /d0. Next, we solve for n red and n grey . Since the T 's are independent of each other, P (T 1 ∈ R ON ) 223 is independent of P (T 2 ∈ R ON ) for 1 = 2 . Hence, P (E m ) = P (E urn,m ) if P (T ∈ R ON ) = P (T ∈ R red ) 224 for all . Let the propensities for promoter state transitions be k 0 and k 1 (Fig. 2a). At stationary state, 225 P (T ∈ R ON ) = k0 /k0+k1. To derive P (T ∈ R red ), let w red and w grey represent the time duration captured 226 by each of the red and grey urns, respectively, and min (w red , w grey ) be the minimum of the two. Then, for T 227 close to the boundaries of the set [0, T stat ], i.e. for T less than min (w red , w grey ) away from the boundaries, 228 P (T ∈ R red ) = n red/n red +ngrey. This is because all of the duration from t = 0 to min (w red , w grey ) is contained 229 within a single urn, which can either be red or grey. On the other hand, for an arbitrary choice of T , 230 P (T ∈ R red ) = n red w red/(n red w red +ngreywgrey). In other words, the probability that a randomly chosen time 231 point falls in a red urn is given by the fraction of time in the interval [0, T stat ] that is covered by the red urns. 232 At stationary state, whether T falls in R red should be the same for all T , which requires w red = w grey . Let us 233 replace w red , w grey with w. Now, if we were to arbitrarily pick a polymerase arrival and shift the corresponding 234 T to the left or right by a fixed amount (say, by grabbing one of the spikes in Fig. 1d), whether the shifted 235 7/15 spike still falls in an urn of the same color depends on w. In simple words, while there is temporal heterogeneity 236 in transcriptional activity over long periods, in short time windows around any event of polymerase arrival, 237 transcriptional activity is homogeneous. Given a suitable choice of w, these time windows can be modeled as 238 if they were composed of urns of the same color. w is determined by the transient time scale for switching 239 of transcriptional activity, which is w = (k 0 + k 1 ) −1 (see the section on reversible unimolecular reactions 240 in McQuarrie [60] for derivation of the transient time scale). Finally, for P (T ∈ R ON ) = P (T ∈ R red ), 241 n red/n red +ngrey = k0 /k0+k1. Since, w (n red + n grey ) = d −1 0 , we obtain n red = k0 /d0 and n grey = k1 /d0. Essentially, 242 we find that n red and n grey are equal to the normalized propensities for the promoter to switch to the active 243 and inactive states, respectively. By replacing these parameter values in Eq. 1, we retrieve the analytical 244 solution of Peccoud and Ycart. 245 Shahrezaei and Swain solved a kinetic model for protein production assuming d 0 d 1 , which implies 246 T stat = d −1 1 . In this case, ribosomes function as the birth factors and for the system to be active for protein 247 production, the gene must be transcriptionally active and RNA polymerase arrivals must occur. In other 248 words, if we start our observation with the promoter in the transcriptionally inactive state, the waiting time 249 for transition to the translationally active state is greater than k −1 0 . In the remaining part of this section, we 250 first derive the propensity for translational activity, and then apply the approach described above to obtain 251 n red and n grey for the Shahrezaei-Swain model. To this end, let us pick a time point randomly as t = 0. Next, 252 we define p 0 (t) as the probability that the gene is transcriptionally inactive at time t and no mRNA has 253 been produced in time [0, t]. Similarly, p 1 (t) is the probability that the gene is transcriptionally active at 254 time t but no mRNA has been produced in time [0, t]. Then, We are interested in the time scale at which p (t) = p 0 (t) + p 1 (t) approaches 0, i.e. the time scale at which 258 the marginal probability of no mRNA production decays to 0. To this end, Eqs. 4-5 can be combined to get 259 a second order differential equation, Inverse of the larger of the roots, i.e. the slow time scale (say, τ s ) gives the waiting time for mRNA production 265 and the inverse of the smaller one yields the fast time scale (say, τ f ). We interpret d 1 τ −1 f as the propensity 266 for occurrence of any event, i.e., promoter state transition or polymerase arrival. For example, if we know 267 that a polymerase arrival occurs at any time point t, it is likely that there will be no arrival in the time 268 window t, t + d −1 1 τ f because the polymerase arrival occurs as fast as the fast time scale of the system allows. 269 Now, we can derive the relationship between the kinetic and urn model parameters in terms of these time 270 scales. First, we sample ribosomes from the master urn, such that for each d −1 1 τ f time window in T stat , we 271 draw a geometrically distributed number of ribosomes. The geometric distribution has the mean β = v1 /d0 272 and we draw α = τ −1 f geometrically distributed samples. Hence, the probability distribution for a sample of 273 m 2 + i 2 ribosomes is given by the negative binomial distribution with the said values for α and β. Next, we 274 distribute these in the red and grey recipient urns. Similarly to arrivals of polymerases, arrivals of ribosomes 275 are independent of each other. We can use the same method as described for Peccoud-Ycart model to derive 276 n red and n grey . The difference is that here, red urns represent translationally active time windows. Hence, 277 their number is given by n red = τ −1 s and n grey = k0+k1 /d1 − τ −1 s , which are the normalized propensities for 278 the system to switch to the translationally active and inactive states, respectively. By using these parameter 279 values in Eq. 2, we retrieve the solution of Shahrezaei and Swain. In the previous sections, we have written the probability of exactly m 1 assignments to the red urns out of a 282 sample of m 1 + i 1 balls from the master urn directly as a negative hypergeometric distribution. The same 283 probability could be written from an alternative perspective, where we start with the probabilities for at 284 least i 1 assignments to the grey urns, at least i 1 + 1 assignments to the grey urns, . . . , at least i 1 + m 1 − 1 285 assignments to the grey urns and all i 1 + m 1 assignments to the grey urns. Then, we combine these as an 286 alternating sum using the principle of inclusion-exclusion to write the probability of exactly m 1 assignments to 287 the red urns (see Supplementary Section 4). For the Peccoud-Ycart model, this approach yields an expression 288 for P (G m1 → n red |µ, n red , n grey ) that is the m th 1 coefficient in the Maclaurin series expansion of Kummer 289 transformation of the Peccoud-Ycart solution, e µ(z−1) 1 F 1 ngrey n red +ngrey ; −µ(z − 1) with µ = v0 /d0, n red = k0 /d0 290 and n grey = k1 /d0. This way of representing the Peccoud-Ycart solution leads to an interpretation of the 291 transcriptional dynamics in terms of transcriptional lapses as we show next, and contrast with the dynamics 292 of transcriptional bursts.

293
For a repressed gene with k1 /d0 1, the hypergeometric function, 1 F 1 n red n red +ngrey ; µ(z − 1) reduces to 294 the generating function for the negative binomial distribution, [1 − µ(z−1) /ngrey] −n red [61]. Here, µ /ngrey is 295 interpreted as the mean number of mRNAs produced when the promoter switches to the active state (burst 296 size), n red as the frequency of switching to the active state (burst frequency), and the hypergeometric 297 function is the generating function for the Peccoud-Ycart solution. In Supplementary Section 5.1, we use 298 a perturbation theoretic approach to show that this is valid for k 1 k 0 . Using this limiting form of the 299 hypergeometric function, when k 0 k 1 , i.e., for an activated gene, e µ(z−1) 1 F 1 ngrey n red +ngrey ; −µ(z − 1) reduces 300 to e µ(z−1) [1 + µ(z−1) /n red ] −ngrey (see Supplementary Section 5.2 for proof). Hence, in both the activated and 301 repressed scenarios, the limiting form of the Peccoud-Ycart solution has a term of the kind [1 − β (z − 1)] −α , 302 which is the generating function for a negative binomial distribution if α, β > 0. However, β ≡ µ /ngrey > 0 in 303 the repressed limit but β ≡ − µ /n red < 0 in the activated limit. Hence, this term is not consistent with the 304 interpretation of transcriptional dynamics in the activated case as being bursty and [1 + µ(z−1) /n red ] −ngrey is 305 not a probability distribution. We interpret the dynamics in the activated case as one that is composed of 306 transcriptional lapses, whereby the gene is expressed with uniform propensity v 0 for majority of the time and 307 for short-lived intervals, there is a lapse in transcriptional activity brought about by gene regulation (Fig. 308  3a). In Fig. 3b, we show that for a gene in the activated case, the approximation in terms of transcriptional 309 lapses is in significantly better agreement with the exact solution than that in terms of transcriptional bursts. 310 This approximation might be useful in fitting single-cell data for activated genes. All problems concerning probabilities could be addressed using urn models [57]. However, it is a non-trivial 314 task to develop physically meaningful urn schemes for any given system, even if solutions for the probability 315 distributions of interest were known. This is because many different models can give rise to identical 316 distributions [56]. Indeed, this is true for distributions resulting from models of regulated gene expression, 317 which are related to the Kemp families of distributions that are well-known in the urn modeling literature [62]. 318 A general feature of some of the models that lead to such distributions is that in a series of trials, multiple 319 instances of the same outcome tend to occur in close succession. This could result due to the presence of 320 contagion or population heterogeneity in the stochastic process under consideration [54]. We consider a known 321 model for each and discuss how they differ from models of regulated gene expression. The Pólya-Eggenberger 322 urn model studies the spread of contagious diseases [56]. In this model, one considers a finite urn with a 323 pre-specified number of white and black balls, and draws balls from the urn randomly and one at a time. The 324 drawn ball is replaced along with additional balls of the same color, which increases the probability of drawing 325 this color in the next trial. This process simulates the spread of pathogens once they appear in a population. 326 However, in the case of regulated gene expression, such contagion does not exist because the concentrations of 327 birth factors do not change with time. Hence, the process of replacing a ball with additional balls of the same 328 color is not physically meaningful. Alternatively, Gurland considered a compound Poisson process, which is a 329 Poisson process with a rate parameter that is also a random variable due to heterogeneity of the population 330 in consideration [54]. For example, consider a constitutively expressed gene in a population of cells such that 331 the rate of RNA polymerase arrival at the promoter varies from cell to cell. If the rate parameter depended 332 on a random variable with Beta distribution, the probability distribution of the corresponding mRNA counts 333 would be identical to the solution of the Peccoud-Ycart model. It is noteworthy that this interpretation has 334 been recently utilized to fit single-cell RNA-seq data [45,46]. However, the Peccoud-Ycart model, like most 335 mechanistic models of gene expression, is solved for a homogeneous population with a fixed rate of polymerase 336 arrivals. Hence, the interpretation as a compound Poisson process is not consistent with its kinetic scheme. 337 Implications for analysis of single-cell data 338 Our study provides helpful insights for single-cell data analysis. First, we find that the urn model parameters 339 are related to the time scales of switching between sub-systems of the promoter states. Particularly, in the 340 case of the Shahrezaei-Swain model, the numerator parameters of the Gaussian hypergeometric generating 341 function are related to the fast time scale and the waiting time for transcriptions. To the best of our 342 knowledge, these parameters have not been ascribed physical interpretations previously. Our interpretations 343 would facilitate mechanistic understanding from analysis of single-cell data for proteins if the data were 344 fit using the Shahrezaei-Swain solution. Second, the field has recently witnessed a rapid development of 345 single-cell technologies and parallel advances in analytical solutions of models incorporating mechanistic 346 details of gene expression [27,63,64]. As such, solutions of the detailed models would likely be utilized for 347 analyzing single-cell data in the future. We find that solutions of disparate mechanistic models (e.g., the 348 models of Cao et al. and Karmakar [27,34]) may result in identical distributions if they involve the same 349 number of timescales for switching between the sub-systems of promoter states. Hence, if a probability 350 distribution yields good fits to data on mRNA or protein counts, it indicates that the number of time scales 351 represented in the distribution is adequate to describe the gene system. Additional data must be collected 352 to distinguish between the mechanistic models leading to these time scales as well as to assess whether the 353 probability distribution might have resulted from the presence of features such as contagion or population 354 heterogeneity in the system. Further insights in this direction could be gained from existing statistical 355 literature to distinguish between such features [65]. Finally, we note that in some cases, the researchers 356 utilize the negative binomial distribution to fit their data [11]. This could be motivated by the challenges 357 of computing hypergeometric functions for the complete range of parameter values [66]. Furthermore, the 358 parameters of negative binomial distribution have well-understood interpretations as the burst size and 359 frequency of expression. We find that at least for activated genes, the transcriptional dynamics are better 360 described in terms of transcriptional lapses. In the general case, it might be worthwhile to consider a 361 distribution with the generating function e µ(z−1) [1 − β (z − 1)] −α to fit the data. If all the parameters, α, β 362 and µ were non-negative, this is a Delaporte distribution, which could be interpreted in terms of a leaky 363 two-state model with transcriptional bursts (see Supplementary Section 5) [62]. If β were allowed to be 364 negative, the parameters could be interpreted in terms of a leaky two-state model with transcriptional lapses. 365

366
Limitations of our approach are worth considering. First, as the mechanistic models grow in complexity, 367 identifying the relationship between the urn model and kinetic parameters becomes a challenging task. 368 Despite this limitation, our approach complements the existing approaches for solving the stochastic models 369 by revealing the origins of the solutions in terms of the physics underlying the dynamics of gene regulation. 370 Second, our model assumes that birth factors arrive independently of each other and that there are no 371 feedback loops. If these assumptions do not hold, the birth factors cannot be assigned to the recipient urns 372 independently of each other and the numbers of recipient urns of different colors are not independent of the 373 counts of births. Nevertheless, the solution for a model with feedback could be viewed as correction terms 374 combined with the solution for its simplified version that ignores feedback. Further, we note that, like most 375 models that have been solved analytically, our model implicitly assumes that the reactants and catalysts 376 needed for gene expression such as tRNAs, amino acids, ribonucleotides, etc. are available in sufficient 377 quantity [5]. In the future, generalizations of our model might be able to relax these assumptions. Third, 378 our model makes physical sense only if the parameters such as n red and n grey are integer-valued. This might 379 not be true for arbitrary values of the kinetic parameters. In this case, the factorial functions appearing in 380 our expressions must be replaced by gamma functions as done for the urn model of the negative binomial 381 distribution [16].

383
We developed an urn model and applied it to study a broad range of stochastic models of regulated gene 384 expression. Our urn model generalizes the classical birth-death model by considering the regulation of births. 385 The classical model makes no distinction between arrivals of birth factors and births, as all arrivals cause 386 births. However, in the presence of regulation, as in the case of gene expression, arrivals of birth factors 387 do not result in births if they occur when genes are inactive. Hence, we reinterpret the classical scheme of 388 births as a scheme to sample birth factors from a master urn, which represents the cytosol of a cell. Next, we 389 note that despite temporal heterogeneity in expression activity of genes over long time intervals, there is 390 homogeneity in short intervals. We use this concept to devise recipient urns of two or more colors, which are 391 discretized time intervals with the promoter existing in a single sub-system of its states for the duration of 392 each urn. Then, we assign the birth factors to the recipient urns and count the assignments to urns of a 393 specific color as births. Given physically intuitive choices of sampling distribution from the master urn and 394 the numbers of recipient urns for each color, our model yields probability distributions that are identical 395 to solutions of a range of kinetic models. We describe the physical principles that lead to our urn scheme 396 and provide kinetic interpretations for the urn model parameters. Finally, we discuss our approach in the 397 broader context of urn models and single-cell data analysis as well as highlight its limitations. We conclude 398 by noting that the solutions from chemical master equations are obtained in terms of generating functions 399 and physical intuition into origins of the expressions for probability distributions have thus far been limited 400 for models with regulation. Our model facilitates direct interpretation of the probability expressions, which 401 underscores its significance for pedagogical purposes and to interpret results from data fitting. The physical 402 insights developed in this work will facilitate the adoption of the analytical solutions in single-cell studies.

404
Supplementary information for this article is available online.

405
Declaration of interest 406 The authors declare that they have no competing interests.

407
Author contributions 408 KC and AN conceived the project. KC designed the research, developed the method, generated the figures and 409 wrote the manuscript. AN provided critical feedback and helped shape the research, analysis and manuscript. 410 KC did the proofs other than the perturbation theoretic proofs, which were done by AN. Both the authors 411 read and approved the manuscript.