Stochastic simulation platform for visualization and estimation of transcriptional kinetics

We present an implementation of the Gillespie algorithm that simulates the stochastic kinetics of nascent and mature RNA. Our model includes two-state gene regulation, RNA synthesis initiation and stepwise elongation, release to the cytoplasm, and stepwise degradation, a granular description currently tractable only by simulation. To facilitate comparison with experimental data, the algorithm predicts fluorescent probe signals measurable by single-cell RNA imaging. We approach the inverse problem of estimating underlying parameters in a five-dimensional parameter space and suggest optimization heuristics that successfully recover known reaction rates from simulated gene expression turn-on data. The simulation framework includes a graphical user interface, available as a MATLAB app at https://data.caltech.edu/records/1287.


11
Transcription has been the focus of intensive study due to its cornerstone role in cell activity regulation. 12 Recent advances in fluorescent imaging have enabled mRNA detection at single-molecule resolution in 13 individual cells, in both live and fixed samples (1,2). Spatial analysis of mRNA signals allows the 14 identification (3,4) and quantification (5) of nascent (actively transcribed) mRNA, which offers a direct 15 window into the kinetics of gene transcription, with minimal interference from downstream effects (5), at 16 the level of a single gene copy (6). 17 Converting high-resolution experimental data into theoretical understanding of transcription requires 18 simultaneous modeling of both nascent and mature species of mRNA. Particularly, since at any given 19 moment an mRNA molecule may be in a partially transcribed and/or degraded state, a good model should 20 be able to capture the submolecular features of mRNA. However, current computational models of 21 transcription present challenges for integration with the new wealth of microscopy data. Most models do 22 not distinguish between nascent and mature mRNA or model the transcript length (7)(8)(9)(10)(11). As recently noted 23 (5), several mechanistic models do describe the elongation of nascent mRNA, but do not consider the 24 mature mRNA population and require additional processing for comparison to microscopy data (4,(12)(13)(14). 25 Further, studies using these models tend to predict low-order statistics (7,13), which paint a limited picture 26 at biologically low molecule numbers (4,15). Recent methods based on directly solving the chemical master 27 equation (CME) (5,15,16), using the finite state projection (FSP) algorithm [13], yield distributions of the 28 number of molecules. However, integrating the discrete CME with submolecular features of mRNA is 29 nontrivial, and has only recently been accomplished on a model with a deterministic elongation process (5). 30 A stochastic stepwise model of transcription, more faithful to the mechanistic details, is not currently 31 tractable using FSP (5) due to exponential growth in the size of the state space with increasing resolution. 32 Here we present a stochastic simulation platform that aims to capture the complexities of RNA processing. 33 The platform consists of a submolecular implementation of the Gillespie algorithm (17), simulating the gene switching, transcription, and degradation expected in a prokaryotic system. Transcription and 35 degradation occur in a stochastic fashion, where the initiation and individual steps of elongation are Poisson 36 processes. The algorithm outputs time-dependent fluorescent probe signals, calculated from the overlap of 37 intact RNA and probe-covered regions. The probe signals are provided as cell-specific readouts and as 38 aggregated histograms, mimicking live-cell (MS2) and fixed-cell (smFISH) fluorescence data, respectively 39 (1,2). Using a GUI, a user can input simulation parameters and examine time-dependent statistics, as well 40 as animate the instantaneous molecule states. 41 We use the platform to approach the inverse problem of biological parameter estimation. A recent 42 investigation demonstrated that entire distributions are required to reliably estimate parameter values from 43 single-cell mRNA data (15). To perform parameter estimation based on these empirical distributions, we 44 implement a heuristic approach based on iteratively minimizing mean squared errors and Wasserstein 45 distances of different observables (18). This approach represents a novel method of estimating plausible 46 regions for multiple parameters using time-series data with multiple observables, without making 47 assumptions regarding the functional form of the distributions. Thus, the platform provides a flexible 48 simulation environment to implement reaction mechanisms as well as a search algorithm designed to 49 directly test those mechanisms' parameters against experimental data. The GUI and search algorithm are 50 available at https://data.caltech.edu/records/1287. 51

Model and simulation platform 53
Our platform models a common formalism for the mRNA transcription process (5,7), with a series of 54 stochastic reactions, including promoter turn-on and turn-off, transcription initiation, elongation, RNase 55 (ribonuclease) binding, and degradation (S1 Table). Specifically, promoter activity is represented as a two-56 state switch. In the active ("on") state, transcription can be initiated. The nascent mRNA strand elongates 57 from the 5' to the 3' end, in a series of discrete steps. Upon reaching the end of the template gene, the mature mRNA molecule is released from the gene. Regardless of RNA maturity, RNase can bind to the 5' 59 end of the mRNA, causing the strand to begin stepwise degradation at an average rate assumed to be 60 identical to the elongation speed (19). The process is depicted in Fig 1A. The physiology of the transcribed 61 gene is parametrized by the turn-on rate , the turn-off rate , the transcription initiation rate , the 62 degradation initiation rate , the elongation speed , and the gene length . The experimental 63 parameters include the timespan of the experiment , as well as the probe span vector ( 3 , 5 ) defining 64 its 3' and 5' limits of coverage with respect to the length of the gene, as shown in Fig 1A (5). 65 The platform performs stochastic simulation of the model using the Gillespie algorithm (17,20), then 66 estimates the fluorescence of each mRNA molecule from the size of its region targeted by fluorescent 67 probes. Specifically, we simulate the production and degradation of each mRNA molecule in the cell, whose 68 status can be defined by four variables, i.e. two integers that define 5'-and 3'-most nucleotides of the 69 transcript and two Boolean variables that define whether the mRNA is polymerase-bound (nascent) and/or 70

RNase-bound (degrading). The gene state (on or off) is defined by a single Boolean variable. To convert 71
the simulated mRNA molecule ensemble (Fig 1B) to the experimentally observed fluorescent signal, we 72 calculate the overlap between the intact RNA and the probe coverage (single realization shown in Fig 1C); 73 the probe readout is rescaled to molecule number using the fluorescence of a single intact molecule (16). 74 The resolution of the simulation is determined by the number of cells and the number of steps taken to fully 75 elongate or degrade each molecule. 76 Model simulation is implemented in MATLAB 2018a (21). A simple graphical user interface (GUI), 77 provided as a MATLAB app at https://data.caltech.edu/records/1287, runs the simulation for a user-defined 78 parameter set defining the physical parameters and simulation precision. Upon completion, the GUI outputs

Parameter estimation 102
Given single-cell time-series fluorescence data that describes nascent and mature mRNA, we seek to 103 estimate the underlying model parameters. We would like to approach this inverse problem by simulating 104 mRNA number distributions for the experimentally available timepoints, evaluating an error metric that 105 maps the divergence between the target distribution and each trial distribution to a single number, then 106 minimizing this error by using it as an objective function. To test the algorithm's ability to recover known parameters, we generated synthetic data for the turn-on 121 experiment using the following ground truth parameters: = 95 min -1 , = 1 min -1 , = 10 min -1 , 122 = 0.5 min -1 , = 41.5 nt s -1 , = 15 min, = 5300 nt, 10,000 cells, and 15 steps of elongation 123 to complete transcription. The procedure used to convert these rates into reaction propensities is described 124 in the Supplementary Information. Relatively coarse simulation quality was used as a proof of concept.
The simulations were parallelized across 90 AWS processors. The process of parameter identification is 126 visualized in Fig 2B. We found that the one-sigma interval around the mean estimate included the ground 127 truth parameters (Fig 2C). The convergence of , , and throughout the search is relatively well-128 behaved and close to monotonic; however, and are far more challenging to estimate (Fig 2D). 129 We compare the mean signals of nascent and total RNA simulated using the one-sigma estimate interval 130 (Fig 2E), as well as the corresponding distributions simulated using the mean estimate (Fig 2F), to the 131 synthetic ground truth data. Comparison at both levels demonstrates convergence. To cross-validate the 132 search, we compare repression simulations generated from the ground truth and estimated parameters. The 133 nascent and total means are consistent (Fig 2G). To test the robustness of the fitting algorithm, we apply 134 the search procedure to the turn-on data generated using a range of and values, mimicking the 135 regulatory parameter modulation hypothesized to occur in vivo (28). The results suggest consistent 136 performance throughout the parameter space, although identifiability of high is poor (Fig 2H). 137 Encouragingly, all one-sigma intervals include the ground truth parameters. 138 the reaction propensity calculations are given in S1 Table. 165 To perform parameter estimation on turn-on synthetic data, we use a heuristic iterative method based on 166 the genetic algorithm (29). We alternate between optimizing mean signals and entire distributions. The error metric for the mean signal is the mean squared error. Due to the limited support of empirical 168 distributions, the commonplace minimization of Kullback-Leibler divergence between target and test 169 distributions (30) is inappropriate for comparing distributions (25). Instead, we use the absolute difference 170 between the target and test cumulative distribution functions (CDFs), which tends to be more robust to 171 noise and sparsity (25); this metric is commonly known as the Wasserstein or earth mover's distance (18). 172 We aggregate different time points' Wasserstein distances by weighing them using a uniform or exponential 173 function of time, as described in Supplementary Information.  174 Empirically, the parameter identifiability is far from uniform throughout the simulated time-series, and 175 different metrics provide sensitivity to different parameters. Further, it is computationally prohibitive to 176 simulate entire trajectories at the beginning of the parameter search, when the relevant region of the five-177 dimensional search space is not yet known. Therefore, we take an ad hoc iterative approach, which 178 incrementally narrows the region of parameters consistent with the observed signals. This heuristic 179 approach is chosen for computational convenience and is not guaranteed to the global parameter optimum.  convergence. The method demonstrates that parameter estimation from a time series of multiple 206 observables is tractable by heuristic likelihood-free methods. The validation we perform suggests that, by 207 using simulations to generate empirical distributions, this approach is more effective to fit experimental 208 signals than traditional methods when no closed-form solutions or approximations are available; further, 209 the visualization capabilities would be useful for the qualitative description and understanding of such 210 complex systems. 211 Our platform allows numerical solution of detailed transcription model for both nascent and mature mRNA 212 species, whose CME may not be solved exactly. However, since the approach is simulation-based, the 213 steady state of the system needs to be computed asymptotically from a non-steady state, which may be 214 time-consuming. Specifically, simulating and fitting the steady-state and turn-off experiments may be computationally prohibitive if the scales of kinetic rates have substantial difference. Alternatively, it may 216 be possible to use analytical solutions (31,32) to approximate an equilibrium distribution; however, this 217 approach is challenging to generalize and the resulting simulation would no longer be exact. 218 The parameter identification process may be facilitated by parameter constraints from analytical solutions. 219 For example, if the steady-state solution for the total mean is known, the and parameters can be 220 fixed for the optimization procedure, reducing the parameter estimation to the simpler problem of 221 optimization in three-dimensional space of , , and . However, we anticipate that the value of this 222 heuristic method rests in applications to models with ad hoc mechanisms that do not have easily tractable 223 analytical solutions. 224 The system is currently limited to modeling prokaryotic transcription. The implementation of eukaryotic 225 transcription would require making significant changes to the reaction schema, such as disabling nuclear 226 nascent degradation and adding a kinetic model of a transport process after the release of the newly 227 transcribed mRNA. The implementation of more complex transcription schema, such as disallowing 228 polymerase progression past another polymerase (33), which are challenging to theoretically treat using the 229 CME framework, are relatively straightforward to achieve by checking the distance to the nearest neighbor 230 in front of each polymerase and making the propensity of elongation zero if the distance is sufficiently 231 small. The enforcement of delays at particular locations along the gene is likewise straightforward, 232 specifically by replacing the constant elongation speed with a speed dependent on the polymerase location. 233