Approximate Bayesian computation of transcriptional pausing mechanisms

At a transcriptional pause site, RNA polymerase (RNAP) takes significantly longer than average to transcribe the nucleotide before moving on to the next position. At the single-molecule level this process is stochastic, while at the ensemble level it plays a variety of important roles in biological systems. The pause signal is complex and invokes interplay between a range of mechanisms. Among these factors are: non-canonical transcription events – such as backtracking and hypertranslocation; the catalytically inactive intermediate state hypothesised to act as a precursor to backtracking; the energetic configuration of basepairing within the DNA/RNA hybrid and of those flanking the transcription bubble; and the structure of the nascent mRNA. There are a variety of plausible models and hypotheses but it is unclear which explanations are better. We performed a systematic comparison of 128 kinetic models of transcription using approximate Bayesian computation. Under this Bayesian framework, models and their parameters were assessed by their ability to predict the locations of pause sites in the E. coli genome. These results suggest that the structural parameters governing the transcription bubble, and the dynamics of the transcription bubble during translocation, play significant roles in pausing. This is consistent with a model where the relative Gibbs energies between the pre and posttranslocated positions, and the rate of translocation between the two, is the primary factor behind invoking transcriptional pausing. Whereas, hypertranslocation, backtracking, and the intermediate state are not required to predict the locations of transcriptional pause sites. Finally, we compared the predictive power of these kinetic models to that of a non-explanatory statistical model. The latter approach has significantly greater predictive power (AUC = 0.89 cf. 0.73), suggesting that, while current models of transcription contain a moderate degree of predictive power, a much greater quantitative understanding of transcriptional pausing is required to rival that of a sequence motif. Author summary Transcription involves the copying of a DNA template into messenger RNA (mRNA). This reaction is implemented by RNA polymerase (RNAP) successively incorporating nucleotides onto the mRNA. At a transcriptional pause site, RNAP takes significantly longer than average to incorporate the nucleotide. A model which can not only predict the locations of pause sites in a DNA template, but also explain how or why they are pause sites, is sought after. Transcriptional pausing emerges from cooperation between several mechanisms. These mechanisms include non-canonical RNAP reactions; and the thermodynamic properties of DNA and mRNA. There are many hypotheses and kinetic models of transcription but it is unclear which hypotheses and models are required to predict and explain transcriptional pausing. We have developed a rigorous statistical framework for inferring model parameters and comparing hypotheses. By applying this framework to published pause-site data, we compared 128 kinetic models of transcription with the aim of finding the best models for predicting the locations of pause sites. This analysis offered insights into mechanisms of transcriptional pausing. However, the predictive power of these models lacks compared with non-explanatory statistical models - suggesting the data contains more information than can be satisfied by current quantitative understandings of transcriptional pausing.

The reaction pathways (Fig 1) of transcription elongation have been studied extensively. 2 RNA polymerase (RNAP) exists inside a transcription bubble that translocates along 3 the double stranded DNA. Within the polymerase, the messenger RNA (mRNA) and 4 template DNA form a DNA/RNA hybrid [1][2][3][4][5][6][7]. Throughout the main transcription 5 elongation pathway, RNAP successively alternates between the pretranslocated and 6 posttranslocated positions, employing a Brownian ratchet mechanism [8,9]. In order 7 to bind and incorporate the next nucleotide onto the 3 end of the mRNA, the active 8 site must be accessible and RNAP must be in the posttranslocated position. and its off-pathway events. RNA polymerase (grey rectangle) translocates the template in the 3 → 5 direction. State notation described in main text. Inhibitory effects that RNA secondary structures have on translocation are displayed. B: Schematic depiction of pausing, including mechanisms and effectors of pausing and pause recovery described in A. C: Pause sites are analogous to traffic lights. If the light is green, the RNA polymerase may go. If the light is red it must wait an indefinite amount of time for the light to turn green. The colour of the upcoming traffic light and the time until it turns green are random variables.
However, RNAP does not always conform to this catalytically productive pathway. 10 As off-pathway events, RNAP can backtrack or hypertranslocate. When 11 backtracking (Fig 1A, LHS), RNAP translocates upstream (in the 3 direction along the 12 template) [9][10][11]. This causes the 3 end of the mRNA to exit the RNA polymerase has taken many forms in the literature and exists as an intermediate between the 26 pretranslocated and backtracked states (Fig 1). Entry into this state from a 27 pretranslocated complex is achieved by rearrangement of the RNAP trigger-loop and 28 fraying of the 3 mRNA, in a manner that inhibits elongation but not translocation [24]. 29 For prokaryotes and eukaryotes, transcription elongation rates range from 20-120 30 bp/s [25][26][27]. However individual RNAP molecules proceed quite erratically along their 31 template. Approximately once every 100 bp [23,28] there exists a pause site which takes 32 significantly longer to transcribe, oftentimes on the timescale of seconds or 33 minutes [19,29,30]. Extended pauses can lead to RNA polymerase traffic jams [31,32]. 34 For the most part, pause sites are likely to be detrimental to the organism. 35 Nonetheless, transcriptional pausing plays a range of important biological roles in 36 certain systems [33]. 37 1. Gene expression. For example, the 5 UTR of HIV-1 contains a pause site 38 immediately downstream from the TAR hairpin [19,34]. TAR is a regulatory element 39 that upregulates transcription upon binding to viral protein Tat [35]. It is therefore 40 beneficial for the virus if there exists a pause site downstream from TAR, as this gives 41 Tat a greater temporal opportunity to bind to TAR before RNAP has left the proximity. 42 2. Modulating RNA folding. For example, the ribDEAHT operon of Bacillus 43 subtilis encodes genes involved in riboflavin synthesis [36]. The ribD riboswitch, which 44 manifests in the nascent mRNA, can adopt either the terminator fold (eliciting 45 termination) or the antiterminator fold (enabling transcription). The former is favoured 46 in the presence of flavin mononucleotide. Transcriptional pause sites flank the 47 riboswitch, thus providing this ligand a greater opportunity to apply its effect [37]. 48 3. RNA splicing. RNA splicing involves the pairing of a donor and an acceptor 49 splice site. As this process occurs cotranscriptionally [38], the chance of the cellular 50 splicing machinery selecting any given donor-acceptor pair is dependent on transcription 51 elongation velocity and transcriptional pausing [39][40][41]. The positioning and strength of 52 pause sites therefore contributes to the proportions of splice variants in systems which 53 employ alternative splicing. 54 The dwell time at a pause site is approximately exponentially distributed [19,23,30] 55 and subsequently any given pause site can be quantified by how likely the RNAP is to 56 pause, and an escape time half-life for when it does pause [19,23]. While comparisons 57 can be made between pause sites and stop signs or speed bumps, due to their stochastic 58 nature pause sites are more akin to traffic lights ( Fig 1C). 59 However the mechanisms that elicit pausing to occur in the first place are complex 60 and multipartite [19,22,24]. A range of sequence-dependent factors contribute to the 61 pause signal. Among the known effectors: 1) when the DNA/RNA hybrid of a 62 posttranslocated state is weak relative to that of the pretranslocated state, RNAP may 63 favour the pre state over the post state thus leading to a pause [21]. This delay can A consensus sequence of the pause site for the E. coli RNAP has recently been 78 identified [28,50]. This motif reveals that the nucleotides at the 3 and 5 ends of the 79 hybrid are important, as well as the incoming NTP which is usually a GTP. By 80 comparing a nucleotide sequence to a motif [50], one may be able to predict the 81 locations of pause sites. However it still leaves much unsaid about the physical processes 82 that govern pausing. Although sequence-dependent explanatory models have been 83 applied [1,45,46,51], a systematic and large scale analysis of the accuracy of this 84 approach has not yet been done.

85
It would be greatly beneficial to have a model for both predicting the locations of 86 and explaining the mechanisms behind pause sites, for any arbitrary gene sequence.

87
There are still uncertainties pertaining to the mechanism behind transcriptional pausing 88 we would like to resolve. 1) To what extent does mRNA secondary structure inhibit the 89 translocation thereby modifying the pausing behaviour? Does using a prediction of the 90 mRNA structure enhance the model [1,44]? 2) Does utilising knowledge of the gated 91 tyrosine residue that inhibits translocation between the backtrack-1 and backtrack-2 92 states improve the model [11]? It could be the case that the backtrack-1 state is readily 93 accessible and incorporating this feature into the model improves its accuracy. 3)

94
Some [1,[18][19][20][21][22], but not all [44,51,52], models have been built with the inclusion of the 95 IS that RNAP must enter before backtracking. Is this model feature essential to explain 96 the sequence-specific properties of pausing or is it a redundant feature that introduces 97 unnecessary complexity into the model?

98
By virtue of the availability of data from a high-throughput detection of pause sites 99 across the entire E. coli transcriptome by Larson et al. 2014 [28], we were able to 100 explore these model variants. In this study we used a Bayesian approach to interrogate 101 this dataset. The volume of this dataset allowed us to 1) evaluate how reliable this 102 modelling approach can be for the prediction of pause sites, and 2) select the best model, 103 and its parameters, to better understand the mechanics of transcriptional pausing.

105
We explored two approaches for predicting pause sites: 1) the simulation of kinetic 106 models as continuous-time Markov processes (based on the kinetic scheme shown in Fig 107  1), and 2) by using a simple naive Bayes classifier. Both models were trained on the 108 aforementioned dataset [28].

109
The first approach involves stochastic simulation of transcription at the 110 single-molecule level using the Gillespie algorithm [53,54]. This is done in a similar 111 fashion to our previous work [55], but here we have used the model to predict the dwell 112 time at each site instead of mean velocity under force.

114
Let S(l, t) denote a state, where l is the current length of the mRNA and t ∈ Z is the 115 position of the polymerase active site with respect to the 3 end of the mRNA (Fig 1A). 116 t = 0 when pretranslocated and t = 1 when posttranslocated. t < 0 denotes backtracked 117 states and t > 1 denotes hypertranslocated states.

118
Standard Gibbs free energies ∆ r G 0 (= ∆G) involved in duplex formation are used to 119 calculate forward and backward translocation rates. These terms are approximated 120 using nearest neighbour models. The total Gibbs energy of state S -arising from 121 nucleotide basepairing and dangling ends -is hybrid . For the latter, 124 dangling end energies are estimated as described by Bai et al. 2004 [51] and only apply 125 to the posttranslocated position. ∆G terms are expressed relative to the thermal energy 126 of the system, in units of k B T , where k B T = 0.6156 kcal/mol at T = 310 K.

127
Comparing kinetic models 128 To better understand the mechanisms that govern transcriptional pausing, we not only 129 estimated the kinetic model parameters but also the kinetic model itself. In this section 130 we describe six different model settings. Each model setting has a discrete set of values 131 giving a cross-product of 2 × 2 × 4 × 2 × 2 × 2 = 128 models to compare. There has been discussion concerning whether there exists an IS that acts as an entry 134 point for backtracking from the elongation pathway [1,22]. The IS is catalytically 135 inactive and can act as a prolonged pause state regardless of whether backtracking is 136 subsequently instigated [42]. While incorporating this physical process may offer 137 additional explanatory power to the model, two additional parameters k U and k A must 138 be estimated. We therefore compared two models: a general model (IS = 1) where the 139 IS exists and is necessary for backtracking , and a simpler model (IS = 0) where there is 140 no IS and RNAP can backtrack freely (and there are two fewer parameters to estimate). 141 Inclusion of the gating tyrosine, GT

142
The crystal structure of the S. cerevisiae Pol II complex by Cheung et al. 2011 [11] 143 reveals a gating tyrosine that may inhibit backtracking. While back translocation into 144 the S(l, −1) position may be permitted, further backtracking into S(l, −2) is likely 145 delimited by this amino acid. We were interested whether the gating tyrosine plays is an 146 effector of transcriptional pausing, so we compared two models: one model (GT = 1)

147
where RNAP can readily translocate between S(l, 0) and S(l, −1) but translocation 148 between S(l, −1) and S(l, −2) is much less favourable, and a simpler model (GT = 0) 149 where the effects of the gating tyrosine are ignored (same as Fig 1). 150 Estimating the translocation transition state, TS

151
As the transcription bubble migrates along the gene, so too do the basepairing 152 configurations within the DNA/RNA hybrid and the DNA/DNA gene. In order for 153 RNAP to translocate from one position into the next, disruption of one hybrid basepair 154 and one gene basepair may be required. This translocation is assumed to occur through 155 a translocation transition state [55]. Four sequence-dependent methods for estimating 156 this transition state are described (Fig 2A).

157
The DNA/RNA basepairs within the hybrid of the translocation transition state are 158 assumed to be a subset of the basepairs within the two neighbouring states. These    To determine which, if any, of these four models are the most suitable we estimated 170 the value of TS.

171
Inclusion of RNA secondary structure blockades, RB

172
The simplest models of incorporating RNA secondary structure as a translocation 173 inhibitor make the assumption that transcription is sufficiently slow for mRNA to fold 174 into its minimum free energy structure within the same timescale [1,44]. As a more 175 complex model, one could invoke a kinetic model of cotranscriptional folding derived 176 from something to the likes of Kinfold [58]. It was of interest how much this first model, 177 and its questionable assumptions, contributed to predictive and explanatory power with 178 respect to transcriptional pausing. We therefore compared two models of RNA folding: 179 the model RB = 0 with no RNA folding and the model RB = 1 where the minimum free 180 energy structure, as predicted by ViennaRNA suite [59,60], is used as a translocation real observed phenomena, it may be the case that they are not required to adequately 186 explain pausing. To elucidate this, we explored models in which either backtracking, or 187 hypertranslocation, or both, or neither, are illegal pathways.  parameters for calculating these rates were in some cases held constant based off 191 previous estimates, while in other cases were estimated from the data. Where estimated, 192 a prior distribution is required. Table 1 summarises these constants and priors.
Transcription bubble Nucleotide incorporation µM 567 For full parameter descriptions refer to the corresponding main body subsection. Where a parameter is estimated, a prior probability distribution is specified, and where a parameter is held constant, its value is left in place. Normal distribution priors were used for energy terms while lognormal priors were used for rates (parameterised such that the mean and standard deviation specified are those in natural log space). See S1 Appendix for justifications behind these prior distributions.
respectively, and the number of paired template nucleotides in the DNA/mRNA hybrid, 197 h [1]. These three parameters are to be estimated from the data and expected to have a 198 profound effect on the sequence-dependent properties of translocation [51].  [44] where, provided that RB = 1, the RNA minimum free energy 202 (MFE) structure is predicted using the ViennaRNA suite [59,60], and these structures 203 can block translocation. However, the λ b nucleotides proximal to the mRNA entry and 204 exit channels respectively are assumed to be unable to adopt a secondary structure due 205 to steric collisions with the enzyme, and are therefore unable to block translocation.

206
Suppose that (r 1 , r 2 , . . . r l ) is the mRNA sequence of state S(l, t). Let R(i, j), where 207 1 ≤ i ≤ j ≤ l + 1, be the subsequence of nucleotides (r i , . . . , r j−1 ) which are basepaired 208 in the MFE structure of mRNA subsequence. R(i, i) is the empty set. Let U (l, t) and Then, the backwards translocation rate out of state S(l, t) is equal to zero, due to Similarly, the forward 213 translocation rate is equal to zero, due to downstream collisions, if r l+t+λ b +1 ∈ D(l, t). 214 Translocation 215 The computation of translocation rates invokes transition state theory and is 216 parameterised as an extension to our previous work [55]. Translocation from state S is 217 described by the rate of backward translocation k bck (S) and the rate of forward  Let T (l, t) be the translocation transition state between states S(l, t) and S(l, t + 1). 231 Using transition state theory, the rates of forward and backwards translocation out of for some pre-exponential constant A, arbitrarily set here to 10 6 s −1 [55]. Two 234 energetic terms are required to compute these rates; the Gibbs energy of the 235 translocation state ∆G S(l,t) , and the Gibbs energy of the translocation transition state 236 ∆G T (l,t) . We will describe these two components separately.
First, calculating ∆G S(l,t) is straightforward (Fig 2C). It requires one translocation 238 parameter -∆G τ 1 . ∆G S(l,t) is primarily computed from DNA/DNA and DNA/mRNA 239 Gibbs energies ∆G where ∆G τ 1 is a term added onto the Gibbs energy of basepairing of the 241 posttranslocated state. ∆G τ 1 was found to be a necessary parameter to describe 242 elongation sufficiently for the E. coli RNAP and was estimated as ∼ −2 k B T [55] and 243 is set accordingly throughout this study (Table 1).

244
Second, calculating ∆G T (l,t) is more complex (Fig 2C). It requires a method for 245 estimating the nucleic acid energies of the transition state, and three translocation and is assumed to be a fast reaction no different to that between S(l, 0) and S(l, 1).

286
Backtracking beyond this point is still assumed to be slow (Case 6).

287
In summary, the rates of translocation are calculated using transition state theory parameters [56,57] and four model parameters -G τ 1 , G ‡ τ , G ‡ τ − , and G ‡ τ + . An estimate 292 of the Gibbs energy of the translocation transition state TS is also required (Fig 2A), as 293 well as the values of RB, HT, BT, and GT.

299
Let T (l, t) be the translocation transition state between neighbouring states S(l, t) 300 and S(l, t + 1). The set of basepairs in H(T (l, t)) and G(T (l, t)) depend on the current 301 transition state model TS.
Once the set of basepairing nucleotides comprising the gene G(T (l, t)) and the T (l,t) is readily computed [56,57].

305
However, because the number of basepairs in each of the four transition state models 306 TS differ, and their corresponding energies therefore systematically differ, ∆G T (l,t) must 307 be normalised such that the four models achieve the same average translocation rates. 308 This was achieved by calculating the mean value of ∆G   Nucleotide incorporation 332 Nucleotide incorporation is described by the rate of catalysis k cat , the second order rate 333 constant of NTP binding k bind , and the NTP dissociation constant K D = k rel k bind , where 334 k rel is the rate of NTP release. Partial equilibrium approximations were not made in 335 the NTP binding step.

344
The baseline Gibbs energy barrier ∆G ‡ τ = 21 k B T and the hypertranslocation barrier 345 ∆G ‡ τ + = 2 k B T . All other parameters used in this example are specified in Table 1.

346
In order to stochastically sample the next state (using the Gillespie 347 algorithm [53,55]), three rates must be calculated.

348
First, computing the rate of NTP binding is straightforward.   It is clear that Case 2 of Equation 7 will fail and Case 5 will instead apply (t > 0 and 361 HT = 1). Therefore, in order to calculate the rate of hypertranslocation, we must first 362 compute the Gibbs energies of S(31, 1) and T (31, 1). The former term has Gibbs energy 363 ∆G S(31,1) = ∆G  Classifying site X l into class P or N is achieved by simulating the kinetic model using 398 the Gillespie algorithm [53,55]. Let f P (l) be the median time that the length of the 399 mRNA is exactly l nucleotides in length.

403
Site l is classified as a pause site if and only if f P (l) > θ for some threshold θ.

404
Applying changes to the value of θ is required to build a ROC curve.

405
Naive Bayes classifier 406 The naive Bayes classifier is a simple probabilistic classifier derived from Bayes' 407 theorem [63] and makes a suitable bioinformatics algorithm for sequence-based 408 prediction [64][65][66]. Classification of site l into P or N is computed through comparison 409 of the respective log posterior probabilities: = log P (X|C l = P)P (C = P) P (X|C l = N )P (C = N ) (12) The naive property of the NBC invokes the assumption of independence between 411 sites, allowing the likelihood to be computed as the product of likelihoods P (X j |C l = c) 412 across all sites in a window around l, where the window size is w 1 + 1 + w 2 (for w 1 = 10 413 and w 2 = 4). Log probabilities are trained using Laplace smoothing.

414
Site l is classified as a pause if g P (l) > θ for some θ.

416
Searching the space of kinetic models and parameters

417
Our aim was to 1) use Bayesian inference to select the best of 128 models (Fig 3); and 418 2) estimate the parameters. This was achieved using the rejection approximate Bayesian 419 computation algorithm (R-ABC) [67,68]. parameters. There is evidence that HT and RB may also be critical. We wanted to 434 confirm that these variables are indeed important for the prediction of pause sites, and 435 to quantify how much predictive power is contained in each variable. To do this, we 436 Fig 4. Posterior distribution of kinetic model parameters. Posterior distributions (coloured bars) and prior distributions (black curves) for the 9 estimated parameters are shown. Posterior distributions reported are conditional on the models which use the parameters. For example ∆G ‡ τ + is conditional on HT = 1, while k U and k A estimates are conditional on IS = 1. The geometric median point-estimate, the 95% highest posterior density (HPD) interval (calculated using Tracer 1.6 [69]), and posterior distribution sample size n are displayed above each plot (3 sf). These results reveal that h, β 2 , ∆G ‡ τ , and ∆G ‡ τ + are informed by the pause site data, while the remaining parameters are largely uninformed. Sequence logos built using B: known pause sites, C: the subset of known pause sites which are correctly classified by the KMC (true positives), and D: the subset of known pause sites which are not classified as pause sites by the KMC (false negatives). The true positives and false negatives collectively comprise the true pause sites. The nucleotide window used by the NBC and the estimated hybrid length are displayed. All logos are generated using WebLogo [70] and trained on test set sequences.
recomputed the AUC of models, by sampling from the posterior distribution, using 437 samples that differ in the described variables (Table 2). 438 These results confirm that TS, h, and β 2 are indeed critical variables. In the most 439 extreme case, changing the transition state model TS from H I G I to H U G U reduced the 440 AUC from 0.73 to 0.43; the latter corresponding to a predictive model that is worse 441 than assigning pause sites at random. In the least extreme case, changing h from 11 nt 442 to 10 nt reduced the AUC from 0.73 to 0.70.

443
Whereas, adjusting sampling from the posterior distribution, conditional on HT and 444 RB, did not yield any significant AUC changes across the four pairwise combinations of 445 HT and RB (other than a minor decrease in AUC for {HT = 0, RB = 1}). It is therefore 446 likely that these two model settings are not offering any further predictive power.

447
Naive Bayes model 448 We trained a naive Bayes classifier to predict pause sites using the same dataset [28] as 449 the kinetic model. This model enabled an estimation of the amount of information 450 available in the data, without the constraint of being physically plausible.

451
Similar to the kinetic model, we performed a ROC analysis on the NBC to evaluate 452 how accurately it can predict pause sites (Fig 5). The AUC of the NBC was 0.888 on 453 the test set (and 0.895 on training set), suggesting that the sequence within this window 454 contains a large amount of information about transcriptional pausing, even though the 455 sites are assessed independently. This model has significantly better prediction power 456 than the kinetic model. The extent of overfitting is minimal.

458
In this study we inferred structural, kinetic, and thermodynamic parameters (Fig 4) of 459 kinetic models for the prediction of transcriptional pausing. We also inferred the kinetic 460 model itself to provide evidence for or against various kinetic model variants that have 461 been described in the literature (Fig 3). The posterior distribution of kinetic models is 462 interpreted using the Bayes factor, K, following the general guidelines by Kass and 463   Raftery 1995 [71]. We compared the predictive power of the kinetic models to that of a 464 statistical technique: the naive Bayes classifier (Fig 5). We estimate that the transcription bubble contains anĥ = 11 nt hybrid and aβ 2 = 0 nt 476 gap between the RNAP and the downstream gene region (Fig 4). These estimates each 477 have a posterior probability of 1.00, indicative of high certainty that these are the best 478 estimates for the data. In contrast the gap between the RNAP and the upstream 479 dsDNAβ 1 was not informed by the data.

480
Furthermore, the translocation transition model TS is an extremely important model 481 setting. We estimate that the best transition model estimate isTS = H I G I (Fig 3). This variant has a marginal posterior probability P (TS = H I G I |D) = 1.0, indicating 483 high confidence that this is the preferred transition state model. Our critical parameter 484 analysis reveals that by changing the value of TS, while holding all else constant, in the 485 extreme case the AUC declines from 0.73 down to 0.48 (Table 2). 486 Under the H I G I model, the transition state between two translocation states 487 contains the intersection between the two sets of DNA/RNA hybrid basepairs, and the 488 intersection between the two sets of DNA/DNA gene basepairs.

489
This model corresponds to a physical mechanism of translocation akin to a bubble 490 melting process (see Fig 2A). First, one DNA/DNA basepair within the gene and one  Examining the sequence logo generated from known pause sites (Fig 5B), we can see 500 that the nucleotides 1-2 positions proximal to the pause site are important, as are the 501 two nucleotides 9-10 positions upstream from the pause site. Pausing usually occurs at 502 a cytidine or uridine and the incoming NTP is usually guanosine. Comparing this with 503 the sequence logo generated from the pause sites which were correctly classified by the 504 kinetic model (Fig 5C), we can see that some of this pattern is captured by the kinetic 505 model. The two nucleotides 9-10 positions upstream from the pause site exist at the tend to be G-C rich, which, under the nearest neighbour models, correspond to stronger 512 basepairs that require more energy to break than A-U or A-T basepairs [56,57]. 513 However, the kinetic model is unable to explain pause sites which do not have strong 514 basepairs at the upstream end of the hybrid. The kinetic model also places too much 515 weight on the position two nucleotides downstream from the pause site (position 23 in 516 the sequence logos). The reasonably large differences in positional information between 517 these two sequence logos suggests that the kinetic model is unable to model pausing 518 when the sequence composition even slightly deviates from this motif.

519
Estimates of β 1 and β 2 are consistent with previous estimates from crystal 520 structures [4][5][6][7]. However, based off these same structures the pretranslocated hybrid 521 length is estimated as 10 bp, which is inconsistent with our estimate ofĥ = 11 bp. This 522 is a peculiar contradiction, especially considering that our Bayesian protocol is 100% 523 certain thatĥ = 11 bp, so it may be more suitable to consider h not as being the true 524 hybrid size but rather the effective hybrid size during transcription.

525
Overall, there is strong evidence that by having accurate estimates for the 526 parameters which govern the size of the transcription bubble, and a good model of the 527 translocation transition state, the sequence-specific properties that emerge are hypothesised role the gating tyrosine plays in eliciting pauses, by permitting rapid 538 translocation into S(l, −1) while delimiting further backstepping, was incorporated with 539 the backtracking model (Fig 2D). Catalytically inactive intermediate state models were 540 assessed and the rates of entry k U and exit k A to and from this state were estimated.

541
And yet, none of these eight variants were notably superior in their ability to predict 542 the locations of pause sites. These three mechanisms have marginal posterior 543 probabilities ranging from 0.44 to 0.69 (Fig 3), to be interpreted as 'not worth more 544 than a bare mention' [71]. The four related parameters -∆G ‡ τ − , k cleave , k U , and k A -545 have posterior distributions almost the same as their prior distributions, consistent with 546 the data providing no information about these parameters.

547
A posterior probability of intermediate magnitude is indicative of the model setting 548 being neither necessary nor detrimental. Instead the feature is unnecessary.

549
Our findings are consistent with the hypothesis that the relative stability between 550 the pre and posttranslocated states chiefly facilitates the occurrence of pausing [21,51], 551 as opposed to the stability of those relative to the backtracked states [18,44,72]. Rather, 552 most pauses are brief (averaging 3 s) and do not involve backtracking [42]. It is likely 553 that backtracking occurs on a timescale so slow [73] that by the time RNAP has had 554 sufficient time to sample the energy landscape of backtracked states, the pause has 555 already begun, and therefore these energies are of little use for predicting the frequency 556 of pausing.

557
Pauses are hypothesised to occur largely through the conformational rearrangement 558 of the enzyme into a catalytically inactive form -the IS [1,[18][19][20][21][22]. This may occur in a 559 sequence-dependent manner where weaker DNA/RNA hybrids are more likely to invoke 560 this transition [18,19,34]. The sequence-independent model of this transition used in 561 this study was also of no use for predicting the locations of pause sites.

562
Although, backtracking, the gating tyrosine, and the IS were not able to explain the 563 frequency of pausing, they may be able to explain the duration of pausing. Treating this 564 system as a regression problem -by fitting to known pause durations -as opposed to a 565 classification problem -where pausing is viewed as a binary trait -may offer further 566 insights into these model features.

567
Hypertranslocation and mRNA folding are not necessary to 568 predict pausing 569 Our initial analysis suggested that hypertranslocation HT could be a necessary model 570 feature; with marginal posterior probability P (HT = 1|D) = 0.93 -corresponding to a 571 Bayes factors of 13.3 -therefore providing 'positive evidence' thatĤT = 1 [71].

572
Accordingly, the posterior distribution of hypertranslocation Gibbs barrier ∆G ‡ τ + is 573 quite different to its prior distribution (Fig 4).

574
There was also evidence that incorporating RNA folding into the model may be 575 necessary; P (RB = 1|D) = 0.87 and a Bayes factor of 6.7; 'positive evidence' [71]. In 576 the RB = 1 model, upstream (downstream) RNA secondary structures inhibit backward 577 (forward) translocation.

578
However, when these two model settings were systematically evaluated on the test 579 set, they were not found to be necessary. Subsampling from the posterior distribution 580 conditional on HT and RB did not yield any significant change in AUC across the four 581 pairwise combinations of HT and RB, aside from a trivial decrease in AUC for 582 {HT = 0, RB = 1} (Table 2). It is therefore likely that although HT = 1 and RB = 1 583 are associated with higher posterior probabilities, the model which has these settings 584 enabled {HT = 1, RB = 1} does not contain any predictive power beyond that of 585 {HT = 0, RB = 0}, so the hypertranslocation and RNA blockade models are of little to 586 no use for the prediction of pause sites.

587
These results are consistent with the findings of Levint et al. 1987, who found no 588 correlation between the locations of pause sites and upstream RNA secondary 589 structures [74], and Dalal et al. 2006 [75], who used optical tweezers to inhibit mRNA 590 folding during transcription, and found that the kinetics of pausing were not affected by 591 the perturbation.

592
This is not to say that these mechanisms are not involved in pausing. RNA folding 593 and hypertranslocation cooperatively induce pausing at the his leader pause site (a 594 Class I pause), for instance. This is achieved by direct interaction between an upstream 595 hairpin and the RNAP [18]. However, on average, the described models of RNA folding 596 and hypertranslocation are of no assistance for the prediction of pause sites. that perform even better than the NBC [50], this provides a lower-bound of the amount 606 of information contained in the data. With a sufficient understanding of the kinetics of 607 transcription, the kinetic model classifier should perform at least as well as the NBC did. 608 Gibbs energies of basepairing can only take the kinetic model so far. Physically 609 informed features that could extract the full potential of the kinetic model classifier 610 include: the effects of double-stranded DNA bending upstream and downstream of the 611 polymerase [19,43]; a sequence-dependent model of entry into the IS [18,19,34]; specific 612 interactions between the DNA/RNA hybrid and the protein [22]; effects that the 613 promoter have on the way the polymerase interacts with the gene [76]; differential rates 614 of NTP binding, release, and catalysis across the four nucleotides [46]; effects that the 615 nucleotide context around the 3 mRNA have on the rates of NTP incorporation [77]; or 616 the effects of local NTP depletion [78]. A mathematical understanding of such processes, 617 among others, may be necessary for the kinetic model to perform as well as a sequence 618 motif and bridge the gap between the two sequence logos (Fig 5).

619
Conclusion 620 In this study, we quantified the predictive power that standard kinetic models of 621 transcription elongation have with respect to identifying transcriptional pause sites.

622
Transcriptional pausing is not sufficiently understood to capture the signal from the 623 data as thoroughly as a non-physical statistical model. We suggest that the relative 624 stability between the pre and posttranslocated states, and the estimated energy 625 required to translocate between them, is the primary effector of pausing.