Controlling E. coli Gene Expression Noise

Intracellular protein copy numbers show significant cell-to-cell variability within an isogenic population due to the random nature of biological reactions. Here we show how the variability in copy number can be controlled by perturbing gene expression. Depending on the genetic network and host, different perturbations can be applied to control variability. To understand more fully how noise propagates and behaves in biochemical networks we developed stochastic control analysis (SCA) which is a sensitivity-based analysis framework for the study of noise control. Here we apply SCA to synthetic gene expression systems encoded on plasmids that are transformed into Escherichia coli. We show that (1) dual control of transcription and translation efficiencies provides the most efficient way of noise-versus-mean control. (2) The expressed proteins follow the gamma distribution function as found in chromosomal proteins. (3) One of the major sources of noise, leading to the cell-to-cell variability in protein copy numbers, is related to bursty translation. (4) By taking into account stochastic fluctuations in autofluorescence, the correct scaling relationship between the noise and mean levels of the protein copy numbers was recovered for the case of weak fluorescence signals.


I. INTRODUCTION
C ELL-TO-CELL variability in protein copy numbers within isogenic populations are typically observed in various types of cells due to underlying random biochemical reaction processes [1], [2], [3]. The variability can lead to noise-induced cellular phenotypes such as cellular differentiation [3], multiple stability [4], and either sensitivity enhancement or suppression [5], [6]. Here we investigate the ability to differentially control the noise and mean levels of gene expression in E. coli.
Such differential control in gene expression has been achieved in different organisms such as yeast [7], [8], [9], soil bacteria [10], and mammalian cell lines [11]. In E. coli, systematic noise control have not been performed by perturbing promoter DNA sequences and ribosome binding sites, while most studies have been focused on genome-wide expression without such perturbations [16], [21], [2]. Here we aim to understand differential control of mean and noise levels of protein concentrations at the single cell levels of E. coli.
The approach we use is based on stochastic control analysis (SCA) [13], [12], a body of theory we developed and reported in previous publications. SCA is a sensitivity analysis framework, that is a direct extension of metabolic control analysis [14], [15] to the stochastic regime [13]. This approach is based on a local sensitivity analysis that can be applied to study first-order effects of finite-size perturbations. SCA can identify which parameters in stochastic systems -here, gene regulatory circuits -need to be varied by how much to achieve a desired control aim. This includes orthogonal control of noise levels with respect to mean levels, and simultaneous changes in noise and mean levels in the same or opposite directions for the same or different protein species. SCA can provide control efficiency and strength to identify the most effective control schemes that are experimentally relevant [12]. Here, we apply SCA experimentally to E. coli genetic systems.
In this paper, gene circuits are encoded on plasmid backbones, which are transformed into E. coli MG1655. The circuits express green fluorescent proteins (GFP) under the lacpromoter. We perturbed the expression system by inducing the promoter with isopropyl β-D-1-thiogalactopyranoside (IPTG) and using a library of ribosome binding sites (RBS). We found that by taking into account stochastic fluctuations in autofluorescence, scaling relationship between GFP signal noise and mean levels can be extended to weak signal regions, where autofluorescence becomes moderately strong. This implies that when fluorescent signals are not strong enough compared to autofluorescence, stochasticity in autofluorescence can be systematically taken into account to characterize cellular systems. In addition, we aimed to understand what the major sources of GFP signal noise are by investigating the scaling relationship between the GFP signal noise and mean levels via promoter induction and RBS perturbation. We found that one of the major noise sources is bursty translation.
II. STOCHASTIC CONTROL ANALYSIS: REVIEW SCA [12], [13] is a local sensitivity analysis based on control coefficients, which are defined approximately as percentage change in a response signal (y) divided by the percentage change in a system parameter (p): We note that the slope in the log-log plot of y vs. p corresponds to C y p . The response signal can be the mean or noise levels of mRNAs or proteins. The parameters can include transcription and translation efficiencies, degradation rates of mRNAs and proteins, dilution rate due to cell growth, and reaction rates of transcription-factor binding and unbinding from promoter regions, etc. Another important quantity in SCA, is the control vector, each element of which corresponds to a control coefficient for a given response signal (y): where N defines the number of parameters (dimension of the parameter space) that will be varied to control the value of y. In this paper, we are mostly interested in dual control of transcription and translation efficiencies, i.e., N = 2. One of the important properties of the control vector is that its innerproduct with a parameter perturbation vector δp becomes the amount of change in the response signal δy, We can quantify which parameter value, and by how much it should be controlled to achieve specific control aims. For example, consider a case where the noise level of a protein needs to be reduced by 9%, while its mean level should remain the same. Here, the noise level (n) is defined by the variance divided by the squared mean value, i.e., squared coefficient of variation. Two control vectors for the noise and mean levels, C n p and C m p , need to be computed based on a given mathematical model. System parameters need to be perturbed while satisfying The perturbation vector δp/p satisfying these two equations can be solved, but the solutions can be infinite. In that case, it is important to select the optimal control scheme (perturbation vector) among the possible solutions. For this, the control efficiency and strength were introduced [12]. Based on these two quantities, one can choose desired control schemes that are appropriate to systems of interest with the maximum control strength and/or efficiency.

III. SCA FOR A SINGLE GENE EXPRESSION CASSETTE
We constructed plasmid expression systems that express green fluorescent protein (GFP) under lac-promoters in E. coli (Fig. 1A). The plasmid copy number in a single cell fluctuates in time because a set of plasmids are randomly partitioned during cell division and are synthesized in a stochastic fashion. Thus, the copy number of lac-promoters per cell fluctuates. For simplicity, we will assume that the plasmid copy number is tightly controlled, i.e., constant at the first level of approximation. The total number of plac will be the sum of the number of inactive and active lac-promoters (Fig. 1B), which will be set to a constant, N p . We call this the two-state model. The plasmid backbone that we used is pGA3K3 with the replication origin, p15A (N p = 10 − 30). Based on this two-state model, we computed control vectors for the mean and noise levels of GFP fluorescence as shown in Fig. 1C (refer to the Materials and Methods and [12] for the control vector computation).
The computed control coefficients show that noise can be controlled efficiently by varying k on , k of f , α m or γ p ; C n kon = C n αm = −1.00, C n k of f = 1.00 and C n γp = 0.97, indicating that, for example, with an increase in α m by 10%, n will reduce by 10% (this is a first-order approximation, because control coefficients are defined locally). Similarly, with an increase in γ p by 10%, n will increase by 9.7%. For the mean level (m), any model parameter will efficiently change m, because the absolute values of all the control coefficients for m are equal to one.

IV. MEAN LEVEL CONTROL
From the computed control coefficients, the mean protein levels can be controlled without changing the noise level (with a minor change, ∼10 folds less than the change in the mean levels) by varying either α p or γ m . To confirm this theoretical prediction, we changed the translation efficiency α p by using a library of both ribosome binding sites (RBSs) and spacer sequences as shown in Fig. 2. Among them, four different spacers -TACTAG, AAAAAA=(A) 6 , (A) 10 , and (A) 13 -that are placed between B0034 and the start codon showed distinct GFP expression levels when plac is fully active ([IPTG] = 1 mM). Here, the introduced spacer sequences are presumed to change ribosome binding affinity, in particular, translation initiation -the limiting step for a translation rate [23], [24]. We note that strong RBSs can recruit many ribosomes to mRNAs, causing an implication depending on the availability of ribosomes. This can apply an upper limit in the value of α p .
Based on our flow cytometry data, the mean level was successfully varied by using different spacer sequences as shown in Fig. 3 and Fig. 4A. We compared three different cases: Points A, B, and C in Fig. 3, corresponding to [IPTG]=1 mM. As shown in Fig. 4A, the rescaled probability density functions (pdfs) were overlapped with a minor discrepancy. This scale invariance confirms that the noise levels of all the points are the same.
Scale invariance in the gamma distribution: Furthermore, the observed invariance implies a special property that we need to consider carefully. This invariance property is satisfied by the gamma distribution function as shown in the Materials and Method section when the burst size is rescaled together. This implies that the difference between the system parameters of Points A, B, and C is only the burst size. For these Points, different spacer sequences were used between B0034 and the start codon, while the lac-promoter was fully induced (saturated). Thus, the translation rate constant α p is expected to be varied for these three Points and the burst size must be closely related to α p , which is consistent with theoretical prediction based on our model (Eq. (3)). This result supports that the observed distribution functions are the gamma distributions (confer to [25], [26] about claims for other types of distribution functions). To confirm this, we fit the GFP pdfs to the gamma distribution functions as shown in Fig. 5. We confirmed that the pdfs follow the gamma distributions well.

V. NOISE LEVEL CONTROL
As discussed above, the noise level can be efficiently controlled by varying k on , k of f , α m and γ p . However, when these parameters are changed, the mean level also changes with the same fold difference but in the opposite direction; for example, in Fig. 1C, C n αm and C m αm are −1.00 and 1.00, meaning that when α m is increased by x%, the noise level decreases by x%, while the mean level increases by x%. Thus, to change the noise level without changing the mean level, we must vary at least two different parameters simultaneously.
Since the mean level can be controlled almost independently of the noise level by changing α p , we will vary α p along with one of the parameters in {k on , k of f , α m } to compensate for the change. The reason that we did not choose to vary γ p is that this parameter is highly dependent on cell growth rate, rather than protein degradation in E. coli; GFP lifetime is much longer than the cell doubling time ∼ 1 hr in M9 media.
Based on the SCA, an individual change in k on , k of f , and α m and any combination of the individual changes can vary the noise and mean levels while satisfying the same  scaling law: n = c/m with c a constant (not varied). We note that the ratio of control coefficients for n and m for a given parameter, e.g., k on , has a graphical meaning: In Fig. 3, when [IPTG] is varied, the corresponding data point shifts (e.g., Point C → D) and the slope of the shift in the loglog plot corresponds to the ratio of the control coefficients:  the same: This implies that the directions of data point shifts in the log-log plot of n vs. m are identical for each individual perturbation of k on , k of f , and α m , and thus for any combination of these three individual perturbations. Therefore, the shift of data points with the slope of −1, observed when varying IPTG concentrations as shown in Fig. 3, cannot determine which parameters among k on , k of f , and α m were affected by IPTG concentration changes. We note that in [21] promoter perturbations in E. coli was claimed to affect k of f only.
As shown in Fig. 3, by using a library of RBS as well as different concentrations of IPTG, the noise level was controlled and ∼ 10 fold change in the noise level was achieved without changing the mean level. What is the biological reason that noise can be increased in this way? In other words, what causes to increase the value of the Fano factor? In the scaling relationship, n = c/m, c is the Fano factor, which is expressed for the case of E. coli (Materials and Methods): b is the translational burst size, quatifying the number of proteins that are synthesized from a single mRNA during the mRNA lifetime (1/γ m ). b m is the transcriptional burst size, quantifying the number of mRNA that are synthesized per plasmid during the time-scale (1/(k on + k of f )) of gene switching (refer to the Materials and Methods). The Fano factor depends on both transcriptional and translational bursts. In our case of lacI q expression of LacI, we can neglect the transcriptional burst (Materials and Methods). Thus, the Fano factor becomes c = 1 + b, and n can be expressed as The Fano factor can be increased by applying stronger translation efficiencies (from Point C to A) and remains the same by decreasing [IPTG] (from Point A to E), leading to the increase in the noise level without changing the mean level. The translational bursts lead to longer-tail pdfs, more precisely, higher cutoff values in the pdfs (in the gamma distribution, there is an exponential factor e −x/b and b acts as a cutoff value): Figure 4B shows that a longer tail in the GFP pdf can be generated by using stronger translation efficiency (Point D → Point E).
Another interpretation for the observed longer tail in Fig. 4B is that the major source of fluctuations in the protein copy numbers is in mRNA copy numbers, which merely get amplified by the translation rate in a non-bursty way: The variance in protein expression levels becomes Variance(N pr ) ∼ (α p /γ p ) 2 Variance(N rna ), resulting in that the noise level does not depend on α p : Since the protein mean level m must be proportional to the translation rate constant α p (i.e., m = βα p with β a constant), we obtain again similar scaling relationship: The Fano factor again increases with α p . Therefore, based on the scaling relationship alone, it is difficult to differentiate whether the translation is bursty or not. VI. TRANSLATION IS BURSTY. We claim that translation processes are bursty. The data points in bold in Fig. 6 correspond to different RBS strength but the same level of [IPTG] equal to 1 mM, where the lacpromoter becomes constitutively active. The noise values for TACTAG, (A) 6 , and (A) 10 were similar, but the noise level for (A) 13 was higher than the rest. This difference cannot be explained in the non-bursty translation scenario, because n should be independent of α p , i.e., RBS strength. In the bursty translation scenario, the noise level can be dependent on the value of α p , especially when the value of α p is similar to that of γ m . Since m is proportional to α p (m = βα p ), Eq. (1) becomes For a strong RBS such as the TACTAG case, α p can be roughly around 1400 hr −1 (Materials and Methods). In this case, α p /γ m ∼ 1400/30 47, i.e. much larger than 1. Thus, Eq. (2) becomes n αp/γm βαp = 1 γmβ , resuling in that n is independent of α p , which is what we observed from Point A to C. For the case of (A) 13 (Point F), the RBS strength is reduced by ∼ 60 times (by comparing the mean levels of Point A and F) and α p /γ m 47/60 0.8. Thus, n becomes dependent on α p (Eq. (2)). As α p decreases, n increases. This is consistent with our observation. Figure 7 shows that the scaling relationship n = c/m can be observed after autofluorescence was systematically removed, even for the small mean value region, where the autofluorescence interferes with the true GFP signals. We took into account the stochasticity in autofluorescence and assumed that the fluctuations in the autofluorescence signals are statistically independent of the true GFP signals. Under this A.

MEAN LEVELS
B. assumption, we compensated for the autofluorescence effect in two different ways: (1) direct noise level correction (red squares in Fig. 7) and (2) fluorescence histogram correction (blue triangles in Fig. 7; an example is presented in Fig. 8) (Materials and Methods). This implies that it is important to take into account the stochasticity of the autofluorescence signals when characterizing the systems by using the pdfs of fluorescence signals.
Fluorescence (AU) alpha = 3 Fig. 8. Autofluorescence compensation: The green dots correspond to autofluorescence (IPTG=0 case), the blue dots to the measured GFP signals, and the red dots to the optimized solution S, i.e., the pdf of the true GFP signals. The black line is to verify the optimized solution S can generate the measured GFP pdfs via convolution (refer to the Materials and Methods).

VIII. CONCLUSIONS
In summary, we perturbed the strength of ribosome binding sites and investigated scaling relationship between the mean and noise levels of the expressed proteins. We confirmed that translational bursts are one of the important sources of noise at the protein level by using our numerical sensitivity analysis method, SCA, and the analytical structure of noise propagation. To investigate the scaling relationship further in detail, we compensated the effect of autofluorescence by taking into account stochasticity in the autofluorescence and recovered the expected scaling relationship even when autofluorescence becomes moderately strong. This shows that the autofluorescence can be systematically removed and its compensation can be applied to characterize cellular systems.

A. GFP expression circuits and strains
All genetic components used in this manuscript are BioBrick parts, from which genetic circuits were constructed by using the Gibson assembly method [30]. The constructed circuits were integrated into a low-to-medium copy number plasmid pGA3K3 with a Kanamycin resistance gene and Escherichia coli MG1655 Z1 was transformed with the plasmids. The strain (lacI q ) constitutively overexpresses LacI from its chromosome.

B. Cell Growth and Flow Cytometry Measurements
E. coli strains were grown to OD600∼0.2 in 2 mL Luria-Bertani (LB) media (Becton Dickinson) with kanamycin 50 µg/mL at 37 • C and 300rpm in a shaker. The cultures were diluted 1:200 into 200 µL prewarmed fresh M9 media (Teknova 2M1990) in 96-well plates (Costar 3904) with kanamycin 50 µg/mL. 12 different IPTG concentrations (0 mM, 0.02∼1 mM) were used for each well (refer to the Supplementary Notes for more detailed information on IPTG concentrations) and grown to OD600=0.3-0.4 in a shaker (37 • C, 300 rpm). For the flow cytometry measurements, the grown cultures were diluted 1:4 in 1xPBS. A Sony Biotechnology ec800 flow cytometer was used with a 525 nm filter and a 488 nm excitation laser for GFP fluorescence. 100,000 events were collected for each sample and gated by using a 2-D normal distribution (Bioconductor flowCore norm2filter function with scale.factor=1) [31] within the R software environment as well as by using python package FlowCytometryTools (http://gorelab.bitbucket.org/flowcytometrytools/#). To prevent well-well contamination we executed a Medium Flush cycle after each sample well. When computing the mean and noise levels of GFP signals, background fluorescence was removed by using the mean and noise levels of GFP signals, or the signal histogram for the case without IPTG for each different gene circuit.

C. Mathematical Model
A two-state model [27], [16], [28] is introduced to describe active and inactive states of a promoter along with transcription and translation processes: where N i denotes the number of inactive promoters, N a that of active promoters, N rna the RNA copy number, and N pr the protein copy number. All the above reaction events are generated stochastically. The noise level, n, of N pr can be analytically solved (refer to the Supplementary Notes of [27] for all the detailed derivation): with m denoting the mean value of N pr and the Fano factor c is expressed as Here, we assumed that the protein lifetime τ p , defined by 1/γ p , is much larger than the mRNA lifetime τ m . We note that control coefficients for n can be computed from this equation analytically. We assume that the inactive and active promoter states switch back and forth many times during the protein lifetime (k on + k of f γ p ). We believe this is the case of our experiments and refer to the parameter value estimation described below. In this case, the above equation can be further approximated: Here, 1/(k on + k of f ) is the time scale of the promoter state switching and k of f /(k on + k of f ) is a suppression weight because the promoter state follows the binomial distribution (due to the fact that the total promoter number is constant) instead of the Poisson distribution. Thus, the second term in the parenthesis can be considered as a transcriptional burst size b m . In our case, α m /(k on + k of f ) ∼ 160/(50 + 51000) 0.003. Thus, c can be further simplified: implying that the change in the IPTG concentration has no effect on the Fano factor, c, thus moving along the line of slope −1 as shown in Fig. 3.

D. Model parameter estimation
Transcription rate constant, α m = 160 hr −1 : The lacpromoter strength, when fully induced with IPTG, was shown ∼1.5 time stronger than J23101 by directly measuring the transcript levels with our malachite-green aptamer probes (refer to Figure 6.3 of [22]). For J23101, α m was estimated at 0.03 sec −1 =110 hr −1 [32]. Thus, α m for our lac-promoter can be estimated at 160 hr −1 .
Gene inactivation, N a k of f Na − −−−− → N i : The number of the inactive promoters is denoted by N i and that of the active promoters, N a . The sum of N i and N a is equal to the copy number of the plasmids, N p (considering that one lacpromoter is included per plasmid). Here, we used N p ∼ 10 (http://parts.igem.org/Part:pSB3K3; pGA3K3 is a variant of pSB3K3). k of f is related to the search time for LacI to find lac-promoter. When there exist one LacI molecule and one lac-promoter within an E. coli cell, the search time is less than 6 min = 0.1 hr [21]. In the case of N a unoccupied lac-promoters and N lacI copies of LacI, the search time becomes 0.1/N lacI N a , which is equal to the inverse of the inactivation rate (1/k of f N a ). Therefore, k of f is estimated to be 10N lacI hr −1 . N lacI can be roughly estimated from the fact that the strength of the lacI q is similar to the promoter J23101 (α m of J23101 is 110 hr −1 ) [32]. Thus, the genomic expression level of LacI, N lacI , becomes α p [mRNA] lacI /γ p = α p α m /γ p γ m = 1400 · 110/1 · 30 5100. Thus, k of f can be roughly estimated as 5.1 × 10 4 hr −1 .
Gene activation, N i konNi − −−− → N a : The activation rate constant k on is related to how fast the genomically-expressed LacI detached from its specific promoter plac (BBa R0010). Considering that the dissociation constant is in the range of 0.1 − 1 pM = 10 −4 − 10 −3 (copy number unit; here we used 1 nM corresponds to roughly 1 molecule number in the volume of E. coli) [34], [35], k on can be in the range of k of f × (10 −4 − 10 −3 ) = 5.1 − 51 hr −1 .

E. Noise level correction
The mean level was corrected with a simple subtraction. The noise level was corrected by using the property that the observed variance (Variance o ) is the sum of the GFP variance (Variance g ) and the background signal variance (Variance b ) under the assumption that the GFP signals are statistically independent of the background signals. More precisely, the noise level of GFP signals, defined by the square coefficient of variation, can be obtained by where the subscripts o and b denote observed and background signals, respectively.

F. Fluorescence histogram correction
The effect of autofluorescence was removed from the GFP signal histogram, more precisely probability mass function (pmf), by assuming that the autofluorescence is statistically independent of the true GFP signals [36]. Under this assumption, the pmf of the measured GFP signals, T (ν), is related to both the autofluorescence pmf C(ν) and the true GFP signal pmf S(ν) via convolution: S(ν) is obtained by minimizing the fitness function: Here, a is the value of ν beyond which T (ν) is essentially zero, and in our study, we used the entire range of pmf. S(ν) is a regularization term, defined as  where g 0 and g 1 are positive regularization constants. The optimized solution of S(ν) is obtained by following the procedure described below.
1) Remove background noise that is equipment-specific.
Fluorescence signals of strength 0 and 1 (Sony Biotechnology ec800) were considered as background noise and removed. Then, T (ν) (for the induction case of [IPTG] > 0) and C(ν) (for the case of [IPTG]=0) were computed from the fluorescence signals using 1000 equal-width bins to obtain individual bin-sizes. Here, the bin-size of T is larger than that of C. 2) To compute the convolution, we will set the bin-size of C equal to that of T . Compute C again from the raw data using the bin-size of T , and append an array of zero at the end of C to make the total bin number equal to 1000. 3) Set the initial values of S equal to T . 4) Generate two different random numbers ν 1 and ν 2 in the range of [0, 999]. S(ν 1 ) and S(ν 2 ) were added and subtracted, respectively, by a constant dS = 10 −6 : S(ν 1 ) → S(ν 1 ) + dS and S(ν 2 ) → S(ν 2 ) − dS. When S(ν 2 ) − dS is less than zero, set S(ν 1 ) equal to S(ν 1 ) + S(ν 2 ) and then S(ν 2 ) equal to 0. In this way, the new S is automatically re-normalized and guaranteed to be non-negative. 5) Compute M α . If M α decreases, we accept the change and, otherwise, reject it and revert S(ν) to the old S values before the update. 6) Repeat the steps 4 and 5 until M α converges and compare ν 0 C(ν − ν )S op (ν )dν and T (ν), where S op is the obtained optimized solution of S. If S op (ν) shows oscillation, reduce the value of α while rebalancing g 0 and g 1 and go back to the step 4. If S op is noisy, increase the value of α while rebalancing g 0 and g 1 and go back to the step 4.
The constants used for the optimization are listed in Table I

G. Nonlinear Regression
The gamma distribution function was used to fit our flow cytometry data. Protein copy number N pr can be converted to fluorescence signal intensity x: N pr = c s x with c s a scaling constant. The gamma distribution function can be rescaled: Here, Γ is a gamma function, and is the number of mRNA produced per cell doubling time, called burst frequency with N a the number of active promoters, and b ≡ α p γ m is the number of proteins produced during the mRNA lifetime, called burst size. Therefore, the fluorescence intensity should also follow the gamma distribution if its corresponding copy number follows the gamma distribution, with the burst size rescaled with c s . Nonlinear regression was carried by using the Scipy curve fit function (http://www.scipy.org/), which employs the Levenberg-Marquardt algorithm for the least squares fitting to estimate a and b.
ACKNOWLEDGMENT This work was supported by the National Science Foundations (NSF MCB 1158573).

Autofluorescence compensation in fluorescence histograms
Depending on the value of the regularization constant α, the optimized solution S(ν) can show oscillation. In this section, we provide its sample pictures. There is a trade-off between the strength of noise in S and the fitting accuracy (comparison between the black and the blue dots in Fig. 2).
The python code used for the compensation is provided below.