Systematic Dissection of Sequence Elements Controlling σ70 Promoters Using a Genomically-Encoded Multiplexed Reporter Assay in E. coli

Promoters are the key drivers of gene expression and are largely responsible for the regulation of cellular responses to time and environment. In E. coli, decades of studies have revealed most, if not all, of the sequence elements necessary to encode promoter function. Despite our knowledge of these motifs, it is still not possible to predict the strength and regulation of a promoter from primary sequence alone. Here we develop a novel multiplexed assay to study promoter function in E. coli by building a site-specific genomic recombination-mediated cassette exchange (RMCE) system that allows for the facile construction and testing of large libraries of genetic designs integrated into precise genomic locations. We build and test a library of 10,860 σ70 promoter variants consisting of all combinations of a set of eight −35 elements, eight −10 elements, three UP elements, eight spacers, and eight backgrounds. We find that the −35 and −10 sequence elements can explain approximately 73% of the variance in promoter strength using a simple log-linear model. Using a simple neural network, we can build models that explain greater than 95% of the variance, and show the increased power is due to nonlinear interactions of the other elements such as the spacer, background, and UP elements.


Introduction
Promoters are the key regulators of gene expression and largely control developmental and environmental responses in all living organisms [1][2][3] . Decades of studies on bacterial and phage promoters have elucidated most if not all essential proteins and basic sequence motifs necessary for initiating transcription. In E. coli , transcription requires the polymerase holoenzyme which consists of a core set of five subunits, as well as one of seven known sigma factors [4][5][6] . Sigma factors provide much of the sequence specificity for bacterial promoters, and the prevalence of each factor varies based on environmental conditions 7,8 . Under standard growth conditions, most active promoters are transcribed by σ70 5 . σ70-dependent promoters are composed of discrete sequence elements that cooperatively determine expression, including two conserved hexamers centered -10 and -35 bases upstream of the transcription start site that directly interact with the σ70 subunit 9 . Other sequence elements are known to be important including the sequence content and size of the spacer region between the -10 and -35 10 , an UP element upstream of the -35 that can anchor the RNA polymerase (RNAP) α-subunit to enhance promoter recognition 11 , and the local sequence context surrounding these elements 12 .
Despite the apparent simplicity of the process and decades of genetic and biochemical dissection, we still lack answers to basic questions surrounding bacterial transcription. For example, given an arbitrary sequence, we largely do not have the ability to know (1) if it is a promoter, (2) its strength, and (3) its regulation. Thus far most approaches try to understand the relationships between sequence elements that comprise the full promoter using reverse genetic approaches where they characterize multiple variants of an element in a single promoter context 10,13 . Although these studies have revealed the contributions of individual sequence elements, the effects of these variants are often inconsistent between promoters, conceivably due to higher order relationships between sequence elements 14 . Conceptually, a simple way to tease apart these relationships is to test a wide variety of element combinations across a variety of backgrounds. However, increasing the number of element combinations and backgrounds quickly surpasses the number of constructs that can be practically tested using traditional means. Developing novel approaches to test vastly more designs will help us understand the behavior of sequence elements across different contexts, and more broadly allow us to explore relationships between promoter sequence and function.
Massively Parallel Reporter Assays (MPRAs) are a new class of experiments that test large numbers of designed genetic variants for functional activity 15,16 . We and others have used MPRAs to quantify expression of large promoter or enhancer libraries for both transcriptional and translational activities [17][18][19][20][21][22][23] . However, there are several limitations when using these systems to study promoter function. First, since many current systems rely on flow cytometry, reporter expression levels must be relatively high to detect signal, limiting the quantitative range of promoter studies. Second, these systems often measure protein production rather than RNA, making it difficult to decouple transcriptional and translational processes. Third, these previous systems are universally encoded on plasmids to increase signal, but this leads to issues when trying to understand promoter function. In bacteria, plasmids exist at variable copy number, which can both contribute to expression noise 24 and saturate endogenous transcriptional cofactors 25 . Fourth, libraries of synthesized oligos characterized in these assays contain considerable amounts of sequence errors, and in certain MPRAs imperfect oligos cannot be distinguished from perfect sequences. Finally, the protocols for many MPRAs are laborious and require access to flow-sorting facilities, restricting their widespread use.
Here we present a new MPRA system for studying promoter function in E. coli . To do this, we develop a new high-efficiency, site-specific genomic recombination-mediated cassette exchange (RMCE) system capable of integrating large libraries of genetic designs into precise genomic locations. We then build a reporter system capable of exploring transcriptional activity using RNASeq while maintaining our ability to differentiate perfect from imperfect reporter sequences. We use this new MPRA system to integrate over 300,000 reporters and measure expression of 10,860 σ70 promoter variants, which are specifically designed to explore the relationships between the -10, -35, UP, and spacer elements across different sequence backgrounds. We show that this genomically-based reporter assay achieves robust, quantitative measurements of promoter strength and use these measurements to develop models that predict promoter strength based on sequence element composition. Furthermore, we leverage the insight gained from these models to identify and dissect higher order interactions between σ70 sequence elements.

Results & Discussion
Design and Testing of High-efficiency Genomic RMCE.
We designed our MPRA system to be genomically encoded at a defined genomic loci while at the same time allowing for easy construction of large libraries. To this end, we first used lambda-red recombination to insert an engineered landing pad into a target locus 26 ( Figure S1). This landing pad contains an engineered operon encoding both red fluorescence (mCherry) and chloramphenicol resistance ( cat R ). The operon is flanked by the mutant loxP sites, loxm2/66 and lox71 [27][28][29] , which allow for the subsequent cassette exchange with a vector containing complementary lox sites, loxm2/71 and lox66 . Lox66 and lox71 sites are capable of undergoing Cre-mediated cassette exchange, and their recombination irreversibly produces the inactive lox site, lox72. Furthermore, the m2 mutation alters the spacer sequence to make them incompatible with natural spacer sequences, thereby preventing cis-recombination events in our system 29 . We used six intergenic loci distributed at different distances and orientations from the origin of replication that were previously identified as potential landing pad insertion sites 26 . To direct RMCE to the landing pad, we designed and constructed an integration vector composed of an arabinose-inducible Cre recombinase 30 , a temperature-sensitive origin-of-replication 31 , and a modular payload flanked by loxm2/71 and lox66 sites complementary to those in the landing pad ( Figure S1). We can replace the landing pad cassette with our engineered design by transforming the integration vector into a strain engineered with the landing pad, inducing Cre-mediated recombination, and selecting for integrated cells while simultaneously removing unintegrated plasmid with heat-curing. We used flow cytometry to track integration of an sfGFP fluorescent marker into the nth-ydgR locus at high efficiency ( Figure 1A). Initially, cells only express mCherry, but upon introduction of a constitutive donor plasmid as a library, and induction of Cre recombinase, we find that almost 2/3rds of the cellular population undergo cassette exchange. Since the donor cassette also includes a resistance marker, when we subsequently apply selection 94.3% of the population contained the reporter and had lost expression genomic mCherry indicating proper cassette exchange.
To evaluate potential location-specific effects on expression, we measured mCherry expression from different integrated loci within the E. coli genome using flow cytometry ( Figure  1B). We used six previously described location with two flanking the origin of replication ( atpI-gidB , yieN-trkD ), two located midreplichore ( ybbD-ylbG , essQ-cspB ), and two near the terminus ( nth-ydgR , ygcE-ygcF ). All six pairs are located on opposite sides of the genome, and face the direction of DNA replication. We found that the two sites near the origin varied greatly, with the atpI-gidB locus having 426% higher median expression than the yieN-trkD locus ( Figure  1B). This may be due to differences in copy number around the origin 32 or collisions with replication machinery 33 . The pairs of mid-replichore and termini loci only varied by 4.2% and 4.6% between each other respectively, and the median expression was approximately 27% higher at the termini location as compared to the mid-replichore locations. Finally, we tested expression of landing pads engineered in both orientations at the nth-ydgR locus and observed little difference in expression as has been previously observed 32,34 ( Figure 1C).

Reporter and Library Design, Construction and Testing.
To test the transcriptional output of large libraries of reporters in multiplex we designed and built a reporter construct. The final RMCE donor cassette ( Figure S1) contains the promoter to be tested, a RiboJ self-cleaving ribozyme sequence 35 , and an sfGFP reporter with a random 20nt barcode in the 3' UTR that uniquely identifies the promoter variant followed by a transcriptional terminator 36 . The RiboJ sequence standardizes the 5' UTR of the reporter, decoupling transcriptional activity from any potential stability effects different UTR regions might have 35 . Immediately downstream, a constitutive promoter drives expression of the kanamycin resistance gene ( kan R ), allowing for selection of the RMCE donor cassette. The entire RMCE donor cassette is flanked by transcriptional terminators that isolate the reporter from local transcription events that might occur outside the reporter cassette. The barcodes are constructed by first amplifying the promoter library with a primer that adds a random 20nt barcode downstream. We subsequently clone the library of barcoded promoters into the RMCE donor plasmid and we use paired-end, next-generation sequencing to map the relationships between barcodes and variants. This approach allows us to identify promoters that contain sequence errors so that we may filter them out of downstream analyses. Finally, we clone the constant RiboJ::sfGFP sequence between the promoter and barcode. We engineered several other aspects of this cassette including sequences to allow for high-efficiency cloning, facile reverse transcription of the barcode, and the subsequent mRNA and DNA sequencing.
We designed a library of 12,288 σ70 promoters to explore every possible combination of a set of 3 UP elements, 8 -35 regions, 8 spacer sequences, 8 -10 regions, and 8 background sequences (Figure 2A). The UP 13 , -35, and -10 elements 37 selected span a large range of previously characterized activities. We designed the spacer sequences to have variable GC content and flexibility, which have been shown to influence promoter expression 10 . We extracted the backgrounds from non-promoter, intergenic regions of the E. coli genome that vary in GC content. Finally, we included 470 negative controls, which are intergenic regions that appear to be transcriptionally quiescent in RNA-Seq studies [38][39][40] .
This library was synthesized, cloned, and integrated the library using this RMCE method into the nth-ydgR locus in the same direction as the DNA replication ( Figure 2B). We chose this locus due to its proximity to the replication terminus which is present at a lower copy-number in rapidly dividing E. coli 41,42 . During the barcode mapping stage, we found 351,275 unique promoter-barcode combinations. After RMCE, we detect 312,053 (89%) of those barcodes using RNA and DNA-Seq. We also did not see large distortions in the overall distribution of barcodes per variant or the number of DNA reads per barcode before and after integration ( Figure 2C, S2).
To measure the expression of each promoter, we grew the library to exponential phase before extracting and sequencing both RNA and DNA barcodes. To account for differences in the abundance of each barcoded promoter, we calculated expression by normalizing the number of RNA counts to the number of DNA counts for all barcodes mapped to a single promoter ( Figure 2D). Expression of our promoter variants spanned a 100-fold range and was highly consistent between technical replicates (R 2 =.998, p < 2.2 x 10 -16 ) ( Figure 2E). In addition, we observed a predictable segregation between our negative controls and promoter variants containing consensus -10 and -35 elements. We find a large spread in promoter activity even amongst those promoters containing consensus -10 and -35 sequences, with the strongest promoter containing consensus -10 and -35 regions having 32.9-fold higher activity than the weakest ( Figure 3A). In general, for all the data we see strong trends of closer to consensus -10 and -35 regions generally increases transcription strength of the promoters ( Figure 3B). However, there are many exceptions and the variance between promoters with identical UP, -10, and -35 regions can vary dramatically depending upon the the spacer and background sequences.

Promoter Activities can be Predicted by Sequence Element Combinations.
Our approach provides us with a unique, large-scale training set of robust, quantitative measurements of promoter strength. We used this data to learn a model predictive of promoter expression based on combinations of sequence element variants. We first trained a multiple linear model on 50% of the data with the identities of the -10, -35, UP, Spacer, and Background as categorical variables and achieved an R 2 of 0.409 (p < 2.2 x 10 -16 ) on the remaining data ( Figure S4A). Based on previous studies showing interaction between the -10 and -35 regions 14 , we also added an interaction term between the -10 and -35 region, and the R 2 increased to 0.623 ( p < 2.2 x 10 -16 ) ( Figure S4B). Previous studies of σ70 element interactions have shown that these elements primarily modulate expression in a multiplicative manner 43 , consistent with simple thermodynamic binding models of the -10 and -35 regions with RNA polymerase 37 . Therefore, we hypothesized a multiple linear regression on log-transformed data may adequately capture the relationship between promoter element composition and their resulting expression. Indeed, the log-linear models worked better both with (R 2 = 0.797, p < 2.2 x 10 -16 ) and without the -10 and -35 interaction term (R 2 = 0.604, p < 2.2 x 10 -16 ) ( Figures 4A & S4C). ANOVA revealed that a vast majority of expression variance (73.1%) and the model's power (91%) could be explained solely by the identity of the -10 and -35 elements and the interactions between them ( Figure 4B). The remaining terms added only 7.2% of the variance explained, with approximately 19% of the variation remaining unexplained. While this may indicate these other elements affect promoter activity very little, the overall dataset shows some patterns and the unexplained variance may be due to more complex cooperative relationships between elements.
To address the possibility for more complex nonlinear interactions, we implemented a simple neural network model ( Figure S5) with the hypothesis that these types of networks may pull out more subtle effects of element combinations. When trained on 50% of our data, the model was able to explain 95% ( p < 2.2 x 10 -16 ) of the variance ( Figure 4C). Surprisingly, by training on various proportions of our data we found that this neural network would predict over 93.6% when trained on 20% of the data ( Figure 4D), and even training on 5% of the data produced models on par with our log-linear models ( Figure S4D). The success of these models confirmed our suspicions that while the -10 and -35 regions behave in a fairly predictable manner, the remaining unexplained variance is likely due to more complex relationships than can be easily modeled linearly.

Identifying Complex Interactions Between Sequence Elements
Because of the difficulty in interpreting neural networks, we focused on specific combinations of elements to better understand these more complex relationships. Based on our results from the linear model, we analyzed the relationship between -10 and -35 variants. A clear trend emerged where the overall expression of the library increased as the -10 variant approached the consensus sequence ( Figure 5A). Despite this trend, the most active promoters consisted of variants with consensus -35 sequences whose -10 sequence deviated from the consensus by a single nucleotide. Pairwise analysis of all -10 and -35 combinations confirmed that for both elements, expression was highest when one, but not both elements matched the consensus ( Figures 3B & 5B).
In addition to the -10 and -35 elements, the UP element is another point of physical contact between RNAP and the promoter. Considering this, we postulated that addition of an UP element would serve a compensatory role for promoters with weak -10 and -35 elements. We observed a clear trend where weaker promoters received the greatest benefit upon addition of the consensus UP element ( Figure 5C). In addition, we found several cases where promoters decreased in expression upon receiving the UP element, and a majority of these promoters contained the consensus -10 variant ( Figure S6). It has been proposed that a strong UP element may decrease transcription of some promoters by inhibiting promoter escape 13,44 ,which may explain our observation. Despite the clear trend we observed, the consensus UP element had highly variable effects when added to different combinations of -10 and -35 elements ( Figure 5D). The weakest combinations of -10 and -35 elements did not receive the greatest increase in expression upon addition of the consensus UP element. Instead, the strongest non-consensus -10 and -35 elements had the greatest increase and this effect weakened as these core elements deviated further from the consensus. Also, addition of the consensus UP element enabled expression from promoters with otherwise inactive -35 variants, which has not been observed in the absence of an extended -10 motif ( Figure S7) 43,44 .
The background and spacer sequences have been suggested to influence promoter expression despite not directly interacting with RNAP. Although our log-linear model indicated that these elements do not play a large role in promoter expression, we explored relationships between these elements and those that directly interact with RNAP. We began by examining the role that promoter background sequence plays in promoter expression. We found modest differences in the distribution of expression for each background, and this appeared to be unrelated to background GC content ( Figure 6A). Surprisingly, expression of promoters with consensus -10 and -35 elements varied between backgrounds suggesting that there exists context-specific behavior amongst different compositions of core promoter elements. To investigate whether the background exhibited nonlinear interactions with promoter elements, we trained our neural network as before, but ignoring background. Although background sequence only accounted for .8% of the variance according to the log-linear model, the performance of the neural network trained on 50% of the data dropped notably (R 2 = .862, p < 2.2 x 10 -16 ) ( Figure  S8). This suggests that nonlinear interactions with background sequences contribute a considerable amount to overall promoter expression.
Finally, the spacer element partitions the -10 and -35 elements and has been suggested to contribute to expression through its GC nucleotide content. In agreement with previous studies 10 , we found that GC content in the -20 to -13 region of the spacer was negatively correlated with promoter expression ( r = -.716 , p = .046) ( Figure 6B). This may be due to reduced flexibility of the spacer inhibiting RNAP association or GC hydrogen-bonding impeding promoter melting. One spacer variant, ECK726, had a unique effect in which it could stimulate transcription amongst promoters with otherwise unviable -10 and -35 combinations ( Figure S9). Ultimately, we find evidence indicating that the -10 proximal region of the spacer influences promoter activity and there are likely sequence-specific effects involving particular segments or nucleotides within the spacer.

Conclusions
In summary, our results show that expression of σ70 promoters is primarily dictated by the quality of their -35 and -10 elements, yet nonlinear effects arise from interactions with other sequence elements. These findings highlight the complexity of promoter sequence-function relationships, but also show for at least this library, predictive models can be constructed with very high accuracy. However, what we do not show is how much closer we are to prediction of promoter strengths for arbitrary sequences. This is a much more difficult problems as full combinatorial exploration of all possible elements used in natural systems would be far larger than can ever be explored experimentally. However, systems such as those developed here will allow us to test hypotheses and models at unprecedented ease and scale.
The platform we developed here enables site-specific integration of large libraries of reporter constructs in E. coli . Though we integrated over 300,000 unique promoter barcode combinations, we expect the methods are easily scalable and compatible with interrogating millions of variants. Furthermore, this approach should work for any of the many model systems in which Cre-recombination is available and for a wide variety of reporter assay formats, especially those which require single variants per cell. Currently the process of cloning, integrating, and measuring the expression of large promoter libraries can be completed at a casual pace in six weeks and is amenable to rapid iteration. This platform should serve as an efficient and versatile means to conduct multiplexed reporter assays in the future to deconstruct the complex relationship between sequence and function.   We designed and constructed a σ70 promoter library using an oligonucleotide microarray, and cloned the library into a custom-made reporter construct. The reporter contains a promoter to be tested, a RiboJ self-cleaving ribozyme sequence to standardize the reporter 5' UTR, and an sfGFP coding sequence followed by a 20 nt barcode in the 3' UTR that identifies the promoter variant. The exchange cassette also includes a constitutive kanamycin resistance marker downstream of the reporter for selection purposes. B) Pooled promoters are uniquely barcoded using PCR, cloned into the exchange vector, and integrated into the E. coli nth-ydgR locus as a library. C) Pre-integration barcodes are identified during mapping stage and integrated barcodes are identified when quantifying promoter strength using RNAseq and DNAseq. We found 89% of the barcodes that were observed in the mapping stage (blue histogram), were later observed in the integrated library (red histogram), and the overall distributions remained similar. D) Expression of each promoter is calculated as the sum of all RNA counts divided by the sum of all DNA counts for all barcodes mapped to a given promoter. E) Promoter strength measurements are highly correlated (R 2 =.998, p < 2.2x10 -16 ) between technical replicates and discriminate between negative controls and promoters with consensus core elements.

Figure 3. Expression levels for thousands of promoters A)
We plot the expression of all the promoters containing consensus -10 and -35 elements we measured in the library ( red to blue is an estimated 100 fold decrease in measured expression). Each block of 48 squares displays six different backgrounds vertically against eight different spacer sequences horizontally. The three blocks represent the UP element choices used. We did not display two backgrounds for space and because they contained the most missing data ( grey ). The expression levels vary up to 32.9-fold based based on different background, spacer, and UP element choices. B) We plot expression of 3,072 promoters with the 136x UP element in blocks of 48 measurements (as in 3A), but now with all -10 (horizontal) and -35 (vertical) choices we measured in our assay. Expression generally increases as the -10 and -35 elements approach the consensus, yet like the consensus, there is variance amongst promoters with the same -10 and -35 elements. . C) We also trained a simple neural network model and found that the resultant predictions captured an estimated 95% of the promoter variance, indicating that these models are better able to capture more complex interactions between sequence elements. D) We trained the same neural network models with 10-fold cross-validation and show that we can effectively predict promoter expression when trained on as little as 5% of the data. In 4A, 4C, and 4D, R 2 is the coefficient of determination between predicted and actual expression values on the held-out datasets.

Landing Pad Generation
Landing pad strains were constructed using lambda-RED recombination in previously identified safe loci 26,45 . The landing pad construct ( Figure S1) was assembled and flanked with one of six pairs of homology arms, corresponding to each of the 6 landing pad locations. At the nth-ydgR locus, two sets of homology arms were used to generate landing pads in the forward and reverse orientation. The linear landing pad DNA was genomically engineered into E. coli MG1655 K.12 harboring pTKRED (Genbank GU327533) following the published protocol 26 , with the exception of chloramphenicol being used for selection of successful recombinants. The pTKRED plasmid was heat-cured from identified recombinants by growth at 42ºC on LB plates with chloramphenicol (34 ug/mL).

Plasmid construction
The integration vector, pLibacceptorV2 ( Figure S1), was designed to include three primary components:

1) A library cloning site containing a selectable marker and flanked by mutant loxP sites 2) An arabinose-inducible Cre-Recombinase 30 3) A heat-sensitive origin of replication 32
The library cloning site ( Figure S2B) was ordered from IDT as a G-Block. The arabinose-inducible Cre system was amplified from pARC8-Cre 30 and the temperature-sensitive origin of replication (tsORI) was amplified from pTKRED 26,30 . Fragments were assembled using a SGI-DNA Gibson Assembly® HiFi HC 1-Step MasterMix (# GA1100-4X10M).

Landing Pad Integration Demonstration Using Flow Cytometry
Landing pad integration in Figure 1A was demonstrated by integrating a constitutively expressed sfGFP into the E. coli nth-ydgR locus. First, a landing pad was engineered in the nth-ydgR locus of K12 MG1655 E. coli in the reverse orientation following the landing pad generation protocol detailed above. A constitutively expressed sfGFP was cloned using restriction ligation into the pLibacceptorV2 RMCE cassette and 100 ng of this plasmid was transformed into the aforementioned landing pad strain and grown overnight at 30 ºC in 100 mL of LB + kanamycin (25 ug/mL). The following day, 200 million cells (estimated by OD) were inoculated in 200 mL LB + kanamycin (25 ug/ml) + .2% (g/mL) Arabinose and grown at 30ºC for 24 hours. From the 24 hours induced culture, 400 million cells were inoculated into separate 80 mL of LB + kanamycin (25 ug/mL). Once culture was grown at 30 ºC for 2 hours (temperature permissive) and the other was grown at 42 ºC (temperature impermissive) for approximately 1.5 hours before both reached an OD 600 ≈ .5. Similarly, 10 uL of the landing pad strain grown overnight was seeded into 1 mL of LB + chloramphenicol (34 ug/mL) and grown at 30 ºC for 2 hours. After reaching OD 600 ≈ .5, each culture was placed at 4 ºC for 30 minutes before being diluted 1:100 in phosphate buffered saline pH 7.4 (Thermo Scientific # 10010023) and analyzed by flow cytometry using a BIO-RAD S3 Cell Sorter.

Landing Pad Location Effects on Expression
To evaluate whether landing pad choice affects expression of constitutive promoters, several landing pads ( Figure S1) were engineered into six loci previously characterized by Kuhlman T. and Cox E. 26 . To characterize expression at each landing pads, strains were grown overnight in 1 mL LB + chloramphenicol (34 ug/mL) at 37ºC. The following day, 10 uL of each culture was seeded into separate 1 mL LB + chloramphenicol (34 ug/mL) cultures and grown for 1.5 hours at 37 ºC before reaching OD 600 ≈ .5. Upon reaching OD 600 ≈ .5, cells were placed at 4 ºC for 30 minutes before being diluted 1:100 in 1 mL PBS and analyzed using flow cytometry.

Minimal Library Design
We designed a library of 12,288 σ70 promoters to explore every possible combination of a set of 3 UP elements, 8 -35 regions, 8 spacer sequences, 8 -10 regions, and 8 background sequences. A complete list of sequences is listed in Supplementary Table 1.
The UP elements were identified by a modification of the SELEX procedure to identify protein-binding sites 11 . Two elements were selected which increased transcription 136 and 326-fold in vivo relative to the natural rrnb P1 UP element. An additional null (zero length) UP element was used. We incorporated extra bases from the flanks of the background sequence to maintain constant length for the entire library. We used eight -35 regions and eight -10 regions that were previously designed to span a wide range of promoter activity for the E. coli lac promoter 37 . These motifs were designed based on "information footprints" detailing the contribution of each nucleotide at each position to promoter strength, learned from a library of approximately 200,000 lac promoters mutagenized in a 75 bp region containing the cAMP Receptor Protein and RNAP binding sites 18 .
Spacer sequences were designed to span a range of GC content and flexibility, which both have been shown to influence promoter expression. Flexibility was calculated based on trinucleotide parameters learned from nucleosome-binding data 46 . All spacers are 17 bp in length which is considered to be the optimal length for σ70-dependent promoters 10 .
Backgrounds were extracted from the non-promoter regions of the genome. We randomly selected eight 200 bp genomic regions that were at least 200 bp away from a transcription start site on either strand. In addition, 470 negative controls were included -intergenic regions that appear to be transcriptionally quiescent in RNA-seq studies [38][39][40] .
We synthesized 120 bp of upstream promoter region, 1 bp for the transcription start site (TSS) and 29 bp of the initial transcribed region (ITR) downstream of the TSS. Previous work has studied the preferred starting nucleotide and location relative to the -10 region 47 . Based on this, we used an "A" at the TSS and required 5 bp between the end of the -10 region and the TSS. The ITR is taken from the 200 bp background sequence and not a specific promoter.
All combinations of the elements described above were synthesized, in addition to the negative controls, resulting in 12,288 σ70 promoters. Relative motif location was maintained for each sequence. Several restriction enzymes and priming sites were added to the termini for library amplification and cloning. Any assembled promoter sequences containing these restriction enzymes were removed. Table 1. Minimal library design organized by sequence elements. The library consisted on 12,288 unique σ70 promoters that consisted of a set of all possible combinations for the following sequence elements: three UP elements, eight -35 regions, eight spacer sequences, eight -10 regions, and eight background sequences.

Minimal Library Cloning
The oligonucleotide library was constructed by Twist Biosciences and delivered lyophilized as a 26 pmol pool. The library was resuspended in 100 uL of TE pH 8.0 and 1 uL was amplified for 12 cycles using GU72 and GU116 with NEB Q5 High-Fidelity 2x Master Mix (# M0492L ). Unless otherwise stated, all amplifications were performed using this polymerase mixture. This product was then ran on a 2% TAE agarose gel and approximately 200 bp amplicons were extracted using a Zymoclean Gel DNA Recovery Kit (#D4008). For barcoding, 1 ng of this eluate was amplified for 10 cycles using primers GU72 and GU73. Following cleaning using a Zymo Clean and Concentrator Kit (#D40140), the library was digested using NEB's SbfI-HF and XhoI.
The plasmid backbone, pLibacceptorV2 was digested using SbfI-HF and SalI-HF with the addition of rSAP (NEB #M0371S). The digested library was ligated into pLibacceptorV2 using T7 DNA Ligase (NEB #M0318S), cloned into 5-alpha Electrocompetent E. coli (NEB #C2989K), and plated on LB + kanamycin (25 ug/mL) yielding approximately 1.1 million colonies estimated by plating concomitant dilution plates. After allowing for 24 hours of growth on plates, the library was scraped and resuspended in LB, and then 800 million cells (based on OD 600 ) were inoculated in 450 mL LB + kanamycin (25 ug/mL) overnight. Unless stated otherwise, all plasmids were isolated using a Qiagen Plasmid Plus Maxiprep Kit (#12963) and concentrated using a Promega Wizard SV Gel and PCR Clean-up System (#A9281).
In order to clone the RiboJ::sfGFP reporter construct, the library was digested using NEB's BsaI-HF and NheI-HF with the addition of rSAP. The reporter construct was digested using NEB's BsaI-HF and NcoI-HF. Similarly to the previous cloning step, the reporter was cloned into the library using T7 DNA Ligase, cloned into 5-alpha electrocompetent E. coli , and plated on LB + kanamycin (25 ug/mL), yielding 8 x 10 5 colonies. The completed plasmid library was isolated as stated above.

Barcode Mapping
After cloning the barcoded library into pLibacceptorV2, we used Next-Generation Sequencing (NGS) to map promoters to their respective barcodes. The barcoded library was amplified for 12 cycles from 10 ng of isolated plasmid using primers GU79 and GU60. Following a DNA Clean-Up using a Zymo Clean and Concentrator Kit, a second PCR was performed to add flow cell adapters to the amplified library. This PCR used 1 ng of the previously amplified library and was for 8 cycles with primers GU70 and one of either GU80 or GU83 for separately indexing replicates. Samples were submitted to the UCLA Technology Center for Genomics and Bioinformatics for sequencing using a 2x150 bp NextSeq 500.
We next used the sequencing data to computationally map each promoter variant to its corresponding barcodes. Demultiplexed reads were paired using Paired-End reAd mergeR (PEAR, default settings). Custom python code was used to identify reads corresponding to perfectly synthesized promoters and their respective barcodes. Briefly, this code searched the first 150 bp of each read for perfect matches to library variants. For reads with perfect matches, the last 20 bp of each read (the barcode) was extracted and a list was compiled mapping each barcode to the most frequently associated library variant. A single barcode appears many times in the sequencing data, and we took steps to ensure a barcode consistently mapped to the same variant. We required that all variants mapped to a single barcode be within an edit distance (Levenshtein distance) of 5 from one another (five single bp changes between the two sequences). We determined this number by bootstrapping a distribution of the edit distance between any two random sequences in our variant library, and setting the threshold to the first percentile (1%) of this bootstrapped distribution. Additionally, each barcode had to appear at least three times in order to be considered for downstream analysis. This step hopefully eliminates barcodes which contained sequencing errors.

Library Integration
The isolated plasmid library was digested with SalI-HF and NheI-HF to eliminate incompletely cloned plasmid before transformation into electrocompetent MG1655 with a landing pad engineered in the nth-ydgR locus and plating on LB + kanamycin (25 ug/mL), resulting in 16 Million colonies. Colonies were resuspended in LB and 800 million cells were inoculated into 250 mL LB + kanamycin (25 ug/mL) and grown overnight. Several 2 mL frozen aliquots were made of this overnight culture.
The library was integrated into the nth-ydgR locus as follows. A frozen aliquot of MG1655 with a landing pad engineered in the reverse orientation at the nth:ydgR locus was transformed with the library and grown overnight in 200 mL LB + kanamycin (25 ug/mL). Following overnight growth, 400 million cells of this culture were seeded into 250 mL LB + kanamycin (25 ug/mL) + .2% arabinose (g/mL) and grown for 24 hours. After integration of the library, the plasmid backbone was removed through heat-curing. From the 24 hour induced culture, 800 million cells were inoculated into 80 mL of LB + kanamycin (25 ug/mL) and grown at 42 ºC for approximately 1.5 hours before reaching an OD 600 =.3. Upon reaching exponential growth, 200 million cells from this culture library were plated and grown for 16 hours at 42 ºC. Heat-cured plates were scraped and resuspended in LB and 400 million cells were inoculated into 200 mL LB + kanamycin (25 ug/mL). This culture, consisting of our integrated and heat-cured library, was grown overnight at 37 ºC and several frozen 2 mL aliquots were made.

Library Growth and Sequencing Library Preparation
A 2 mL frozen aliquot of the library was inoculated in 200 mL MOPS EZ-Rich Media (TEKNOVA #M2105) with .2% glucose (g/mL) and 25 ug/mL of kanamycin and grown at 30 ºC overnight. The overnight culture was used to seed a new culture at OD 600 = .0005 and grown for approximately 5.5 hours at 37 ºC to an OD 600 = .5. The culture was rapidly cooled to 0 ºC in an ice slurry for two minutes. Three 50 mL aliquots were pelleted at 4 ºC by centrifugation at 13,000xg for two minutes and the supernatant was poured out before snap-freezing the pellets in liquid nitrogen. Three 5 mL aliquots were prepared using the same approach.

RNA and DNA library preparation
RNA was extracted from two 50 mL library pellets using a Qiagen RNEasy Midi kit (#75142) and 45 ug of each extract was concentrated using a Qiagen Minelute Cleanup Kit (# 74204) . Barcoded cDNA was generated from 25 ug of each concentrated RNA extract using Thermo Fisher SuperScript IV (#18090010) primed with GU101. The manufacturer's protocol was followed aside from extending the reaction time to 1 hour at 52 ºC. The cDNA reaction was cleaned using a Zymo Research DNA Clean and Concentrator kit (#D40140) before amplification. Barcoded cDNA was amplified via PCR for 13 cycles using primers GU59 and GU102. This reaction was cleaned using a Zymo Research DNA Clean and Concentrator Kit and 1 ng of this reaction was used in a second PCR for indexing and addition of flow cell adapters. The second PCR was for 8 cycles and utilized primers GU102 and either GU61 or GU62.
gDNA was extracted from two 5 mL cell library pellets using a Qiagen Gentra Puregene kit (# 158567). Barcoded DNA was amplified from 1 ug of gDNA via PCR for 14 cycles using primers GU59 and GU60. The reaction was subsequently cleaned using a Zymo Research DNA Clean and Concentrator kit. To add sequencing adapters and indices to the library, 1 ng of this reaction was subject to a second PCR for 8 cycles using primers GU70 and either GU63 or GU64.
RNA and DNA sequencing libraries were cleaned using a Zymo Research Clean and Concentrator Kit before quantification using an Agilent Tapestation. All libraries were submitted to the Broad Stem Cell Research Center at UCLA for sequencing on a HiSeq2000.

Barcode Measurement Processing
Barcode counts were extracted from demultiplexed replicate RNA and DNA reads using a custom bash script. From each sequencing file, the first 20 nucleotides (containing the barcode) were extracted, reverse complemented, and the counts for each unique sequence were determined. Each file was read into R studio (Version 1.0.153). Read counts were normalized using the following formula: Normalized files were merged together based on common barcodes ( dplyr package Version 0.7.2), generating a dataset containing normalized read counts for each barcode in each sample. This file was subsequently merged into the mapping file containing the list of barcodes and their mapped promoter (barcode_statistics.txt).

Promoter Expression Quantification
Barcodes mapped to common promoters were aggregated and promoters that had fewer than four barcodes detected in either RNA or DNA sample of the technical replicates were removed from analysis. Of the 12,849 mapped promoters, 11,328 passed this threshold. To calculate promoter expression, promoter expression in each replicate RNA extraction was calculated as the sum of all RNA counts divided by the sum of all combined DNA counts for all barcodes mapped to that promoter.

Modeling
First, we randomly split our library of promoter variants into 50% training and 50% testing data sets. We fit the linear model using the lm() function in R (stats package version 3.3.3). We modeled expression based on the variant identify of each element and included an interaction term for the -10 and -35 elements. We used the aov() function (stats package) to calculate the model variance and the variance explained by each sequence element was calculated as the percentage of the sum of the squared deviation.
We used the R package 'nnet' to fit our data to a single-hidden-layer neural network. The network topology was structured as 30 input nodes, each representing an element variant, 10 nodes in the hidden layer, and a single linear output node. The network was trained for 300 iterations with weight decay set to .01. We performed 10-fold cross-validation using the same network trained on different proportions of the training data, resampled between each fold and tested on the remaining proportion of the training data.