Abstract
Predictable tuning of gene expression is essential for engineering genetic circuits and for optimising enzyme levels in metabolic engineering projects. In bacteria, gene expression can be tuned at the stage of transcription, by exchanging the promoter, or at stage of translation by altering the ribosome binding site sequence. In eukaryotes, however, only promoter exchange is regularly used, as the tools to modulate translation are lacking. Working in S. cerevisiae yeast, we here describe how hairpin RNA structures inserted into the 5’ untranslated region (5’UTR) of mRNAs can be used to tune expression levels by altering the efficiency of translation initiation. We demonstrate a direct link between the calculated free energy of folding in the 5’UTR and protein abundance, and show that this enables rational design of hairpin libraries that give predicted expression outputs. Our approach is modular, working with different promoters and protein coding sequences, and it outperforms promoter mutation as a way to predictably generate a library where a protein is induced to express at a range of different levels. With this tool, computational RNA sequence design can be used to predictably fine-tune protein production, providing a new way to modulate gene expression in eukaryotes.
Introduction
The rate of production of a given protein in a cell is determined by its gene expression, which in its most simple form is a two-step process, with the gene first transcribed into an mRNA as directed by the promoter, and this mRNA then translated by the ribosome to make the protein. Altering the expression levels of genes is a crucial tool for modern bioscience research, for synthetic biology and for many biotechnology applications [1, 2, 3], and in all organisms this is most-readily achieved by intervening in the first step of gene expression, by changing or regulating the promoter in order to change the rate of transcription of the mRNA [4, 5]. Natural systems, however, also regularly modify protein production by altering the second stage of gene expression where the mRNA is translated [6]. In model bacterial systems, changing the gene sequence in order to modify the rate of translation is a well-established method for predictably tuning the rate of protein production from a gene. However, this is a rarely-used strategy for engineering changes in gene expression in eukaryotic systems.
In bacteria the rate of translation of an mRNA and thus the expression level of a gene can be tuned by changing the sequence of bases immediately upstream of the AUG start codon that are known to recruit the ribosome to initiate translation through RNA:RNA base-pairing [7]. This region, known as the Ribosome Binding Site (RBS) consists of a core sequence that directly base-pairs with the 16S RNA of the ribosome, and surrounding sequences that modify the efficiency of this interaction by forming local secondary structures via mRNA folding. As the efficiency of ribosome recruitement at the RBS defines the translation initiation rate of an mRNA, extensive research has been undertaken to determine how changes to the RBS sequence can be used to tune gene expression [8]. This has led to several sequence-to-output predictive tools that use thermodynamic models of nucleic acid pairing to predict the binding efficiency of ribosomes to any given bacterial mRNA [7]. These are enabled by the multiple software packages that predict nucleic acid secondary structures and determine their Minimum Free Energy (MFE) of folding by summing the thermodynamic contributions of all base-pairing interactions [9, 10, 11, 12, 13, 14].
The most advanced RBS prediction tool, the RBS Calculator, uses secondary-structure calculations to predict the strengths of the various RNA:RNA interactions that occur during ribosome binding and converts these into a predicted translation rate [8]. It can also forward-design new 5’UTR sequences in order to produce desired gene expression levels and can design libraries of 5’UTR sequences in which the careful placement of a few degenerate bases leads to a diverse, yet bounded, variation in the resulting expression levels from the library of DNA expression constructs that encodes these [8, 15]. This enables researchers working with bacteria to be able to fine-tune the expression of genes of interest and design graded expression libraries in a predictable manner, greatly accelerating progress in synthetic biology and biotechnology applications such as metabolic engineering.
In eukaryotes, translation initiation follows a different mechanism to bacteria, where only part of the ribosome, the 40S subunit, initially binds the mRNA [16]. After binding the 5’ cap, it scans along the 5’UTR of the mRNA until reaching the first AUG start codon, which is usually preceded by an A or G/C rich upstream motif known as the Kozak sequence [17]. No direct RNA:RNA base pairing is seen between the ribosome and the mRNA, and as such changing the bases within this region is a rarely-used mechanism for altering gene expression.
However, as with bacteria, it is well-established that the folding of secondary structures in the mRNA 5’UTR can affect the rate of translation and typically when this occurs it inhibits protein expression [18, 19, 20]. Three genome-wide studies in S. cerevisiae have shown that a negative correlation exists between the efficiency with which an mRNA is translated and the secondary structure around its start codon [21, 22, 23], and in a recent study where the 10 bases upstream of the start codon on an mRNA were randomised, a significant association between thermodynamically stable secondary structures and reduced protein levels was found [24]. Secondary structures present in 5’UTRs in higher eukaryotes, such as mammalian cells [25, 26] and plants [27], have also been shown to lead to a similar reduction in gene expression.
In several past experiments in yeast, secondary structures have been added to mRNAs to regulate and tune their expression. Lamping et al. recently showed that GC-rich sequences encoding hairpins can be used as modules to down-regulate expression [20], while others have combined hairpins with other forms of RNA-based translational regulation or with RNA-binding proteins that bind these motifs [28, 29, 30, 31]. No approach, however, exists where gene expression can be predictably tuned at the translation step in eukaryotes as it can in bacteria. Achieving tools equivalent to the RBS Calculator in eukaryotes like yeast would be highly-desirable, especially for metabolic engineering projects where enzyme expression levels need to be optimised, or in synthetic biology where precise and efficient gene expression is typically desired [2, 3].
Towards this objective, we describe here a design-led approach to predictably tune the expression of genes in S. cerevisiae by repressing translation of mRNAs through the introduction of synthetic hairpin secondary structures within the 5’UTR. Following the strategy established by the RBS Library Designer tool [15], we use MFE-based prediction of RNA secondary structures to design sequences with degen-erate bases that can be inserted into the 5’UTR to yield a defined range of expression levels within a population of yeast. We derive a mathematical model that links the predicted MFE of folding for each hairpin sequence with resulting protein expression levels measured in vivo and show that sequences can be designed to fine-tune gene expression libraries as desired. We show that these hairpin-encoding se-quences are modular, working as predicted when paired with different promoters or alternative protein-coding sequences, and are able to alter expression from regulated promoters without impairing their regulation. Our approach greatly simplifies the production of graded expression libraries in S. cerevisiae compared to existing methods and offers a valuable new tool for predictably tuning gene expression in eukaryotes.
Materials & Methods
Strains and media
BY4741 (MATa his3Δ1 leu2Δ0 met15Δ0 ura3Δ0) was used for all yeast transformations, using a high efficiency yeast transformation protocol [32]. Standard practice in yeast genetics was followed [33]. When used, IPTG was supplied at a concentration of 10mM. For selection in E. coli, antibiotics were added at the following concentrations: Ampicillin: 100 μg/ml, Kanamycin: 50 μg/ml, Chloramphenicol: 33 μg/ml.
Library cloning
The Yeast ToolKit (YTK) cloning system was used for library plasmid construction [34]. NEB Turbo chem-ically competent E. coli (Catalog #C2984I) were used for transformation of library constructs, according to the manufacturer’s protocol. Supplementary table 1 lists all multigene level plasmids constructed for the libraries in this study and their constituent cassette-level plasmids. Contrary to normal YTK protocol, one of the cassettes is an in vitro product created via PCR, rather than through a cloning step. The cor-responding PCR reactions for each cassette part in these libraries are also shown in Supplementary table 1 and primers are defined in Supplementary table 4. For a detailed description of the cloning method we refer to the supplementary material and Supplementary figure 1.
Cloned cassette-level plasmids used in the multigene assembly reactions and as template for the PCRs are listed in Supplementary table 2, along with part plasmids used in their assemblies. Any used parts that were not defined as a standard part in the YTK kit are listed in Supplementary table 3.
Flow cytometry analysis
Libraries were tested using flow cytometry performed with the Attune NxT Acoustic Focusing Cytometer (Invitrogen), with accompanying 96-well plate reader. Two lasers are installed in this machine: blue 488 nm and a 561 nm yellow laser. Green fluorescence (blue laser) was detected at a voltage of 450, red fluorescence (yellow laser) at a voltage of 480, while forward and side scatter were detected at voltages of 40 and 340, respectively.
Strains were picked and grown to saturation in 700μl YEP-Dextrose in 2ml 96-deepwell plates (VWR, cat no 732-0585). Cultures were grown overnight in a shaking incubator (Infors HT multitron MTP shaker) at 800rpm at 30°C, with breathe-easy film (sigma, cat no Z380059) covering the plate to prevent evaporation. This plate was then diluted 500 times into a new deepwell plate containing 700μl minimal dropout media with 2% galactose for induction. After overnight incubation of at least 12 hours, the cultures were backdiluted into a Costar 96 round-well flatbottom plate (VWR, cat no 3596). Dilution was 10-100 fold, depending on culture density, using the same media in a total volume of 300μl per well. Cultures were grown for a minimum of 4 hours before initiation of the measurements, to ensure logarithmic growth. For autofluorescence measurements the same incubation protocol was followed, exclusively with dextrose as the carbon source.
10,000 events were collected for each of the 3 biological repeats per sample. Only events with for-ward and side scatter values greater than 103 were counted. Populations were tightly gated around the median of forward-and side-scatter, in order to limit the effect of cell size on the measurements. Popula-tions were subsequently gated for sufficient mRuby2 expression in strains that contained a constitutively expressed red fluorescent control. Conversely, in constructs with constitutively expressed yEGFP, pop-ulations were gated for sufficient green fluorescence. Gating and exporting was done using FlowJo 10.0.7r2. We defined normalised fluorescence as the fold increase over median autofluorescence lev-els. Accordingly, raw fluorescence values of tested strains were divided by the median fluorescence of unmodified BY4741 cells, grown under identical conditions. Matlab 2016b was used for the visualisation of the resulting histograms.
Library member sequence determination
Hairpin sequences of individual library members were determined through yeast colony PCR. Single colonies were resuspended in 50 μl 0.02 M NaOH with a sterile toothpick. After incubating this solution at 99°C for 10 minutes, a 2 μl aliquot was used as template in a 50 μl PCR reaction. Primers TW149 and TW188 were used for the reaction and for sequencing. They are defined in Supplementary table 4. Isolates with DNA sequences that did not match the relevant library sequence space (i.e. mutations or sequencing errors) were discarded from the analysis. 5’UTR sequences of all isolated library members are listed in Supplementary table 6.
Reverse transcription and qPCR
Total RNA was isolated from yeast using the YeaStar RNA Kit (Zymo Research, Cat No R1002), ac-cording to the manufacturer’s instructions. Cultures were grown to saturation overnight and backdiluted 1:100 the next morning. Cultures were then grown to an O.D.600 of approximately 2, to ensure logarith-mic growth. 1.5 ml of the culture was used for RNA isolation. This volume was adjusted to ensure that the same amount of biomass was used for every sample.
400 ng of total RNA of each of the samples was used in the reverse transcription (RT) reaction to produce cDNA that could be used for qPCR. The RT reaction was performed in a total volume of 10μl, using the Tetro cDNA synthesis kit (Bioline, Cat No BIO-65043) according to the manufacturer’s in-structions. For each reaction, a negative control lacking the reverse transcriptase was included. Specific primers were used for the RT step, which are listed in Supplementary table 4.
cDNA obtained in the RT reaction was diluted 300x and used for qPCR. 4.6μl of diluted cDNA was used in a total reaction volume of 10μl. The Kapa universal qPCR 2x mastermix kit (KAPA biosystems, Cat No kk4601) was used according to the manufacturer’s instructions. The primers (0.2μl per primer per reaction) for each of the screened targets are listed in Supplementary table 4. Measurements were performed with the Eppendorf MasterCycler RealPlex qPCR thermocycler and accompanying software. The following cycling program was used: denaturation for 10 min at 95°C followed by 50 cycles of 15 s at 95°C, 1 min annealing and extension at 60°C.
Three technical replicates were performed for every biological sample. The data were analysed using the 2-ΔΔCT (also ‘dd-Ct’) method [35]. TPI1 was used as a reference gene [36]. The error was calculated as the standard deviation of the replicates with propagation of the error in the reference gene measurements. For each qRT-PCR experiment, two controls were included to monitor the level of DNA contamination of the cDNA and the used reagents. For every target a triplicate measurement of ddH2O was included. Secondly, for every target in every strain we included the -RT control samples produced during the cDNA synthesis.
Results
Altering expression strength with 5’UTR RNA structures
We set out to build on recent work in S. cerevisiae showing that hairpin structures within the 5’UTR of mRNAs decrease expression by inhibiting translation [20]. To verify this finding and explore the relationship in more depth, we used one-pot cloning methods to introduce a library of hairpin structures into the 5’UTR of an mRNA encoding expression of green fluorescent protein (GFP). This mRNA is expressed from the GAL1 promoter and upon galactose induction yields a strong and easily-detectable green fluorescent signal from yeast cells. If RNA structures placed within the 5’UTR of this mRNA lead to changes in the efficiency of its translation, then the fluorescent signal of the yeast will be altered and the expression level can be quantified by analytical flow cytometry.
Using a strategy of PCR amplification with degenerate oligonucleotide primers, followed by cloning with the modular YTK system, we introduced a hairpin library spaced 15 bases upstream of the start codon of GFP, as summarised in Figure 1 and detailed in Supplementary figure 1. To design the hairpin-encoding sequences, we used RNAfold from the ViennaRNA suite of tools, as it allows the RNA-folding algorithms to run on a local computer and the thermodynamic parameters can be adjusted to work for 30°C [37, 14]. As an initial test, two pairs of degenerate oligonucleotide primers were designed to create two alternative secondary structure libraries. The primer pairs each encoded the introduction of 60 bases of sequence predicted by RNAfold to fold into a strong hairpin motif consisting of a 27 basepair stem and 4 base loop (Figure 1A).
Degenerate nucleotides were designed into 10 positions within the primers so that the different con-structs produced by the cloning would have variation in the bases within the hairpin stem, and therefore a range of strengths for the resulting RNA secondary structure. Using RNAfold, the distribution of the predicted minimum free energies of folding for each of the two hairpin libraries could be determined by calculating the folding strengths for all possible combinations of introduced degenerate bases (Figure 1B). The two initial libraries, each with 10 degenerate bases were designed to give a normal distribu-tion of predicted secondary structure strengths with average minimum free energy (MFE) of folding of -32.2 kcal/mol and -28.8 kcal/mol (libraries HL1 and HL2, respectively - see Supplementary table 5 for sequences).
Following plasmid construction to introduce the hairpin-encoding sequences with degenerate bases, E. coli were transformed and all resulting colonies were pooled for each library. Plasmid libraries were extracted from these two pools and then transformed into S. cerevisiae cells. All yeast colonies from each library were pooled and then grown in galactose media to induce gene expression (Figure 1C).
The green fluorescence from each induced pool of yeast colonies was measured at the single-cell level by flow cytometry and used to quantify GFP expression from the library constructs. For both HL1 and HL2 libraries, the normalised green fluorescence was seen to vary across the population over three orders of magnitude (Figure 1D), indicating that the introduced hairpin sequences were indeed altering the expression of GFP in the cells. The shape and peak of the fluorescence histograms in the two cases also differed, with more cells exhibiting low amounts of GFP expression when the library with stronger predicted secondary structure (HL1) was used.
Matching expression levels to predicted folding energies
We next sought to determine if there was a mathematical relationship between the predicted MFE of the encoded 5’UTR secondary structures and the resulting in vivo GFP expression levels in S. cerevisiaeTo do this we selected 31 individual colonies from libraries HL1 and HL2 (Figure 1), determined the se-quence of their 5’UTR regions and used flow cytometry to characterise their green fluorescence per cell upon galactose induction. We used RNAfold to predict the MFE for the hairpin structures in each isolate by inputting their 5’UTR sequences and plotted the relationship between the predicted MFE and the normalised green fluorescence measurements (Figure 2A).
This plot revealed a clear relationship between the predicted UTR folding energy and the GFP ex-pression for each isolated colony. A steep decline in the expression level is seen as the MFE approaches more negative values, in line with the hypothesis that increasingly strong secondary structures within the 5’UTR inhibit gene expression. However, this decline does not start until the MFE reaches approximately -22 kcal/mol, which indicates that the structures weaker than -22 kcal/mol are not strong enough to inhibit expression in our system. Presumably these weaker structures do not cause a sufficient roadblock to the initiation of translation by the ribosome. To confirm that the measured decrease in gene expression was indeed due to reduced translation rather than reduced transcription, we used quantitative PCR (qPCR) to verify that our transcript levels per cell remain the same, despite different 5’UTR hairpin sequences being introduced (Supplementary figure 3).
Using the gene expression data, we fitted a model to predict the gene expression output from the calculated MFE values. As the relationship in (Figure 2A) represents a sigmoidal curve, we modified the equation for logistic growth to produce a new equation that predicts gene expression (as normalised fluorescence) from the MFE of folding of the 5’UTR sequence and the maximum output of the promoter used in the gene expression construct (Figure 2B). The fitted curve does not fall within the 95% confi-dence interval of all the measured data points, suggesting that some relevant properties remain that are not captured by this model. In the most extreme cases, the predicted expression value is 3-fold over-or underestimated compared to the observed value. This compares favourably to the RBS-calculator, which showed deviations of up to 10-fold when it was first published[8].
To test whether this equation has predictive power, we designed, built and tested four further libraries as before by introducing hairpin sequences with degenerate bases between the GAL1 promoter and the GFP coding region (Figure 2C). These four additional libraries (HL3-6) were designed to have weaker average structure strengths within their distributions compared to HL1 and HL2, so that when all six libraries are combined they cover a range from a weak average MFE of -8.0 kcal/mol (HL6) through to a strong average MFE of -32.2 kcal/mol (HL1). The focus was on average MFE values between -32 and -20 kcal/mol, as higher values were predicted to fall within the plateau region of maximum expression and would therefore not lead to significant diversity within the library. The tight peak at maximum expression in library HL6 confirms this.
For each library, the minimal folding energies of all possible members were calculated by RNAfold and then converted into predicted gene expression levels using the established equation and the Pmax value of the GAL1 promoter. The predicted distributions were then compared to experimentally-obtained distributions of green fluorescence measurements for each constructed library of yeast cells. An overview of these comparisons is shown in (Figure 2D) and demonstrates a good qualitative agreement between the model predictions and the resulting experimental data.
Generally, the histograms for the predicted normalised expression match those for the measured normalised GFP expression in both their spread (the range of expression levels) and the position of the peak (the average expression). Some peak-broadening is seen in the flow cytometry data compared to the predictions, likely as small and large cells within the measured population cause intrinsic deviations in the data despite cells having the same relative expression. However for the goal of forward engineer-ing, these libraries and the model-based predictions serve their purpose, allowing a user to design a sequence that predictably alters gene expression from a promoter of known strength.
Interestingly, a notable exception in our ability to predict expression from a designed 5’UTR hairpin sequence occurs if the design encodes a tetraloop in the 4 base loop sequence of the hairpin. RNA tetraloop motifs, such as those encoded by the sequence UUCG, are known to significantly stabilise hairpin sequences. In hairpin library designs with tetraloop-encoding bases at the loop region, the measured GFP expression from cells was dramatically reduced compared to the predicted expression from the MFE calculations (Supplementary figure 3). Introducing a tetraloop sequence into an other-wise identical library dramatically reduced expression levels and introduced a severe mismatch between predicted and measured expression profiles. This effect was especially pronounced when the stem se-quence directly adjacent to the tetraloop was kept devoid of degenerate bases and thereby perfectly complementary. Until RNA folding models can accurately predict the MFE contribution of tetraloops or until the effect of tetraloop inclusion can be accurately modelled, we recommend avoiding tetraloop-encoding bases within hairpin designs.
5’UTR hairpins as modular parts
Having predictably altered the expression of GFP from the GAL1 promoter in S. cerevisiae with 5’UTR hairpin libraries, we next sought to show that the approach is modular, i.e. that the introduction of de-signed structures can predictably alter gene expression when combined with other modular DNA parts, such as different promoters. To demonstrate this we first examined the effect of exchanging the open reading frame (ORF) sequence encoding the protein produced in our constructs. We replaced the GFP-encoding sequence with that encoding the red fluorescent protein (RFP) mRuby2 in the constructs for libraries HL4, HL2 and HL1. The sequence identity between the two ORFs encoding these fluores-cent proteins is as little as 11.5%, as identified by a BLAST alignment optimised for somewhat similar sequences (blastn) [38].
After cloning the libraries into S. cerevisiae we measured the red flourescence per cell of the pooled yeast colonies after galactose induction using flow cytometry and compared the fluorescence distribu-tions to those predicted by our mathematical model (Figure 3A). As seen previously with GFP, the RFP data closely matched the predicted expression levels for all three libraries tested, demonstrating that the 5’UTR libraries remain predictable when the context of the downstream sequence is changed.
Next, we tested the effect of varying the promoters within our constructs. To do this we designed a 5’UTR encoding a new hairpin library with an average MFE of -28.9 kcal/mol. This library, called HC1, was designed to produce a uniform distribution between autofluorescence levels and the maximum expression levels of the various mid-to high-strength promoters. This was cloned upstream of the GFP-encoding ORF and downstream of five constitutive promoters known to have different expression strengths. We then measured the GFP expression from these five promoters in the absence of any 5’UTR structures to obtain Pmax values and used these data with the calculated MFE values for the HC1 design to predict the expression profiles of the five new libraries. Histograms of these predictions are shown alongside the equivalent measured data for all five promoters in Figure 3B and show a clear qualitative match.
While this is only a small sample of possible promoters, this result implies that the 5’UTR library approach is modular with respect to upstream promoters (i.e. there is no context-dependency), mean-ing that the approach could likely be applied to alter the expression from most, if not all promoters in S. cerevisiae. However, when the results are inspected with more detail, it is noticeable that the ex-pression level distribution becomes progressively less uniform as weaker promoters are used. For weak promoters, the experimentally measured distribution skews slightly towards lower expression levels.
This phenomenon can be clarified by understanding the use of the logistic fit in Figure 2. An even distribution of expression strengths is only found in the range of values that form the straight part of the curve when the data are plotted against a logarithmic scale (as in Figure 2A). This range becomes narrower for weaker promoters with lower Pmax values. Thus, when a 5’UTR library spanning a MFE range of -44 to -24 kcal/mol (i.e. HC1) is used with a weaker promoter, a larger proportion of library members will obtain a sequence that fully-represses detectable expression. This in turn skews the expression level distribution towards the lower end of the spectrum.
However, by simply measuring the normalised fluorescence of a promoter prior to library design and creation, the Pmax can be determined and the library MFE spread can be intentionally designed to take into account the promoter strength. Going forward this will allow expression libraries to be created with more precision, regardless of the nature or sequence of the upstream promoter or the downstream ORF.
Predictable tuning of regulated expression with 5’UTR hairpins
Precise tuning of gene expression is important in many applications of biotechnology, synthetic biology and metabolic engineering. The approach developed here of placing designed secondary structure within the mRNA 5’UTRs offers a new modular tool to achieve this in yeast. Currently in S. cerevisiae the most commonly-used method for altering the strength of gene expression is to replace the promoter, and multiple promoter libraries have been described that are intended to enable users to alter the amount of transcription and thus expression of any gene.
Most promoter libraries use sets of constitutive promoters of different strengths, however when gene regulation is required (e.g. for inducible expression) these are not suitable. A small number of promoter libraries based around regulated promoters have been described and used in yeast, but within these libraries the efficiency of regulation often varies considerably as the promoter strengths change. This is because mutating bases within the promoter in order to change the transcription strength often inadver-tently alters the efficiency with which transcription factors bind their cognate sites on the promoter and regulate promoter expression.
The 5’UTR hairpin approach developed here is a promising solution to this problem, because the fine-tuning of gene expression output is achieved by altering bases away from the promoter sequence where the regulation is encoded. It therefore offers a new way to modify protein levels from gene expression constructs without affecting their regulation characteristics at the promoter level. To demonstrate this, we directly compared our approach to an equivalent artificially regulated GAL1 promoter library produced previously using targeted mutagenesis (Figure 4).
In both cases, expression of the Lac Inhibitor protein (LacI) in yeast inhibits expression of GFP by binding to an integrated Lac operator site (LacO) within the core of the GAL1 promoter. In galactose media, expression is repressed unless the inducer IPTG is added which blocks the action of LacI and permits full-strength expression.
The 21-member promoter library was previously made by targeted promoter mutagenesis, followed by selection and characterisation experiments requiring a two-week worflow [39]. When measured for GFP expression in induced and repressed conditions (Figure 4A) a desirable range of maximum outputs is seen within the library but the efficiency of repression varies considerably, with several promoters being especially leaky (i.e. not well-repressed).
In contrast, in under a week using our 5’UTR hairpin method we were able to generate a graded library of 45 constructs covering the full range of maximum outputs, while all maintaining strong re-pression when uninduced (Figure 4B). To do this we simply paired the strongest member of the LacI-repressed GAL1 promoter library (LX) with a library of 5’UTR hairpins (named HG1) designed to have an average folding strength of -28.8 kcal/mol. Upon cloning into S. cerevisiae a total of 48 transformants were picked and characterised by flow cytometry for GFP expression in induced and repressed condi-tions. Only 3 selected colonies with aberrant expression were discarded to yield the 45-member library, whereas the classic promoter mutagenesis method used previously had required screening over 300 colonies to isolate the 21-member graded library.
The use of designed 5’UTR hairpins thus outperforms promoter mutagenesis methods for expression library creation in terms of both ease and speed to create the library and in terms of the resulting constructs showing the desired expression characteristics and no unwanted leaky expression. The increased predictability afforded by this new approach can also aid a priori modelling efforts, because there is greatly decreased risk that the promoter with the required maximum expression also has leaky expression or unanticipated impaired regulation. As leakiness typically needs to be as low as possible whenever regulated expression is desired (for example in genetic circuits or biosensors) we anticipate that our 5’UTR hairpin method for tuning expression will be widely-applied.
Discussion
In this work we have determined how protein expression in S. cerevisiae can be predictably tuned through the incorporation of 5’UTR secondary structures. By placing a library of hairpins of different strengths into the 5’UTR region of a gene, we can downregulate translation efficiency and the magni-tude of the inhibition can be predicted using well-understood principles of RNA base-pairing. The result is a system for eukaryotes that is similar to the extensively-used RBS Calculator that is available for use in bacteria. We verified the modularity of this system by testing a variety of libraries with two different ORFs and testing one hairpin library with a variety of constitutive promoters. The predictions were shown to hold in all of these cases. In synthetic biology and other disciplines that rely on the precise engineer-ing of gene expression in living cells, the regulation of protein production is of central importance and so being able to fine-tune the efficiency of mRNA translation is an important contribution. Importantly, employing the approach described here requires only a few cloning steps and can be incorporated as a routine part of gene construction and optimisation.
Our results showed that the folding strength of the introduced hairpin determines the expression level of the associated protein with stronger hairpin structures leading to lower expression levels likely through increased interference with the translational machinery of the host during the scanning step of translation initiation. Corroborating this finding, our qPCR experiments showed no difference in the transcript levels when mRNAs contained highly-structured or weakly-structured 5’UTRs, and yet these mRNAs expressed protein at greatly different levels.
Our calibration curve for the mathematical model determines how the folding strength of the hairpin correspond with the translation of the protein and is to our knowledge the most detailed report of this relationship to date. It shows that a MFE difference of 10 kcal/mol is required to impart a 10-fold dif-ference in gene expression, which is more than 7 times the energy that would be expected simply from thermodynamic predictions of the energy required to change the folded to unfolded ratio by 10-fold [40]. This finding is strong evidence that RNA helicase activity in translation initiation partially counteracts the effects of hairpin structure in mRNA. The shape of the curve is effectively a function of the properties of the translation machinery of the host, in particular the processivity of the RNA helicases, notably eIF4A, which are associated with the 40S ribosomal subunit during translation initiation [16, 41].
Furthermore, we also observed that expression was not affected by hairpin structures weaker than -18 kcal/mol. This indicates the translation initiation machinery and associated RNA helicases are likely to be able to fully-denature structures of these strengths. However, hairpins stronger than -44 kcal/mol were not translated at all, suggesting that the native machinery is completely incapable of unfolding these. Precisely determining the limits of the native machinery is important, as it opens up the possibil-ity for determining the impact of individual components of the translation machinery affecting helicase activity, such as eIF4A, eIF4B, eIF4G. Each of these have been shown to play a role in RNA-helicase processivity and observing how overexpression and elimination of these components affect the limits and shape of the curve may provide valuable insights to their function[42, 43, 44].
Interestingly, it was found that weak promoters are especially sensitive to strong 5’UTR structure and thus require appropriate design of libraries with weaker average folding strengths. Another unex-pected outcome was the severe impact of tetraloops on translation inhibition. Tetraloops combined with a perfectly paired stem exhibited a far greater inhibitive strength than their net MFE contribution would suggest. While this finding requires more investigation, one possible explanation would be if these motifs were bound and stabilised by other local factors present during translation initiation. Tetraloops such as those tested here, are known to be conserved folding motifs found in ribosomal RNA, which could lead to them being bound by ribosome-associated proteins[45, 46].
Tetraloops are one example of motif that challenges the predictive power of our approach, and other, as yet unidentified features may exist that also affect predictability. Features that attract RNA bind-ing proteins, target the mRNA for degradation or interfere with transport through the nuclear pore are also possibilities. Indeed, in our measurements for the calibration curve, we noted that expression can deviate by up to three-fold from the predicted value in certain cases. This may be caused by the acci-dental inclusion of sequences encoding unknown motifs and would affect accurate predictions if trying to achieve a single 5’UTRs for a set expression level. For this reason, we prefer the library approach with degenerate bases, which will almost always yield at least one yeast colony exhibiting the desired expression level. As a workaround, hairpins from the pre-characterised list of 5’UTRs shown in Supplementary table 6 can be chosen instead. A further limitation of our approach is that our sequences do not increase expression and only lower it from what is seen normally with the promoter of choice. In bacteria, changing the RBS sequence can often be used to increase expression, however, as yeast 5’UTRs rarely contain rate-limiting secondary structures, the introduction hairpin motifs only reduces the expression levels.
Adapting hairpin design to incorporate active regulation is a promising route for future work. Previous work has shown that aptamer motifs can be designed into a 5’UTR to fold into inhibitory hairpin structures when they bind a specific inducer molecule translation [47]. Approaches such as riboswitches or recruit-ing RNA-binding proteins such as MS2 coat protein are other routes to regulation. Another interesting potential improvement would be to increase the prediction accuracy by taking into account the dynamic properties of RNA folding. mRNAs with 5’UTR structures have been shown to have biphasic polysome distributions in yeast[19], which indicates that once the structure is unwound it stays unwound, as it reforms slowly relative to the rate of translation initiation. Taking (re-)folding speed into account could therefore lead to improvements in prediction accuracy as has recently been done for the bacterial RBS Calculator [48].
A further improvement to our system would be to make the developed method applicable in all eu-karyotes. A new calibration curve will need to be made for each organism that this method is imple-mented in, since the cellular machinery will behave slightly differently in each case due to the properties of the translation initiation machinery. Interestingly, for higher eukaryotes, such as plants and mam-malian cells, the ideal placement of the hairpin module within the 5’UTR may not be the same as in yeast. For work in S. cerevisiae we recommend placing the hairpin just upstream of the AUG, as the further upstream it lies, the more likely the encoding-sequence could influence the upstream promoter. However, in higher eukaryotes, the amount of inhibition of translation is thought to be highest when sec-ondary structures are closest to the 5’ cap [18, 49, 19, 21]. This may be because higher eukaryotes have the DHX29 RNA helicase that is absent in S. cerevisiae and this is thought to enable them to tolerate much longer, structured 5’UTRs [20].
In terms of applications, we anticipate that this approach will be useful broadly but especially in synthetic biology where exploring and tuning gene expression strength is critical. Already a study on gene expression noise in genetic circuits in S. cerevisiae has demonstrated the use of 5’UTR hairpins to modify translation rates of a transcript [50]. We expect our hairpin library system to be ideal for optimising the simultaneous expression of multiple genes. Efforts to create heterologous metabolic pathways in yeast are common in synthetic biology and metabolic engineering [51] and efficient ways of optimising enzyme expression levels in these pathways are needed. Because the fraction of functional library members in a transformed population using our method is high, we expect that multiple libraries can be inserted simultaneously during cloning of a pathway, while at the same time ensuring that functional expression of each gene always occurs.
Due to the different underlying principles of translation initiation in eukaryotes, there are some dif-ferences between the approach we present here and the popular RBS Calculator tool for bacteria. In a typical case, the RBS Calculator can be used to both up-and down-regulate, while yeast hairpins only downregulate. However, modularity for bacterial RBS sequences is poor, usually requiring a custom sequence to be designed for each gene. But because hairpin interactions are more specific and pre-dictable this is much less of an issue in our approach, allowing the designed sequences to be used as modules. Already a set of 16 hairpin modules for S. cerevisiae 3’UTR regions has been described which modifies gene expression by altering mRNA degradation rates [31]. However, this complementary ap-proach does not allow a priori predictions of protein expression levels based on sequence as achieved here.
Taken together, we have developed here a novel sequence-to-output design strategy for the creation of yeast gene expression libraries, and our approach significantly outperforms existing protocols for library creation both in terms of predictability and the speed and ease of the method. With a one-week turnaround time and a maximum of three days of cloning, the use of computationally-designed 5’UTR hairpin libraries is a fast, cheap and accessible way to tune expression levels and to specifically alter the rate of the translation step in eukaryotic systems.
Funding
Imperial College PhD Studentship [to T.W.] Engineering and Physical Sciences Research Council [EP/J021849/1 to T.W., T.E.]; Biotechnology and Biological Sciences Research Council [BB/K019791/1 to R.M.M., T.E.]. Funding for open access charge: Research Council UK, Open Access Fund.
Conflict of interest statement
None declared.
Acknowledgements
The authors thank Charlie Gilbert for providing critical feedback during the manuscript drafting stage and for Robert Chen for help with the YTK system.