Highly accessible translation initiation sites are predictive of successful heterologous protein expression

Recombinant protein production in microbial systems is well-established, yet half of these experiments have failed in the expression phase. Failures are expected for ‘difficult-to-express’ proteins, but for others, codon bias, mRNA folding, avoidance, and G+C content have been suggested to explain observed levels of protein expression. However, determining which of these is the strongest predictor is still an active area of research. We used an ensemble average of energy model for RNA to show that the accessibility of translation initiation sites outperforms other features in predicting the outcomes of 11,430 experiments of recombinant protein production in Escherichia coli. We developed TIsigner and showed that synonymous codon changes within the first nine codons are sufficient to improve the accessibility of translation initiation sites. Our software produces scores for both input and optimised sequences, so that success/failure can be predicted and prevented by PCR cloning of optimised sequences.


Introduction
Recombinant protein expression has numerous applications in biotechnology and biomedical research. Despite extensive refinements in protocols over the past three decades, half of the experiments have failed in the expression phase (http://targetdb.rcsb.org/metrics/ ). Notable problems are the low expression of 'difficult proteins' such as membrane proteins, and the poor growth of the expression hosts, which may relate to the toxicity of heterologous proteins 1 (reviewed in detail elsewhere 2,3 ). If these issues are factored out, we expect a strong correlation between mRNA and protein levels. However, this assumption oversimplifies the complexity of translation and turnover of biomolecules because mRNA abundance can only explain up to 40% of the variation in protein abundance 4-10 . Furthermore, the strong promoters used in expression vectors do not always lead to a desirable level of protein expression 11 .
For Escherichia coli , two main models were proposed to explain the low correlation between mRNA and protein levels, which are based on either codon or mRNA folding analysis. Codon analysis measures a bias in codon usage using codon adaptation index (CAI) 12 or tRNA adaptation index (tAI) 13,14 whereas mRNA folding analysis predicts the presence of RNA secondary structures and their folding stability. Codon usage bias is thought to correlate with tRNA abundance, translation efficiency and protein production 12-16 but its usefulness has been questioned upon [17][18][19][20] . In contrast, many findings support the model based on mRNA folding in which the stability of RNA structures around the Shine-Dalgarno sequence and/or translation initiation sites inversely correlates with protein expression 17,18,20-23 . We recently proposed a third model in which the avoidance of inappropriate interactions between mRNAs and non-coding RNAs has a strong effect on protein expression 24 . The roles of these models in protein expression is still an active area of research.
The common algorithms of gene optimisation samples synonymous protein-coding sequences using 'fitness' models based on CAI, tAI, mRNA folding, and/or G+C content (%) [25][26][27][28][29] . However, these 'fitness' models are usually based on some of the above findings that relied on either endogenous proteins, reporter proteins or a few other proteins with their synonymous variants. It is unclear whether these features are generalisable to explain the expression of various heterologous proteins. To address this question, we studied multiple large datasets across species in order to extract features that allow us to predict the outcomes of 11,430 experiments of recombinant protein expression in E. coli . With this information, we propose how such features can be exploited to fine-tune protein expression at a low cost.

Accessibility of translation initiation sites strongly correlates with protein abundance
To explore new features that could explain the expression of heterologous proteins, we first examined an E. coli expression dataset of green fluorescent protein (GFP) fused in-frame with a library of 96-nt upstream sequences (n=244,000) 20 . We clustered these 96-nt upstream sequences using CD-HIT-EST 30,31 , giving rise to 14,425 representative sequences. We calculated the accessibility that represents the opening energy for all possible sub-sequences of these sequences (see Methods). For each sub-sequence region, we examined the correlation between the opening energy and GFP levels. We found that the opening energy of translation initiation sites, in particular from the nucleotide positions −30 to 18 (−30:18), showed a maximum correlation with protein abundance (Fig 1A; R s =−0.65, P<2.2×10 −16 ). This is stronger than the correlation between the minimum free energy −30:30 and protein abundance, which was previously reported as the highest rank feature (Fig 1A; R s =0.51, P<2.2×10 −16 ). The P-values of multiple testing were adjusted using Bonferroni's correction and reported to machine precision. The datasets used and results were summarised in Supplementary Table S1.
We repeated the analysis for a dataset of yellow fluorescent protein (YFP) expression in Saccharomyces cerevisiae 22 . This dataset corresponds to a library of 5′UTR variants, in which the 10-nt sequences preceding the YFP translation initiation site were randomly substituted (n=2,041). In this case, the opening energy −7:89 showed a stronger correlation with protein abundance than that of the minimum free energy −15:50 reported previously (Fig 1B; R s =−0.55 versus 0.46).
To examine the usefulness of accessibility in complex eukaryotes, we analysed a dataset of GFP expression in Mus musculus 32 . The reporter library was originally designed to measure the strength of translation initiation sequence context, in which the 6-and 2-nt sequences upstream and downstream of the GFP translation initiation site were randomly substituted, respectively (n=65,536). Here the opening energy −8:11 showed a maximum correlation with expressed proteins, which again, is stronger than that of the minimum free energy −30:30 (Fig 1C; R s =−0.28 versus 0.12).
Taken together, our findings suggest that the accessibility of translation initiation sites strongly correlates with protein abundance across species. Interestingly, our findings also suggest that E. coli tends to have a longer accessible 5′UTR region than that of S. cerevisiae and H. sapiens (−30 versus −7 and −8; see Fig 1). This can be explained by the presence of the Shine-Dalgarno sequence 33 at the region −13:−8, which should be accessible to recruit ribosomes.

Accessibility predicts the outcome of recombinant protein expression
We investigated how accessibility performs in the real world in prediction of recombinant protein expression. For this purpose, we analysed 11,430 expression experiments in E. coli from the 'Protein Structure Initiative:Biology' (PSI:Biology) 34-36 . These PSI:Biology targets were expressed using the pET21_NESG expression vector that harbours the T7lac inducible promoter and a C-terminal His tag 36 .
We split the experimental results of the PSI:Biology targets into protein expression 'success' and 'failure' groups (n=8,780 and 2,650, respectively; see Supplementary Fig S2). These PSI:Biology targets spanned more than 189 species and the failures are representative of various problems in heterologous protein expression. Only 1.6% of the experiments belong to homologous protein expression, which is negligible (n=179; see Supplementary Fig S2).
We calculated the opening energy for all possible sub-sequences of the PSI:Biology targets as above (Fig 2). For each sub-sequence region, we used the opening energy levels to predict the expression outcome and computed the prediction accuracy using the area under the receiver operating characteristic curve (AUC; see Fig 2C). A closer look into the correlations and AUC scores calculated for the sub-sequence regions reveals a strong accessibility signal of translation initiation sites (Fig 2B and C, Cambray's GFP and PSI:Biology datasets, respectively). Although the sequences of the Cambray's GFP and PSI:Biology datasets are different, we reasoned that the correlations and AUC scores can be compared by the sub-sequence regions that are in common (see Fig 2A for an example of a sub-sequence region). Based on this idea, we matched the correlations and AUC scores by sub-sequence region and confirmed that sub-sequence regions that have strong correlations are likely to have high AUC scores (Fig 2D). In contrast, the sub-sequence regions that have zero correlations are not useful for predicting the expression outcome (AUC approximately 0.5).
We then asked how accessibility manifests in the endogenous mRNAs of E. coli , for which we studied the proteomics dataset of 3,725 proteins consolidated in the PaxDb 37 . As expected, we observed a similar accessibility signal, with the region −25:16 correlated the most with protein abundance (Fig 2E). However, the correlation was rather low (R=−0.17, P<2.2×10 −16 ), which might be due to the limitations of mass spectrometry 38,39 . Furthermore, the endogenous promoters have variable strength, which gives rise to a broad range of mRNA and protein levels 40,41 . Taken together, our results show that the accessibility signal of translation initiation site is surprisingly consistent across various datasets analysed ( Supplementary Fig S1 and Fig 2).

Accessibility outperforms other features in prediction of recombinant protein expression
To choose an accessibility region for subsequent analyses, we selected the top 200 regions from the above correlation analysis on Cambray's dataset ( Fig 2B) and ranked their Gini importance scores in prediction of the outcomes of the PSI:Biology targets. The region −24:24 was ranked first, which is nearly identical to the region −23:24 with the top AUC score (Fig 2C, AUC=0.70). We therefore used the opening energy at the region −24:24 in subsequent analysis.
We asked how the other features perform compared to accessibility in prediction of heterologous protein expression, for which we analysed the same PSI:Biology dataset. We first calculated the minimum free energy and avoidance at the regions −30:30 and 1:30, respectively. These are the local features associated with translation initiation rate. We also calculated CAI 12 , tAI 42 , codon context (CC) 43 , G+C content (%), and Iχnos scores 44 . CC is similar to CAI except it takes codon-pair usage into account, whereas the Iχnos scores are translation elongation rates predicted using a neural network model trained with ribosome profiling data. These are the global features associated with translation elongation rate. The AUC scores for the local features were 0.70, 0.67 and 0.62 for the opening energy, minimum free energy and avoidance, respectively, whereas the global features were 0.58, 0.57, 0.54, 0.54 and 0.51 for Iχnos, G+C content (%), CAI, CC and tAI, respectively ( Fig 3A). The local features outperform the global features, suggesting that effects on translation initiation can predict the outcome of heterologous protein expression. Our findings support previous reports that the effects on translation initiation are rate-limiting 17,23 which, interestingly, correlate with the binary outcome of recombinant protein expression ( Fig 3B). Importantly, accessibility outperformed all other features.
To identify a good opening energy threshold, we calculated positive likelihood ratios for different opening energy thresholds using the cumulative frequencies of true negative, false negative, true positive and false positive derived from the above ROC analysis (Fig 4, top  panel). Meanwhile, we calculated the 95% confidence intervals of these positive likelihood ratios using 10,000 bootstrap replicates. We reasoned that there is an upper and lower bound on translation initiation rate, therefore the relationship between translation initiation rate and accessibility is likely to follow a sigmoidal pattern. We fit the positive likelihood ratios into a four-parametric logistic regression model (Fig 4). As a result, we are 95% confident that an opening energy of 10 or below at the region −24:24 is about two times more likely belongs to the sequences which are successfully expressed than those that failed.

Accessibility can be improved using a simulated annealing algorithm
The above results suggest that accessibility can, in part, explain the low expression problem of heterologous protein expression, we sought to exploit this idea in gene optimisation. We developed a simulated annealing algorithm to maximise the accessibility at the region −24:24 using synonymous codon substitution (see Methods). Previous studies have found that full-length synonymous codon-substituted transgenes may produce unexpected results, in particular a reduction in mRNA level 24, 44,45 . Therefore, we sought to determine the minimum number of codons needed for synonymous substitutions in order to achieve near optimum accessibility. For this purpose, we used the PSI:Biology targets that failed to be expressed. As a control, we first applied our simulated annealing algorithm such that synonymous substitutions can happen at any codon of the sequences except the start and stop codons (see Methods). Although full-length synonymous codon substitution was allowed, the changes may not necessarily happen to all codons due to the stochastic nature of our optimisation algorithm. Next, we constrained synonymous codon substitution to the first 14 codons and applied the same procedure (Supplementary Fig S3). Therefore, the changes may only occur at any or all of the first 14 codons. We repeated the same procedure for the first nine and also the first four codons. Thus a total of four series of codon-substituted sequences were generated. We then compared the distributions of opening energy −24:24 for these series using the Kolmogorov-Smirnov statistic (D KS ; see Fig  5A). The distance between the distributions of the nine and full-length codon-substituted series was significantly different yet sufficiently close (D KS =0.09, P=3.3 10 -8 ), suggesting × that optimisation of the first nine codons is sufficient in most cases to achieve an optimum accessibility of translation initiation sites. We named our software as T ranslation I nitiation coding region des igner (TIsigner), which by default, allows synonymous substitutions up to the first nine codons.
We asked to what extent the existing gene optimisation tools modify the accessibility of translation initiation sites. For this purpose, we first submitted the PSI:Biology targets that failed to be expressed to the ExpOptimizer webserver from NovoPro Bioscience (see Methods). We also optimised the PSI:Biology targets using the standalone version of Codon Optimisation OnLine (COOL) 28 . We found that both tools increase accessibility indirectly even though their algorithms are not designed as such (i.e., the 5′UTR sequence is not taken into account). In fact, a purely random synonymous codon substitution on these PSI:Biology targets using our own script resulted in a similar increase in accessibility ( Fig 5B). These results may explain some indirect benefits from the existing gene optimisation tools.

Discussion
Our findings show that the accessibility of translation initiation sites is the best predictor of heterologous protein expression in E. coli, as originally proposed in the 1970s/80s 46 . Increasing the accessibility of the 5′ region, including the Shine-Dalgarno sequence, facilitates the recruitment of ribosomes and therefore increases the translation initiation rate and protein level. In a landmark study, Salis et al. designed a total of 132 synthetic ribosome binding sites using minimum free energy models 26 . They found that weakly structured ribosome binding sites result in high red fluorescent protein levels. This was supported by recent studies using the endogenous folA and adk genes 47 and a dual-reporter system in E. coli 48 . These studies, and many others, support our finding that optimisation of the accessibility of translation initiation sites is a key to improve heterologous protein production.
Previous studies have used minimum free energy models to define the accessibility of a region of interest 26,47,48 . However, we have discovered that the opening energy is a better choice for modelling accessibility (see Fig 1A for example). Opening energy is an ensemble average of energy that takes into account of suboptimal RNA structures that are not reported by minimum free energy models by default 49,50 . Currently, the modelling of accessibility using opening energy is largely used for the prediction of RNA-RNA intermolecular interactions, for example, as implemented in RNAup and IntaRNA 51,52 . Our study has shown that this approach can be used to identify the key accessibility regions that are consistent across multiple large expression datasets. We have implemented our findings in TIsigner webserver, which currently supports recombinant protein expression in E. coli and S. cerevisiae (optimisation regions −24:24 and −7:89, respectively; see Fig 1). An independent yet similar implementation is available in XenoExpressO webserver with the purpose of optimising protein expression for an E. coli cell-free system 53 . The authors showed that an increase in accessibility of a 30 bp region from the Shine-Dalgarno sequence enhances the expression level of human voltage dependent anion channel, which supports our timely findings.
The strengths of our approach (implemented in the TIsigner webservice and software tool) are four-fold. Firstly, the likelihood of success or failure can be assessed prior to running an experiment. Users can compare the opening energy calculated for the input and optimised sequences and the distributions of the 'success' and 'failure' of the PSI:Biology targets. We also introduced a scoring scheme to score the input and optimised sequences based upon how likely they are to be expressed (Fig 4; also see Methods). Secondly, optimised sequences can have up to the first nine codons substituted (by default), meaning that gene optimisation using a standard PCR cloning method is feasible. For cloning, we propose a nested PCR approach, in which the final PCR reaction utilises a forward primer designed according to the optimised sequence 54 (Fig 5C). Thirdly, the cost of gene optimisation can be reduced dramatically as gene synthesis is replaced with PCR using our approach. This enables high-throughput protein expression screening using the optimised sequences, generated at a low cost. Finally, tunable expression is possible, i.e. high, intermediate or even low expression 5′ codon sequences can be designed, allowing for more control over heterologous protein production. Although our study focuses largely on the expression of recombinant proteins without an N-Terminal fusion tag, our findings might give meaningful insights to other systems.

Sequence features analysis
Minimum free energy, opening energy and avoidance were calculated using RNAfold, RNAplfold and RNAup from ViennaRNA package (version 2.4.11), respectively [49][50][51][55][56][57][58] . RNAfold was run with default parameters. For RNAplfold, sub-sequences were generated from the input sequences to calculate opening energy (using the parameters -W 210 -u 210). For RNAup, we examined the stochastic interactions between the region 1:30 of each mRNA and 54 non-coding RNAs (using the parameters -b -o). RNAup reports the total interaction between two RNAs as the sum of energy required to open accessible sites in the interacting molecules and the energy gained by subsequent hybridisation 49 . For the G Δ u G Δ h interactions between each mRNA and 54 non-coding RNAs, we chose the most stable mRNA:ncRNA pair to report an inappropriate mRNA:ncRNA interaction, i.e. the pair with the strongest hybridisation energy, . ΔG ) ( h min CAI, tAI and CC were calculated using the reference weights from Sharp and Li 12 , Tuller et al. 42 and Ang et al. 43 , respectively. Translation elongation rate was predicted using Iχnos 44 trained with ribosome profiling data (SRR7759806 and SRR7759807) 59 . See Supplementary  Table S1 for the datasets used in this study.

TIsigner
Finding a synonymous sequence with a maximum accessibility is a combinatorial problem that spans a vast search space. For example, for a protein-coding sequence of nine codons, assuming an average of 3 synonymous codons per amino acid, we can expect a total of 19,682 unique synonymous coding sequences. This number increases rapidly with increasing number of codons. Heuristic optimisation approaches are preferred in such situations because the search space can be explored more efficiently to obtain nearly optimal solutions.
To optimise the accessibility of a given sequence, TIsigner uses a simulated annealing algorithm [60][61][62][63] , a heuristic optimisation technique based on the thermodynamics of a system settling into a low energy state after cooling. A simulated annealing algorithm has been used to solve several combinatorial optimisation problems in bioinformatics. For example, we previously applied this algorithm to align and predict non-coding RNAs from multiple sequences 64 . Other studies use this algorithm to find consensus sequences 62 and optimise the ribosome binding sites 26 and mRNA folding 65 using minimum free energy models.
According to statistical mechanics, the probability of a system occupying energy state p i with temperature follows a Boltzmann distribution of the form , which gives a set , E i , T e −E /T i of probability mass functions along every point in the solution space. Using a Markov i chain sampling, these probabilities are sampled such that each point has a lower temperature then the previous one. As the system is cooled from high to low temperatures ( , the samples converge to a minimum of , which in many cases might be the global ) algorithm in which a 'bad' move from initial state such that , is accepted if where is a uniformly random number between 0 and 1.
In our implementation, each iteration consists of a move that may involve multiple synonymous codon substitutions. The algorithm begins at a high temperature where the first move is drastic, synonymous substitutions occur in all replaceable codons. At the end of the first iteration, a new sequence is accepted if the opening energy is smaller than that of the input sequence. However, if the opening energy of a new sequence is greater than that of the input sequence, acceptance depends on the Metropolis-Hastings criteria. The accepted sequence is used for the next iteration, which repeats the above process. As the temperature cools, the moves get milder with fewer synonymous codon changes (Supplementary Fig S3). Simulated annealing stops upon reaching a near optimum solution.
For the web version of TIsigner, the default number of replaceable codons is restricted to the first nine codons. However, this default setting can be reset to range from the first four to nine codons, or the full length of the coding sequence. Furthermore, TIsigner runs multiple simulated annealing instances, in parallel, to obtain multiple possible sequence solutions. There is a possibility to select tunable expression levels when the T7lac promoter is selected (as the expression scores were calculated based on the PSI:Biology dataset; see below). Among the solutions, the sequence that matches most closely to the users' selected target expression score is chosen as the optimum. The option for tunable expression is not available for custom UTRs, the sequence with minimum opening energy is chosen as the optimum.
We allow users to select desirable target expression scores for the experiments using the T7lac inducible promoter. To implement this criterion, the posterior probabilities of success for input and optimised sequences are evaluated using the following equations from Bayesian statistics: (2) The fitted positive likelihood ratios in equation (1) were obtained from the following 4-parametric logistic regression equation: with parameters a, b, c, and d. The prior probability was set to 0.49, which is the proportion of 'Expressed' (n=21,046) divided by 'Cloned' (n=42,774) of the PSI:Biology targets reported as of 28 June 2017 66 . Posterior probabilities were scaled as percentages to score the input and optimised sequences.
The presence of terminator-like elements 67 in the protein-coding region may result in expression of truncated mRNAs due to early transcription termination. Therefore, we implemented an optional check for putative terminators in the input and optimised sequences by cmsearch (INFERNAL version 1.1.2) 68 using the covariance models of terminators from RMfam 69,70 . We also allow users to filter the output sequences for the presence of restriction sites. Restriction modification sites (AarI, BsaI, and BsmBI) are avoided by default.

Sequence optimisation
We submitted the PSI:Biology targets that failed to be expressed (n=2,650) to the ExpOptimizer webserver from NovoPro Bioscience ( https://www.novoprolabs.com/tools/codon-optimization ). A total of 2,573 sequences were optimised. The target sequences were also optimised using a local version of COOL 28 and TIsigner using default settings. We also ran a random synonymous codon substitution as a control for these 2,573 sequences.

Statistical analysis
AUC and Gini importance scores were calculated using scikit-learn (version 0.20.2) 71 . The 95% confidence intervals for AUC scores were calculated using DeLong's method 72 . Spearman's correlation coefficients and Kolmogorov-Smirnov statistics were calculated using Pandas (version 0.23.4) 73 and scipy (version 1.2.1) 74,75 , respectively. Positive likelihood ratios with 95% confidence intervals were calculated using bootLR package 76,77 . The P-values of multiple testing were adjusted using Bonferroni's correction and reported to machine precision. Plots were generated using Matplotlib (version 3.0.2) 78 and Seaborn (version 0.9.0) 79 .

Code and data availability
Our code and data can be found in our GitHub repository ( https://github.com/ /Gardner-BinfLab/TIsigner_paper_2019 ). These include the scripts and Jupyter notebooks to reproduce our results and figures. TIsigner is written in Python 3.6 and the source code is available on ( https://github.com/Gardner-BinfLab/TIsigner ). The public web version of this tool runs at https://tisigner.otago.ac.nz .   inducible promoter and oFAB1173/BCD7, respectively. (C) Prediction accuracy of the expression outcomes of the PSI:Biology targets using opening energy (n=11,430). The opening energy at the region −23:24 (sub-sequence l=47 at position i=24, green crosshair) shows the highest prediction accuracy score (AUC=0.70). For this dataset, the expression vector used is pET21_NESG, in which the promoter and fusion tag are T7lac and C-terminal His tag, respectively. (D) Comparison between the correlations and AUC scores by sub-sequence region taken from the above analyses. The sub-sequence regions that have strong correlations are likely to have high AUC scores, whereas the sub-sequence regions that have no correlations are likely not useful in prediction of the expression outcome. (E) Correlation between the opening energy for the sub-sequences of E. coli transcripts and protein abundance. The transcripts used for this analysis are protein-coding sequences concatenated with 50 and 10 nt located upstream and downstream, respectively. The opening energy at the region −25:16 (sub-sequence l=41 at position i=16, green crosshair) shows the strongest correlation with protein abundance (R s =−0.17; n=3,725, PaxDb integrated proteomics dataset). R s , Spearman's rho.  Opening energy of 10 or below at the region − 24:24 is about two times more likely to come from the target genes that are successfully expressed than those that failed (with 95% confidence).
Cumulative frequency distributions of the true positive and false positive (less than type), and true negative and false negative (more than type) derived from the ROC analysis in Fig 2A  (left panel, opening energy −24:24). These values were used to estimate positive likelihood ratios with 95% confidence intervals using 10,000 bootstrap replicates. The estimated ratios and/or confidence intervals are inaccurate at low numbers of true positives or true negatives. Therefore, a four-parameter logistic curve was fitted to the positive likelihood ratios. Fitted values are useful to estimate the posterior probability of protein expression. The PSI:Biology targets that failed to be expressed were optimised using simulated annealing (n=2,650). The Kolmogorov-Smirnov distance between the distributions of '9' and 'full-length' was significantly different but sufficiently close (D KS =0.09, P<10 -7 ), indicating that optimisation of the first nine codons can achieve nearly optimum accessibility. For comparison, the distribution of the PSI:Biology targets that were successfully expressed are shown (n=8,780). (B) Accessibility of translation initiation sites can be increased indirectly using the existing gene optimisation tools and random synonymous codon substitution. 'TIsigner (9)' refers to the default settings of our tool, which allows synonymous substitutions up to the first nine codons (as above). (C) Accessibility of translation initiation sites can be optimised using PCR cloning. The forward primer should be designed according to TIsiger optimised sequences. For example, using a nested PCR approach, the optimised sequence can be produced using the forward primer designed with appropriate mismatches (gold bulges) to amplify the amplicon from the initial PCR reaction.