Abstract
The degeneracy of the genetic code allows nucleic acids to encode amino acid identity as well as non-coding information for gene regulation and genome maintenance. The rare arginine codons AGA and AGG (AGR) present a case study in codon choice, with AGRs encoding important transcriptional and translational properties distinct from the other synonymous alternatives (CGN). We created a strain of Escherichia coli with all 123 instances of AGR codons removed from all essential genes. We readily replaced 110 AGR codons with the synonymous CGU, but the remaining thirteen “recalcitrant” AGRs required diversification to identify viable alternatives. Successful replacement codons tended to conserve local ribosomal binding site-like motifs and local mRNA secondary structure, sometimes at the expense of amino acid identity. Based on these observations, we empirically defined metrics for a multi-dimensional “safe replacement zone” (SRZ) within which alternative codons are more likely to be viable. To further evaluate synonymous and non-synonymous alternatives to essential AGRs, we implemented a CRISPR/Cas9-based method to deplete a diversified population of a wild type allele, allowing us to exhaustively evaluate the fitness impact of all 64 codon alternatives. Using this method, we confirmed relevance of the SRZ by tracking codon fitness over time in 14 different genes, finding that codons that fall outside the SRZ are rapidly depleted from a growing population. Our unbiased and systematic strategy for identifying unpredicted design flaws in synthetic genomes and for elucidating rules governing codon choice will be crucial for designing genomes exhibiting radically altered genetic codes.
Significance Statement This work presents the genome-wide replacement of all rare AGR arginine codons in the essential genes of Escherichia coli with synonymous CGN alternatives. Synonymous codon substitutions can lethally impact non-coding function by disrupting mRNA secondary structure and ribosomal binding site-like motifs. Here we quantitatively define the range of tolerable deviation in these metrics and use this relationship to provide critical insight into codon choice in recoded genomes. This work demonstrates that genome-wide removal of AGR is likely to be possible, and provides a framework for designing genomes with radically altered genetic codes.
Main Text
The genetic code possesses inherent redundancy (1), with up to six different codons specifying a single amino acid. While it is tempting to approximate synonymous codons as equivalent (2), most prokaryotes and many eukaryotes (3, 4) display a strong preference for certain codons over synonymous alternatives (5, 6). While different species have evolved to prefer different codons, codon bias is largely consistent within each species (5). However, within a given genome, codon bias differs among individual genes according to codon position, suggesting that codon choice has functional consequences. For example, rare codons are enriched at the beginning of essential genes (7, 8), and codon usage strongly affects protein levels (9-11), especially at the N-terminus (12). This suggests that codon usage plays a poorly understood role in regulating protein expression. Several hypotheses attempt to explain how codon usage mediates this effect, including but not limited to: facilitating ribosomal pausing early in translation to optimize protein folding (13); adjusting mRNA secondary structure to optimize translation initiation or to modulate mRNA degradation; preventing ribosome stalling by co-evolving with tRNA levels (6); providing a “translational ramp” for proper ribosome spacing and effective translation (14); and providing a layer of translational regulation for independent control of each gene in an operon (15). Additionally, codon usage may impact translational fidelity (16), and the proteome may be tuned by fine control of the decoding tRNA pools (17). Although Quax et al. provide an excellent review of how biology chooses codons, systematic and exhaustive studies of codon choice in whole genomes are lacking (18). Studies have only begun to empirically probe the effects of codon choice in a relatively small number of reporter genes (12, 19-22). Several important questions must be answered as a first step towards designing custom genomes exhibiting new functions—How flexible is genome-wide codon choice? How does codon choice interact with the maintenance of cellular homeostasis? What heuristics can be used to predict which codons will conserve genome function?
Replacing all essential instances of a codon in a single strain would provide valuable insight into the constraints that determine codon choice and aid in the design of recoded genomes. Although the UAG stop codon has been completely removed from Escherichia coli (23), no genome-wide replacement of a sense codon has been reported. While the translation function of the AGG codon has been shown to permit efficient suppression with nonstandard amino acids (24-26), AGG necessarily remains translated as Arg in each of these studies. No study has yet demonstrated that all AGR codons (or all instances of any sense codon) can be removed from essential genes, nor explained why certain AGR codons could not be changed successfully. These insights are crucial for unambiguously reassigning AGR translation function.
We chose to study the rare arginine codons AGA and AGG (termed AGR according to IUPAC conventions) because the literature suggests that they are among the most difficult codons to replace and that their similarity to ribosome binding sequences underlies important non-coding functions (8, 27-30). Furthermore, their sparse usage (123 instances in the essential genes of E. coli MG1655 and 4228 instances in the entire genome (Table 1, S1)) made replacing all AGR instances in essential genes a tractable goal, with essential genes serving as a stringent test set for identifying any fitness impact from codon replacement (31). Additionally, recent work has shown the difficulty of directly mutating some AGR codons to other synonymous codons (25), although the authors do not explain the mechanism of failure or report successful implementation of alternative designs. We attempted to remove all 123 instances of AGR codons from essential genes by replacing them with the synonymous CGU codon. CGU was chosen to maximally disrupt the primary nucleic acid sequence (AGR→CGU). We hypothesized that this strategy would maximize design flaws, thereby revealing rules for designing genomes with reassigned genetic codes. Importantly, individual codon targets were not inspected a priori in order to ensure an unbiased empirical search for design flaws.
Summary of AGR codons changed by location in the genome, and failure rates by pool.
To construct this modified genome, we used co-selection multiplex automatable genome engineering (CoS-MAGE) (32, 33) to create an E. coli strain (C123) with all 123 AGR codons removed from its essential genes (Figure 1A and see Table S1 for a complete list of AGR codons in essential genes). CoS-MAGE leverages lambda red-mediated recombination (34, 35) and exploits the linkage between a mutation in a selectable allele (e.g. tolC) to nearby edits of interest (e.g., AGR conversions), thereby enriching for cells with those edits (Figure S1). To streamline C123 construction, we chose to start with E. coli strain EcM2.1, which was previously optimized for efficient lambda red-mediated genome engineering (33, 36). Using CoS-MAGE on EcM2.1 improves allele replacement frequency by 10-fold over MAGE in non-optimized strains but performs optimally when all edits are on the same replichore and within 500 kilobases of the selectable allele (33). To accommodate this requirement, we divided the genome into 12 segments containing all 123 AGR codons in essential genes. A tolC cassette was moved around the genome to enable CoS-MAGE in each segment, allowing us to rapidly prototype each set of AGR→CGU mutations across large cell populations in vivo. (Please see the ‘General Replacement Strategy’ and ‘Troubleshooting Strategy’ sections of the Materials & Methods for a more detailed discussion). Of the 123 AGR codons in essential genes, 110 could be changed to CGU by this process (Figure 1), revealing considerable flexibility of codon usage for most essential genes. Allele replacement (in this case, AGR→CGU codon substitution) frequency varied widely across these 110 permissive codons, with no clear correlation between allele replacement frequency and normalized position of the AGR codon in a gene (Figure 2A).
Workflow used to create and analyze strain C123. The DESIGN phase involved identification of 123 AGR codons in the essential genes of Escherichia coli. MAGE oligos were designed to replace all instances of these AGR codons with the synonymous CGU codon. The BUILD phase used CoS-MAGE to convert 110 AGR codon to CGU and to identify 13 AGR codons that required additional troubleshooting. The in vivo TROUBLESHOOTING phase resolved the 13 codons that could not be readily converted to CGU and identified mechanisms potentially explaining why AGR→CGU was not successful. In the STUDY Phase, next-generation sequencing, evolution and phenotyping was performed on strain C123. (outer) Schematic of the C123 genome (Nucleotide 0 oriented up; numbering according to strain MG1655). Exterior labels indicate the set groupings of AGR codons. Successful AGR→CGU (110 instances) conversions are indicated by radial green lines, and recalcitrant AGR codons (13 instances) are indicated by radial red lines.
The remaining 13 AGR→CGU mutations were not observed, suggesting a codon substitution frequency of less than our detection limit of 1% of the bacterial population (Materials & Methods, Table S6). These “recalcitrant codons” were assumed to be deleterious or non-recombinogenic and were triaged into a troubleshooting pipeline for further analysis (Figure 1). Interestingly, all except for one of the thirteen recalcitrant codons were co-localized near the termini of their respective genes, suggesting the importance of codon choice at these positions — seven were at most 30 nt downstream of the start codon, while five were at most 30 nucleotides (nt) upstream of the stop codon (Figure 2A, lower panel, Table S8). Due to our unbiased design strategy, we anticipated that several AGR→CGU mutations would present obvious design flaws such as introducing non-synonymous mutations (two instances) or RBS disruptions (four instances) in overlapping genes. For example, ftsI_AGA1759 overlaps the second and third codons of murE, an essential gene, introducing a missense mutation (murE D3V) that may impair fitness. Replacing ftsI_AGA with CGA successfully replaced the forbidden AGA codon while conserving the primary amino acid sequence of MurE with a minimal impact on fitness (Figure 3A, Table S6). Similarly, holB_AGA4 overlaps the upstream essential gene tmk, and replacing AGA with CGU converts the tmk stop codon to Cys, adding 14 amino acids to the C-terminus of tmk. While some C-terminal extensions are well-tolerated in E. coli (37), extending tmk appears to be deleterious. We successfully replaced holB_AGA with CGC by inserting three nucleotides comprising a stop codon before the holB start codon. This reduced the tmk/holB overlap, and preserved the coding sequences of both genes (Figure S2A).
(A) AGR recombination frequency (MASC-PCR, n=96 clones per cell population) was plotted versus the normalized ORF position (residue number of the AGR codon divided by the total length of the ORF). Failed AGR→CGU conversions are indicated using vertical red lines below the x-axis. (B) Doubling time of strains in the C123 lineage in LBl media at 34 °C was determined in triplicate on a 96-well plate reader. Colored bars indicate which set of codons was under construction when a doubling time was determined (coloring based on Figure 1). Each data points represent different stages of strain construction. Alternative codons were identified for 13 recalcitrant AGR codons in our troubleshooting pipeline, and the optimized replacement sequences were incorporated into the final strain (gray section at right, labeled with a ’*’), and the resulting doubling times were measured. Error bars represent standard error of the mean in doubling time from at least three replicates of each strain.
Wild type AGR codons are indicated in bold black letters, design flaws are indicated in red letters, and optimized replacement genotypes are indicated in green letters. (A) Genes ftsI and murE overlap with each other. An AGA→CGU mutation in ftsI would introduce a non-conservative Asp3Val mutation in murE. The amino acid sequence of murE was preserved by using an AGA→CGA mutation. (B) Gene secE overlaps with the RBS for downstream essential gene nusG. An AGG→CGU mutation is predicted to diminish the RBS strength by 97% (47). RBS strength is preserved by using a non-synonymous AGG→GAG mutation. (C) Gene ssb has an internal RBS-like motif shortly after its start codon. An AGG→CGU mutation would diminish the RBS strength by 94%. RBS strength is preserved by using an AGA→CGA mutation combined with additional wobble mutations indicated in green letters. (D) Gene rnpA has a defined mRNA structure that would be changed by an AGG→CGU mutation. The original RNA structure is preserved by using an AGG→CGG mutation. The RBS (green), start codon (blue) and AGR codon (red) are annotated with like-colored boxes on the predicted RNA secondary structures.
Additionally, the four remaining C-terminal failures included AGR→CGU mutations that disrupt RBS motifs belonging to downstream genes (secE_AGG376 for nusG, dnaT_AGA532 for dnaC, and folC_AGAAGG1249,1252 for dedD, the latter constituting two codonsj. Both nusG and dnaC are essential, suggesting that replacing AGR with CGU in secE and dnaT lethally disrupts translation initiation and thus expression of the overlapping nusG and dnaC (Figure 3B, S2B). Although dedD is annotated as non-essential (31), we hypothesized that replacing the AGR with CGU in folC disrupted a portion of dedD that is essential to the survival of EcM2.1 (E. coli K-12). In support of this hypothesis, we were unable to delete the 29 nucleotides of dedD that were not deleted by Baba et al. (31) and did not overlap with folC, suggesting that this sequence is essential in our strains. The unexpected failure of this conversion highlights the challenge of predicting design flaws even in well-annotated organisms. Consistent with our observation that disrupting these RBS motifs underlies the failed AGR→CGU conversions, we overcame all four design flaws by selecting codons that conserved RBS strength, including a non-synonymous (Arg→Gly) conversion for secE.
These lessons, together with previous observations that ribosomes pause during translation when they encounter ribosome binding site motifs in coding DNA sequences (20), provided key insights into the N-terminal AGR→CGU failures. Three of the N-terminal failures (ssb_AGA10, dnaT_AGA10 and prfB_AGG64) had RBS-like motifs either disrupted or created by CGU replacement. While pf_AGG64 is part of the ribosomal binding site motif that triggers an essential frameshift mutation inprfB (21, 38, 39), pausing-motif-mediated regulation of ssb and dnaT expression has not been reported. Nevertheless, ribosomal pausing data (20) showed that ribosomal occupancy peaks are present directly downstream of the AGR codons for ssb and absent for dnaT (Figure S3); meanwhile, unsuccessful CGU mutations were predicted to weaken the RBS-like motif for prfB and ssb and strengthen the RBS-like motif for dnaT (Figure 3C, S2C), suggesting a functional relationship between RBS occupancy and cell fitness. Consistent with this hypothesis, successful codon replacements from the troubleshooting pipeline conserve predicted RBS strength compared to the large predicted deviation caused by unsuccessful AGR→CGU mutations (Figure 4, y axis and comparison between orange asterisks and green dots). Interestingly, attempts to replace dnaT_AGA10 with either CGN or NNN failed—only by manipulating the wobble position of surrounding codons and conserving the arginine amino acid could dnaT_AGA10 be replaced (Figure S2C). These wobble variants appear to compensate for the increased RBS strength caused by the AGA→CGU mutation—RBS motif strength with wobble variants deviated 8-fold from the unmodified sequence, whereas RBS motif strength for AGA→CGU alone deviated 27-fold.
Scatter plot showing predicted RBS strength (y-axis, calculated with the Salis ribosome binding site calculator (47)) versus deviations in mRNA folding (x-axis, calculated at 37°C by UNAFold Calculator (41)). Small gray dots represent non-essential genes in E. coli MG1655 that have an AGR codon within the first 10 or last 10 codons. Large gray dots represent successful AGR→CGU conversions in the first 10 or last 10 codons of essential genes. Orange asterisks represent unsuccessful AGR→CGU mutations (recalcitrant codons) in essential genes. Green dots represent optimized solutions for these recalcitrant codons. The “safe replacement zone” (blue shaded region) is an empirically defined range of mRNA folding and RBS strength deviations, based on the successful AGR→CGU replacement mutations observed in this study. Most unsuccessful AGR→CGU mutations (Orange asterisks) cause large deviations in RBS strength or mRNA structure that are outside the “safe replacement zone.” Genes holB and ftsI are two notable exceptions because their initial CGU mutations caused amino acid changes in overlapping essential genes. Gene folC corresponds to 2 AGRs. Arrows for four examples of optimized replacement codons (ftsA, folC, rnpA, rpsJ) show that deviations in RBS strength and/or mRNA structure are reduced. Arrows are omitted for the remaining 8 optimized replacement codons so as to increase readability.
In order to better understand several remaining N-terminal failure cases that did not exhibit considerable RBS strength deviations (rnpA_AGG22, ftsA_AGA19, fTr_AGA16, and rpsJ_AGA298), we examined other potential nucleic acid determinants of protein expression. Based on the observation that mRNA secondary structure near the 5’ end of Open Reading Frames (ORFs) strongly impacts protein expression (12), we found that these four remaining AGR→CGU mutations changed the predicted folding energy and structure of the mRNA near the start codon of target genes (Figure. 3D, S4). Successful codon replacements obtained from degenerate MAGE oligos reduced the disruption of mRNA secondary structure compared to CGU (Figure 4, green dots). For example, rnpA has a predicted mRNA loop near its RBS and start codon that relies on base pairing between both guanines of the AGG codon to nearby cytosines (Figure 3D, S5A). Importantly, only AGG22CGG was observed out of all attempted rnpA AGG22CGN mutations, and the fact that only CGG preserves this mRNA structure suggests that it is physiologically important (Figure 3D, S5B-C). In support of this, we successfully introduced a rnpA AGG22CUG mutation (Arg→Leu) only when we changed the complementary nucleotides in the stem from CC (base pairs with AGG) to CA (base pairs with CUG), thus preserving the natural RNA structure (Figure S5D) while changing both RBS motif strength and amino-acid identity. Our analysis of all four optimized gene sequences showed reduced deviation in computational mRNA folding energy (computed with UNAFold(40)) compared to the unsuccessful CGU mutations (Figure 4, x-axis orange asterisks and green dots). Similarly, predicted mRNA structure (computed with a different mRNA folding software: NUPACK(41)) for these genes was strongly changed by CGU mutations and corrected in our empirically optimized solutions (Figure S4).
Troubleshooting these 13 recalcitrant codons revealed that mutations causing large deviations from natural mRNA folding energy or RBS strength are associated with failed codon substitutions. By calculating these two metrics for all attempted AGR→CGU mutations, we empirically defined a safe replacement zone (SRZ) inside which most CGU mutations were tolerated (Figure 4, shaded area). The SRZ is defined as the largest multi-dimensional space that contains none of the mRNA folding energy or RBS strength associated AGR→CGU failures (Figure 4, red asterisks). It comprises deviations in mRNA folding energy of less than 10% with respect to the natural codon and deviations in RBS-like motif scores of less than a half log with respect to the natural codon, providing a quantitative guideline for codon substitution. Notably, the optimized solution used to replace the 13 recalcitrant codons always exhibited reduced deviation for at least one of these two parameters compared to the deviation seen with a CGU mutation. Furthermore, solutions to the 13 recalcitrant codons overlapped almost entirely with the empirically-defined SRZ. These results suggest that computational predictions of mRNA folding energy and RBS strength can be used as a first approximation to predict whether a designed mutation is likely to be viable. Developing in silico heuristics to predict problematic alleles streamlines the use of in vivo genome engineering methods such as MAGE to empirically identify viable replacement codons. Therefore, these heuristics reduce the search space required to redesign viable genomes, raising the prospect of creating radically altered genomes exhibiting expanded biological functions.
Once we had identified viable replacement sequences for all 13 recalcitrant codons, we combined the successful 110 CGU conversions with the 13 optimized codon substitutions to produce strain C123, which has all 123 AGR codons removed from all of its annotated essential genes. C123 was then sequenced to confirm AGR removal and analyzed using Millstone, a publicly available genome resequencing analysis pipeline (42). Two spontaneous AAG (Lys) to AGG (Arg) mutations were observed in the essential genes pssA and cca. While attempts to revert these mutations to AAG were unsuccessful—perhaps suggesting functional compensation—we were able to replace them with CCG (Pro) in pssA and CAG (Gln) in cca using degenerate MAGE oligos. The resulting strain, C123a, is the first strain completely devoid of AGR codons in its annotated essential genes (Sequences available online). Although some AGR codons in non-essential genes could unexpectedly prove to be difficult to change, our success at replacing all 123 instances of AGR codons in essential genes provides strong evidence that the remaining 4105 AGR codons can be completely removed from the E. coli genome, permitting the unambiguous reassignment of AGR translation function (23).
Kinetic growth analysis showed that the doubling time increased from 52.4 (+/− 2.6) minutes in EcM2.1 (0 AGR codons changed) to 67 (+/− 1.5) minutes in C123a (123 AGR codons changed in essential genes) in lysogeny broth (LB) at 34 °C in a 96-well plate reader (See Materials and Methods). Notably, fitness varied significantly during C123 strain construction (Figure 2B). This may be attributed to codon deoptimization (AGR→CGU) and compensatory spontaneous mutations to alleviate fitness defects in a mismatch repair deficient (mutS-) background. Overall the reduced fitness of C123a may be caused by on-target (AGR→CGU) or off-target (spontaneous mutations) that occurred during strain construction. In this way, mutS inactivation is simultaneously a useful evolutionary tool and a liability. Final genome sequence analysis revealed that along with the 123 desired AGR conversions, C123a had 419 spontaneous non-synonymous mutations not found in the EcM2.1 parental strain (Figure S10). Of particular interest was the mutation argU_G15A, located in the D arm of tRNAArg (argU), which arose during CoS-MAGE with AGR set 4. We hypothesized that argU_G15A compensates for increased CGU demand and decreased AGR demand, but we observed no direct fitness cost associated with reverting this mutation in C123, and argU_G15A does not impact aminoacylation efficiency in vitro or aminoacyl-tRNA pools in vivo (Figure S6). Consistent with Mukai et al. and Baba et al. (25, 31), argW (tRNAArg ccu; decodes AGG only) was dispensable in C123a because it can be complemented by argU (tRNAArgUCU; decodes both AGG and AGA). However, argU is the only E. coli tRNA that can decode AGA and remains essential in C123a probably because it is required to translate the AGR codons for the rest of the proteome (23).
To evaluate the genetic stability of C123a after removal of all AGR codons from all the known essential genes, we passaged C123a for 78 days (640 generations) to test whether AGR codons would recur and/or whether spontaneous mutations would improve fitness. After 78 days, no additional AGR codons were detected in a sequenced population (sequencing data available at https://github.com/churchlab/agr_recoding) and doubling time of isolated clones ranged from 22% faster to 22% slower than C123a (n=60).
To gain more insight into how local RBS strength and mRNA folding impact codon choice, we performed an evolution experiment to examine the competitive fitness of all 64 possible codon substitutions at each of AGR codons (Table S2). While MAGE is a powerful method to explore viable genomic modifications in vivo, we were interested in mapping the fitness cost associated with less-optimal codon choices, requiring codon randomization depleted of the parental genotype, which we hypothesized to be at or near the global fitness maximum. To do this, we developed a method called CRAM (Crispr-Assisted-MAGE). First, we designed oligos that changed not only the target AGR codon to NNN, but also made several synonymous changes at least 50 nt downstream that would disrupt a 20 bp CRISPR target locus. MAGE was used to replace each AGR with NNN in parallel, and CRISPR/cas9 was used to deplete the population of cells with the parental genotype. This approach allowed exhaustive exploration of the codon space, including the original codon, but without the preponderance of the parental genotype. Following CRAM, the population was passaged 1:100 every 24 hours for six days, and sampled prior to each passage using Illumina sequencing (see Table S2 & Figure 5).
CRAM (Crispr-Assisted MAGE) was used to explore codon preference for several AGR codons located within the first 10 codons of their CDS. Briefly, MAGE was used to diversify a population by randomizing the AGR of interest, then CRISPR/Cas9 was used to deplete the parental (unmodified) population, allowing exhaustive exploration of all 64 codons at a position of interest. Thereafter codon abundance was monitored over time by serially passaging the population of cells and sequencing using an Illumina MiSeq. The left y-axis (Codon Frequency) indicates relative abundance of a particular codon (stacked area plot)). The right y-axis indicates the combined deviations in mRNA folding structure (red line) and internal RBS strength (blue line) in arbitrary units (AU) normalized to 0.5 at the initial timepoint. 0 means no deviation from wild type. The horizontal axis indicates the experimental time point in hours at which a particular reading of the population diversity was obtained. Genes bcsB and chpS are non-essential in our strains and thus serve as controls for AGR codons that are not under essential gene pressure.
Sequencing 24 hours after CRAM showed that all codons were present (including stop codons) (Figure S7), validating the method as a technique to generate massive diversity in a population. All sequences for further analysis were amplified by PCR with allele-specific primers containing the changed downstream sequence. Subsequent passaging of these populations revealed many gene-specific trends (Figure 5, S7, S8). Notably, all codons that required troubleshooting (dnaT_AGA10, ftsA_AGA19, frr_AGA16, rnpA_AGG22) converged to their wild-type AGR codon, suggesting that the original codon was globally optimized. For all cases in which an alternate codon replaced the original AGR, we computed the predicted deviation in mRNA folding energy and local RBS strength (as a proxy for ribosome pausing) for these alternative codons and compared these metrics to the evolution of codon distribution at this position over time. We also computed the fraction of sequences that fall within the SRZ inferred from Figure 4 (see Methods). CRAM initially introduced a large diversity of mRNA folding energies and RBS strengths, but these genotypes rapidly converged toward parameters that are similar to the parental AGR values in many cases (Figure 5, overlays). Codons that strongly disrupted predicted mRNA folding and internal RBS strength near the start of genes were disfavored after several days of growth, suggesting that these metrics can be used to predict optimal codon substitutions in silico. In contrast, non-essential control genes bcsB and chpS did not converge toward codons that conserved RNA structure or RBS strength, supporting the conclusion that the observed conservation in RNA secondary structure and RBS strength is biologically relevant for essential genes. Interestingly, tilS_AGA19 was less sensitive to this effect, suggesting that codon choice at that particular position is not under selection. Additionally, the average internal RBS strength for the ispG populations converged towards the parental AGR values whereas mRNA folding energy averages did not, suggesting that this position in the gene may be more sensitive to RBS disruption rather than mRNA folding. Gene lptF followed the opposite trend.
Interestingly, several genes (lptF, ispG, tilS, gyrA and rimN) preferred codons that changed the amino acid identity from Arg to Pro, Lys, or Glu, suggesting that non-coding functions trump amino acid identity at these positions. Importantly, all successful codon substitutions in essential genes fell within the SRZ (Figure 6), validating our heuristics based on an unbiased test of all 64 codons. Meanwhile non-essential control gene chpS exhibited less dependence on the SRZ. Based on these observations, while global codon bias may be affected by tRNA availability (6, 43-45), codon choice at a given position may be defined by at least 3 parameters: (1) amino acid sequence, (2) mRNA structure near the start codon and RBS (3) RBS-mediated pausing. In some cases, a subset of these parameters may not be under selection, resulting in an evolved sequence that only converges for a subset of the metrics. In other cases, all metrics may be important, but the primary nucleic acid sequence might not have the flexibility to accommodate all of them equally, resulting in codon substitutions that impair cellular fitness.
Scatter plot showing the results of the CRAM experiment (Figure 5). Each panel represents a different gene. The Y-axis represents RBS strength deviation (calculated with the Salis ribosome binding site calculator (47)) while the X-axis shows deviations in mRNA folding energy (x-axis, calculated at 37°C by UNAFold Calculator (41)). Codon abundance at the intermediate time point (t=72hrs, chosen to show maximal diversity after selection) is represented by the dot size. Green dots represent the WT codon. Blue dots represent synonymous AGR codons. Orange dots represent the remaining 58 non-synonymous codons, which may introduce non-viable amino acid substitutions. Black squares represent unsuccessful AGR→CGU conversions observed in the genome-wide recoding effort (Table 1, Figure 1). The “safe replacement zone” (blue shaded region) is the empirically defined range of mRNA folding and RBS strength deviations, based on the successful AGR→CGU replacement mutations observed in this study (Figure 3). Genes bcsB and chpS are non-essential in our strains and thus serve as controls for AGR codons that are not under essential gene pressure.
These rules were used to generate a draft genome in silico with all AGR codons replaced genome-wide, reducing by almost fourfold the number of predicted design flaws (e.g., synonymous codons with metrics outside of the SRZ) compared to the naive replacement strategy (Figure 7, Figure S9, Table S7, see Methods). Furthermore, predicting recalcitrant codons provides hypotheses that can be rapidly tested in vivo using MAGE. Successful replacement sequences can then be implemented together in a redesigned genome. Encouragingly, since all newly predicted design flaws occur in non-essential genes, they would be less likely to impact fitness unless (1) despite the “non-essential” annotation, the gene is actually essential or quasi-essential (i.e., inactivation would impair growth), or (2) the codon in a non-essential gene impacts the expression of a neighboring essential gene (e.g., impacts an RBS motif or RNA structure). While incorrect genome annotations can only be addressed empirically (as demonstrated with gene dedD), further analysis reveals that AGR codons in non-essential genes should rarely impact annotated essential genes. In E. coli MG1655, only three AGR codons in non-essential genes overlap with the initial mRNA and RBS motifs of essential genes, and at least one synonymous CGN codon is predicted to obey the SRZ for all three cases. Furthermore, even if all synonymous mutations were to disobey the SRZ, since disruption of non-essential gene function should not compromise viability, it is expected that non-synonymous mutations in non-essential genes would be viable as long as they conserve crucial motifs impacting expression of the essential gene. Importantly, we confirmed by MAGE that AGR→CGU codon replacement was possible for 2 of these 3 cases and that an alternative synonymous solution could be found in the remaining case (see Methods).
(A) Empirical data from the construction of C123. 110 AGR codons were successfully recoded to CGU (green), and 13 recalcitrant AGR codons required troubleshooting (red, striped). (B) Predicted recalcitrant codons (codons for which no CGN alternatives fall within the SRZ in Figure 4) for replacing all instances of the AGR codons genome-wide. The reference genome used for this analysis had insertion elements and prophages removed (48) to reduce total nucleotides synthesized and to increase genome stability, leaving 3222 AGR codons to be replaced (see Methods). Our analysis predicts that replacing all instances of AGR with CGU would have resulted in 229 failed conversions (‘Naive Replacement’, red striped). However, implementing the rules from this work (‘Informed Replacement’) to identify the best CGN alternative reduces the predicted failure rate from 7.1% (229/3222), to 2.0% (64/3222AGR) of which only a small subset will have a direct impact on fitness due to their location in non-essential genes. In such cases, MAGE with degenerate oligos could be used to empirically identify replacement codons as we have demonstrated herein. Each specific synonymous CGN is identified with a unique shade of green and is labeled inside of its respective section.
To conclude, comprehensively removing all instances of AGR codons from all E. coli essential genes revealed 13 design flaws which could be explained by a disruption in coding DNA Sequence, RBS-mediated translation initiation, RBS-mediated translation pausing, or mRNA structure. While the importance of each factor has been reported, our work systematically explores to what extent and at what frequency they impact genome function. Furthermore, our work establishes quantitative guidelines to reduce the chance of designing non-viable genomes. Although additional factors undoubtedly impact genome function, the fact that these guidelines captured all instances of failed synonymous codon replacements (Figure 4) suggests that our genome design guidelines provide a strong first approximation of acceptable modifications to the primary sequence of viable genomes. These design rules coupled with inexpensive DNA synthesis will facilitate the construction of radically redesigned genomes exhibiting useful properties such as biocontainment, virus resistance, and expanded amino acid repertoires (46).
Materials and Methods
Supplementary Materials
Figures S1-S10
Tables S1-S5
Acknowledgments
This work was supported by the US Department of Energy [DE-FG02-02ER63445]; by the US Defense Advanced Research Projects Agency [N66001-12-C-4211 to F.J.I. and D.S.]; by the National Institute of General Medical Sciences [GM22854-42 to D.S.]; by a US Department of Defense NDSEG Fellowship (to M.J.L. and G.K.); by a US National Science Foundation Graduate Research Fellowship (to D.B.G.); by the Lynch Foundation (to M.M.L.); by an Amazon AWS in Education Grant Award (to G.K.); and by the Arnold & Mabel Beckman Foundation and DuPont, Inc. (to F.J.I.). Funding for open access charge: US Department of Energy [DE-FG02-02ER63445]. Stephanie Yaung generously provided CRISPR plasmids and technical support. A patent application has been filed by Harvard University relating to aspects of the work described in this manuscript. In the interests of transparency, we wish to mention that G.M.C. is a founder of Enevolv Inc. and Gen9bio (neither of which was involved in this study). Other potentially relevant financial interests are listed at http://arep.med.harvard.edu/gmc/tech.html.