Abstract
Extreme engineering of an organism’s genetic code could impart true genetic incompatibility, even blocking effects of horizontal gene transfer and viral infection. Recent experiments exploring this possibility demonstrate that such radical genome engineering achievements are plausible. However, it is unclear when the modifications will compromise the fitness of an organism. Efforts to reformat an entire genome are difficult and expensive; computational methods predicting fruitful experimental trajectories could play a pivotal role in advancing such efforts. We present a framework for building in silico models to assist genome-scale engineering. Genetic code engineering requires choosing from many possible codon-usage schemes, to find a design that is viable and effective. We use machine learning to identify which alternative codon-usage schemes are likely to result in no observed viable cells. Our data-driven approach employs observations of how modifying codon usage in individual genes impacted observed viability in E. coli, revealing salient features for early identification of problematic genetic code designs. We achieved an average area under the receiver operating characteristic of 0.72 on out-ofsample data.
Author Summary As machine learning and artificial intelligence play an increasingly central role in science and engineering, it will be important to establish standardized techniques that facilitate the dialogue between experimentation and modeling. Biological experimental techniques are concurrently evolving at a rapid pace, providing unique opportunities to collect high-quality, novel information that was previously unobtainable. This work navigates the landscape of this vast, new territory, identifies interesting landmarks for exploration and posits new approaches towards advancing our research efforts in these areas. In this work, we show that, using a small dataset of 47 observations and rigorous nested cross validation techniques, we can build a model that makes better-than-random predictions of how codon usage changes in essential genes influence viability in E. coli. These predictions can be used to inform experimental trajectories in both genetic code and codon optimization experiments. We discuss ways to improve this model, iteratively, by performing high value experiments that decrease uncertainty in predictions and extrapolation error. Finally, we present novel visualization methods to aid in developing intuitions for how re-coding impacts groups of genes. These methods are also useful tools in building important insights into how well machine learning algorithms can generalize to new data.
Footnotes
Approved for public release: distribution unlimited.
This material is based upon work supported by the Assistant Secretary of Defense for Research and Engineering and MIT under Air Force Contract No. FA8721-05-C-0002 and/or FA8702-15-D-0001. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Assistant Secretary of Defense for Research and Engineering and MIT.