Abstract
Many applications of synthetic biology involve engineering microbial strains to express high-value proteins. Thanks to advances in rapid DNA synthesis and sequencing, deep learning has emerged as a promising approach to build sequence-to-expression models for strain design and optimization. Such models, however, require large amounts of training data that are costly to acquire, which creates substantial entry barriers for many laboratories. Here, we study the relation between model accuracy and data efficiency in a large panel of machine learning models of varied complexity, from penalized linear regressors to deep neural networks. Our analysis is based on data from a large genotype-phenotype screen in Escherichia coli, which was generated with a design-of-experiments approach to balance coverage and depth of the genotypic space. We sampled these data to emulate scenarios with a limited number of DNA sequences for training, as commonly encountered in strain engineering applications. Our results suggest that classic, non-deep, models can achieve good prediction accuracy with much smaller datasets than previously thought, and provide robust evidence that convolutional neural networks further improve performance with the same amount of data. Using methods from Explainable AI and model benchmarking, we show that convolutional neural networks have an improved ability to discriminate between input sequences and extract sequence features that are highly predictive of protein expression. We moreover show that controlled sequence diversity leads to important gains in data efficiency, and validated this principle in a separate genotype-phenotype screen in Saccharomyces cerevisiae. These results provide practitioners with guidelines for designing experimental screens that strike a balance between cost and quality of training data, laying the groundwork for wider adoption of deep learning across the biotechnology sector.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
Updated text and figures