ABSTRACT
With advances in machine learning (ML)-assisted protein engineering, models based on data, biophysics, and natural evolution are being used to propose informed libraries of protein variants to explore. Synthesizing these libraries for experimental screens is a major bottleneck, as the cost of obtaining large numbers of exact gene sequences is often prohibitive. Degenerate codon (DC) libraries are a cost-effective alternative for generating combinatorial mutagenesis libraries where mutations are targeted to a handful of amino acid sites. However, existing computational methods to optimize DC libraries to include desired protein variants are not well suited to design libraries for ML-assisted protein engineering. To address these drawbacks, we present DEgenerate Codon Optimization for Informed Libraries (DeCOIL), a generalized method which directly optimizes DC libraries to be useful for protein engineering: to sample protein variants that are likely to have both high fitness and high diversity in the sequence search space. Using computational simulations and wet-lab experiments, we demonstrate that DeCOIL is effective across two specific case studies, with potential to be applied to many other use cases. DeCOIL offers several advantages over existing methods, as it is direct, easy-to-use, generalizable, and scalable. With accompanying software (https://github.com/jsunn-y/DeCOIL), DeCOIL can be readily implemented to generate desired informed libraries.
Competing Interest Statement
The authors have declared no competing interest.
Abbreviations
- DeCOIL
- Degenerate codon optimization for informed libraries
- DC
- Degenerate codon
- DE
- Directed evolution
- ML
- Machine learning
- MLPE
- Machine learning-assisted protein engineering
- MLDE
- Machine learning-assisted directed evolution
- ftMLDE
- Focused training machine-learning assisted directed evolution
- GB1
- Protein G subdomain B1
- TrpB
- β-subunit of tryptophan synthase from Thermotoga maritima
- evSeq
- Every variant sequencing