Abstract
Enzymes, nature’s catalysts, possess remarkable properties such as high stereo-, regio-, and chemo-specificity. These properties allow enzymes to greatly simplify complex synthetic processes, resulting in improved yields and reduced manufacturing costs compared to traditional chemical methods. However, the lack of experimental characterization of enzyme substrates, with only a few thousand out of tens of millions of known enzymes in Uniprot having annotated substrates, severely limits the ability of chemists to repurpose enzymes for industrial applications. Previous machine learning models aimed at predicting enzyme substrates have been hampered by poor generalization to new substrates. Here, we introduce VIPER (Virtual Interaction Predictor for Enzyme Reactivity), a model that achieves an average 30% improvement over the previous state-of-the-art model (ProSmith) in reaction prediction for unseen substrates. Furthermore, we reveal flaws in previous enzyme-substrate reaction datasets, and introduce a novel high-quality enzyme-substrate reaction dataset to alleviate these issues.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
- Included analysis of dataset quality - Switched from using deep ensemble to monte carlo dropout for confidence estimation. - Updated OOD enzyme family section to use in-distribution validation set - Now using pre-trained version of ProSmith - Performed ablation study - Included ROC AUC, PR AUC, F1 Score, and Accuracy as metrics - Minor changes to introduction, discussion, and abstract to reflect above changes - Added figure titles where they were missing - Expanded on figure text for Fig. 1 - Added line numbers. - Performed extra OOD evaluation on imine reductases - Added new sections to methods to reflect above changes - Performed hyperparameter optimization for ESP model - Appended Supplementary to main text