Improving enzyme optimum temperature prediction with resampling strategies and ensemble learning

Japheth E. Gado; Gregg T. Beckham; Christina M. Payne

doi:10.1101/2020.05.06.081737

ABSTRACT

Accurate prediction of the optimal catalytic temperature (T_opt) of enzymes is vital in biotechnology, as enzymes with high T_opt values are desired for enhanced reaction rates. Recently, a machine-learning method (TOME) for predicting T_opt was developed. TOME was trained on a normally-distributed dataset with a median T_opt of 37°C and less than five percent of T_opt values above 85°C, limiting the method’s predictive capabilities for thermostable enzymes. Due to the distribution of the training data, the mean squared error on T_opt values greater than 85°C is nearly an order of magnitude higher than the error on values between 30 and 50°C. In this study, we apply ensemble learning and resampling strategies that tackle the data imbalance to significantly decrease the error on high T_opt values (>85°C) by 60% and increase the overall R² value from 0.527 to 0.632. The revised method, TOMER, and the resampling strategies applied in this work are freely available to other researchers as a Python package on GitHub.

Competing Interest Statement

The authors have declared no competing interest.

Footnotes

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.