ABSTRACT
Accurate prediction of the optimal catalytic temperature (Topt) of enzymes is vital in biotechnology, as enzymes with high Topt values are desired for enhanced reaction rates. Recently, a machine-learning method (TOME) for predicting Topt was developed. TOME was trained on a normally-distributed dataset with a median Topt of 37°C and less than five percent of Topt values above 85°C, limiting the method’s predictive capabilities for thermostable enzymes. Due to the distribution of the training data, the mean squared error on Topt values greater than 85°C is nearly an order of magnitude higher than the error on values between 30 and 50°C. In this study, we apply ensemble learning and resampling strategies that tackle the data imbalance to significantly decrease the error on high Topt values (>85°C) by 60% and increase the overall R2 value from 0.527 to 0.632. The revised method, TOMER, and the resampling strategies applied in this work are freely available to other researchers as a Python package on GitHub.
Competing Interest Statement
The authors have declared no competing interest.