ABSTRACT
Background The link between protein or nucleic acid sequence and biochemical or organismal phenotype is essential for understanding the molecular mechanisms of evolution, reverse ecology, and designing proteins and genes with specific properties. However, it is difficult to practically make use of the relationship between sequence and phenotype due to the complex relationship between sequence and folding or activity.
Results Here, we predict the originating species’ optimal growth temperatures of individual protein sequences using trained machine learning models. Both multilayer perceptron and k Nearest Neighbor regression outperformed linear regression could predict the originating species’ optimal growth temperature from protein sequences, achieving a root mean squared error of 3.6 °C. Similar machine learning models could predict organismal optimal growth pH and oxygen tolerance, and the quantitative properties of individual proteins or nucleic acids.
Conclusions Using multilayer perceptron and k Nearest Neighbor regressions, we were able to build models specific to individual protein or nucleic acid families that can predict a variety of quantitative phenotypes. This methodology will be useful the in silico screening of individual mutations for particular properties, and also effective in the predicting the phenotypes of uncharacterized biological sequences and organisms.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
Revisions to methods and analysis targets based on feedback.
List of Abbreviations
- ADK
- Adenosine Kinase
- CSP
- Cold Shock Protein
- kNN
- k-Nearest Neighbor
- MLP
- Multi-layer perceptron
- RF
- Random Forest
- SOD
- Superoxide Dismutase