Abstract
Predicting prokaryotic phenotypes – observable traits that govern functionality, adaptability, and interactions – holds significant potential for fields such as biotechnology, environmental sciences, and evolutionary biology. This study leverages machine learning to explore the relationship between prokaryotic genotypes and phenotypes. Taking advantage of the highly standardized datasets in the BacDive database, we modeled eight physiological properties based on protein family inventories, discuss the evaluation metrics, and explore the biological implications of our models. The high confidence values of our predictions highlight the importance of data quality and quantity for a reliable inference of bacterial phenotypes. Our approach yielded nearly 55,000 new data points for approximately 20,000 strains which are published openly in the BacDive database, enriching existing phenotypic datasets and paving the way for future research and analysis. The open-source software generated can readily be applied to other datasets, for example the IMG/M system for metagenomics, as well as different applications, like the assessment of the potential of soil bacteria for bioremediation projects.
Competing Interest Statement
The authors have declared no competing interest.