%0 Journal Article %A Bethany M. Moore %A Peipei Wang %A Pengxiang Fan %A Bryan Leong %A Craig A. Schenck %A John P. Lloyd %A Melissa D. Lehti-Shiu %A Robert L. Last %A Eran Pichersky %A Shin-Han Shiu %T Robust predictions of specialized metabolism genes through machine learning %D 2018 %R 10.1101/304873 %J bioRxiv %P 304873 %X Plant specialized metabolism (SM) enzymes produce lineage-specific metabolites with important ecological, evolutionary, and biotechnological implications. Using Arabidopsis thaliana as a model, we identified distinguishing characteristics of SM and GM (general metabolism, traditionally referred to as primary metabolism) genes through a detailed study of features including duplication pattern, sequence conservation, transcription, protein domain content, and gene network properties. Analysis of multiple sets of benchmark genes revealed that SM genes tend to be tandemly duplicated, co-expressed with their paralogs, narrowly expressed at lower levels, less conserved, and less well connected in gene networks relative to GM genes. Although the values of each of these features significantly differed between SM and GM genes, any single feature was ineffective at predicting SM from GM genes. Using machine learning methods to integrate all features, a well performing prediction model was established with a true positive rate of 0.87 and a true negative rate of 0.71. In addition, 86% of known SM genes not used to create the machine learning model were predicted as SM genes, further demonstrating its accuracy. We also demonstrated that the model could be further improved when we distinguished between SM, GM, and junction genes responsible for reactions shared by SM and GM pathways. Application of the prediction model led to the identification of 1,217 A. thaliana genes with previously unknown functions, providing a global, high-confidence estimate of SM gene content in a plant genome.Significance Specialized metabolites are critical for plant-environment interactions, e.g., attracting pollinators or defending against herbivores, and are important sources of plant-based pharmaceuticals. However, it is unclear what proportion of enzyme-encoding genes play roles in specialized metabolism (SM) as opposed to general metabolism (GM) in any plant species. This is because of the diversity of specialized metabolites and the considerable number of incompletely characterized pathways responsible for their production. In addition, SM gene ancestors frequently played roles in GM. We evaluate features distinguishing SM and GM genes and build a computational model that accurately predicts SM genes. Our predictions provide candidates for experimental studies, and our modeling approach can be applied to other species that produce medicinally or industrially useful compounds. %U https://www.biorxiv.org/content/biorxiv/early/2018/10/15/304873.full.pdf