Abstract
Machine learning provides a probabilistic framework for metabolic pathway inference from genomic sequence information at different levels of complexity and completion. However, several challenges including pathway features engineering, multiple mapping of enzymatic reactions and emergent or distributed metabolism within populations or communities of cells can limit prediction performance. In this paper, we present triUMPF, triple non-negative matrix factorization (NMF) with community detection for metabolic pathway inference, that combines three stages of NMF to capture myriad relationships between enzymes and pathways within a graph network. This is followed by community detection to extract higher order structure based on the clustering of vertices which share similar statistical properties. We evaluated triUMPF performance using experimental datasets manifesting diverse multi-label properties, including Tier 1 genomes from the BioCyc collection of organismal Pathway/Genome Databases and low complexity microbial communities. Resulting performance metrics equaled or exceeded other prediction methods on organismal genomes with improved precision on multi-organismal datasets.
Availability and implementation The software package, and installation instructions are published on github.com/triUMPF
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
↵* Contact: shallam{at}mail.ubc.ca
updating the structure and title of the paper.