Abstract
The human microbiome is increasingly mined for diagnostic and therapeutic biomarkers using machine learning (ML). However, metagenomics-specific software is scarce and overoptimistic evaluation and limited cross-study generalization are prevailing issues. To address these, we developed SIAMCAT, a versatile R toolbox for ML-based comparative metagenomics. We demonstrate its capabilities in a meta-analysis of fecal metagenomic studies (10,803 samples). When naively transferred across studies, ML models lost accuracy and disease specificity, which could however be resolved by a novel training set augmentation strategy. This revealed some biomarkers to be disease-specific, others shared across multiple conditions. SIAMCAT is freely available from siamcat.embl.de.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
jakob.wirbel{at}embl.de, konrad.zych{at}embl.de, morgan.essex{at}mdc-berlin.de, nicolai.karcher{at}embl.de, ece.kartal{at}embl.de, guillems{at}ethz.ch, bork{at}embl.de, ssunagawa{at}ethz.ch