Abstract
Microbial communities are highly dynamic and sensitive to changes in the environment. Thus, microbiome data are highly susceptible to batch effects, defined as sources of unwanted variation that are not related to, and obscure any factors of interest. Existing batch correction methods have been primarily developed for gene expression data. As such, they do not consider the inherent characteristics of microbiome data, including zero inflation, overdispersion and correlation between variables. We introduce a new multivariate and non-parametric batch correction method based on Partial Least Squares Discriminant Analysis. PLSDA-batch first estimates treatment and batch variation with latent components to then subtract batch variation from the data. The resulting batch effect corrected data can then be input in any downstream statistical analysis. Two variants are also proposed to handle unbalanced batch x treatment designs and to include variable selection during component estimation. We compare our approaches with existing batch correction methods removeBatchEffect and ComBat on simulated and three case studies. We show that our three methods lead to competitive performance in removing batch variation while preserving treatment variation, and especially when batch effects have high variability. Reproducible code and vignettes are available on GitHub.
Competing Interest Statement
The authors have declared no competing interest.