Abstract
Batch effects, undesirable sources of variance across multiple experiments, present significant challenges for scientific and clinical discoveries. Specifically, batch effects can introduce spurious findings and obscure genuine signals, contributing to the ongoing reproducibility crisis. Typically, batch effects are treated as associational or conditional effects, despite their potential to causally impact downstream inferences due to variations in experimental design and population demographics. In this study, we propose a novel framework to formalize batch effects as causal effects. Motivated by this perspective, we develop straightforward procedures to enhance existing approaches for batch effect detection and correction. We illustrate via simulation the utility of this perspective, finding that causal augmentations of existing approaches yield sufficient removal of batch effects in intuitively simple settings where conditional approaches struggle. By applying our approaches to a large neuroimaging study, we show that modeling batch effects as causal, rather than associational, effects leads to disparate downstream scientific conclusions. Together, we believe that this work provides a framework and potential limitations for the collection, harmonization, and subsequent analysis of multi-site scientific mega-studies.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
The manuscript has been heavily updated to record changes requested for submission. Almost every content figure has been expanded for new methods and we added a comprehensive simulation suite illustrating the advantages of causal cComBat over cComBat.
http://fcon_1000.projects.nitrc.org/indi/CoRR/html/concept.html