Fed-ComBat: A Generalized Federated Framework for Batch Effect Harmonization in Collaborative Studies

In neuroimaging research, the utilization of multi-centric analyses is crucial for obtaining sufficient sample sizes and representative clinical populations. Data harmonization techniques are typically part of the pipeline in multi-centric studies to address systematic biases and ensure the comparability of the data. However, most multi-centric studies require centralized data, which may result in exposing individual patient information. This poses a significant challenge in data governance, leading to the implementation of regulations such as the GDPR and the CCPA, which attempt to address these concerns but also hinder data access for researchers. Federated learning offers a privacy-preserving alternative approach in machine learning, enabling models to be collaboratively trained on decentralized data without the need for data centralization or sharing. In this paper, we present Fed-ComBat, a federated framework for batch effect harmonization on decentralized data. Fed-ComBat extends existing centralized linear methods, such as ComBat and distributed as d-ComBat, and nonlinear approaches like ComBat-GAM in accounting for potentially nonlinear and multivariate covariate effects. By doing so, Fed-ComBat enables the preservation of nonlinear covariate effects without requiring centralization of data and without prior knowledge of which variables should be considered nonlinear or their interactions, differentiating it from ComBat-GAM. We assessed Fed-ComBat and existing approaches on simulated data and multiple cohorts comprising healthy controls (CN) and subjects with various disorders such as Parkinson’s disease (PD), Alzheimer’s disease (AD), and autism spectrum disorder (ASD). Results indicate that Fed-ComBat outperforms centralized ComBat in the presence of nonlinear effects and is comparable to centralized methods such as ComBat-GAM. Using synthetic data, Fed-ComBat is able to better reconstruct the target unbiased function by 35% (RMSE = 0.5952) with respect to d-ComBat (RMSE = 0.9162) and 12% with respect to our proposal to federate ComBat-GAM, d-ComBat-GAM (RMSE= 0.6751) and exhibits comparable results on MRI-derived phenotypes to centralized methods as ComBat-GAM without the need of prior knowledge on potential nonlinearities.


1
With the vast generation of neuroimaging data across multiple institu-2 tions, concerns seeking to protect sensitive data have been issued as the (e.g., a brain region) indexed by g ∈ {1, 2, ..., G}. Each batch contains n i number of observations, and the total number of observations is N = S i n i . S can denote for simplicity the number of sites in the study, but it can also be extended to the total number of scanners between sites or any other number of batch effects. We can model a specific phenotype g observed in the j-th patient who belongs to the i-th site denoted by y ijg as follows: where x ij denotes the covariate effects expected to be preserved after re-104 moving the batch effects (e.g., sex and age), α g acts as a global fixed intercept 105 (i.e., the mean), while γ ig indicates a random intercept that accounts for the 106 site-specific shift. ε ijg is a noise model that captures the variability of each 107 phenotype ε ijg ∼ N (0, σ 2 g ), and δ ig is a multiplicative effect that scales the 108 "unbiased" phenotype variability to fit the one at each site.
and ϕ(x, θ g )| x=0 = 0 (4) A first constraint in Equation (3)  and ComBat-GAM a particular case of the proposed formulation in this work.

127
For a centralized setup, the estimation of all these parameters is performed in three steps: i) maximum likelihood estimation (MLE) for parametersα g ,θ g ,γ ig (see Equation (2)) and of the phenotype varianceσ 2 g = 1 N ij y ijg −α g − ϕ(x ij ;θ g ) −γ ig 2 , ii) residual standardization mapping the residuals to satisfy the form y ijg → z ijg ∼ N (γ ig , δ 2 ig ) as follows: and iii) estimation of the additive and multiplicative batch effectsγ * ig and δ * ig as in Equation (6), using empirical Bayes (EB) with priors on γ ig and δ 2 ig to iteratively estimate these parameters as in Equation (7) Lastly, phenotypes can be harmonized while preserving the covariate effects of interest as follows: In the following section, we will discuss how this formulation facilitates 128 the incorporation of harmonization within the federated learning framework. Considering the formulation previously presented in Equation (1), the parameters α g , θ g and γ ig can be optimized by minimizing an objective function F . As data is now siloed, it is only possible to have an evaluation of the cost function at each site F i , thus defining the federated optimization problem as: where, F (α g , θ g , γ ig ) : Result: Harmonized phenotypes with siloed data.
x ← FederatedStandardization(x); Estimation of fixed effects and random intercept: // Partial local optimization using SGD.
foreach local gradient step t do // Aggregate and update every parameter using FedAVG.
; // Equations (6) and (7) // Correct data Starting from the panel formulation where y g is the stacked vector of all 156 subjects across all sites i, and X is a design matrix containing the covariate 157 effects to be preserved indexed by c in subject j from site i (x ijc ), an indi-158 cator matrix encoding the site, and a column of ones to capture α g , we can 159 estimate the augmented parameter matrixΘ g using maximum likelihood.

160
The estimator can be decomposed as a sum of the covariance and the cross- .
Seeking a federated nonlinear approach, we here propose d-ComBat-GAM  We can consider d-ComBat-GAM as a particular case of Fed-ComBat where ϕ(·) becomes a linear combination parametrized by θ. Let b(·) be an arbitrary basis representation, the covariate function approximate is then defined as follows: Note that Equation (14)

183
We evaluated Fed-ComBat on synthetic data accounting for different 184 sources of batch-and covariate-wise heterogeneity. 185 We also benchmarked this approach on a collection of nine cohorts corre-

Results
Two versions of Fed-ComBat were used for comparison: a first one defin-

Brain MRI-data 272
Seeking evidence of nonlinear covariate effects of age on brain phenotypes, 273 we compared the goodness of fit of two models: a linear model and a GAM.

274
The criterion used to evaluate the presence of nonlinearities was the difference  shows the AIC metric across the regions of the brain using the Desikan-

279
Killiany parcellation and cortical thickness as the endogenous variable and 280 the ASEG atlas for subcortical volumes. The data was controlled for sex, 281 diagnosis, ICV, and group, and age was considered an exogenous variable.