1 Abstract
To acquire larger samples for answering complex questions in neuroscience, researchers have increasingly turned to multi-site neuroimaging studies. However, these studies are hindered by differences in images acquired across multiple scanners. These effects have been shown to bias comparison between scanners, mask biologically meaningful associations, and even introduce spurious associations. To address this, the field has focused on harmonizing data by removing scanner-related effects in the mean and variance of measurements. Contemporaneously with the increase in popularity of multi-center imaging, the use of multivariate pattern analysis has also become commonplace. These approaches have been shown to provide improved sensitivity, specificity, and power due to their modeling the joint relationship across measurements in the brain. In this work, we demonstrate that the currently available methods for removing scanner effects are inherently insufficient for MVPA. This stems from the fact that no currently available harmonization approach has addressed how correlations between measurements can vary across scanners. Data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) is used to show that considerable differences in covariance exist across scanners and that the state-of-the-art harmonization techniques do not address this issue. We also propose a novel methodology that harmonizes covariance of multivariate image measurements across scanners and demonstrate its improved performance in data harmonization, which further facilitates more power for detection of clinically relevant associations.
2 Introduction
The need for larger samples in human subjects research have led to a growing number of multi-site studies that aggregate data across multiple locations. This trend is especially prevalent in neuroimaging research where the reliability and generalizabilty of findings from the conventional single-site studies are often limited by the ability to recruit and study sufficiently large and representative samples from the population. Many consortia have been formed to address such issues Mueller et al. (2005); Sudlow et al. (2015); Trivedi et al. (2016); Van Essen et al. (2013). The larger samples obtained through these efforts promote greater power to detect significant associations as well as better generalizability of results. However, these study designs also introduce heterogeneity in acquisition and processing that, if not appropriately addressed, may impact study findings.
Several researchers have determined that variability driven by scanner, often called scanner effects, reduce the reliability of derived measurements and can introduce bias. Neuroimaging measurements have been repeatedly shown to be affected by scanner manufacturer, model, magnetic field strength, head coil, voxel size, and acquisition parameters Han et al. (2006); Kruggel et al. (2010); Reig et al. (2009); Wonderlick et al. (2009). Yet even in scanners of the exact same model and manufacturer, differences still exist for certain neuroimaging biomarkers Takao et al. (2011).
Until recently, neuroimaging analyses primarily involved mass univariate testing which treats features as independent. Under this paradigm, the impact of scanner effects is through changes in the mean and variance of measurements. Increasingly, researchers have used sets of features as patterns for prediction algorithms in a framework called multivariate pattern analysis (MVPA). This approach has become a powerful tool in diverse research topics including pain perception Smith et al. (2017), neural representations Haxby et al. (2014), and psychiatric illnesses Koutsouleris et al. (2014). One of the major benefits of MVPA is that it leverages the joint distribution and correlation structure among multivariate brain features in order to better characterize a phenotype of interest O’Toole et al. (2007). As a result, scanner effects on the covariance of measurements are likely to impact findings substantially. In fact, a recent investigation showed that MVPA was able to detect scanner with high accuracy and that the detection of sex depended heavily on the scanners included in the training and test data Glocker et al. (2019).
The major statistical harmonization techniques employed in neuroimaging have generally corrected for differences across scanners in mean and variance, but not covariance Fortin et al. (2016, 2018); Rao et al. (2017); Yamashita et al. (2019). Increasingly, the ComBat model Johnson et al. (2007) has become a popular harmonization technique in neuroimaging and has been successfully applied to structural and functional measures Bartlett et al. (2018); Fortin et al. (2017, 2018); Marek et al. (2019); Yu et al. (2018). However, this model does not address potential scanner effects in covariance. Recently, another stream of data-driven harmonization methods have aimed to apply machine learning algorithms such as generative adversarial network (GAN) or distance-based methods to unify distributions of measurements across scanners, but these methods shift the original data distributions inexplicitly and have not been tested for their potential influence on MVPA Nguyen et al. (2018); Zhou et al. (2018).
In this paper, we examine whether scanner effects influence MVPA results. In particular, we study the cortical thickness measurements derived from images acquired by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) and demonstrate the existence of scanner effects in covariance of structural imaging measures. We then propose a novel harmonization method called Correcting Covariance Batch Effects (CovBat) that removes scanner effects in mean, variance, and covariance. We apply CovBat and show that within-scanner correlation matrices are successfully harmonized. Furthermore, we find that machine learning methods are unable to distinguish scanner manufacturer after our proposed harmonization is applied, and that the CovBat-harmonized data facilitate more accurate prediction of disease group. We also assess the performance of the proposed method in simulated data, and again find that the method mitigates scanner effects and improves detection of meaningful associations. Our results demonstrate the need to consider covariance in harmonization methods, and suggest a novel procedure that can be applied to better harmonize data from multi-site imaging studies.
Scanner Identified Despite Existing Harmonization
We first examine a subset of the ADNI dataset to determine if covariance among cortical thickness measurements differs across scanners. Our data were obtained from the ADNI-1 database which included a magnetization-prepared gradient echo (MP-RAGE) sequence from Siemens and Philips scanners and a similar works-in-progress MP-RAGE sequence for General Electric (GE) scanners Jack et al. (2010). All of these scans were collected at a magnetic field strength of 1.5 telsa and there were a total of 90 scanners across 58 collaborating institutions. The T1 images were processed using the ANTs cross-sectional cortical thickness pipeline Tustison et al. (2019). Lastly, the cortical thickness measures were derived from the pre-processed images as the average thickness in 62 cortical regions, 31 in each hemisphere, defined through the Desikan-Killiany Atlas Klein & Tourville (2012). For additional information, see SI Appendix.
To investigate the potential impact of scanner differences in covariance using MVPA, we conducted an experiment to predict scanner manufacturer labels using data harmonized with existing methods. In particular, we use a Monte Carlo split-sample experiment including data acquired on any scanner used to image three or more subjects. This sample consists of 505 subjects across 64 scanners, with 213 subjects acquired on scanners manufactured by Siemens, 70 by Philips, and 222 by GE. Using the 62 cortical thickness values as inputs, we i) randomly split the sample into 50% training data and 50% testing data; ii) train a random forests algorithm to recognize if a scanner was manufacturered by Siemens, and iii) assess predictive performance on the testing data. We train it using data harmonized via the existing ComBat method and our proposed method CovBat. For all tests, we accounted for the possibility that scanner could be detected through the covariates age, sex, and disease status by residualizing out those variables. We repeat steps (i)-(iii) 100 times and report the area under the receiver operating characteristic curve (AUC) values. Figure 1 shows that Siemens scanners are identifiable based on unharmonized cortical thickness measurements (AUC 0.89±0.02), which is consistent with recent findings Glocker et al. (2019). We also note that scanner manufacturer is still detectable after ComBat is applied (0.66 ± 0.03). After using the proposed method method however, the performance for distinguishing Siemens scanners is close to random (AUC 0.54±0.03).
Statistical Harmonization of Covariance
Combatting Batch Effects
An increasingly popular method for harmonization of neuroimaging measures is a method called ComBat Fortin et al. (2017, 2018); Johnson et al. (2007). This method seeks to remove the mean and variance scanner effects in the data in an empirical Bayes framework. Let yij = (yij1, yij2, …, yijp)T, i = 1, …, M, j = 1, …, ni denote the p × 1 vectors of observed data where p is the number of features. Our goal is to harmonize these vectors across the M scanners indexed by i. ComBat assumes that the features indexed by v follow where αijv is the intercept, xij is the vector of covariates, βv is the vector of regression coefficients, γiv is the mean scanner effect, and δiv is the variance scanner effect. The errors eijv are assumed to follow . ComBat then finds empirical Bayes point estimates and then residualizes with respect to these estimates to obtain
Correcting Covariance Batch Effects
We propose the CovBat algorithm by accounting for the joint distribution of ComBat-adjusted observations as follows:
Step 1. We first perform ComBat to remove the mean and variance shifts in the marginal distributions of the cortical thickness measures and additionally residualize with respect to the covariates to obtain ComBat-adjusted residuals where
Step 2. The are assumed to have mean 0; their covariance matrices which we denote by Σi, however, may differ across scanners. We thus perform principal components analysis (PCA) on the full dataset to obtain the full-data covariance matrix as where . The ComBat-adjusted residuals can then be expressed as , where the λk are the eigenvalues of Σ, ϕk are the principal components obtained as the eigenvectors of Σ, ξijk are the principal component scores, and K < q is chosen to capture the majority of the variation in the observations. After applying this decomposition, the scanner-specific covariance matrices can be expressed using the full data eigenvectors as where λik are scanner-specific eigenvalues. This model assumes that the covariance scanner effect is contained within the λik, which can be approximated as the sample variance of the principal component scores ξijk.
Step 3. Thus, we posit: where and μik, ρik are the center and scale parameters corresponding to each principal component indexed by k. Note that this is analogous to the ComBat model, applied to each of the k principal component scores instead of the original measures. After imposing an analogous prior on each parameter, we can then estimate each of the k pairs of center and scale parameters by finding the values that bring each scanner’s mean and variance in scores to the pooled mean and variance. We then remove the scanner effect in the scores via .
Step 4. Finally, we obtain CovBat-adjusted residuals by projecting the adjusted scores back into the residual space via
We then add the intercepts and covariates effects estimated in Step 1 to obtain CovBat-adjusted observations
Harmonization Evaluation
We focus our evaluation framework on removal of scanner effects in covariance rather than mean and variance which have been shown to be addressed by ComBat in previous papers Fortin et al. (2017, 2018). Hence, we propose tests that directly assess harmonization of correlation matrices across scanners. Furthermore, we assess the degree to which residual scanner effects can affect comparison between scanners and clinically meaningful associations.
To assess scanner effects in covariance, we examine the correlation matrices before and after harmonization. Additionally, we quantify the similarity of correlation matrices between scanners by looking at the pairwise Frobenius norms. For this measure, lower values indicate greater harmonization in covariance.
We also evaluate if the harmonization procedures affect the results of MVPA. Similar to the earlier experiments in Section 4 for classifying scanner, we i) randomly split the subjects into 50% training set and 50% validation set; ii) train a random forests algorithm to detect a binary clinical covariate, and iii) assess predictive performance on the validation set via AUC. We train separate models for unharmonized, ComBat-harmonized, and CovBat-harmonized data where both harmonization methods are performed including age, sex, and diagnosis status as covariates. We perform these steps (i)-(iii) 100 times and again report the AUC values. For these experiments, higher AUC would indicate greater ability to recover biologically meaningful associations.
Material and Methods
Here, we provide an overview of the materials and methods used in this paper. More details can be found in SI Appendix, and the code for executing the analyses described here is available at (https://github.com/andy1764/CovBatHarmonization).
ADNI Data
All data for this paper are obtained from ADNI (http://adni.loni.usc.edu/ and processed using the ANTs longitudinal cortical thickness pipeline Tustison et al. (2018) with code available on GitHub (https://github.com/ntustison/CrossLong). We briefly summarize the steps involved. First, raw MP-RAGE or the equivalent sequence for GE scanners are downloaded from the ADNI-1 database. The images are first processed using the ANTs cross-sectional cortical thickness pipeline Tustison et al. (2014), which involves N4 bias correction, brain extraction, Atropos n-tissue segmentation, and registration-based cortical thickness estimation. Then, a single-subject template is created for each individual using all of their repeated scans, and the template is subsequently used in rigid registration of the subject’s images. For our analyses, we only use the cortical thickness values of the baseline scans.
We define scanner based on information contained within the Digital Imaging and Communications in Medicine (DICOM) files for each scan. Specifically, subjects are considered to be acquired on the same scanner if they share the same location of scan, scanner manufacturer, scanner model, head coil, and magnetic field strength. In total, this definition yields 142 distinct scanners of which 78 had less than three subjects and were removed from analyses.
Simulation Design
Let yij, i = 1, 2, 3, j = 1, 2, …, 100 be vectors of length 62 representing the simulated outcome for cortical thickness values in 62 regions. The yij are generated using the following model: where xij is a single binary covariate drawn from a Bernoulli random variable with probability 0.5, α is the vector of intercepts, β is the vector of coefficients, γi = (γi1, γi2, …, γi62)T are vectors of region-specific mean shift drawn from independently and identically distributed (i.i.d.) standard normal distributions and δi = (δi1, δi2, …, δi62)T are vectors of region-specific scale shifts drawn from i.i.d. scanner-specific inverse gamma distributions with chosen parameters. For our simulations, we chose to sufficiently distinguish the scanner-specific scaling factors by assuming δ1v ~ Inverse Gamma(2, 0.5), δ2v ~ Inverse Gamma(3, 1), and δ3v ~ Inverse Gamma(4, 2) for v = 1, 2, …, 62. The error terms eij ~ N(0, Σ + Ωi + xij Ψ) where Σ is the sample covariance matrix of Scanner B in the ADNI analyses, xij is a single binary covariate, Ωi are scanner-specific covariance shift matrices, and Ψ is a chosen covariance shift matrix which can be similar to any of the Ωi. To ensure that the covariance matrices are positive semi-definite, we set the negative eigenvalues equal to a small constant, 10−12. This method for inducing covariance scanner effects ensures flexibility in the direction, complexity, and confounding of the effect. For additional details and results of the specific simulation settings considered in this papers, see SI Appendix.
CovBat Reduces Covariance Scanner Effect
We apply CovBat to observations acquired on the three scanners with the largest number of subjects. Scanner A was a Siemens Symphony 1.5T scanner while scanners B and C are GE Signa Excite 1.5T scanners. See SI Appendix for demographic details. We observe that demographic variables differ across scanner so we residualize each cortical thickness measure on age, sex, and diagnosis status to obtain the correlation structure independent of these clinical covariates. Figure 2 shows the correlation matrices for each scanner using the residualized cortical measures both before and after CovBat. The differences between the unharmonized correlation matrices are striking. Especially notable are the increased positive correlations across most pairs of cortical regions in Scanner A and the weakened right-left correlations in Scanner C visible as the diagonal line in the top-left and bottom-right quadrants. Visually, the correlation structures are considerably more similar across scanners after CovBat; the correlation structures of Scanners A and B are almost indistinguishable after this adjustment.
We also compare with harmonization via ComBat, and report our quantitative results for ComBat-adjusted as well as CovBat-adjusted correlation matrices in Table 1. A tuning parameter of the CovBat model is the desired proportion of variance explained in the dimension reduction space, which we selected at 80% (26 PCs). To ensure that our results do not depend strongly on the choice of tuning parameter, we also report the minimum and maximum of the pairwise Frobenius norms after applying CovBat with percent variation explained ranging from 50% (10 PCs) to 99% (53 PCs). We report the results of this sensitivity analysis in parentheses. We find that ComBat adjustment does not harmonize the correlation matrices whereas CovBat adjustment shows large reductions in the between-scanner distances across a range of tuning parameter choices.
CovBat Recovers Biological Associations
It is well-known that cortical thickness differs substantially by sex and Alzheimer’s disease status Lerch et al. (2005); Sowell et al. (2007). To assess whether CovBat maintains biological associations of interest, we perform two MVPA experiments using random forests to classify healthy versus Alzheimer’s disease (AD) and to differentiate patients by sex. Figure 3 shows that detection of these biological differences is considerably improved by either har monization method, but the proposed CovBat approach shows an even greater performance improvement. For detection of AD, the mean AUC increases from 0.74 (±0.03) in raw data to 0.78 (±0.03) in ComBat-harmonized data to 0.79 (±0.03) in CovBat-harmonized data. Similarly, the mean AUC for detection of sex increased from 0.66 (±0.03) to 0.69 (±0.03) to 0.70 (±0.04). These findings suggest that CovBat not only provides thorough removal of scanner effects, but also helps to recover clinical associations.
Findings Replicated in Simulations
To test our harmonization method, we create simulated datasets based on a modified version of the ComBat model which includes scanner effects in covariance. We impose mean and variance scanner effects on a ground truth multivariate normal distribution and additionally modify the covariance matrix of this distribution by scanner. To achieve the latter, we add high-rank scanner-specific matrices to the underlying true covariance matrix to ensure that the scanner effect can be corrected through adjustment in PC scores, but also requires harmonization of a sufficiently high number of PCs. To test detection of a simulated covariate, we impose that distribution of the outcome measures depends on the presence of a binary covariate drawn from a Bernoulli distribution. Our simulations consisted of three scanners each with 100 simulated subjects with a binary covariate drawn from a Bernoulli(0.25) distribution. We further simulate a covariate associated with a decrease in the mean of the cortical thickness values for 15 ROIs in both hemispheres. Additional details are available in SI Appendix.
Covariate Effect on Mean
In a first scenario where the covariate does not influence the covariance, we anticipate that harmonization of mean and variance is sufficient to remove any confounding of scanner effect with the association of interest. We also anticipate that detection of scanner via MVPA should be improved by harmonization of covariance. We perform experiments to test these hypotheses under the same paradigm as the MVPA experiments implemented on the ADNI dataset and report the results in Figure 4. The results show that both ComBat (AUC 0.59±0.04) and CovBat (0.54±0.04) underperform compared with the raw data (0.63±0.05) for detection of the simulated covariate. This result could be attributable to covariate effects remaining after the ComBat residualization step, which would then be removed through the harmonization of mean across scanners. As for detection of scanner, we find that scanner 1 is almost perfectly detected in the raw data (AUC 0.999±0.001), obscured after ComBat (0.58±0.05) and nearly impossible to detect after CovBat (0.53±0.03).
Covariate Effect on Covariance
In a second simulation scenario, we study the impact of an additional covariate effect on variance and covariance that is confounded with the scanner effects. To achieve this, we allowed the covariate effect on covariance to be proportional to a chosen scanner’s covariance shift (see SI Appendix for details). This scenario represents a situation where detection of the covariate using MVPA would be highly influenced by the presence of scanner effects. Without harmonization of covariance, observations from the chosen scanner resemble observations obtained from subjects with the covariate. Consequently, we expect that ComBat alone would be insufficient to recover the covariate association and that CovBat would outperform on this metric. Since we made no changes to the scanner effects, we again anticipate that detection of scanner should become less accurate after CovBat. The results of the MVPA experiments are shown in Figure 4. As anticipated, we observe that the mean AUC using the raw data is the lowest (0.82±0.03), ComBat shows some performance increase, (0.85±0.03), and CovBat performs the best (0.88±0.02). Detection of scanner also follows our observations in ADNI data with Scanner 1 almost perfectly detected in the raw data (AUC 0.999 ± 0.001), difficult to detect after ComBat (0.58 ± 0.04) and nearly impossible to detect after CovBat (0.53 ± 0.03).
Discussion
The growing number of multi-site studies across diverse fields has spurred the development of harmonization methods that are general, but also account for field-specific challenges. In neuroimaging research, the rise of MVPA has established an unmet need for harmonization of covariance. We demonstrated that strong scanner effects in covariance exist and could influence downstream MVPA experiments, which remain after performing the state-of-the-art harmonization. We then proposed a novel method demonstrated to be effective in removing scanner differences in covariance and improving the detection of biological associations via MVPA. Simulation studies further replicated these observations, and suggest that the improvement in covariate detection could be linked to confounding between scanner effect and covariate effect on the covariance between multivariate measurements. This finding suggests that future work could aim to control for covariate effects on variance and covariance so that harmonization does not remove desired properties of the data. While our study focused on structural neuroimaging data, our findings extend directly to functional, metabolic, and other imaging modalities. Further studies should also determine the extent to which multivariate statistical and machine learning studies of genomic data are susceptible to the biases documented.
Supporting Information Appendix (SI)
ADNI Dataset Demographics
The subsample of 505 subjects included in the study have a mean age of 75.3 (SD 6.70) and is comprised of 278 (55%) males, 115 (22.8%) Alzheimer’s disease (AD) patients, 239 (47.3%) late mild cognitive impairment (LMCI), and 151 (29.9%) cognitively normal (CN) individuals. For the subsample comprised of the three largest sites in the dataset, their demographics are listed in Table 3. Since the correlations between cortical thickness values are of primary interest in our study, we display the correlation matrices annotated with the 62 regions of interest (ROIs) in Figure 6.
Harmonization using Subset
Both ComBat and CovBat estimate and residualize out the covariate effects using the full data; however, there are cases were only a subset of the data is available when performing harmonization. For instance, if a group of subjects has already been acquired, prediction on subjects subsequently acquired on the same scanners could only leverage data from the original sample. In this scenario, the new sample can be harmonized using ComBat or CovBat by estimating the covariate effect using the original sample, then proceeding with subsequent steps as usual.
We evaluate this modification by repeating our main MVPA analyses using ADNI data with different subsampling of the patients. Specifically, we replace step (i) in both analyses by instead splitting the sample into 270 training subjects and 235 testing subjects such that both the train and test sets contain at least one subject acquired on each scanner. We then apply ComBat and CovBat by estimating the βv, v = 1, 2, …, 62 using only the training subjects. We report the results in Figure 7. The results appear quite similar to harmonization using the full dataset, except with additional variance in the AUC values for detection of male. Detection of site still worsens after ComBat (AUC 0.67±0.03) and is almost at chance after CovBat (AUC 0.54±0.03). For detection of AD, improvements are demonstrated after ComBat adjustment (AUC 0.76±0.03) and greater improvements after CovBat (AUC 0.77±0.03). For detection of male, lesser improvement is observed from ComBat (AUC 0.68±0.03) to CovBat (AUC 0.68±0.03).
Simulation Settings
In the first simulation setting, we assume that the covariate only affects the mean of the measurements. We choose βv = −0.5 for 15 regions of interest in both the left and right hemispheres to impose that about half of the ROIs are negatively associated with the covariate. We also choose Ψ = 0 where 0 is a 62 × 62 zero matrix to ensure that each site-specific covariance matrix only depends on the underlying true covariance matrix Σ and the chosen Ωi matrices. The covariance matrices across sites are shown in Figure 5 with associated pairwise Frobenius norms listed in Table 2.
In the second simulation setting, we assume that the covariate affects not only mean, but also variance and covariance. To achieve this, we use the same β value as above but choose Ψ to be related to Ω2 to force confounding of site and covariate effects on covariance. To achieve this, we have Ψi,i = Ωi,i and Ψi,j = Ωi,j/2 for i ≠ j and i = 1, 2, …, 62, j = 1, 2, …, 62. The simulation findings are shown in the main paper be consistent with findings from the ADNI data application. To better illustrate the effects of harmonization, we plot the stratified covariance matrices for subjects whose binary covariate equals 0 or 1, before and after CovBat in Figure 8. We observe that CovBat harmonization leads to better differentiation between the two subject groups. Meanwhile the differences across sites are much smaller after CovBat as evident in Figure 9 and Table 4.
Acknowledgements
The majority of the data used in this paper are derived from the ADNI study. Data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.;Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.;Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute
Footnotes
↵3 Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf