Global Signal Regression Strengthens Association between Resting-State Functional Connectivity and Behavior

Global signal regression (GSR) is one of the most debated preprocessing strategies for resting-state functional MRI. GSR effectively removes global artifacts driven by motion and respiration, but also discards globally distributed neural information and introduces negative correlations between certain brain regions. The vast majority of previous studies have focused on the effectiveness of GSR in removing imaging artifacts, as well as its potential biases. Given the growing interest in functional connectivity fingerprinting, here we considered the utilitarian question of whether GSR strengthens or weakens associations between resting-state functional connectivity (RSFC) and multiple behavioral measures across cognition, personality and emotion. By applying the variance component model to the Brain Genomics Superstruct Project (GSP), we found that behavioral variance explained by whole-brain RSFC increased by an average of 47% across 23 behavioral measures after GSR. In the Human Connectome Project (HCP), we found that behavioral variance explained by whole-brain RSFC increased by an average of 40% across 58 behavioral measures, when GSR was applied after ICA-FIX de-noising. To ensure generalizability, we repeated our analyses using kernel regression. GSR improved behavioral prediction accuracies by an average of 64% and 12% in the GSP and HCP datasets respectively. Importantly, the results were consistent across methods. A behavioral measure with greater RSFC-explained variance (using the variance component model) also exhibited greater prediction accuracy (using kernel regression). A behavioral measure with greater improvement in behavioral variance explained after GSR (using the variance component model) also enjoyed greater improvement in prediction accuracy after GSR (using kernel regression). Furthermore, GSR appeared to benefit task performance measures more than self-reported measures. Since GSR was more effective at removing motion-related and respiratory-related artifacts, GSR-related increases in variance explained and prediction accuracies were unlikely the result of motion-related or respiratory-related artifacts. However, it is worth emphasizing that the current study focused on whole-brain RSFC, so it remains unclear whether GSR improves RSFC-behavioral associations for specific connections or networks. Overall, our results suggest that at least in the case for young healthy adults, GSR strengthens the associations between RSFC and most (although not all) behavioral measures. Code for the variance component model and ridge regression can be found here: https://github.com/ThomasYeoLab/CBIG/tree/master/stable_projects/preprocessing/Li2019_GSR. Highlights Global signal regression improves RSFC-behavior associations Global signal regression improves RSFC-based behavioral prediction accuracies Improvements replicated across two large-scale datasets and methods Task-performance measures enjoyed greater improvements than self-reported ones GSR beneficial even after ICA-FIX


Introduction
Resting-state functional connectivity (RSFC) is a powerful tool for measuring the synchronization of fMRI signals between brain regions, while participants are lying at rest without any "extrinsic" task (Biswal et al., 1995;Fox and Raichle 2007;Buckner et al., 2013).
One of the most contentious rs-fMRI preprocessing steps is global signal regression (GSR; Uddin 2017; Power et al., 2017a). Although the exact implementation varies across studies (Liu et al., 2017), GSR typically involves the regression of whole brain (or gray matter or cortical) fMRI signal from every brain voxel (Power et al., 2014;Burgess et al., 2016;Power et al., 2018). The global signal is associated with head motion, respiration and cardiac rhythms (Birn et al., 2006;Power et al., 2014;Liu et al., 2017). Consequently, GSR is highly effective at removing global artifacts arising from motion and other physiological sources. More specifically, studies have shown that GSR reduces the correlation magnitude between quality control metrics and RSFC (Satterthwaite et al., 2013;Power et al., 2014;Burgess et al., 2016;Ciric et al., 2016;Parkes et al., 2018), as well as removes prominent increases and/or decreases in signal intensities lasting many TRs (Power et al., 2014;Byrge et al., 2017;Glasser et al., 2018). GSR also improves neuronalhemodynamic correspondence (Keller et al., 2013).
However, GSR has been criticized to introduce negative correlations, so the signs of the resulting correlations might be hard to interpret, i.e., negative correlations after GSR might be a mathematical consequence of GSR, rather than actual inhibitory interactions between brain regions (Murphy et al., 2009;Weissenbacher et al., 2009). GSR might also distort group differences (Saad et al., 2012). For example, the global signal differs between certain populations, so GSR might remove RSFC differences between diseased populations and healthy controls (Yang et al., 2014; but see Parkes et al., 2018). Furthermore, because the global signal itself contains neuronal information (Schölvinck et al., 2010;Matsui et al., 2016;Wen and Liu, 2016) and is associated with vigilance and arousal (Wong et al., 2013(Wong et al., , 2016Yeo et al., 2015b;Chang et al., 2016), some have argued against removing such information (Cole et al., 2010;Ben Simon et al., 2017;Glasser et al., 2018). Finally, although GSR reduces the correlation magnitude between quality control metrics and RSFC, this reduction is distance dependent, which might potentially introduce biases in certain analyses (Power et al., 2014;Parkes et al., 2018).
A different, but potentially useful, perspective is that GSR simply changes the "reference" signal, and hence what is being measured. More specifically, analyses without GSR evaluates total fMRI fluctuation (global signal fluctuations plus fluctuations relative to the global signal), while analyses with GSR examines fMRI fluctuations relative to the global signal (Yeo et al., 2015b). Therefore, whether GSR should be performed or not depends on the scientific or clinical question being asked. Indeed, when using electroencephalogram (EEG) to detect seizures, the "best" reference electrode(s) is the one that makes the epileptiform discharge most visible to the clinician (Murphy and Fox, 2017). Similarly, removing the global gene expression pattern might also be useful for genetic analyses (Krienen et al., 2016;Anderson et al., 2018;Filbin et al., 2018). Therefore, under this perspective, whether GSR should be performed depends on its utility in the problem being studied.
Given significant interest in the relationship between behavior and functional brain architecture revealed by RSFC (Fox et al., 2012;Mueller et al., 2013;Smith et al., 2015;Finn et al., 2015;Rosenberg et al., 2016;Dubois et al., 2016;Bertolero et al., 2018;Kong et al., 2018), here we explored the utilitarian question of whether GSR strengthens or weakens the association between RSFC and behavior in young healthy adults. In the literature, results on whether GSR strengthens or weakens RSFC-behavioral association are mixed. While some studies have suggested that GSR leads to stronger relationships between RSFC and behavior (Hampson et al., 2010;Kruschwitz et al., 2015;Yeo et al., 2015b), others have suggested that GSR weakens relationship between RSFC and behavior, especially when comparing healthy controls and diseased populations Yang et al., 2014; but see Parkes et al., 2018).
In this work, we utilized the variance component model (Yang et al., 2011;Sabuncu et al., 2016) to quantify the association between behavior and RSFC with and without GSR. The variance component model has been widely used to estimate the variance explained by genome-wide genetic variants for a complex trait (Yang et al., 2011), and was recently applied to associating neuroimaging data with behavioral measurements . To ensure our conclusions were robust to the choice of analysis strategy, we repeated the analyses using kernel ridge regression for RSFC-based behavioral prediction (He et al., 2018). Kernel ridge regression provides a fast and effective way for predicting behavioral phenotype in individual subjects using rs-fMRI (He et al., 2018;Kong et al., 2018). Here, we applied both approaches (variance component model and kernel ridge regression) to two large-scale datasets of young healthy adults: the Brain Genomics Superstruct Project (GSP; Holmes et al., 2015) and the Human Connectome Project (HCP; Van Essen et al., 2013).
Because the two datasets were collected using different scanner type (e.g., Trio versus Skyra) and acquisition sequence (e.g., multiband versus non-multiband), this allows us to test whether the effects are robust across acquisition parameters. In the case of the HCP dataset, it has previously been denoised using ICA-FIX Griffanti et al., 2014). This allows us to test whether GSR was useful above and beyond ICA-FIX.

Datasets and preprocessing
Two publicly available datasets were examined: Brain Genomics Superstruct Project (GSP; Holmes et al., 2015) and Human Connectome Project (HCP) S1200 release . Both datasets contained structural MRI, rs-fMRI, and multiple behavioral measures for each subject.
All participants in the GSP dataset were healthy and young (ages 18-35). MR images were collected on matched 3T Tim Trio scanners (Siemens Healthcare, Erlangen, Germany) at Harvard University and Massachusetts General Hospital using the vendor-supplied 12channel phased-array head coil. The T1-weighted structural images were 1.2mm isotropic.
Most participants (N = 1490) had one session of rs-fMRI data, while a small number of participants (N = 69) had two sessions of data collected less than 6 months apart. Each session included one or two rs-fMRI runs. The behavioral data were collected from computer-based cognitive tasks and personality assessment. Behavioral measurements of non-compliant participants were removed (Holmes et al., 2015). Further details can be found elsewhere (Holmes et al., 2015).
HCP participants (N = 1094) were drawn from a population of twins and siblings.
Participants were healthy and young (ages 22-37). All imaging data were acquired on a customized Siemens 3T Skyra at Washington University (St Louis) using a multi-band sequence. The structural images were 0.7mm isotropic. The rs-fMRI data were 2mm isotropic with TR = 0.72s. Two sessions of rs-fMRI data were collected in consecutive days for each subject, and each session consisted of one or two runs. The length of each rs-fMRI scan was 14.4 min (1200 frames). Details of the data collection can be found elsewhere (Van Essen et al., 2012;Smith et al., 2013). Details about behavioral measures can be found in HCP S1200 Data Dictionary and Barch et al. (2013).
Each subject's cortical surface mesh was registered to a common spherical coordinate system (Fischl et al., 1999a(Fischl et al., , 1999b. Two preprocessing pipelines were applied to rs-fMRI data: GSP-Baseline and GSP-Baseline+GSR. The pipelines consisted of a combination of FreeSurfer 5.3.0 (Fischl et al., 2012), FSL 5.0.8 (Jenkinson et al., 2012;Smith et al., 2004) and in-house Matlab functions.
Framewise displacement (FD; Jenkinson et al., 2002) and root-mean-square of voxelwise differentiated signal (DVARS) (Power et al., 2012) were estimated using fsl_motion_outliers. Volumes with FD > 0.2mm or DVARS > 50 were marked as outliers (censored frames). One frame before and two frames after these volumes were flagged as censored frames. Uncensored segments of data lasting fewer than five contiguous volumes were also labeled as censored frames (Gordon et al., 2016). BOLD runs with more than half of the volumes labeled as censored frames were removed.
Linear regression of multiple nuisance regressors was applied. In the GSP-Baseline pipeline, nuisance regressors consisted of: (1) a vector of ones and linear trend, (2) six motion correction parameters, (3) averaged white matter signal, (4) averaged ventricular signal, and (5) temporal derivatives of (2) -(4). The white matter mask was obtained from FreeSurfer's segmentation of the anatomical T1 image, followed by three rounds of erosion before projection to the subject's native rs-fMRI space. The ventricle mask was obtained in a similar fashion, but only with one round of erosion. If there were less than 100 voxels (in anatomical space) after erosion, then no erosion was performed. In the GSP-Baseline+GSR pipeline, in addition to the previously mentioned regressors, mean whole brain signal and its temporal derivative were also removed. Across both pipelines, censored frames were ignored when computing the regression coefficients (Power et al., 2014).
A total of 1551 subjects survived the fMRI preprocessing quality control (e.g., motion censoring). 23 behavioral phenotypes, including cognitive and personality measures, were considered (Table S1). 689 participants were further excluded because they did not have all 23 behavioral measurements, yielding a final set of 862 participants.

HCP preprocessing and behavioral data
Details of the HCP preprocessing can be found elsewhere (HCP S1200 manual; Van Essen et al. 2012;Glasser et al. 2013;Smith et al. 2013). Of particular importance is that both cortical and subcortical data were denoised with ICA-FIX Griffanti et al., 2014) and saved in the CIFTI grayordinate format. The surface (fs_LR) data were aligned with MSM-All (Robinson et al., 2013).
Recent studies have suggested that ICA-FIX does not fully eliminate global motionrelated and respiratory-related artifacts (Burgess et al., 2016;Siegel et al., 2017), so further censoring was performed. Volumes with FD > 0.2mm or DVARS > 75 were marked as outliers (censored frames). One frame before and two frames after these volumes were flagged as censored frames. Uncensored segments of data lasting fewer than five contiguous volumes were also labeled as censored frames. BOLD runs with more than half of frames flagged as censored were removed. We refer to this data as being processed by the HCP-Baseline pipeline.
To examine the effects of global signal regression, we additionally regressed the global signal obtained by averaging across all cortical vertices 1 and its temporal derivative (Power et al., 2018). Like before, censored frames were ignored when regression coefficients were computed. We refer to this processing pipeline as HCP-Baseline+GSR.
FC was computed among 419 ROIs using Pearson's correlation. The 400 cortical ROIs were the same as before, but in fs_LR space (Schaefer et al., 2017; https://github.com/ThomasYeoLab/CBIG/tree/master/stable_projects/brain_parcellation/Scha efer2018_LocalGlobal). The 19 subcortical ROIs were defined based on grayordinate data structure. Pearson's correlations were calculated only with the uncensored frames. For each subject, the correlation matrix was computed for each run, Fisher's z-transformed, and then averaged across runs and sessions, yielding one final 419 x 419 RSFC matrix for each subject. 58 behavioral measures across cognition, personality and emotion were selected (Table S2; Kong et al., 2018). Of the 1029 subjects who survived the motion censoring, an additional 76 participants were excluded due to missing data for the behavioral measures, resulting in 953 subjects.

QC-FC correlations for preprocessed rs-fMRI
We first replicated previous literature showing that GSR reduces imaging-related artifacts, while increasing distance-dependent biases. In addition to the individual subject QC plots pioneered by Power and colleagues (Power et al., 2014;Power, 2017), we also computed QC-FC correlation metrics widely used in the literature (Satterthewaite et al., 2012;Power et al., 2014;Burgess et al., 2016;Ciric et al., 2017). For each ROI pair, the QC-FC correlation was defined as the across-subject correlation between the functional connectivity of the ROI pair and the number of censored frames. The number of censored frames was chosen as the QC measure because it reflected both FD and DVARS, thus taking into account both head motion and other noise sources causing frame-to-frame signal intensity fluctuations (e.g. respiration).

Variance component model
The association between functional connectivity and a behavioral phenotype (e.g., fluid intelligence) was captured by a variance component model (Yang et al., 2011;Sabuncu et al., 2016): where is an × 1 vector of behavioral phenotype and is the number of subjects in the dataset. is an × nuisance covariate matrix. denote the regression coefficients. Age, sex, and motion (FD and DVARS) were included as covariates. Subject-specific effect follows a multivariate normal distribution ∼ ( , 2 ), where 2 is a scaling constant (estimated from data) and is the × functional connectivity similarity matrix among subjects. More specifically, the -th row and -th column of corresponds to the functional connectivity similarity between the -th and -th subjects. In the variance component model, the diagonal entries of the matrix are assumed to be one (Yang et al., 2011;Sabuncu et al., 2016). Inspired by previous RSFC fingerprinting studies that identified individuals based on the Pearson's correlation between their RSFC matrices (Finn et al., 2015), here we defined the functional connectivity similarity between the -th and -th subjects to be the Pearson's correlation between their 419 x 419 RSFC matrices (considering only lower triangular entries since the matrices are symmetric).
Overall, the intuition behind the variance component model is that (2) A larger implies that a larger portion of the individual differences in the behavioral phenotype can be explained by individual differences in RSFC. In practice, 2 and 2 were iteratively estimated using the restricted maximum likelihood (ReML) algorithm .
We applied the variance component model to all 23 behavioral measures, age and sex in the GSP dataset. All participants (N = 862) were utilized. We also applied the variance component model to all 58 behavioral measures, age and sex in the HCP dataset. Because dealing with both RSFC and family structure within the variance component model is tricky, we followed the strategy of choosing one subject from each family , resulting in 419 unrelated subjects. When analyzing age (or sex), age (or sex) was not included as a nuisance covariate.
Since the variance component model assumes the behavioral scores are Gaussian distributed, each behavioral phenotype was quantile normalized before the variance component model was fitted (Elliott et al., 2018). To perform quantile normalization for a particular behavioral phenotype, the scores of all participants were sorted from low to high.
The scores were then replaced by samples from a standard Gaussian distribution (also sorted from low to high). Thus, the resulting behavioral distribution follows a Gaussian distribution.
We note that not performing quantile normalization yielded very similar results.
A jackknife procedure was utilized to quantify the uncertainty of the variance estimates . For each behavioral phenotype, half the participants were randomly removed. The variance component model was fitted for the remaining participants.
This jackknife procedure was repeated 1000 times, resulting in 1000 jackknife estimates of the explained behavioral variance. Because half the participants were deleted in each jackknife sample, the sample variance of the 1000 jackknife estimates can be used as an estimate of the uncertainty of the explained behavioral variance (Shao, 1989;Shao and Tu, 2012).

Kernel ridge regression
To ensure our conclusions are robust to the choice of analysis strategy, we utilized kernel ridge regression (Murphy, 2012)  Finally, we note that linear ridge regression was also considered. The results were consistent with all conclusions in this paper. Because of the current length of this manuscript and because kernel ridge regression accuracies were better than linear ridge regression, we decided to focus on kernel ridge regression for the remainder of this paper.

Consistency of results across methods
We first explored whether the results were consistent between variance component model and kernel ridge regression. For each dataset, we correlated the RSFC-explained variances and prediction accuracies across behavioral measures. In other words, we wanted to explore whether a behavioral measure better explained by RSFC (using the variance component model) was also more easily predicted by kernel ridge regression.
Second, for each dataset, we correlated changes in explained variances (with and without GSR) and changes in prediction accuracies (with and without GSR) across behavioral measures. In other words, we wanted to explore whether a behavioral measure with a larger improvement in RSFC-explained variance after GSR (using the variance component model) will also enjoy greater improvement in prediction accuracy after GSR.

Relationship with baseline associations and prediction accuracies
To explore which behavioral measures benefited most from GSR, we hypothesized that GSR-related changes might be related to baseline RSFC-behavioral associations or prediction accuracies. For example, we could imagine that behavioral measures that were more poorly predicted using the baseline preprocessing strategy might benefit more from GSR and vice versa. Thus, for each dataset, we correlated changes in explained variances (with and without GSR) with the explained variances of the baseline preprocessing pipeline across behavioral measures. The analysis was repeated with prediction accuracies.

Task-performance and self-reported measures
To further explore which behavioral measures benefited most from GSR, a second hypothesis was that GSR-related improvements might be related to whether a behavioral measure involved a cognitive task or was a self-reported measure. Thus, the 58 HCP behavioral measures were grouped into 27 "task-performance" measures, 24 "self-reported" measures and 7 unclassified measures. The task-performance measures were defined as the tasks that assess cognitive performance satisfying two criteria. First, there were some cognitive processes involved when the participants performed the task. Second, the task was performance-based, i.e. the behavioral scores reflected whether the participants performed the task well or not. On the other hand, self-reported measures were those in which participants provided answers to the surveys themselves. The remaining measures were considered as unclassified. For example, the delay discounting scores were collected based on a task, but a subject with higher score does not mean "better" performance, thus delay discounting was considered unclassified. Similarly, the odor identification measure was performance-based, but there was little cognitive processing in odor identification, so the measure was also considered unclassified. For the HCP dataset, we then explored whether GSR-related improvements (in explained variance and prediction accuracies) differed between taskperformance and self-reported measures. In the case of the GSP dataset, there were too few task-performance measures to perform a similar analysis.

Further motion analyses
In this section, we discuss further motion controls and analyses. First, in the original analyses (Section 2.2), FD and DVARS were included as nuisance covariates in the variance component model. We explored the effects of not including FD and DVARS as covariates.
One hypothesis is that not including FD and DVARS as covariates might increase the explained variance since some behavioral measures are correlated with motion (Siegel et al., 2016), which might in turn correlate with residual motion artifacts in the rs-fMRI data.
Second, we computed correlations between all behavioral measures and motion (FD or DVARS). We then compared the magnitude of the correlations with the explained behavioral variance (variance component model) and the behavioral prediction accuracies (kernel ridge regression), in order to show that the explained variance and prediction accuracies cannot be entirely explained by motion.

Data and code availability
Code for this work is freely available at the GitHub repository maintained by the

Overview
In Section 3.2, we replicated results in the literature, showing that GSR introduces negative correlations, while reducing imaging artifacts. This is followed by variance component results and behavioral prediction results in Sections 3.3 and 3.4 respectively. In Section 3.5, we explored consistencies and heterogeneities across approaches and behavioral measures. Finally, motion-related effects were investigated in Section 3.6. Figure 1A shows the 400-area cortical parcellation (Schaefer et al., 2017). The colors corresponded to the 17 large-scale networks from Yeo et al. (2011). Figure 1B shows the 19

Consistent with the literature, GSR reduces RSFC strength
subcortical ROIs (Fischl et al., 2002). Figures 1C and 1D show the functional connectivity among the 419 ROIs averaged across participants with and without GSR in the GSP dataset.
The results for the HCP dataset are shown in Figure S1.
In both GSP and HCP datasets, the RSFC patterns were highly similar with and without GSR. Correlations between the 419 x 419 RSFC matrices (excluding diagonals and repeated entries) with and without GSR were 0.95 in the GSP dataset and 0.82 in the HCP dataset. However, consistent with previous work (Murphy et al., 2009;Fox et al., 2009;Weissenbacher et al., 2009), GSR reduces functional connectivity across almost all ROI pairs.
However, also consistent with previous work (Murphy et al., 2009;Gotts et al., 2013), this shift in functional connectivity was not uniform across all edges and participants. In the GSP dataset, the RSFC shift (averaged across participants) ranged from -0.58 to 0 across ROI pairs, while the RSFC shift (averaged across ROI pairs) ranged from -0.63 to -0.08 across participants. On the other hand, in the HCP dataset, the RSFC shift (averaged across participants) ranged from -0.40 to 0 across ROI pairs, while the RSFC shift (averaged across ROI pairs) ranged from -0.43 to -0.04 across participants. Consistent with previous literature (Power et al., 2014;Byrge et al., 2017;Power et al., 2018), without GSR, there were multiple bands of signal increase or decrease across most brain voxels that extended for many time points. A particularly salient global decrease in signal intensity appeared right after a peak in FD, which was mostly attenuated by GSR (black arrow in Figure 2A). Similar observations could be made for a mid-motion HCP participant ( Figure S2A).  Furthermore, replicating previous work, GSR exacerbated the distance-dependent biases in QC-FC correlations. In the case of the GSP dataset ( Figure 2B), GSR introduces a negative bias to distance-dependent QC-FC correlations with slope decreasing from -0.0001 to -0.0003, consistent with previous work (Power et al., 2014;Burgess et al., 2016;Ciric et al., 2017). However, in the HCP dataset ( Figure S2B), GSR introduces a positive bias to distance-dependent QC-FC correlations with slope increasing from 0.0001 to 0.0002.  Figure 3B shows the RSFC-explained variance for age and sex. Figure 3C shows the RSFC-explained variance for 9 behavioral measures in the GSP dataset. Figure S3 shows the RSFC-explained variance for the remaining 14 behavioral measures. Like Figure Figure 4A shows behavioral variance explained by RSFC averaged across 58 behavioral measures in the HCP dataset. Similar to before, the blue and green boxplots corresponded to HCP-Baseline+GSR and HCP-Baseline respectively. The non-overlapping

Behavioral-RSFC association in the HCP dataset
IQRs suggest a robust increase in explained behavioral variance with GSR. Compared to baseline, the mean explained behavioral variance increased by 40.4% with GSR. Figure 4B shows the RSFC-explained variance for age and sex. Figures 4C, S4, and S5 show the RSFC-explained behavioral variance for individual behavioral measures. Figure   4C contains 13 HCP cognitive measures. Figures S4 and S5 contain the remaining 45 behavioral measures, including alertness, motor, sensory, personality, emotion, and other inscanner measures. Like before, the blue and green boxplots correspond to the HCP-Baseline+GSR and HCP-Baseline preprocessing pipelines respectively. Differences between the two pipelines are illustrated by the red boxplots.
For 40 of the 60 measures, the red "boxes" were entirely above zero, implying that for the 40 measures, GSR increases the explained variance in at least 75% of the jackknife samples. For 47 measures, GSR increased the explained behavioral variance, although the red "boxes" were not entirely above zero (e.g., Walking Memory (N-back)). Conversely, there were 5 behavioral measures for which the baseline processing pipeline (non-GSR) exhibited greater explained variance in at least 75% of the jackknife samples. Recog. -Sad, Anger -Hostility, Fear -Somatic Arousal, and Self -Efficacy). Thus, the overall increase in behavior-RSFC association ( Figure 4A) was not driven by specific behavioral measures, but was the result of robust improvements across most of the behavioral measures examined.
It is worth noting that the behavioral variance explained by RSFC was higher in the HCP dataset compared with the GSP dataset. One reason could be the significantly greater amount of data per participant in the HCP dataset (1 hour) compared with the GSP dataset (at most 12 minutes). Figure 5 shows the RSFC-based prediction accuracies with and without GSR in the GSP dataset. Figure 5A shows the mean prediction accuracy averaged across 23 behavioral measures. Like previous figures, the blue and green boxplots correspond to GSP-Baseline+GSR and GSP-Baseline preprocessing pipelines respectively with the bars representing the IQR across random cross-validation splits. The non-overlapping IQRs suggest a robust increase in behavioral prediction accuracies with GSR. The mean improvement was 64.3%.

RSFC prediction of behavior in the GSP dataset
Figures 5B shows the prediction accuracies for age and sex. Figure 5C shows the prediction accuracies for 9 behavioral measures in the GSP dataset. Figure S6 shows the prediction accuracies for the remaining 14 behavioral measures. Like before, the blue and green boxplots correspond to the GSP-Baseline+GSR and GSP-Baseline preprocessing pipelines respectively. Differences between the two pipelines are illustrated by the red boxplots.  Figure 6 shows the RSFC-based prediction accuracies with and without GSR in the HCP dataset. Figure 6A shows the mean prediction accuracy averaged across 58 behavioral measures. Like previous figures, the blue and green boxplots correspond to HCP-Baseline+GSR and HCP-Baseline preprocessing pipelines respectively. The non-overlapping

RSFC prediction of behavior in the HCP dataset
IQRs suggest a robust increase in behavioral prediction accuracies with GSR. The mean improvement was 11.6% using kernel regression. Figure 6. GSR improves RSFC-based behavioral prediction accuracies in the HCP dataset using kernel ridge regression. (A). Test accuracies averaged across all 58 behaviors for two preprocessing pipelines: HCP-Baseline+GSR (blue) and HCP-Baseline (green). The "boxes" show the median and interquartile range (IQR) of test accuracies across 20 random cross-validation splits. The whisker length is 1.5 IQR. Black circles indicate mean. Outliers are shown by grey dots. (B). Test accuracies for age and sex. (C). Test accuracies for 13 cognitive measures. For each behavioral measure, the accuracies of the HCP-Baseline+GSR pipeline, HCP-Baseline pipeline, and difference between the two pipelines are shown in blue, green and red respectively.
Figures 6B shows the prediction accuracies for age and sex. Figure 6C shows the prediction accuracies for 13 HCP cognitive measures. Figures S7 and S8 show the prediction accuracies for the remaining 45 behavioral measures. Like before, the blue and green boxplots correspond to the HCP-Baseline+GSR and HCP-Baseline preprocessing pipelines respectively. Differences between the two pipelines are illustrated by the red boxplots.
The red boxes were entirely above zero for 33 of the 60 demographic and behavioral measures ( Figures 6B, 6C, S7, and S8), implying that for the 33 measures, GSR increased the prediction accuracy in at least 75% of the random cross-validation splits. Conversely, there were 14 behavioral measures for which the baseline preprocessing pipeline (non-GSR) exhibited greater prediction accuracies in at least 75% of the random data splits. Thus, the overall increase in behavioral prediction accuracy ( Figure 6A) was not driven by specific measures, but was the result of robust improvements across many of the behavioral measures examined.

Consistencies between RSFC-behavioral associations and RSFC-behavioral prediction accuracies
To investigate whether RSFC-behavioral associations and RSFC-behavioral prediction accuracies are consistent, Figure 7 shows

Relationship with baseline associations and prediction accuracies
In the case of the variance component model, GSR-related improvements in RSFCexplained variance were weakly correlated with the RSFC-explained variance of the baseline preprocessing pipelines for both the GSP (r = 0.23) and HCP (r = 0.24) datasets.
In the HCP dataset, GSR-related improvements in behavioral prediction accuracies were also weakly correlated with the prediction accuracies of the baseline preprocessing pipeline (r = 0.10). On the other hand, in the GSP dataset, GSR-related improvements in behavioral prediction accuracies were moderately negatively correlated with the prediction accuracies of the baseline preprocessing pipeline (r = -0.51). In other words, if a behavioral measurement was poorly predicted with the baseline preprocessing pipeline, then the GSRrelated improvement in behavioral prediction would be greater.

GSR preferentially improves RSFC-explained behavioral variance and behavioral
prediction accuracies for task performance measures Figure 9A shows the mean GSR-related improvements in RSFC-explained variance averaged across 27 HCP task-performance measures (green) and 24 self-reported measures (red). On average, task-performance measures enjoyed greater GSR-related improvement (7.9%) compared with self-reported measures (4.4%). Figure 9B shows the GSR-related improvements in RSFC-explained variance for individual behavioral measures sorted by the magnitude of improvement. Figure S9A shows the mean improvement in GSR-related behavioral prediction accuracies averaged across task performance measures (green) and self-reported measures (red). On average, task-performance measures enjoyed greater GSR-related improvement (0.016) compared with self-reported measures (0.0084). Figure S9B shows the GSR-related improvements in behavioral prediction accuracies for individual behavioral measures sorted by the magnitude of improvement.

GSR improves RSFC-explained variance even without including FD & DVARS as
covariates in the variance component model Figure 10A shows the behavioral variance explained by RSFC averaged across 23 behavioral measures in the GSP dataset for both processing pipelines, with and without including FD and DVARS as nuisance covariates in the variance component model.
Interestingly, the variance explained were unchanged regardless of whether including FD and DVARS were included as covariates. Figure 10B shows the behavioral variance explained by RSFC averaged across 58 behavioral measures in the HCP dataset for both processing pipelines, with and without including FD and DVARS as nuisance covariates in the variance component model. For both pipelines, the explained variance was higher when FD and DVARS were not included as nuisance covariates. Nevertheless, the explained variance was still higher for the baseline+GSR pipeline than the baseline pipeline.

RSFC-explained behavioral variance cannot be explained by motion
Tables S3 and S4 show the correlations of behavioral measures with FD (and DVARS) in the GSP and HCP datasets respectively. Figure 11A shows that the mean behavioral variances explained by RSFC in the GSP dataset for both processing pipelines were much greater than the absolute correlation between the behavioral measures and FD (expressed in terms of percentage variance), as well as the absolute correlation between the behavioral measures and DVARS (expressed in terms of percentage variance). The same was true for the HCP dataset ( Figure 11B). Therefore, RSFC-explained behavioral variance for both baseline and baseline+GSR processing pipelines cannot be simply due to motion. Figure 12A shows that the mean RSFC-based behavioral prediction accuracy in the GSP dataset for the baseline+GSR processing pipeline was greater than the mean absolute correlation between the behavioral measures and FD, as well as the mean absolute correlation between the behavioral measures and DVARS. However, in the case of the baseline processing pipeline, the absolute correlation between the behavioral measures and DVARS was bigger than the prediction accuracy.

RSFC-based behavioral prediction accuracies cannot be explained by motion
On the other hand, Figure 12B shows that the average RSFC-based behavioral prediction accuracies in the HCP dataset for both processing pipelines were greater than the absolute correlation between the measures and FD, as well as the absolute correlation between the measures and DVARS.
Overall, RSFC-based behavioral prediction accuracies for the baseline+GSR preprocessing pipeline cannot be simply due to motion in both GSP and HCP datasets. In the case of the baseline preprocessing pipeline, this was only true for the HCP dataset.

Discussion
Using the variance component model, we showed that GSR strengthened the association between behavior and RSFC in young healthy adults from two large-scale Therefore, across both methods and datasets, we found that GSR improved behavioral associations and prediction accuracies. Furthermore, both approaches were consistent in the sense that a behavioral measure well explained by RSFC using the variance component model was also better predicted by kernel ridge regression (Figure 7). In addition, a behavioral measure with a larger improvement in RSFC-explained variance after GSR also tended to enjoy greater improvement in prediction accuracy after GSR (Figure 8). Further analysis suggests that task performance measures appeared to benefit more from GSR than self-reported measures (Figures 9 and S9).

Interpretations and implications
We note that our results could have been in the opposite direction. After all, it is well known that the global signal contains neural information (Schölvinck et al., 2010;Matsui et al., 2016), so regressing the global signal might potentially reduce the associations between RSFC and behavioral traits. On the other hand, GSR is known to reduce imaging artifacts (Satterthwaite et al., 2013;Power et al., 2014;Burgess et al., 2016;Ciric et al., 2016;Parkes et al., 2018), so using the cleaner data might strengthen RSFC-behavior association. Yet another consideration is that because certain behavioral traits (e.g., reading pronunciation) are correlated with motion (Siegel et al., 2017), more effective removal of motion-related artifacts via GSR might weaken the RSFC-behavior association (which might be desirable).
Given that our experiments suggest that GSR strengthens RSFC-behavior association (on average), one possibility is that the neural component within the global signal might not be associated with the examined behavioral traits. For example, the neural component might reflect transient neural activity, which might not relate to stable behavioral traits measured outside the scanner. A second possibility is that having cleaner data (from removing the noise component contained in the global signal) outweighs any negative effect from losing the neural component embedded in the global signal, so that overall, GSR improves behavior-RSFC association. Dissociating these two possibilities might not be easy, but a recent study suggested that GSR reduced task-relevant information in task-fMRI, and instead advocated the use of temporal ICA to remove global artifacts, while retaining neural information (Glasser et al., 2018). Thus, we hope to evaluate the effects of temporal ICA on behavior-RSFC association in the future.
Given that GSR is more effective at removing global motion and respiratory artifacts than baseline processing (Figures 2; S2; Burgess et al., 2016;Ciric et al., 2017), the improved explained behavioral variance and prediction accuracies (with GSR) were not confounded by imaging-related artifacts. If the reverse were true (i.e., non-GSR explained greater behavioral variance and improved behavioral prediction accuracies compared with GSR), the interpretation would be considerably trickier because it would be hard to rule out the possibility that the greater explained behavioral variance might be due to imaging artifacts correlated with behavioral measurements (Siegel et al., 2017).
The variance component model and the kernel ridge regression were set up, so that inter-subject similarity was defined based on the correlation of whole-brain RSFC matrices.
This choice was motivated by seminal work on functional connectivity fingerprints (Finn et al., 2015). Work from our lab also suggests that this particular choice of inter-subject similarity yields competitive behavioral prediction accuracy (He et al., 2018). We note that our experiments with linear ridge regression (not shown) yielded similar conclusions to the variance component model and kernel ridge regression, suggesting that the benefits of GSR were not limited to this particular choice of similarity metric.

Limitations, methodological considerations and future work
Because our study is focused on young healthy adults, our conclusions might not hold for other cohorts. Indeed, some studies have suggested that GSR weakens differences between healthy controls and diseased populations, such as autism spectrum disorder  and schizophrenia (Yang et al., 2014). On the other hand, Parkes and colleagues showed that group differences between healthy controls and schizophrenia were only seen with GSR (Parkes et al., 2018). However, given that GSR is known to more effectively remove global motion and respiratory artifacts compared with other de-noising approaches (Ciric et al., 2017;Power et al., 2017b), it is not easy to rule out that any group differences without GSR were not due to imaging artifacts. Nevertheless, given the availability of largescale datasets of diseased and lifespan cohorts (Milham et al., 2012;Di Martino et al., 2014;Poldrack et al., 2016), we hope to extend the current study to lifespan and diseased datasets.
Previous work on rs-fMRI behavioral prediction has suggested that whole-brain RSFC would more effectively explain behavioral variance than a more restricted set of networks (Finn et al., 2015;Kong et al., 2018), so our analyses were limited to whole brain RSFC.
However, a drawback of defining inter-subject similarity based on the whole-brain RSFC matrices was that we were unable to distinguish whether certain connections (e.g., within specific networks or between certain networks) were more important for explaining behavioral variance. Furthermore, it remains unclear whether GSR improves RSFCbehavioral associations for specific connections or networks. Therefore, future work might involve extending this study to investigate the effect of GSR on RSFC-behavior association for specific networks (Nostro et al., 2018) or at the voxel level (Shehzad et al., 2014;Gong et al, 2018).
Despite the effectiveness of GSR in removing imaging artifacts (Ciric et al., 2017), there were likely residual artifacts. For example, when FD and DVARS were not included as nuisance covariates in the HCP dataset, the explained behavioral variances increased by about the same amount for both baseline and baseline+GSP pipelines ( Figure 10B), suggesting the presence of residual artifacts for both pipelines. Interestingly, excluding FD and DVARS as nuisance covariates has almost no impact on explained behavioral variances in the GSP dataset ( Figure 10A). One possible explanation is that the behavioral measures were more correlated with motion in the HCP dataset. Indeed, if we only considered 29 of the 58 HCP behavioral measures that were the least associated with FD, then including or excluding FD and DVARS as nuisance covariates has almost no impact on explained behavioral variances in the HCP dataset. Nevertheless, the good news is that in both GSP and HCP datasets, both RSFC-explained behavioral variance and RSFC-based prediction accuracies after GSR were much greater than the magnitude of correlations between motion (FD/DVARS) and behavior (Figures 11 and 12), suggesting that motion cannot fully account for the explained behavioral variance and prediction accuracies.
Finally, it is worth noting that while the preprocessing pipelines considered in this study are representative of most published papers, the exact processing choices are variable across the literature, so we have taken the pragmatic approach of being consistent with our previous work (Yeo et al., 2011;Yeo et al., 2015a;He et al., 2018;Kong et al., 2018). For example, to generate the nuisance white matter signal, the white matter mask was defined anatomically and eroded three times. However, some have suggested that further erosion might be necessary to avoid the white matter signal being a hidden proxy for global signal (Power et al., 2017b). Indeed, in the GSP dataset, the white matter signal was quite correlated with the global signal with r = 0.77 ± 0.14 (mean ± std) across participants. Similarly, there is no consensus on selecting the DVARS threshold. We opted to select the DVARS threshold such that the number of censored frames due to DVARS was roughly the same as FD, which is the approach we considered in our previous studies (Kong et al., 2018), but might not be optimal (Afyouni and Nichols, 2018). Overall, there remains further room for exploring how the exact set of preprocessing hyperparameters might affect the final RSFC-behavior associations and behavioral prediction accuracies.

Conclusion
By applying the variance component model, we showed that GSR greatly increased the associations between behavior and RSFC in two large datasets of young, healthy adults.
The results were replicated using kernel ridge regression and linear ridge regression (not shown). Given that GSR significantly reduced imaging-related artifacts, GSR-related improvements could not be simply explained by imaging artifacts being correlated with the behavioral measurements. The benefits of GSR on behavior-RSFC associations and prediction accuracies persisted even after ICA-FIX.