A reproducible benchmark of resting-state fMRI denoising strategies using fMRIPrep and Nilearn

Reducing contributions from non-neuronal sources is a crucial step in functional magnetic resonance imaging (fMRI) analyses. Many viable strategies for denoising fMRI are used in the literature, and practitioners rely on denoising benchmarks for guidance in the selection of an appropriate choice for their study. However, fMRI denoising software is an ever-evolving field, and the benchmarks can quickly become obsolete as the techniques or implementations change. In this work, we present a fully reproducible denoising benchmark featuring a range of denoising strategies and evaluation metrics, built primarily on the fMRIPrep and Nilearn software packages. We apply this reproducible benchmark to investigate the robustness of the conclusions across two different datasets and two versions of fMRIPrep. The majority of benchmark results were consistent with prior literature. Scrubbing, a technique which excludes time points with excessive motion, combined with global signal regression, is generally effective at noise removal. Scrubbing however disrupts the continuous sampling of brain images and is incompatible with some statistical analyses, e.g. auto-regressive modeling. In this case, a simple strategy using motion parameters, average activity in select brain compartments, and global signal regression should be preferred. Importantly, we found that certain denoising strategies behave inconsistently across datasets and/or versions of fMRIPrep, or had a different behavior than in previously published benchmarks, especially ICA-AROMA. These results demonstrate that a reproducible denoising benchmark can effectively assess the robustness of conclusions across multiple datasets and software versions. Technologies such as BIDS-App, the Jupyter Book and Neurolibre provided the infrastructure to publish the metadata and report figures. Readers can reproduce the report figures beyond the ones reported in the published manuscript. With the denoising benchmark, we hope to provide useful guidelines for the community, and that our software infrastructure will facilitate continued development as the state-of-the-art advances.


Introduction
Resting-state functional magnetic resonance imaging (fMRI) is a tool for studying human brain connectivity (Biswal et al., 2010;Fox & Greicius, 2010) which comes with many analytical challenges (Cole et al., 2010;Satterthwaite et al., 2012). One such key challenge is the effective correction of non-neuronal sources of fluctuations (called confounds), known as denoising, which is important to reduce bias when studying the association between connectomes and behavioral measures of interest (Chyzhyk et al., 2022). A wide range of denoising strategies have been proposed in the literature, with no approach emerging as a clear single best solution. Denoising benchmarks (Ciric et al., 2017;Parkes et al., 2018) have thus become an important resource for the community to understand which denoising strategy is most appropriate in a given study. Denoising benchmarks are however at a constant risk of becoming obsolete, with new strategies being regularly developed or revised, as well as an ever-expanding scope of populations being enrolled in research studies. The main objective of the present work is to develop a fully reproducible fMRI denoising benchmark that enables testing the robustness of conclusions across multiple software versions and evaluation datasets.
Reproducible and robust results have become a recurring interest in the neuroimaging community (Botvinik-Nezer et al., 2020;Niso et al., 2022). The popular package fMRIPrep (Esteban et al., 2019) is a prominent solution for fMRI preprocessing designed with reproducibility in mind, and we decided to build upon that software for our benchmark. However, fMRIPrep only performs minimal preprocessing while generating a broad list of potential confounds, intentionally leaving the selection of the exact denoising strategy to end-users. The connectivity metrics are also not included as part of fMRIPrep outputs, and users rely on additional software to apply denoising to time series and generate connectivity measures. One popular open-source Python library for this purpose is Nilearn (Abraham et al., 2014). Yet, until recently, there was no straightforward way to incorporate fMRIPrep outputs into Nilearn in order to reproduce the most common denoising strategies. This lack of integration represented a major barrier to the exploration of denoising tools, both for cognitive neuroscientists who were required to develop custom code, and for computer scientists who had to develop a detailed understanding of the inner workings of denoising strategies and fMRIPrep.
The main references for denoising benchmarks (Ciric et al. 2017, Parker et al., 2018 did not use the then-novel fMRIPrep. Whether the results of these benchmarks remain consistent with fMRIPrep outputs is an open question. Different fMRI preprocessing softwares provide largely similar results, but noticeable differences are still present (Bowring et al., 2019;Li et al., 2021). Other computational factors can possibly impact the conclusion of a benchmark, such as the version of software and operating system (Gronenschild et al., 2012). Recent research has also demonstrated that, given one fMRI dataset and similar research goals, different researchers will select a wide variety of possible analytical paths (Botvinik-Nezer et al., 2020). The lack of standard integration between fMRIPrep and Nilearn could lead to differences (and errors) in the implementation of the same denoising strategies by researchers, which can in turn lead to marked differences in the impact of denoising methods.
In this work, we propose to address the issues of robustness and reproducibility of denoising benchmarks by building a fully reproducible solution. This reproducible benchmark will allow the research community to consolidate past knowledge on technical advances, examine computation instability across different software versions, and provide guidance for practitioners. In order to create this benchmark, we implemented a series of specific objectives: • First, we developed a standardized application programming interface (API) to extract nuisance regressors from fMRIPrep. The robust API, which was added to Nilearn release 0.9.0, can be used to flexibly retrieve a subset of fMRIPrep confounds for denoising and precisely replicate nuisance regressors based on denoising strategies proposed in the literature. • Our second objective was to implement a denoising benchmark to provide recommendations on the choice of denoising strategies for fMRIPrep users. We used easily fetchable open access data, specifically two datasets on OpenNeuro  with diverse participant profiles: ds000228 (Richardson et al., 2019) and ds000030 (Bilder et al., 2020). ds000228 contains adult and child samples, and ds000030 includes psychiatric conditions. The benchmark systematically evaluates the impact of denoising choices using a series of metrics based on past research (Ciric et al., 2017;Parkes et al., 2018). • Our third objective was to turn this benchmark into a fully reproducible and interactive research object. We combined a series of technologies, including software containers (Gorgolewski et al., 2017), the Jupyter Book project (Granger & Perez, 2021), and the NeuroLibre preprint service (Karakuzu et al., 2022) in order to create the first fully reproducible benchmark of denoising strategies for fMRI resting-state connectivity. • Our fourth and last objective was to demonstrate that our approach can be used to evaluate the robustness of the benchmark, by identifying possible differences across multiple versions of fMRIPrep.

Software implementation
We designed two APIs for users to perform denoising of fMRI time series using Nilearn, based on fMRIPrep outputs. The APIs are maintainable, i.e., composed of modular and well-tested code, and user-friendly, i.e., the syntax is standard and robust to errors. The confounds are loaded by the APIs in a format compatible with downstream Nilearn analysis functions. load_confounds: basic noise components The following Python code snippet demonstrates the basic usage of load_confounds.
• strategy: A list defining the categories of confound variables to use. Amongst the three in this example, motion and wm_csf are further tunable.
• motion and wm_csf: additional parameters with four options load_confounds_strategy: pre-defined strategies The following code snippet demonstrates the basic usage of load_confounds_strategy. This snippet retrieves the same confounds variables as described in the example for load_confounds.
• denoise_strategy: The name of a predefined strategy (see Table 1).
load_confounds_strategy provides an interface to select a complete set of curated confounds reproducing a common strategy used in the literature, with limited parameters for user customisation. There are four possible strategies that can be implemented from fMRIPrep confounds: • simple : motion parameters and tissue signal • scrubbing : volume censoring, motion parameters, and tissue signal • compcor (Behzadi et al., 2007): anatomical compcor and motion parameters • ica_aroma (Pruim, Mennes, van Rooij, et al., 2015): ICA-AROMA based denoising and tissue signal All strategies, except compcor, provide an option to add global signal to the confound regressors. The predefined strategies and associated noise components are listed in Table  1. Parameters that can be customized are indicated with a *. See the Nilearn documentation 2 for more details. See Annex B for a more in-depth review of common denoising strategies in the literature and Annex C for a summary of evaluation benchmarks using these strategies.

Denoising workflow
The denoising workflow is implemented through Nilearn. Figure 1 presents the graphic summary of the workflow. An fMRI dataset in the Brain Imaging Data Structure (BIDS) standard was first passed to fMRIPrep. Brain parcellation atlases were retrieved through the TemplateFlow (Ciric et al., 2022) Python client (see https://www.templateflow.org/usage/client/). In cases where an atlas was absent from TemplateFlow, it was converted into TemplateFlow naming convention to enable use of the Python client. Each atlas was passed to the NiftiLabelsMasker or NiftiMapsMasker for time series extraction. fMRIPrep outputs were input to a Nilearn-based connectome generating workflow using load_confounds_strategy. The filtered confounds and the corresponding preprocessed NIFTI images were then passed to the Nilearn masker generated with the atlas. The time series and connectomes were saved as the main outputs for further analysis. The Python-based workflow describes the basic procedure to generate functional connectomes from fMRIPrep outputs with a Nilearn data loading routine (e.g., NiftiMapsMasker or NiftiLabelsMasker), fMRIPrep confounds output retrieval function (e.g., load_confounds_strategy), and connectome generation routine (ConnectivityMeasure). Path to the preprocessed image data is passed to load_confounds_strategy and the function fetches the associated confounds from the .tsv file. The path of an atlas and the path of the preprocessed image file is then passed to the masker, along with the confounds, for time series extraction. The time series are then passed to ConnectivityMeasure for generating connectomes.

Benchmark workflow
OpenNeuro datasets were retrieved through DataLad  and fMRIPrep images were pulled from DockerHub. SLURM job submission scripts to process the fMRI data were generated with the Python tool fMRIPrep-SLURM (https://github.com/SIMEXP/fmriprep-slurm). The fMRIPrep derivatives and atlas retrieved from the TemplateFlow archive were passed to the connectome workflow described in Figure  1. We extracted the signals using a range of atlases at various resolutions (see Materials and Methods for details). For each parcellation scheme and each fMRI dataset, 11 sets of time series were generated, including one baseline and 10 different denoising strategies (see Table 2). We report the quality metrics and break down the effect on each dataset, preprocessed with fMRIPrep 20.2.1 long-term support branch (LTS). Motion characteristics were also generated per dataset and used to exclude fMRI runs with excessive motion from entering the benchmark. Trends in each atlas were similar, so we combined all atlases for the following report. The detailed breakdown by parcellation scheme can be found in the associated Jupyter Book (https://simexp.github.io/fmriprep-denoise-benchmark/). Figure 2 presents a graphical summary of the benchmark workflow. Benchmark results from fMRIPrep 20.

LTS
We reported the demographic information and the gross mean framewise displacement before and after excluding subjects with high motion. We then aimed to assess the overall similarity between connectomes generated from each denoising strategy, and evaluated the denoising strategies using four metrics from Ciric and colleagues' benchmark (2017): 1. Loss of degrees of freedom: sum of number of regressors used and number of volumes censored.
4. Network modularity : graph community detection based on Louvain method, implemented in the Brain Connectome Toolbox.

Significant differences in motion levels existed both between datasets, and within-dataset, across clinical and demographic subgroups
We applied a motion threshold to exclude subjects with marked motion in the two OpenNeuro datasets: dataset ds000228 (N=155) (Richardson et al., 2019) and dataset ds000030 (N=212) (Bilder et al., 2020). Table 3 shows the demographic information of subjects in each dataset after the automatic motion quality control. Following this, we checked the difference in the mean framewise displacement of each sample and the sub-groups ( Figure 3). In ds000228, there was still a significant difference (t (73)  In summary, children moved more than adults , and subjects with schizophrenia moved more than controls.
Due to the imbalanced samples per group and low number of subjects in certain groups after the automatic motion quality control, we collapsed all groups within each dataset to avoid speculation on underpowered samples in the results. For a breakdown of each metric by atlas, please see the supplemental Jupyter Book 3 .  Figure. 3 Mean framewise displacement of each dataset. To evaluate the metrics in a practical analytic scenario, we excluded subjects with high motion while preserving 1 minute of data for functional connectivity calculation: gross mean framewise displacement > 0.25 mm, above 80.0% of volumes removed while scrubbing with a 0.2 mm threshold. In ds000228, the child group still had higher motion compared to the adult groups. In ds000030, where all subjects were adults, the control group only showed significant differences in motion with the schizophrenia group. In both datasets, the sample sizes from each group were highly imbalanced (see Table 3), hence no between group differences were assessed in quality metrics analysis.
Most denoising strategies converged on a consistent average connectome structure With the benchmark workflow in place, we first aimed to assess the overall similarity between connectomes generated from each denoising strategy. We calculated Pearson's correlations between connectomes generated from all strategies presented in the benchmark ( Figure 4). The connectome correlation pattern across denoising strategies was similar in both datasets. Overall, the strategies displayed at least moderate similarity with each other, with Pearson's correlations above 0.6. There were two large clusters of highly-related strategies, driven by the presence (or lack) of global signal regression. Within each cluster of strategies, the correlations amongst the strategies were strong, with values above 0.9. baseline, aroma, and aroma+gsr did not fit well in either of the two clusters, indicating that denoising generally impacts the connectome structure, and that the AROMA might be sensitive to different sources of noise, compared to those captured by other strategies in the benchmark.

Loss in temporal degrees of freedom varied markedly across strategies and datasets
In previous research, the loss of temporal degrees of freedom has shown an impact on the subsequent data analysis. Higher loss in temporal degrees of freedom can spuriously increase functional connectivity (Yan et al., 2013). Volume censoring-based and data-driven strategies (ICA-AROMA and some variations of CompCor) introduce variability to degrees of freedom and can bias group level comparisons (Ciric et al., 2017).
The loss of temporal degrees of freedom is the sum of the number of regressors used and censored volume lost. Depending on the length of the scan, the number of discrete cosine-basis regressors can differ given the same repetition time (TR  Figure 5. The loss of degrees of freedom per strategy varied across the two datasets shown in the benchmark. The two datasets showed different loss of degrees of freedom in scrubbing-based strategies, while using the same gross motion-based exclusion criterias. This was expected, as the amount of motion between time points was higher in ds000228, a dataset consisting mostly of children. Data-driven denoise strategy (based on AROMA and CompCor) however did not always have a lower loss of degrees of freedom in ds000030. Quality control / functional connectivity (QC-FC) showed a heterogeneous impact of denoising strategies based on data profile The denoising methods should aim to reduce the impact of motion on the data. To quantify the remaining impact of motion in connectomes, we adopted a metric proposed by Power and colleagues (2015) named quality control / functional connectivity (QC-FC). QC-FC is a partial correlation between mean framewise displacement and functional connectivity, with age and sex as covariates. Significance tests associated with the partial correlations were performed. P-values above the threshold of were deemed significant. α = 0. 05 In both datasets, aroma+gsr showed more edges with residual motion than the baseline, which should not be the case for a denoising tool. Scrubbing-based strategies consistently performed better than the baseline in both datasets. In ds000228, the most effective method according to QC-FC was scrubbing.5 (scrubbing at a liberal threshold), followed by scrubbing.2 and simple. All the GSR counterparts of the methods had slightly higher residual motion. Amongst all the data-driven methods, compcor performed the best.
compcor6 and aroma performed close to baseline. In ds000030, the best performing method was compcor, followed by scrubbing.2 (aggressive scrubbing). The simple and scrubbing.5 methods performed similarly as very few volumes were censored with a liberal threshold, and the GSR variations (simple+gsr and scrubbing.5+gsr) performed better than baseline (see Figure 6). simple performed close to the baseline in terms of the number of edges correlated with motion. The aroma and compcor6 strategies were better than baseline. The average percentage of significant QC-FC and the average median of absolute value of QC-FC are presented in Figure 6 and Figure 7. In summary, based on a QC-FC evaluation, diverse strategies performed quite differently based on the dataset used for evaluation.

Scrubbing-based strategies decreased distance-dependent effects of motion
The impact of motion on functional connectivity has been reported to be higher for brain parcels closer to each other in space . To determine the residual distance-dependent effects of subject motion on functional connectivity (DM-FC), we calculated a correlation between the Euclidean distance between the centers of mass of each pair of parcels  and the corresponding QC-FC correlations. We reported the absolute DM-FC correlation values and expected to see a general trend toward zero correlation after denoising.
All strategies performed better than the baseline in both datasets (Figure 8). We observed a trend consistent across both datasets, whereby strategies scrubbing.2 and scrubbing.2+gsr were the most effective in reducing the correlation. In ds000228, simple was the least effective strategy for reducing distance dependency. Data-driven methods showed similar results to each other, with aroma+gsr performing the best.
scrubbing.5 and simple greatly benefited from adding GSR in the regressors. In ds000030, the difference between scrubbing.2 and other strategies was bigger than in ds000228, with the remainder performed similarly with each other. The impact of GSR was small with the exception of scrubbing.2+gsr. In summary, we observed similar trends across strategies between the two datasets, yet with differences in the magnitude of correlations. All strategies reduced the correlation lower than the baseline. Consistent with the literature, scrubbing strategies were the best at reducing distance dependency.

Global signal regression increases network modularity
Confound regressors have the harmful potential to remove real signals of interest as well as motion-related noise. To evaluate this possibility, we examined the impact of denoising strategies on a common graph feature, network modularity, generally regarded as a key feature of biological network organization . Network modularity was quantified using the Louvain method for graph community detection (Rubinov & Sporns, 2010). We computed the partial correlation between subjects' modularity values and mean framewise displacement, using age and sex as covariates, following the implementation of Power and colleagues (2015).
The inclusion of global signal regressors increased average Louvain network modularity in both datasets (Figure 9, top panel). The remaining strategies performed as follows in both datasets, from best to worst: compcor, scrubbing.2, scrubbing.5, simple, compcor6, and aroma. In both datasets, aroma performed almost at the similar level as the baseline. We found fixed results in the ability of denoising in reducing the impact of motion on modularity (Figure 9 lower panels). In ds000228, we see simple and scrubbing.5 reducing the impact of motion. In ds000030, only scrubbing.2 performed better than baseline. In both datasets, the data-driven strategies and strategies with GSR performed consistently worse than baseline. The overall trend across strategies is similar to QC-FC with the exception of the baseline strategy (see Figure 6 and 7). The reason behind this observation could be a reduction of variance in the Louvain network modularity metric for GSR-based denoising strategies. We plotted the correlations of baseline, scrubbing.2, scrubbing.2+gsr from one parcellation scheme (DiFuMo 64 components) from ds000030 to demonstrate this lack of variance (see Figure 10).  Data-driven denoising strategies showed inconsistent evaluation outcomes between two fMRIPrep versions Different versions of the same software could produce differences in the outcomes of our denoising evaluation. To gain insight into the stability of fMRIPrep, we examined whether a few key observations from fMRIPrep 20.2.1 LTS remained salient in fMRIPrep 20.2.5 LTS, specifically: 1. High loss of temporal degrees of freedom for scrubbing.2 in ds000228 and compcor for ds000030.
2. aroma performed close to baseline in QC-FC for ds000228.
3. simple performed close to baseline in QC-FC for ds000030.
The results of QC-FC demonstrated similar overall trends in 20.2.5 LTS, but with aroma performing worse than baseline for ds000228 (observation 2) and simple performing better than baseline for ds000030 (observation 3) (see Figure 11). Inconsistency in outcomes across the two fMRIPrep versions were found in strategies with data-driven noise components. In version 20.2.5 LTS, and unlike 20.2.1 LTS, comcpor6 performed worse than the baseline in metric QC-FC for both datasets. In ds000228, aroma was the second worst performing strategy. For ds000030, the strategies with no data-driven noise components showed better performance in 20.2.5 LTS (Figure 11) than 20.2.1 LTS (see Figure 6). (Figure 6).

Discussion
We aimed to create a re-executable benchmark to provide guidelines and accessible tools for denoising fMRI data. The re-executable benchmark showed most denoising strategies, such as scrubbing-based strategies, simple, and strategies with GSR, performed in line with the literature. However, results for one strategy, aroma+gsr, departed from previous literature. The metric performed consistently across the software versions with a marked exception in the data-driven denoising strategies (compcor, aroma). This result demonstrates the necessity of distributing an executable research object for methods development and software testing, and providing accurate guidelines to users over time.
The load_confounds and load_confounds_strategy APIs The standardized APIs load_confounds and load_confounds_strategy are the core elements of the re-executable denoising benchmark. The APIs provide an easy way to implement classic denoising strategies from the literature, and can reduce the effort required, as well as errors, when using these strategies. Having clear and concise code also facilitates re-use and sharing of the denoising strategy used in a particular study, which improves reproducibility of science.
The new APIs developed for this project have been integrated in an established, popular software library, Nilearn (Abraham et al., 2014). The implementation of these APIs required other contributions to Nilearn and introduced new modules, in order to streamline the compatibility between the APIs and other data processing utilities. Specifically, we introduced a new module nilearn.interfaces dedicated to interacting with other neuroimaging software libraries and BIDS. We refactored the parameter sample_mask in all masker modules to allow volume censoring in the signal.clean function 4 . The masker modules implement a series of methods to convert 3D or 4D neuroimaging data into numerical arrays, for example extracting average time series from a brain parcellation. As a result, the outputs from load_confounds and load_confounds_strategy, as well as volume censoring information, can be directly ingested into all Nilearn masker objects. Thanks to these contributions, it is now possible to construct a complete Python-based fMRIPrep post-processing workflow with very concise code. Documentation for this workflow can be found in the Nilearn User Guide library 5 , and users can adapt code from the Nilearn tutorial to implement denoising strategies with ease.
Similar functionality provided by the load_confounds and load_confounds_strategy APIs are included in other fMRIPrep-compatible fMRI processing software, such as C-PAC (Li et al., 2021), XCP-D (Adebimpe et al., 2023), and ENIGMA HALFpipe (Waller et al., 2022). Unlike our APIs, which focus on retrieving denoising regressors only, these softwares provide denoising utilities bundled in a full preprocessing workflow. The denoising regressor retrieval steps amongst those softwares are therefore not reusable and more difficult to reproduce. Our APIs provide the advantage that users can easily reuse the denoising strategies. In fact, XCP-D has adopted our APIs in their code base. A limitation of our APIs is that the implemented denoising strategies are limited to those covered by the regressors included in fMRIPrep. With the constant development of denoising strategies, what the APIs provide will always lag behind the advancement of the field. However, as a trade-off, we can ensure the quality and robustness of the implementation.

Denoising strategy
In order to summarize our results, we created a table ranking strategies from best to worst, based on four benchmark metrics, across datasets and fMRIPrep versions (see Figure 12). The baseline strategy consistently performs the worst, as expected, with the notable exception of aroma+gsr performing worst on QC-FC. Figure 12. Ranking of all denoising strategies. We ranked four metrics from best to worst. Larger circles with brighter color represent higher ranking. Metric "correlation between network modularity and motion" has been excluded from the summary as it is potentially a poor measure. Loss of temporal degrees of freedom was also excluded as the metric should not be ranked. However, it is a crucial measure that should be taken into account alongside the rankings.
The simple+gsr strategy is not the best for any particular individual evaluation metric, but it performed consistently well across metrics, datasets and software versions. The loss in degrees of freedom simple (26 + number of cosine terms) and simple+gsr (27+number of cosine terms) used slightly more regressors than aroma, and had markedly lesser loss than scrubbing methods. simple+gsr is consistently better than other data-driven strategies, which makes it the best choice for analysis that requires low loss of degrees of freedom and also preserve continuous sampling time series (which is broken by scrubbing).
Scrubbing based strategies are the best when it comes to minimizing the impact of motion, with a cost of higher loss in degrees of freedom. We found that scrubbing with an aggressive 0.2 mm threshold (scrubbing.2) mitigates distance dependency well consistently, regardless of the group of subjects. Despite excluding data with the same standard on both datasets, the child-dominant sample (ds000228) showed more volumes censored with the scrubbing strategy, and a liberal framewise displacement threshold showed sufficient ability to reduce the distance dependency of motion as observed in the original study of the strategy . In a sample with higher motion, such as ds000228, a liberal scrubbing threshold reduced the impact of motion and performed similarly with a higher threshold. Taking the loss of degrees of freedom into consideration, we recommend a liberal scrubbing threshold rather than scrubbing with a stringent threshold for datasets with marked motion.
For the two anatomical CompCor strategies, compcor performs better than compcor6.
However, compcor introduces large variability into the loss of degrees of freedom. In ds000228, the loss in temporal degrees of freedom is even higher than scrubbing with a stringent threshold. This result is consistent with the observation of Parkes and colleagues (2018) that anatomical CompCor is not sufficient for high motion data. Moreover, this observation puts one of the rationales in the original study, i.e., to reduce the loss in degrees of freedom, in question (Behzadi et al., 2007). In the absence of physiological recordings, our benchmark is not suitable to examine another property of CompCor, that is the ability to remove unwanted physiology signals (Behzadi et al., 2007). The datasets do not include physiology measures to perform alternative strategies such as RETROICOR to mitigate physiology signals explicitly.
In our results, aroma shows, at best, similar performance with the simple strategy across various metrics. Surprisingly, the additional GSR worsens the results of ICA-AROMA (aroma+gsr), determined by QC-FC, consistently across the dataset. Our findings on ICA-AROMA are inconsistent with the past literature. ICA-AROMA has been presented as the best strategy that trades off the ability to remove impact of motion and number of temporal degrees of freedom lost, and was recommended to add GSR as part of the regressors (Ciric et al., 2017;Parkes et al., 2018). There is a possibility that the global signal regressor reintroduced motion to the data. In fMRIPrep, the whole brain global signal regressor and the estimated head-motion parameters were calculated on the output from their regular pipeline (i.e., before denoising), which is inconsistent with the original proposal (Pruim, Mennes, van Rooij, et al., 2015). We strongly recommend fMRIPrep users to avoid GSR when using the ICA-AROMA strategy, and understanding the discrepancy between our results and prior benchmarks should be an area of future work. It is also worth noting that fMRIPrep will drop the support for ICA-AROMA from version 23.1.0 due to the lack of maintenance in the upstream software 6 . Researchers should consider other denoising strategies for consistency in data analysis.
Strategies including GSR produced connectomes with higher network modularity compared to their counterparts without GSR. There is no systematic trend of whether GSR improves the denoising strategies based on the remaining impact of motion. The result is consistent with the fact that global signal regression increases the number of negative connections in a functional connectome (see Nilearn examples visualizing connectomes with and without global signal regression 7 ) by shifting the distribution of correlation coefficients to be approximately zero-centered (Murphy & Fox, 2017). A clear benefit of GSR is thus to emphasize the network structure, but its benefits for denoising can vary. Some strategies, such as simple, seem to benefit greatly from the addition of GSR.

Re-executable research object
We created a re-executable denoising benchmark with two main outcomes. Firstly, we created a reusable code base that will ensure the robustness of the procedure. The current benchmark includes several parameters, from the choices of atlases, denoising strategies, fMRIPrep versions, to datasets. The code for connectome generation and denoising metric calculation is written as an installable Python library (https://github.com/SIMEXP/fmriprep-denoise-benchmark). Customized scripts to deploy the process for each combination of the parameters are also generated by reusable Python functions. The full workflow can be executed on the two benchmark datasets preprocessed by any release from the fMRIPrep LTS series. Full documentation to re-execute the workflow, from fetching datasets to running the analysis, is available as part of the research object 8 . Secondly, we created an interactive Jupyter Book (Granger & Perez, 2021) hosted on NeuroLibre (Karakuzu et al., 2022) for users to freely browse the results with finer details. All figures in this report can be rebuilt with the provided Makefile, handling data download and the report generation. Taken together, it is possible to reproduce the results of this manuscript, starting from raw data down to final figures, and update the entire manuscript on future releases of fMRIPrep, turning this research object into a living publication rather than a snapshot of current software destined for quick deprecation.
There are additional benefits from creating a re-executable denoising benchmark. The code for the current project is a good prototype of different BIDS-apps for post processing (Gorgolewski et al., 2017): a connectome generation BIDS-app and a denoising metric generation BIDS-app. BIDS-app is easier for user adoption under the BIDS convention and can expand the scope of the benchmark from the two datasets shown here to any BIDS-compliant dataset. The process of creating this benchmark also provides valuable first hand information about runtime, and the impact of atlas choice on computational costs, which we did not cover here but has big practical implications. High dimensional probabilistic atlases require four times more RAM than discrete segment atlases. For metric generation, high dimensional atlases can have a runtime up to 24 hours compared to 1 hour for low dimensional atlases. There is thus a very concrete "reproducibility" cost which comes with high-resolution and probabilistic atlases. The issue is rarely mentioned regarding the reproducibility of science, yet can be a real obstacle to actual reproduction. Future editions of the workflow will be built with runtime optimization in mind and potentially improve the code base for upstream projects, such as fMRIPrep.

Evaluate software version changes
Our benchmark results on two versions of the long-term support (LTS) release of fMRIPrep reveals similar trends in the metrics, but some inconsistency. Between the two datasets, ds000228 showed more consistent results than ds000030 across two LTS releases (see Figure 12). The marked difference in ds000030 was likely the result of a bug fix implemented in 20.2.2LTS 9,10 and that ds000030 had been reported as an affected dataset. The results from the data-driven strategies in both datasets demonstrated inconsistent relative difference when comparing to the baseline strategy. This piece of work is a new addition to the existing literature on the heterogeneity of results found through research software testing (Bowring et al., 2019;Gronenschild et al., 2012). The inconsistency highlights the importance and need for testing the numeric stability of research software at each major step of its life cycle.
Rebuilding this paper on future fMRIPrep releases can serve as a stability test of preprocessed results at the dataset level. This benchmark is thus a hybrid contribution, being as much research paper as it is a software development tool. We still recommend several aspects of improvements to better achieve this goal for future similar efforts. Firstly the API will need to be kept up to date with fMRIPrep releases. The current code will be applicable for 20.2.x series up to September 2024. For fMRIPrep release beyond the LTS version, as long as the API in Nilearn is maintained, the code used to generate all current reports can be applied to the same two datasets. With the high number of tunable parameters (denoise strategies, atlases, software versions), a framework allowing parameter configuration, such as Hydra 11 , would help better manage and expand the benchmark. The current benchmark generates jobs through metadata stored in python dictionaries. By adapting a framework like Hydra, one can deploy the benchmark analysis with a simplified interface.

Conclusions
This work introduces new software libraries to systematically evaluate the impact of a wide range of denoising strategies across datasets, and versions of the fMRIPrep preprocessing pipeline. We used this software infrastructure to implement a fully reproducible benchmark of denoising strategies on two datasets with varied characteristics, including age, motion level and the presence of clinical diagnoses. We would like to provide two strategy recommendations based on this benchmark, depending on a key consideration: whether preserving continuous sampling time series is needed (e.g. to train auto-regressive models) or not (e.g. to generate correlation coefficients across brain parcels). To preserve the continuous sampling property of time series, simple+gsr is the recommended strategy, especially for datasets with low motion, and appears to be robust across software versions. If continuous temporal sampling is not a priority, scrubbing.5 was the best strategy for datasets with marked motion where denoising quality can be favored over loss of temporal degrees of freedom. aroma+gsr should be avoided, at least as implemented in fMRIPrep, 11 https://github.com/facebookresearch/hydra 10 See #2444 in change log https://fmriprep.org/en/stable/changes.html#july-16-2021 9 See: https://github.com/nipreps/fmriprep/issues/2307 as multiple metrics departed from the conclusions of previous denoising benchmark works. We suggest avoiding data-driven denoising strategies in general if stability of performance across fMRIPrep versions is a concern. The denoising benchmark demonstrated that standardized and reusable tools are key to reliable assessment of denoising strategies across multiple fMRIPrep versions. We hope that our benchmark provides useful guidelines for the community, and that our software infrastructure will enable us to provide timely updates in the years to come. Our approach demonstrates the importance of assessing computational performance stability in methods development and provides an example for future researchers to adapt similar works.
Dataset ds000030 includes multiple tasks collected from subjects with a variety of neuropsychiatric diagnosis, including ADHD, bipolar disorder, schizophrenia, and healthy controls. The current analysis focused on the resting-state scans only. Scans with an instrumental artifact (flagged under column ghost_NoGhost in participants.tsv) were excluded from the analysis pipeline. Of 272 subjects, 212 entered the preprocessing stage. Demographic information per condition can be found in Table 2. T1w images were collected with the following parameters: TR = 2530 ms, TE = 3.31 ms, Flip Angle = 7°, 1 mm isotropic voxels. BOLD images were collected with the following parameters: TR = 2000 ms, TE = 30 ms, Flip Angle = 90°, 3 x 3 x 4 mm voxels. All images were acquired on a 3T Siemens Trio Tim Scanner. --use-aroma \ --omp-nthreads 1 \ --nprocs 1 \ --random-seed 0 \ --output-spaces MNI152NLin2009cAsym MNI152NLin6Asym \ --output-layout bids \ --notrack \ --skip_bids_validation \ --write-graph \ --resource-monitor For the full description generated by fMRIPrep, please see supplemental Jupyter Book 12 . We reported the primary outcomes using outputs from fMRIPrep 20.2.1LTS, and then investigated if the same conclusions can be observed in 20.2.5LTS.

Choice of atlases
We extracted time series with regions of interest (ROI) defined by the following atlases: Gordon atlas (Gordon et al., 2016), Schaefer 7 network atlas (Schaefer et al., 2018), Multiresolution Intrinsic Segmentation Template (MIST) (Urchs et al., 2019) and Dictionary of Functional Modes (DiFuMo) (Dadi et al., 2020). All atlases were resampled to the resolution of the preprocessed functional data.
Since DiFuMo and MIST atlases can include networks with disjointed regions under the same label, we carried out further ROI extraction. Labels are presented with the original number of parcels. and we denote the number of extracted ROI in brackets. Gordon and Schaefer atlas parcels use isolated ROI, hence no further extraction was done. The Schaefer 1000 parcels atlas was excluded; regions were small enough that not all could be consistently resolved after resampling the atlas to the shape of the processed fMRI data.

Participant exclusion based on motion
We performed data quality control to exclude subjects with excessive motion leading to unusable data. In the current report, we use framewise displacement as the metric to quantify motion. Framewise displacement indexes the movement of the head from one volume to the next. The movement includes the transitions on the three axes ( ) and , , the respective rotation ( ). Rotational displacements are calculated as the α, β, γ displacement on the surface of a sphere of radius 50 mm . fMRIPrep generates the framewise displacement based on the formula proposed in . The framewise displacement, denoted as , at each time point is expressed as: To ensure the analysis is performed in a realistic scenario we exclude subjects with high motion (Parkes et al., 2018) while retaining at least 1 minute of scan for functional connectome construction, defined by the following exclusion criteria: mean framewise displacement > 0.25 mm, above 80.0% of volumes removed while scrubbing with a 0.2 mm threshold. We evaluated common confound regression strategies that are possible through fMRIPrep-generated confound regressors. The connectome generated from high-pass filtered time series served as a baseline comparison. Confound variables were accessed using the API load_confounds_strategy. The detailed 11 strategies and a full breakdown of parameters used in these strategies is presented in Table 3.

Evaluation of the outcome of denoising strategies
We first performed Pearson's correlations to understand the overall numerical similarities of the denoised connectomes across different strategies. For each parcellation scheme, we computed a correlation matrix across the thirteen strategies. These correlation matrices were then averaged across the parcellation schemes within each dataset. The averaged correlation matrices were reordered into blocks of clusters with the function scipy.cluster.hierarchy.linkage. The aim was to provide an overview of the similarity of connectomes generated with the strategies.
We then used selected metrics described in the previous literature to evaluate the denoising results (Ciric et al., 2017;Parkes et al., 2018). After investigating the metrics with fMRIPrep version 20.2.1 LTS, we assessed whether the conclusions were consistent in 20.2.5 LTS.
Loss in temporal degrees of freedom (Ciric et al., 2017;Yan et al., 2013) The common analysis and denoising methods are based on linear regression. Using more nuisance regressors can capture additional sources of noise-related variance in the data and thus improve denoising. However, this comes at the expense of a loss of temporal degrees of freedom for statistical inference in further analysis. This may be an important point to consider alongside the denoising performance for researchers who wish to perform general linear model based analysis. Higher loss in temporal degrees of freedom can spuriously increase functional connectivity (Yan et al., 2013). Volume censoring-based and data-driven strategies (ICA-AROMA and some variations of CompCor) introduce variability to degrees of freedom and can bias group level comparisons (Ciric et al., 2017). We calculate the number of regressors used and number of censored volume loss. Depending on the length of the scan, the number of discrete cosine-basis regressors can differ. The number of discrete cosine-basis regressors will be denoted as in the report ( , ). 000228 = 4 000030 = 3 Simple, simple+gsr, compcor6 are the strategies with a fixed number of degrees of freedom loss. Scrubbing, compcor, aroma, and aroma+gsr strategies show variability depending on the number of noise components detected.
Quality control / functional connectivity (QC-FC) QC-FC  quantifies the correlation between mean framewise displacement and functional connectivity. This is calculated by a partial correlation between mean framewise displacement and connectivity, with age and sex as covariates. The denoising methods should aim to reduce the QC-FC value. Significance tests associated with the partial correlations were performed, and correlations with P-values above the threshold of deemed significant. A version of this analysis corrected for multiple α = 0. 05 comparisons using the false discovery rate (Benjamini & Hochberg, 1995) is available in the appendix.

Distance-dependent effects of motion on connectivity
To determine the residual distance-dependence of subject movement, we first calculated the Euclidean distance between the centers of mass of each pair of parcels . Closer parcels generally exhibit greater impact of motion on connectivity. We then correlated the distance separating each pair of parcels and the associated QC-FC correlation of the edge connecting those parcels. We report the absolute correlation values and expect to see a general trend toward zero correlation after confound regression.

Network modularity
Confound regressors have the potential to remove real signals in addition to motion-related noise. In order to evaluate this possibility, we computed modularity quality, an explicit quantification of the degree to which there are structured subnetworks in a given network -in this case the denoised connectome . Modularity quality is quantified by graph community detection based on the Louvain method (Rubinov & Sporns, 2010), implemented in the Brain Connectivity Toolbox (Rubinov & Sporns, 2010). If confound regression and censoring were removing real signals in addition to motion-related noise, we would expect modularity to decline. To understand the extent of correlation between modularity and motion, we computed the partial correlation between subjects' modularity values and mean framewise displacement, with age and sex as covariates.

Contributions
The initial API was started by Hanad Sharmarke and Pierre Bellec. The implementation was completed by Hao-Ting Wang, Steven Meisler, François Paugam, and Pierre Bellec. Hao-Ting Wang migrated the code base to Nilearn. Nicolas Gensollen and Bertrand Thirion reviewed the code migrated to Nilearn. We thank Chris Markiewicz for feedback related to fMRIPrep.
Hao-Ting Wang and Pierre Bellec drafted the initial version of the paper, with critical feedback from Natasha Clarke. All co-authors contributed to and approved the final version of the manuscript.
Please see the original repository (https://github.com/SIMEXP/load_confounds#contributors-) for a history of initial development and contributors, and this issue (https://github.com/nilearn/nilearn/issues/2777 ) for a history of the integration in Nilearn and all the linked Pull Requests.
• The principal component-based method CompCor (Behzadi et al., 2007;Muschelli et al., 2014) extracts reduced components from white matter and cerebrospinal fluid masks to estimate non-neuronal activity. • Independent component analysis-based methods estimate spatially independent components representing brain activity and noise. To identify the components related to head motion, researchers used a data-driven classifier (ICA-FIX (Salimi-Khorshidi et al., 2014)) or a pre-trained model (ICA-AROMA (Pruim, Mennes, van Rooij, et al., 2015)).
Despite the abundance of measures for non-neural signals, there is no one class of measures that can capture all known noise. A combination of nuisance regressors is thus needed to address the different sources of noise, and different common strategies have been put forth in the literature.

Annex B. Common denoising strategies
Researchers developed and proposed different strategies with the aim of improving denoising approaches. A denoising strategy typically involves the selection of a subset of nuisance regressors from the broad list presented above. Head motion combined with non-grey matter tissue signals is one of the most basic approaches . Scrubbing combines the basic approach above with volume censoring, which can be implemented either by removing time points entirely  or by adding "spike" nuisance regressors (Lemieux et al., 2007;Siegel et al., 2014). Anatomical CompCor regressors are usually applied along with the basic head motion parameters (Muschelli et al., 2014). ICA-AROMA requires two steps of denoising: (i) a partial regression to remove variance associated with noise ICA-AROMA components inside a full ICA decomposition preceded by (ii) a linear regression including the basic average of white matter and the cerebrospinal fluid signals regressors (Pruim, Mennes, van Rooij, et al., 2015). As there exists multiple variants of most categories of confounds, e.g. degree of expansion in motion parameters, a large number of specific variants do exist for each strategy through combinatorial effects. To provide objective guidance to practitioners in their selection of an effective denoising strategy, comprehensive denoising benchmarks have emerged in the functional connectivity research literature.

Annex C. Evaluation of denoising strategies
Denoising benchmarks use several metrics to evaluate the effectiveness of denoising strategies on functional connectivity, commonly including loss of temporal degrees of freedom, residual motion effects, and impact on network properties, such as modularity. The ability of different strategies to remove confounds is heterogeneous (Ciric et al., 2017), and no single strategy dominates across all evaluation metrics. Several benchmarks still recommended ICA-AROMA due limited loss of degrees of freedom compared to scrubbing-based methods (Parkes et al., 2018), possibly in combination with global signal regression for maximal reduction of motion artifacts (Burgess et al., 2016;Ciric et al., 2017). However, a denoising strategy can perform differently due to factors that strongly correlate with motion, such as psychiatric conditions, age, or the choice of subsequent analytical technique. For example, CompCor may only be maximally effective in low-motion data (Behzadi et al., 2007). Volume-censoring-based strategies are unsuitable for time series analysis where a uniform sampling of signals is required, such as most time-frequency analysis implementations. Researchers thus need to evaluate the best denoising strategy based on the available benchmarks, the profile of their data and their choice of analytical techniques.
The existing benchmarks have a few pitfalls that limit the integration to each researcher's unique needs. The benchmark research and denoising methods development are conducted on in-house preprocessing solutions with different datasets. From the research standpoint, this is not necessarily a pitfall as it showcases that workflows built upon different software following the general shared principle for preprocessing have converging conclusions. However, this is a problem for user adoption of the recommended strategy to their choice of preprocessing workflow. To correctly implement the best approach for their study, researchers need to understand the extensive literature and then construct the workflow. Another pitfall is that results in benchmarks are subject to the scope of datasets evaluated. As researchers use in-house tools and analysis code, there is no direct way to apply the suggested strategy or generate the benchmark statistics for evaluation on a new dataset. The denoising benchmark literature provides a good overview of the progress of methods in the field of resting state functional connectivity, but falls short in providing a reproducible path for the general community to adapt the results to modern solutions of preprocessing, such as fMRIPrep.