Statistical significance in DTI group analyses: How the choice of the estimator can inflate effect sizes

Diffusion magnetic resonance imaging (dMRI) is one of the most prevalent methods to investigate the micro- and macrostructure of the human brain in vivo. Prior to any group analysis, dMRI data are generally processed to alleviate adverse effects of known artefacts such as signal drift, data noise and outliers, subject motion, and geometric distortions. These dMRI data processing steps are often combined in automated pipelines, such as the one of the Human Connectome Project (HCP). While improving the performance of processing tools has clearly shown its benefits at each individual step along the pipeline, it remains unclear whether – and to what degree – choices for specific user-defined parameter settings can affect the final outcome of group analyses. In this work, we demonstrate how making such a choice for a particular processing step of the pipeline drives the final outcome of a group study. More specifically, we performed a dMRI group analysis on gender using HCP data sets and compared the results obtained with two diffusion tensor imaging estimation methods: the widely used ordinary linear least squares (OLLS) and the more reliable iterative weighted linear least squares (IWLLS). Our results show that the effect sizes for group analyses are significantly smaller with IWLLS than with OLLS. While previous literature has demonstrated higher estimation reliability with IWLLS than with OLLS using simulations, this work now also shows how OLLS can produce a larger number of false positives than IWLLS in a typical group study. We therefore highly recommend using the IWLLS method. By raising awareness of how the choice of estimator can artificially inflate effect size and thus alter the final outcome, this work may contribute to improvement of the reliability and validity of dMRI group studies.

for, for instance, a typical diffusion tensor imaging (DTI) study in which two groups of subjects (e.g., 69 healthy controls vs. patients) are compared. This lack of agreement is reinforced by our limited 70 understanding of whether a specific processing method has a significant contribution to the reliability 71 of the subsequent group analysis in terms of outcome. In this context, one could state that, in 72 practice, the added benefit of a particular data correction procedure is nullified if there are other data 73 aspects with a much higher variability. As an example, the decrease in diffusion parameter estimation 74 bias due to Gibbs ringing correction may be completely swamped by the high noise levels in low-75 SNR dMRI data, obviating the relevance of performing this processing step. 76 In general, the relative improvement of one processing step not only depends on the intrinsic quality 77 of the data, but also on the performance of the other processing steps used in the dMRI pipeline. 78 Correcting spatial misalignment across multiple diffusion-weighted images (DWIs) due to subject 79 motion, for instance, may benefit from preceding denoising of these images. In addition, after the 80 data has been corrected for artifacts, strategies to further analyze the data (e.g., using fiber 81 tractography, histograms, ROIs, voxel-based approaches, or network graphs) may have a difference 82 in sensitivity to the benefit of some of the individual processing steps and potentially generate 83 differences in the final outcome of a group study. 84 While many steps in a dMRI processing pipeline can be considered as optional, for several diffusion 85 approaches such as DTI or diffusion kurtosis imaging (DKI), there is the mandatory step of choosing 86 the diffusion estimation method to obtain model parameters. Over the last decade, a plethora of such proven to be more efficient in producing fewer false positives than parametric methods (Eklund et al., 136 2016). Significance was determined at p corr < 0.05 using family-wise error rate (FWER) adjustment 137 to correct for multiple comparisons after applying threshold-free cluster enhancement (TFCE) (Smith 138 and Nichols, 2009). Calculation speed was accelerated using the tail approximation (Winkler et al., 139 2016). A Dell server with 72 Intel Xeon E7-8870 v3 @ 2.10 GHz dual cores with 1 TB RAM was 140 used for calculations. 141

Effect of tensor estimator 143
For each participant, there are two FA maps: one obtained from the diffusion tensor estimated with 144 OLLS and one with IWLLS. In order to investigate the potential differences in FA (regardless of 145 gender) between the OLLS and IWLLS pipelines, we used a paired two-sample t-test. This procedure 146 tests whether there is a significant effect of using a different tensor estimation method on FA, without 147 considering if the participant is female or male.

Effect of Gender 149
Differences in FA values between males and females (denoted as FA m and FA f ) were investigated 150 using an unpaired two-sample t-test for the OLLS and IWLLS pipelines separately. A further 151 correction was applied via the "-corrcon" option in PALM, which accounts for the multiple contrasts 152 during the FWER correction. 153

Pipeline dependent gender differences 154
To test whether gender differences depend on the tensor estimation method, we performed a two-155 sample t-test on the gender, where the tested variable is the difference in FA, denoted as ∆ , 156 between the IWLLS and OLLS pipelines: 157 More specifically, we evaluated with this test whether the ∆ values for males, denoted as ∆ m , 158 differ significantly from the ∆ vales for females, denoted as ∆ f . Statistically, this procedure is 159 the same as the interaction part of a two-group analysis of variance (ANOVA) test with two levels 160 per participant. A significant effect means that the gender differences are solely driven by the choice 161 of estimation method. Independent and symmetric errors were assumed to boost the statistical power 162 of the test, by using the command "-ise" in PALM. Effect sizes and their distributions were analyzed 163 in detail within the regions of significance. 164

Effect size 165
The practical significance of the findings was further evaluated by reporting effect sizes, as suggested   (Fig. 6 c)), it can be readily seen that FA m increased more than FA f when changing the 244 estimator from IWLLS to OLLS. The FA m -FA f difference is plotted for each decile with the 245 bootstrapped confidence intervals as a function of male deciles, indicating that the increase in FA m 246 was systematically larger than the increase in FA f by 0.5-2% due to this change (Fig. 6 d)). Note that 247 if a confidence interval does not include zero, one may also conclude that said difference is 248 significant between the changes of these ratios. rate. Note that if a confidence interval does not include zero, one may also conclude that said 276 difference is significant between the changes of these ratios.  The area of investigation is located where ΔFA f > ΔFA m is significant, as shown in Fig. 4, but within 303 that region is limited to voxels where FA m > FA f . Fig. 10 shows the differences of the effect sizes as 304 a function of OLLS-based effect sizes. For the sake of simplicity, the spatial distribution of the 305 voxels in MNI space is not shown.

4
Discussion 312 In this work, we investigated how making a different choice for a specific data processing step can 313 affect the outcome in a typical DTI group study. More specifically, we performed a voxel-based 314 analysis, comparing FA values between males and females using HCP data, and revealed that a 315 higher effect size was obtained with the OLLS diffusion tensor estimator than with its IWLLS 316 counterpart. If we consider that the IWLLS estimator has a higher accuracy, we can conclude that 317 OLLS overestimates the observed FA based gender differences. With the majority of published DTI 318 studies having used the OLLS estimator, it is not hard to imagine that the lack of general agreement 319 in findings for several research topics (both in neuroscience and clinical applications) could also be 320 partly attributed to the higher number false positives introduced by the OLLS estimator as compared 321 with the IWLLS estimator. In the following paragraphs, we will discuss how our findings relate with 322 what is known in functional MRI (fMRI) and we will place our results in the context of other dMRI 323 studies. 324 The term 'blobology' (Poldrack, 2012) corresponds to the colorful patches, the 'blobs', of fMRI 325 brain studies, summarizing the localization of the results after processing and statistical thresholding. 326 The phrase reflects an inherent frustration within the neuroimaging community, partly due to the lack 327 of effect size reports. In dMRI studies, unfortunately, effect sizes are rarely reported. Researchers 328 often spend most of their efforts on reporting statistically significant results from the data, while the 329 extent of these effects, which is highly complementary, is hardly considered. dtifit. In all of the aforementioned large-scale cohorts (ADNI, HCP, UK BioBank, Whitehall study), 344 OLLS is also used which, in light of our findings, may adversely affect the reliability of the final 345 outcome in a group study. Generally, lower-quality dMRI data in terms of effective SNR or CNR 346 benefit more from using an estimator with better performance characteristics such as the IWLLS 347 approach (Veraart et al., 2013a(Veraart et al., , 2013b. In this work, we used HCP data, which are among the 348 highest quality data available in current large-scale cohorts (Bastiani et al., 2019). Given the lower 349 number of DWIs, the lower SNR and CNR, and the higher amount of physiological artifacts in more 350 conventional neuroimaging studies, especially in a clinical setting, one can expect even more inflated 351 effect sizes by using the OLLS estimator than those observed in this work. Researchers often justify the choices made for specific processing steps in their data processing 362 pipeline by referring to previously peer-reviewed studies, which used the same settings or 363 algorithms, despite the availability of more reliable alternatives. In addition, as OLLS generates 364 an artificially higher effect size than IWLLS, it stimulates the positive bias in publications 365 (Rothstein et al., 2006) and contributes to "the natural selection of bad science" (Smaldino and 366 McElreath, 2016). To some extent, following the implementation of "registered reports" may 367 mitigate this concern as the processing pipeline can be reviewed and scrutinized before starting 368 the actual analysis (Nosek and Lakens, 2014). 369 In a recent review paper by Poldrack et al. (Poldrack et al., 2017) the lack of common consensus in 370 processing and analysis was showcased for fMRI. With common fMRI software packages, it was 371 shown that the number of possible analysis workflows can be as much as 69,120. For DTI, it is not 372 hard to achieve the same order of magnitude for this number of workflows given the vast amount of options and parameter settings one can think of. In this work, we specifically investigated the effect 374 of choosing between the OLLS and the IWLLS estimator on the outcome of the analysis, as using a 375 diffusion tensor estimator is mandatory. Other processing steps, such as denoising and correcting for 376 artifacts are not per se necessary (although highly recommended, of course) to continue with 377 performing an actual group study. In this context, there may be several aspects of a typical processing 378 or analysis workflow for DTI that may result in much larger effects than shown in this work. 379 Eklund et al. (Eklund et al., 2016) used resting-state fMRI to obtain "null data", i.e., truly negative 380 data, to test the false-positive ratios for task fMRI. Unfortunately, for DTI, such an experimental 381 testing setup to evaluate statistical inferences related to methodological factors is not trivial. 382 However, without loss of generality, in this work, we performed a standard group study on gender as 383 the framework to evaluate the effect of using different diffusion tensor estimation approaches. We 384 used HCP data because of the excellent data quality and the large number of subjects with proper 385 male-female balance, thereby eliminating issues related to small sample size and low power during 386 statistical inference (Button et al., 2013). 387 In this work, we did not opt for analyzing the "statistical" significance (i.e., p-values) of our findings, 388 but rather considered the difference in effect sizes that can be observed. In a similar context, shifting 389 the focus from p-values to effect sizes was also recently presented by Ritchie et al. (Ritchie et al., 390 2018). They compared volumes and DTI based metrics of cortical, subcortical, and WM regions 391 between females and males from the UK BioBank for more than 5000 participants. The comparison 392 of the right CST revealed that males have larger FA values than females, with a p-value of 4×10 -65 393 using Cohen's d = 0.54. After adjusting for total brain volume, the values changed to 8×10 -12 with 394 Cohen's d = 0.22. While these p-values are indeed very significant, they do not contain any useful 395 information. On the other hand, the effect size measures provide more practical information. That is, 396 adding another 5000 or more participants to the analysis will not result in any meaningful change in 397 terms of the effect size, as this investigation is already statistically well-powered, while the p-value 398 would decrease further. For the same reason, i.e., avoiding under-powered study design, we used 399 HCP data for our group comparison, allowing us to focus on the performance of the DTI estimators. 400 Despite the efforts of optimizing the dMRI processing pipeline, it is often not clear what the benefits 401 are of new developments for group-based studies. In this work, however, we showed that the 402 application of IWLLS should be preferred over the OLLS for diffusion tensor estimation. The current 403 framework can be easily extended to examine effects of modifying other processing elements, but also to investigate choices in algorithms and settings for specific analysis strategies, like tractography 405 and connectomics, further improving the reliability and validity of future dMRI group studies. 406

Conflict of Interest 407
The authors declare that the research was conducted in the absence of any commercial or financial 408 relationships that could be construed as a potential conflict of interest.