NOMIS: Quantifying morphometric deviations from normality over the lifetime of the adult human brain

We present NOMIS (https://github.com/medicslab/NOMIS), a comprehensive open MRI tool to assess morphometric deviation from normality in the adult human brain. Based on MR anatomical images from 6,909 cognitively healthy individuals aged 18-100 years, we modeled 1,344 measures computed using the open access FreeSurfer pipeline, considering account personal characteristics (age, sex, intracranial volume) and image quality (resolution, contrast-to-noise ratio and surface reconstruction defect holes), and providing expected values for any new individual. Then, for each measure, the NOMIS tool was built to generate Z-score effect sizes denoting the extent of deviation from the normative sample. Depending on the user need, NOMIS offers four versions of Z-score adjusted on different sets of variables. While all versions consider head size and image quality, they can also incorporate age and/or sex, thereby facilitating multi-site neuromorphometric research across adulthood.

Introduction 50 Despite the popularity of magnetic resonance imaging (MRI) to examine abnormalities in brain 51 morphometry, tools quantifying normality are lacking. While age, sex and intracranial volume are 52 well-known to influence brain volume and shape[1, 2] the determination of whether an 53 individual's brain region measurements are within normality faces multiple major challenges 54 such as the lack of normative data across appropriate age groups, the influence of the MRI 55 processing pipeline, the variety in neuroanatomical atlases used for parcellation and the 56 uniqueness of the image acquisition itself [3,4]. We made previous attempts[5-8] to produce such 57 normative data in adulthood based on FreeSurfer, an open-access and fully automated 58 segmentation software (http://freesurfer.net), for two specific brain atlases, namely  Killiany [9] (DK) and Desikan-Killiany-Tourville[10] (DKT). This initial foray allowed for the 60 quantification of the extent of deviation from normality for a given individual, according to 61  and with probable Alzheimer's disease (AD), for a total of 273 participants. While CIMA-Q and 172 COMPASS-ND were acquired at 18 different sites, we selected only data from scanners that had 173 at least three participants other than SIMON, which resulted in a total of 547 images (300 CU, 174 193 MCI, 54 AD) from 12 different scanners; each ranging from 7 to 145 participants. On those 175 12 scanners, SIMON was scanned 48 times and was aged between 42-46 years old during that 176 time. 177

Brain segmentation 178
Brain segmentation was conducted using FreeSurfer version 6.0, a widely used and freely 179 available automated processing pipeline that quantifies brain anatomy (http://freesurfer.net). 180 All raw T1-weighted images were processed using the "recon-all -all" FreeSurfer command with 181 the fully-automated directive parameters (no manual intervention or expert flag options) on the 182 white and pial surfaces areas for all atlases comprised in FreeSurfer 6.0: the default subcortical 184 atlas[14] (aseg.stats), the Desikan-Killiany atlas [9] 190 Briefly, this processing includes motion correction, removal of non-brain tissue using a hybrid FreeSurfer output file. We added the total ventricle volume (labeled as "ventricles") using the 208 sum of all ventricles and the corpus callosum (labeled as "cc") using the sum of all corpus 209 callosum segments.

Image quality predictors 236
Image quality predictors included voxel size (resolution) and two measures of image quality, 237 one global, and the second local. The first was the total number of defect holes over the whole 238 cortex, i.e. topological errors in the initial cortical surface reconstructions. The total number 239 correlated well with visual inspection of the whole image by trained raters [11]. This measure 240 was extracted from the aseg.stats FreeSurfer output file. The second measure was contrast-to-241 noise ratio (CNR) assessed in each region (R) and therefore used as a regional measure of image 242 quality. For each region, CNR was calculated after FreeSurfer preprocessing using gray matter 243 (GM) and cerebral white matter (WM) intensities from the brain.mgz file and the following 244 formula: 245 !"# $ = (() $ *+,-− /) *+,-) 1 (() $ 2,34,-5+ + /) 2,34,-5+) 246 247 3.29 and higher than 3.29 (p < .001) were removed before computing the statistical model. This 250 procedure allowed the identification of brain regions that were either very small or very large 251 when compared to the rest of the sample and thus, might not be good representative of 252 normality. For volumes and surfaces, this procedure was applied in proportion to eTIV (i.e. 253 regional measure/eTIV). Since cortical thickness is not affected by eTIV, the outliers screening 254 procedure was applied directly on the raw values. The number of outliers was below 1% for all 255 regions (mean ±sd of all atlases: 0.45% ±0.10%) except the right long insular gyrus and central 256 sulcus of the insula white surface (1.1%) and pericallosal sulcus volume (1.1%) of the Destrieux 257 atlas. Detailed results can be found in the supplementary material as csv files. 258

Regression models and statistical analyses 259
For each brain region measure, the normative values were produced following two linear 260 regression models. First, a Model 1 was conducted with image quality predictors (voxel size, 261 surface defect holes and CNR) and eTIV. Then, Model 2 with age and sex was applied on the 262 residuals of Model 1. In order to respect the normality of the residuals, surface holes and all 263 ventricles variables, except the 4th (3 rd , lateral, inferior lateral and the sum of all ventricles), were 264 log transformed. For ventricles and white matter regions, CNR of the total brain gray matter was 265 used while for the brainstem subregions and hippocampal subfields, CNR from the whole 266 brainstem and whole hippocampus were used, respectively. Quadratic and cubic terms for age, 267 CNR and surface holes were included. Since voxel size has a relatively limited variability (mean: 268 1.02, std: 0.24, range: 0.18-2.2), we chose not to include quadratic and cubic terms for this 269 variable. We also included all interactions except for voxel size (Model 1: eTIV X surface holes, 270 eTIV x CNR, CNR X surface holes ; Model 2: age X sex). Feature selection was conducted with a 271 10-fold cross-validation[58] backward elimination procedure, retaining the model with the 272 subset of predictors that produced the lowest predicted residual sum of squares. For each 273 selected final model, the fit of the data was assessed using R 2 coefficient of determination: 274 where the numerator is the residual sum of squares (Y is the value of the variable to be predicted 276 and f is the predicted value), the denominator is the total sum of squares (9 < is the mean) and 4 is 277 the index over subjects. To assess the unique contribution of each predictor, we used the lmg 278 . This metric is a R 2 partitioned by averaging sequential 279 sums of squares over all orderings of the predictors. Brain figures were made using the ggseg R 280 package [61]. In order to compare the effects of each predictor, the sum of all relaimpo R 2 terms 281 related to each variable was computed (i.e. quadratic, cubic, and half of interaction values). For 282 example, the variance explained by age includes the R 2 sum for age, age 2 , age 3 , age X sex /2. 283 When a term was not included within a model, its R 2 value was given 0. 284 The models were verified by examining the difference between R 2 of the training sample 285 and R 2 of the independent test sample of healthy controls. It was expected that the test R 2 would 286 be within 10% from the value of the training R 2 . Then, patterns of normality deviations were 287 examined with the Z score effect sizes using the validation samples of healthy individuals and of 288 individuals with AD and SZ. 289 While the goal of NOMIS differs, we compared its results with twoharmonization procedures, 294 NeuroCombat[16] and NeuroHarmonize[17] on the aseg volume and DKT cortical volume and 295 thickness measures (matrix of 146 brain measures) from the harmonization dataset (SIMON, 296 CIMA-Q and COMPASS-ND). We used the scanner identification number as "batch" (i.e. site) 297 variable. For NeuroCombat, we also specified age and eTIV as covariates to preserve their effects. 298 To compare harmonization procedures with NOMIS, after harmonization, eTIV was regressed out 299 from the brain volume measures. Finally, to compare them on the same scale for statistical 300 analyses on the variance and figure presentations, all measures were transformed into T and Z 301 scores, respectively (see Supplementary Fig 2 as example). 302 We had three expectations following harmonization procedures. Compared to raw data, these 303  The R 2 for model 2 ranged between 0.02 to 0.51, with a mean ±sd of 0.23 ±0.14, 0.08 340 ±0.04 and 0.11 ±0.07, for subcortical volumes, neocortical volumes and thicknesses, respectively. 341 One should note that the R 2 in model 2 cannot be compared to that of model 1 since the total 342 variance in model 2 is the remaining variance after model 1 (residuals). The highest R 2 were 343 observed in the largest regions and ventricles (i.e. all ventricles volume 0.51, brain segmentation 344 One should note that the models for these measures appear to be slightly less generalizable than 360 the others. 361

Clinical validation 391
We validated the normative values in individuals with clinically ascertained Alzheimer's 392 disease and schizophrenia, which showed expected patterns of mean deviations from otherwise 393 cognitively/behaviorally healthy individuals (Fig 9). In the Alzheimer's disease group, the mean 394 deviations from normality covered the frontal, temporal and parietal cortices with enlarged 395 ventricles, but were especially more pronounced in the hippocampus and entorhinal cortex. In 396 schizophrenia, atrophy was widespread to nearly all of the cortex. Supplementary Fig 3 displays  397 the variance of the scores in those two groups. 398   Despite these strengths, users should keep in mind that before using NOMIS, it is 479 mandatory to verify FreeSurfer segmentations and that while it will remove parts of variance due 480 to head size and image quality, it won't correct for segmentation errors or image artefacts. 481 Moreover, the normative sample, comprised essentially of research volunteers in academic-led 482 environments, was recruited using a non-probability sampling method and may not be 483 representative of the targeted population by the user. are applicable for the sites/scanners included in the analysis and not for future sites/scanners or 506 data. This makes such post-hoc correction analysis-specific and needs to be conducted each time 507 some data are removed or added to an analysis. Such an approach can be very useful for large 508 multi-centric studies but is not applicable for generating normative values aiming to be applied 509 on future data. It is also vulnerable to selection bias since the scaling factors are not based on the 510 images or scanner characteristics, but on the difference of data between sites/scanners [18,19]. 511 Thus, distinct characteristics of the participants at a given site can affect the scaling factors and 512 post-hoc scaling factors should be used when the aim of a study is not vulnerable to sources of 513 variance between sites that are not related to image acquisition. 514 We compared NOMIS values to two post-hoc harmonization procedures, namely 515 NeuroCombat [16] and NeuroHarmonize[17] and while globally NOMIS slightly lowered the 516 variance of the values from the same individuals originating from 12 different scanners, these 517 two procedures were worse than NOMIS and did not significantly reduce true variance induced 518 by different scanners. We also verified effect sizes of well-established effects in MCI and AD 519 participants and once again the harmonization procedures were either similar or worse than 520 NOMIS. NeuroCombat and NeuroHarmonize systematically lowered the morphometric 521 differences between CU, MCI and AD participants while NOMIS lowered the entorhinal volume 522 and thickness effect sizes and increased the hippocampal volume differences between these 523 groups. These results suggest that caution should be exercised when using post-hoc 524 harmonization; the use of a calibration technique (e.g. repeated scans of human volunteers as 525 part of the study) is strongly encouraged. 526

Using NOMIS 527
The NOMIS tool is a user-friendly automated script in Python, freely accessible 528 (https://github.com/medicslab/NOMIS). Users only need to pre-process their images with 529 choose the version of the Z-score by including in the csv file only the variables that need to be 532 adjusted and the script automatically selects the appropriate version of predictors. The predictive 533 models and all statistical parameters are provided along with the script. We anticipate that this 534 tool will be of broad interest to the neuroscientific community. contributed to the design and implementation of ADNI and/or provided data but did not 553 participate in analysis or writing of this report. A complete listing of ADNI investigators can be 554