High resolution nonlinear registration with simultaneous modelling of intensities

This paper describes and evaluates FMRIB’s nonlinear image registration tool (FNIRT), that is part of the FMRIB software library (FSL). It is a small deformation framework using sum of squared differences (SSD) as its cost function and Gauss-Newton for minimisation. The framework uses a joint shape and intensity model that attempts to explain the observed differences between two images in terms of having different shape and/or contrast, being differently affected by intensity bias-fields etc. Thus the estimation of the warps will be relatively unaffected by intensity differences that would otherwise violate the assumptions behind the SSD cost function. It uses a projection onto a manifold defined by a specified range of allowed Jacobian determinants to ensure that the warps are diffeomorphic. The utility of the model is demonstrated on a variety of simulated and experimental data with good results. FNIRT is also quantitatively evaluated using previously published datasets consisting of scans from multiple subjects, all with anatomically defined brain regions that are manually outlined. In this evaluation FNIRT performs well in comparison to previously published results with other registration algorithms.

Introduction information on the uncertainty of the estimated parameters (deformation fields).
In what follows we describe a framework where differences in shape and intensity are 97 modelled simultaneously, thereby preventing the latter to influence or be interpreted as 98 the former. This is done by introducing additional parameters that model intensity 99 differences and by inferring on these at the same time as the parameters for shape. A 100 series of intensity models will be introduced and demonstrated and the context in which 101 each one is suitable is described. Within this framework pairs of images with differing 102 contrast or differentially affected by bias can be registered by minimisation of the 103 residual error, i.e. the SSD. This enables us to use Gauss-Newton type optimisation to 104 find the parameters in an efficient way. We further suggest a multi-stage process where 105 shape and intensity are jointly modelled up to certain warp resolution after which the The registration is performed by warping one image (volume) to another. We will refer 113 to the image we want to warp as the "input" image and the stationary image (the 114 image we want to warp it to) as the "reference" image. The reference image will be 115 denoted by f when referred to as a whole and can interchangeably be an lmn × 1 116 column vector or an l × m × n volume. When referring to a single (scalar) value at a 117 specified index (or voxel coordinate) i = [i j k] we will use f i . The collection of all 118 indicies [i j k] of f is denoted by I so an alternative notation for f would be f I . The 119 warps (the nonlinear transformation) will be parametrised (the details of which can be 120 found in Appendix S1) by a parameter vector w. The input image will be denoted by g 121 when referred to as a whole in its original form. We use g i (w) to denote the value of g 122 at the index i (as defined in the space of f ) when warped by the parameters w. when applying the transform given by w. The functions g i (w) and g(w) are predicated 125 on a specific interpolation model and they would for example be subtly different when 126 spline interpolation is used compared to when using tri-linear interpolation. 127 Registration as a nonlinear optimisation problem 128 Our optimisation problem can formally be described as i.e. given g and f , find the set of parameters w that yield the minimum value for the Since h(w) is an approximation of O(w) it is likely that 145 ∇h(w 0 + ∆w) = ∇O(w 0 + ∆w), especially if ∆w is large (which implies that we are a 146 long way away from w 0 around which the approximation is valid). Therefore, to find 147 the point where ∇O truly is 0 one uses an iterative scheme of updates to w where This is the update rule for Newton's method for nonlinear optimisation, and the 149 theoretical foundation for a multitude of derived methods. 150 From this point on we will simplify the notation by removing the | wi whenever it is 151 obvious from the context at which point the pertinent entity has been calculated.

152
The Gauss-Newton approximation 153 There is a modification of Newton's method, known as the Gauss-Newton method, that 154 where a line-search is performed along each direction p i .

174
The Scaled Conjugate Gradient (SCG) method [36] is a further development that 175 calculates also a step-length along p k , eliminating the need for a line-search. 176 Analogously to the Levenberg-Marquardt algorithm it adjusts the step-length based on 177 the outcome of the previous iteration.

178
When is it an advantage to actually know/calculate H? The sum of squared differences cost function 216 The sum of squared differences (SSD) cost function is predicated on the model which gives a likelihood that is maximised when is minimised. To minimise this with the Gauss-Newton method we need to be able to 219 calculate ∇O at any point w in parameter space. If we assume that the displacements 220 are modelled by some basis-set parametrised by w then ∇O is a vector of the same size 221 May 14, 2019 9/44 as w where the jth element is given by In vector-matrix notation ∇O can be written as where J is the Jacobian matrix of g(w), i.e. the matrix whose ijth element is ∂g i /∂w j . 224 Using the same notation we can express the Gauss-Newton approximation of the 225 Hessian as The details of how to calculate J depends on the basis-set that is used to represent the 227 warps and will be relegated to Appendix S1.

228
Simultaneous modelling of shape and intensities

229
The SSD has the advantage that it makes it straightforward to calculate ∇O and H, 230 but it has the disadvantage that the underlying assumption that f and g are identical 231 save for some difference in shape is almost always wrong. The assumptions of other cost 232 functions, like e.g. Mutual Information, on the other hand are much more likely to be 233 fulfilled. The disadvantage of those is that the elements of ∇O and H will have to be 234 calculated numerically making the former very time consuming and the latter in 235 practice infeasible.

236
The main contribution of this paper is to suggest a cost function that retains the implemented a hierarchy of such intensity models where the "lower level" models are 243 special cases of those higher on the hierarchy. These are: As indicated by the name this implies no intensity modelling at all, i.e.
This may be useful for registration of "quantitative" images such as e.g. Fractional

248
Global linear scaling 249 Here where B i (b) refers to the value of some scalar field B(b) at the location given by i. The 256 mapping b → B is in principle arbitrary but will in our case be a set of continuous 257 basis-functions with b as the coefficients. This is useful when f and g have similar 258 contrast but are affected by RF inhomogeneity in different ways.

259
Global nonlinear scaling 260 A polynomial of order n − 1 is used to model the intensities so that This can, within reason, model differences in amount of T1/T2-weighting between the 262 two images.
This is a combination of the two preceding models where i.e. has been divided into a local and a global part. This models 266 both differences in T1/T2-weighting and a bias field caused by differences in RF 267 inhomogeneity.

268
Local nonlinear scaling 269 This model encompasses all the previous ones and models where b has now been divided into n sub-vectors, i. the intensities in f and g) for different voxels.

275
It may seem superfluous with more than one model given that the model given by permissive". As we will see later the opposite is also true, i.e. if the intensity model fails 281 to explain the "true" intensity differences it will be modelled as geometric differences 282 leading to non-sensical warps.

283
It should further be noted that even though the b → B mapping (in Eqs 14, 16 and 284 17) is in principle arbitrary it has to be such that the resulting field B is smooth 285 compared to the resolution of (the image) f .
When one simultaneously models shape and intensity there are two sets of parameters; 288 w which parametrises the geometric warps and b which models intensity differences.

289
Therefore ∇O changes so that where dO/dw is a columnvector whose ith element is ∂O/∂w i and dO/db is one whose 291 ith element is ∂O/∂b i and H changes so that where J w is the Jacobian of the w → g mapping g(w) and J b is the Jacobian of the 293 b → E mapping E(b; f ). In Appendix S1 we derive expressions for ∇O and H for the 294 different intensity models.

295
It should also be noted that the magnitude of the elements of dO/dw and dO/dp 296 are very different, and hence J T Gauss-Newton-based algorithm, with "sufficient accuracy" together with a 313 low/medium-resolution set of warps (w). These (b) could then be used as constants in 314 a subsequent estimation of w at higher resolution. The elements of w are now the only 315 parameters and, with a suitable representation of the warps, this means that H has a 316 structure that is "close to" spherical and that it is feasible to use e.g. the SCG 317 algorithm described above to find the warps. deciding which of those is the "best". This is obtained by combining the SSD cost 325 function with a regularisation term that is a scalar function S(w), which has a higher 326 value for "less likely" fields, i.e. one that penalises non-smooth warp fields. This is 327 equivalent to using a Normal prior on w so that w ∼ N (0, λ −1 C −1 ) where 328 S(w) = λw T Cw. Hence, the cost function becomes where σ 2 is the sample variance and λ is an arbitrary factor that determines the relative 330 weight of the regularisation. In the present paper σ 2 could have been lumped into λ 331 since it is an empirically determined "fudge-factor". It is, however, still convenient to 332 keep as a separate factor since it facilitates empirically determined values for λ that are 333 valid over a range of different σ 2 (i.e. different SNR). It also means that at earlier 334 iterations when the estimated σ 2 is large the regularisation is given more weight which 335 helps avoiding local minima.
where S(b) is a function of the same form as S(w) operating on one (Eqs 14 or 16) or 340 more (Eq 17) fields. As implemented in FNIRT S(w) and S(b) can be either 341 "membrane energy" or "bending energy" (see e.g. [11]), though all experiments in this 342 paper was performed using "bending energy". (and is typically ignored). The regularisation helps, but does not ensure that this 352 condition is fulfilled. By giving the regularisation a large weight one can be almost 353 guaranteed a diffeomorphic mapping, but that will, on the other hand, mean that the 354 warped image (g(w)) will still be quite dissimilar to f . We will have what is commonly 355 known as a "small deformation transform".

356
Projection onto a diffeomorphic surface 357 There are methods that (almost) guarantee diffeomorphic mappings by virtue of the 358 way in which they combine many small (diffeomorphic) deformations (see 359 e.g. [15], [16], [18] or [35]). In the present paper we opted for a different solution, as 360 proposed by [37]. We allow the field to become non-diffeomorphic for a limited number 361 of iterations after which it is projected onto the "closest" diffeomorphic field. The 362 partial derivatives of the fields (i.e. ∂φ x /∂x, ∂φ x /∂y,..., ∂φ z /∂z) are explicitly It has a command-line interface and is, in addition, an integral part of FEAT, FSL-VBM, 377 TBSS and FDT. Some implementation details are given in Appendix S1. Here we will 378 limit the description to say that there is a choice between the Levenberg-Marquardt

379
(LM) and the Scaled Conjugate Gradient (SCG) methods for minimisation. The former 380 explicitly calculates and uses the Hessian (J T J) and is hence quite memory hungry for 381 high resolution (many elements in w). The latter does not explicitly use the Hessian 382 and therefore needs less memory. The flip-side is that the SCG method cannot be used 383 for simultaneous estimation of warp (w) and intensity (b) parameters since the implicit 384 approximation of H is too poor in that case. FNIRT is able to take w and b as input, 385 providing an initial guess for both. Therefore our strategy for achieving high resolution 386 warps is to run FNIRT to a medium resolution ( 8-10mm knot spacing) with 387 simultaneous estimation of intensity parameters b. A second run is then performed 388 using the warp and intensity parameters from the first run as input, refining the former 389 by going to a higher warp resolution but keeping the latter fixed.

Demonstration of intensity models 392
A set of simple simulations was performed to demonstrate the utility of the different 393 intensity models. The "phantom" consisted of a set of geometric shapes: spheres, 394 circular disks and a circular bowl. These were combined in such a way that when cut at 395 a certain level in the z-direction it would resemble either a smiley or grumpy face when 396 viewed in the xy-plane. These shapes were "imaged" using a 64 × 64 × 64 matrix with a 397 1 × 1 × 1mm voxel size. Intensities were assigned to the face, eyes and mouth to mimic 398 those of white and grey matter and CSF respectively. To ensure that we assigned 399 "realistic" intensities to the different tissue types, and also to introduce realistic intensity 400 differences due to sequence differences and variable flip-angle, we used the signal 401 equations for an MPRAGE sequence as outlined in [33]. for 1.5T according to [38]). The second grumpy face was created using these parameters 411 with homogeneous B 1− (receive B 1 field) and B 1+ (transmit B 1 field) fields. An 412 inhomogeneous field was created as a quadratic spline field with a knot spacing of 413 20mm where selected coefficients were set to 0.5 while the remaining ones were kept a 1. 414 Quadratic splines were used so that the field should not be created with exactly the 415 same basis as we use to model it (though it is admittedly very similar). This field was 416 applied as a B 1− field to the third grumpy face (by element-wise multiplication) and as 417 a B 1+ field to the fourth grumpy face (by letting it modulate α in the signal equations 418 in [33]). In all, this gives one smiley face to use as the target for all registrations and 419 four grumpy faces whose intensities are successively more different from that of the The face consists of "white matter", the eyes of "gray matter" and the mouth of "CSF". The data was simulated with parameters as described in the main text. In short the two leftmost phantoms were simulated with parameters as used at the FMRIB and the three rightmost with parameters as described in [33]. The middle images is simulated with homogeneous transmit and receive B 1 fields, the second from right is simulated with an inhomogeneous receive B 1 and the rightmost with an inhomogeneous B 1 transmit field (leading to a spatially varying flip-angle). The considerably larger head of this subject (and also slight, unintentional, rotation of 455 the head around the z-axis) resulted in a different bias field where the intensities were 456 highest in the lower-left and the upper-right quadrants (as seen in an axial slice) of the 457 brain. This was a consequence of these two parts being closest to the coil-array. F was 458 registered to M using a configuration file for T1 to T1 registration. The registration was 459 run to a knot spacing of 10mm and to a knot spacing of 2mm, both with intensity 460 modelling given by Eq 16 and by Eq 13. The registration was also reversed so that M 461 was registered to F using the same parameters as above. As in the case above warps 462 were allowed to become non-diffeomorphic.  Other scanning parameters (TI, TE and TR) were not completely consistent across the 481 group, but were in the same general domain as described for the female volunteer above. 482 All scans were affected by B 1− bias fields with gradients primarily in the y-and 483 z-directions ranging from moderate (a factor of 1.5 across the brain) to severe (a factor 484 of 3 across the brain). Generation of the template was started by registering all subjects 485 to the MNI152 template after which the initial a group template was created as the 486 average of these initial registered images. This template was refined by four more field. All scans were first registered, with a 5mm knot spacing, to the MNI Macaque atlas [41] after which an initial population mean was created. The scans were 507 re-registered to the population mean using the same registration parameters as for the 508 registration to the MNI space. Following this five more iterations were performed of 509 registering all scans to the population mean and re-generating the population mean. At 510 each new iteration the knot spacing was decreased and/or the regularisation was 511 decreased until the final iteration was run with a 1mm knot spacing. One such set of 512 iterations was performed using Eq 13 ("standard" scaled sum-of-squared differences) 513 and one set using Eq 16 (global nonlinear intensity mapping with a multiplicative bias 514 field) for intensity mapping. the Relative Overlap Metric (or the Jaccard coefficient) as suggested in [42] was 534 calculated (see definition in Appendix S1) and reported for each region. In addition we 535 computed a modified version of the Inverse Consistency Measure also suggested by [42]. 536 It was modified to use the norm, rather than the square of the norm, of the differences 537 between the forward and backward mappings thus yielding a measure in mm that is 538 more intuitive to interpret. The details of the modification are described in Appendix 539 S2.

540
The regions appear to be very carefully defined and compared to, for example, the 541 LPBA40 dataset the regions are more narrow, being restricted to gray matter voxels, 542 yielding relatively large surface areas to volume ratios which should be more difficult to 543 register and hence better discriminate between different registration algorithms. to what [44] reported for an early beta version. We have relegated most of the methods 547 and results of this to Appendix S4. In short, in [44] the information contained in the 548 pial surface of the cortex was not used by FNIRT in contrast to all the other methods, 549 which used this information (see footnote 4 on page 799 of the [44] paper). Therefore we 550 re-ran the experiments in [44] on the data sets LPBA40 [45] and MGH10 [44] with 551 identical settings except for now using the information from the pial surface of the 552 cortex. In addition, we ran it to a 1mm knot spacing, a resolution that was not possible 553 with the beta release that was evaluated in [44].

555
Demonstration of intensity models 556 The results of the registrations of the simulated phantoms are summarised in compare the two leftmost images to the two rightmost in either of the lower two rows) 575 where vast misregistration results from not modelling the bias field.  Registration of a single subject to the MNI152 template  The upper left panel shows the MNI152 standard brain. The lower left panel shows the brain of a healthy female volunteer affinely registered to the MNI152 brain. Note that there is an appreciable intensity difference between the frontal and occipital parts of the brain. The lower middle panel shows the same volunteer after FNIRT registration with a 2mm knot spacing using the intensity model given by Eq 16. The image in the lower right panel has been registered to the same resolution (2mm) but using a simpler intensity model (global linear scaling). Note that the intensity model has no been used to "correct" the images, which are shown with their original intensities to demonstrate the level of inhomogeneity. The upper right panel shows the distribution of Jacobian determinants (in log-lin scale) for the intensity model given by Eqs 16 (blue) and 13 (red). N.B. that in order to demonstrate the effect of the intensity modelling the projection onto a diffeomorphic surface was turned off for this registration, and that normally there would be no Jacobian values outside the prescribed range.

Registration of one subject to another 584
The subject→subject registration demonstrates (Fig 4) a similar behaviour with respect 585 to intensity modelling as did the subject→MNI152 registration. The presence of a bias 586 field severely disrupts the registration leading to superfluous warps when not explicitly 587 modelled. In contrast the warps look sensible and the range of Jacobians is much 588 reduced (reducing the number of voxels with a Jacobian determinant≤ 0 by 50%) when 589 modelling a nonlinear global intensity relationship and a multiplicative bias field (Eq 590 16).

591
For the subject↔subject registration we also demonstrate the non-symmetrical 592 results that FNIRT yields by showing subject F transformed to the space of subject M 593 using the warps calculated from an F→M registration (Fig 5, middle column) and also 594 using the inverse of the warps obtained from an M→F registration (see Appendix S1 for 595 a description of how we calculated the inverse). It is easily seen that the two solutions 596 are slightly different, though surprisingly it is not obvious which of the two is "better". 597  appear reasonable to a visual inspection. It should also be noted that this subject is in 606 no way "selected", but was obtained by asking around the lab for "the most atrophied 607 subject from your study". Our desire to stay with this "random" selection meant that it 608 was included even though it has quite pronounced intensity changes from small vessel

Generation of a group template 613
The group template started drifting and did not converge when using linear scaling. In 614 contrast it converged to a plausible group average when modelling the bias field (using 615 Eq 16), and this was true regardless of if we used the bias corrected images for 616 re-calculating the average or not.
warped images from three of the subjects constituting this average. As can be seen, the 619 visual agreement is very good. 620 Fig 7. On the left is the group average with a red square indicating the part that is shown in the subsequent panels. The first of those shows again the group average, zoomed in to concentrate on a part containing ACC, Caudate, Insula and prefrontal cortex. This area was chosen as it contains the relatively "easy" structures Caudate and insular cortex, but also structures of intermediate and high difficulty such as ACC and PFC respectively. The rightmost three panels show the warped scans of three of the 57 subjects constituting the average. These were the three first subjects in alphabetical order and have not been selected as being particularly well registered. It should also be noted that the dark streak across the white matter between the Caudate and the ACC in the third subject does not imply a non-diffeomorphic warp, but is clearly visible in the original scan of this subject.
In When using linear scaling (Eq 13) the registered images were severely distorted and the 626 population average did not converge. This was a consequence both of the severe B 1− 627 field and also of the hyper-intense vessels that were present in all the scans. In contrast, 628 when using nonlinear scaling with a multiplicative bias field (Eq 16) the registration was 629 successful and the population mean converged to a very sharp image (Fig 9). The 630 success of the registration can also be seen from the individual images after registration 631 (Fig 9). One can deduce the approximate placement of the four coils simply by looking at this image. The second panel from the left in the top row shows a partial axial slice of the population average of the six monkeys after registration. The remaining panels show the same area of the six individual scans after registration. It can be seen that even with a limited FOV within a slice the intensity variations are severe. In each of the displayed sections there are gray matter voxels with higher intensity than some white matter voxels, and in four of them there are gray matter voxels with intensity more than twice as high as some white matter voxels. Despite this it can be seen that the registration is good across the six subjects. Jaccard coefficient (Relative Overlap Metric) for the 32 regions of the NIREP database, averaged across left and right sides, for affine (red) and nonlinear registration (green: 10mm knot spacing, blue: 1mm knot spacing). The line in the boxes denotes the median, the boxes extend from the 25th to 75th percentile, the whiskers indicate the total range of "non-outlier" data and outliers (approximately defined as outside ±2.7σ) are represented as red crosses. Each box represents 240 values (registrations). It can be seen that for all regions the nonlinear registration is a considerable improvement over the linear, ranging from 50% relative improvement for less convoluted regions with thick cortex (e.g. Insula and Parahippocampal gyri) to 100% for more difficult regions (e.g. pre-and postcentral gyri). There is also a consistent, albeit smaller, improvement when increasing the warp resolution from 10mm to 1mm.

Demonstration of projection onto a diffeomorphic manifold
registration. It should be kept in mind though that performing no registration at all 666 would result in a perfect consistency. It is interesting to note that the M ICE actually 667 decreases (for most regions) when going from 10 to 1mm knot spacing. This, together 668 with the higher overlap scores, is a strong indication that the increased resolution really 669 results in a more "correct" registration. The voxel wise M ICE averaged across all 670 registrations is shown in Fig 14. Note how the M ICE is largest in the parietal cortex. 671 This is consistent with the subjective impression that the variability in location (and 672 even existence) of sulci is greater here than in the rest of the brain. It can also be seen that for the "easier" regions such as the Insula and Parahippocampal regions both the errors and the difference between linear and nonlinear registration is small, while it is large for "difficult" regions. However, the pattern is not "just" the opposite of that for the overlap criteria in that a "difficult" area such as the pre-or post-central gyrus has a smaller M ICE than e.g. the Superior or Inferior parietal lobules. Note also that the M ICE tends to be smaller for registration with 1mm knot spacing than with 10mm knot spacing. ART, SyN and JRD-fluid refer to the metods described in [47], [29] and [48] respectively. 683 FNIRT-HR and FNIRT-LR refer to FNIRT run to a knot spacing of 2mm and 10mm 684 respectively. For the MGH10 data the top methods were ranked SyN/FNIRT-HR and 685 ART/FNIRT-LR with no other method achieving a rank among the top four.

686
The full set of results is presented in Appendix S4. it is necessary to make the distinction between the effects of a B 1− and a B 1+ field (i.e. 697 if it is generally useful to use Eq 17 instead of Eq 16) and as of present the 698 recommendation is to use Eq 16. More experience of data from very high field scanners 699 (where B 1+ inhomogeneity is a greater problem) will allow us to investigate this in more 700 detail in the future.

701
Our evaluations indicate that the simultaneous modelling renders the method very 702 robust to the severe bias fields resulting from "sum-of-squares" reconstruction of data 703 acquired with coil arrays (as opposed to bird cage coils that result in more homogeneous 704 images). This is an increasingly important property of a registration algorithm as coil pre-processing such as histogram equalisation [49] and bias correction (e.g. [50] it would be easy to find the intensity mapping. Hence none of the problems can be said 712 to precede the other in any meaningful sense and they benefit from being addressed 713 simultaneously.

714
Comparison to other methods

715
The published strategy that is most similar to ours is the combined 716 segmentation-registration framework in SPM [32]. Similar to our model, theirs includes 717 an explicit representation and modelling of a bias field. In the segmentation model each 718 brain tissue class (gray and white matter and CSF), and some classes intended to the other hand the segmentation does potentially discard useful information from 725 differential intensities within a given tissue class that FNIRT is able to utilize. For 726 example subcortical gray matter structures will have a different T1 compared to cortical 727 gray matter [38], and the central sulcus has been implicated as having a different T1 728 than other cortical areas [51]. Additionally, some subcortical areas have structure that 729 can potentially be used to guide registration (see figure 8) but that is unused in a joint 730 segmentation-registration framework. It will also be a disadvantage to have to perform 731 two registrations to a set of tissue priors when registering two scans of the same subject 732 in longitudinal studies. A consequence of the choice of basis-function in [32] is that they 733 are limited to a rather crude warp resolution ( 20mm), but when a higher resolution is 734 called for they augment that with a DARTEL registration [18] of gray and white matter 735 segments. This is similar to our strategy of fixing the intensity model for the final 736 resolution steps when estimating the warps.
"global" cost function has been rendered local by sampling it locally. Both these groups 739 show promising results, and the latter has additionally performed very well in a recent 740 comparison of nonlinear image registration methods (see [44] and section ) though the 741 specific problem of locally varying intensities (bias field) was not addressed in that 742 study.

743
Quantitative evaluation: The NIREP data 744 The main quantitative evaluation was performed on the NIREP data [42], though we 745 did also use the LPBA40 [45] and MGH10 [44] data. The regions constituting the 746 NIREP database appear (to a visual inspection) to be very carefully defined and follow 747 both the pial cortical and the white matter surface faithfully. It is our subjective 748 impression that it is superior to other databases of the same type (e.g. LPBA40 [45] or 749 MGH10 [44]) which is the reason it was our main choice for the quantitative evaluation. 750 There is a limited amount of reported results using the NIREP database, but our results 751 compare favourably to those we have found (i.e. [52]). In the main body of the paper we 752 graphically present some results for the Jaccard coefficient (Relative Overlap in [42]) 753 and for our modified ICE index. In the Appendix S3 we also present tabulated values 754 for overlaps and consistency error that might be useful for comparison to future 755 studies/methods.

756
Limitations of datasets for quantitative evaluation 757 These datasets (NIREP, LPBA40 and MGH10) had all been highly "massaged" and 758 were all bias field corrected (though this is not explicitly stated for the NIREP data) 759 and skull-stripped. This means that data that is used for evaluation of registration is that the intensity in these parts can vary considerably between scans from different 769 scanners and/or sequences. This is especially problematic for the meninges since that 770 intensity is right next to the pial surface of the brain. In different T1-weighted scans we 771 have seen intensities in this tissue ranging from much lower than in gray matter to 772 roughly equal to gray matter to higher than gray matter. This causes problems when

779
This means that quantitative results from data like these will reflect a best case 780 scenario and that both overlap and inverse consistency are worse in the "normal" 781 clinical or scientific laboratory setting. It also means that when comparing methods, 782 that aspect of how they deal with these kind of real world problems is not addressed.

783
What resolution warps should FNIRT be run at? to produce large deformations without incurring a non-invertable field. Another 826 difference between the two strategies is that in "diffeomorphic by construction" 827 methods regularisation is typically applied to each update instead of to the resulting field. This means that one does not "build up tensions" in the warps which can therefore grow to any magnitude without incurring any cost. In contrast, small 830 deformation methods typically regularise the warps themselves, using that as a means 831 to prevent non-invertible warps but at the same time limiting the warps that can be 832 accommodated.

833
Despite these purported limitations we opted to develop our method within a small 834 deformation framework. The reason for this was our desire to include a simultaneous 835 intensity model that would allow us to address imperfections in real data such as bias 836 fields and the difficulties in estimation the gradient and Hessian of such a model within 837 a "diffeomorphic by construction" framework. So, in order to attempt to overcome the 838 disadvantages of the small deformation model we implemented a multi-resolution, 839 multi-regularisation scheme with projection onto a diffeomorphic warp after every few 840 iterations.

841
In the present paper we demonstrate that this has yielded a method that achieves volunteers. Our results indicate that for registration within these kinds of populations a 851 small deformation framework along with careful implementation and projection onto a 852 diffeomorphic manifold is sufficient to obtain good results. It is possible that for more 853 difficult applications, e.g. for registering severely atrophied to healthy brains, or 854 monkeys to humans [16], diffeomorphic by construction algorithms may have an construction" methods, such as the patch-to-a-C example [9] or moving a control point 860 through a line/plane defined by other control points [16], are somewhat contrived and 861 not necessarily representative of anything one would ever encounter while registering 862 scans of brains.

863
Inverting and composing warps 864 When calculating a diffeomorphic warp in an Eulerian framework it is trivial to 865 calculate also its inverse and, possibly more important, one can also calculate the 866 gradient and Hessian of the cost function for both forward and backward warps [18].

867
This means that it is feasible to estimate exactly inverse consistent warps (e.g. [18] 868 and [29]), something that would not be possible in a small deformation framework. This 869 is a potentially important point since the constraint imposed by inverse consistency 870 when warping one scan to another is a highly reasonable and unequivocal one, unlike, 871 for example, the regularisation model that is arbitrary (i.e. membrane energy, bending 872 energy or linear elastic energy) and chosen more for computational convenience than for 873 empirical reasons.

874
It is, however, feasible to calculate a good approximation the inverse of warps 875 obtained with a small deformation framework [54], and it can be done in a few tens of 876 seconds (see Appendix S1). Hence, the implicit assumption in [18] that the inverse Our re-evaluation of the data used in [44] confirmed our hypothesis that FNIRT would 881 perform better if using the information in the pial edge of the cortex. All the evaluations 882 in [44] were performed on skull stripped data. Skull stripping introduces an artificial 883 sharp edge of non-zero to zero voxels where the skull stripping algorithm/observer 884 locates the cut, which means that the accuracy of a subsequent registration will depend 885 crucially on the accuracy of the skull stripping. For this reason we usually recommend 886 that FNIRT is run on non-skull stripped (original) data unless one has a very good 887 reason to believe that the skull stripping has performed perfectly. If, for some reason, only skull stripped data is available we usually recommend that one runs FNIRT with a 889 switch that tells it to regard zero values as missing data, thus ignoring the artificial edge 890 (as long as one is not 100% confident in the accuracy of the stripping). For these 891 reasons we initially recommended these settings (ignoring zeros) for the evaluations 892 performed by [44], unlike all the other methods that used the artificial edge.

893
What we did not consider at the time was that it didn't matter if the brains were 894 poorly extracted or not since the extraction was based on the union of all manually 895 labeled regions. This means that if one wants to perform well in terms of aligning these 896 regions one should definitely use the information in the artificial edge since it pertains 897 directly to the measure that is used for the evaluation. Unfortunately this creates a 898 strong dependence between the "data" (the skull stripped scans) and the "ground truth" 899 (the manually defined labels). These should of course ideally be independent.

900
Paradoxically this means that the it would be more "valuable" to use the information 901 contained in a brain↔zero junction than to use any other internal junction regardless of 902 how "objectively wrong" the former is. wishing to confirm our findings.

909
It should be noted that FNIRT was not the only method that was a beta release at 910 the time and it is conceivable that some of the other methods would also perform better 911 if their finished, released, versions were used (see e.g. [35] for a re-evaluation of 912 DARTEL).

914
We have described the implementation of a method/framework for small-displacement 915 nonlinear registration of brain MR images. The results look promising and on "ideal" 916 data compare favorably to the methods included in [44]. However, we believe that the 917 main advantage of the presented method is on the "imperfect" data commonly encountered in a clinical or neuroscience setting, especially with respect to bias fields 919 commonly seen in modern coil array acquisitions.

920
Future work will focus on a multivariate framework where the intensity modelling is 921 extended to a general model-driven transfer function.

923
Appendix S1 Additional information about the method. Additional re-analysis of two of the datasets used in [44] (the MGH10 and LPB40 data sets) using 932 different parameters for FNIRT.