Harmonised segmentation of neonatal brain MRI

Deep learning based medical image segmentation has shown great potential in becoming a key part of the clinical analysis pipeline. However, many of these models rely on the assumption that the train and test data come from the same distribution. This means that such methods cannot guarantee high quality predictions when the source and target domains are dissimilar due to different acquisition protocols, or biases in patient cohorts. Recently, unsupervised domain adaptation (DA) techniques have shown great potential in alleviating this problem by minimizing the shift between the source and target distributions, without requiring the use of labelled data in the target domain. In this work, we aim to predict tissue segmentation maps on T2-weighted (T2w) magnetic resonance imaging (MRI) data of an unseen preterm-born neonatal population, which has both different acquisition parameters and population bias when compared to our training data. We achieve this by investigating two unsupervised DA techniques with the objective of finding the best solution for our problem. We compare the two methods with a baseline fully-supervised segmentation network and report our results in terms of Dice scores obtained on our ground truth test dataset. Moreover, we analyse tissue volumes and cortical thickness (CT) measures of the harmonised data on a subset of the population matched for gestational age (GA) at birth and postmenstrual age (PMA) at scan. Finally, we demonstrate the applicability of the harmonised cortical gray matter maps with an analysis comparing term and preterm-born neonates and a proof-of-principle investigation of the association between CT and a language outcome measure.


INTRODUCTION
Medical image deep learning has made incredible advances in solving a wide range of scientific problems, including tissue segmentation or image classification (Miotto et al., 2018). However, one major drawback the two cohorts matched for GA and PMA, for which we analyse tissue volumes and global and local CT 48 measures. Finally, we perform an analysis comparing term and preterm-born neonates on the harmonised 49 cortical gray matter maps and we show the importance of harmonising the data by a proof-of-principle 50 investigation of the association between cortical thickness and a language outcome measure.

52
The T 2 w MRI data used in this study was collected as part of two independent projects: the developing 53 Human Connectome Project (dHCP 1 ), and the Evaluation of Preterm Imaging (ePrime 2 ) study. The dHCP 54 data was acquired using a Philips 3T scanner and a 32-channels neonatal head coil (Hughes et al., 2017), 55 using a T 2 w turbo spin echo (TSE) sequence with parameters: repetition time T R = 12 s, echo time 56 T E = 156 ms, and overlapping slices with resolution 0.8 × 0.8 × 1.6 mm 3 . All data was motion corrected 57 (Cordero-Grande et al., 2018;Kuklisova-Murgasova et al., 2012) and resampled to an isotropic voxel size 58 of 0.5 mm 3 . The ePrime dataset was acquired with a Philips 3T system and an 8-channel phased array head 59 coil, using a T 2 w fast-spin echo (FSE) sequence with parameters: repetition time T R = 14.73 s and echo Irina Grigorescu et al.

Harmonised segmentation of neonatal brain MRI
time T E = 160 ms (Ball et al., 2017). Images were acquired with a voxel size of 0.86 × 0.86 × 2 mm, with 61 1 mm overlap. 62 Our two datasets comprise of 402 MRI scans of infants born between 23 − 42 weeks GA at birth and 63 scanned at term-equivalent age (after 37 weeks PMA) as part of the dHCP pipeline, and a dataset of 485 64 MRI scans of infants born between 23 − 33 weeks GA and scanned at term-equivalent age as part of the 65 ePrime project. Figure 1 shows their age distribution. Both datasets were pre-processed prior to being used by the deep learning algorithms. The ePrime 67 volumes were linearly upsampled to 0.5 mm 3 isotropic resolution to match the resolution of our source 68 (dHCP) dataset. Both dHCP and ePrime datasets were rigidly aligned to a common 40 weeks gestational 69 age atlas space (Schuh et al., 2018) using the MIRTK (Rueckert et al., 1999) software toolbox. Then, 70 skull-stripping was performed on all of our data using the brain masks obtained with the Draw-EM pipeline 71 for automatic brain MRI segmentation of the developing neonatal brain (Makropoulos et al., 2018). Ground 72 truth tissue segmentation maps were obtained using the same pipeline (Draw-EM) for both cohorts. 73 To train our networks, we split our datasets into 80% training, 10% validation and 10% test (see Table 1), 74 keeping the distribution of ages at scan as close to the original as possible. We used the validation sets to 75 keep track of our models' performance during training, and the test sets to report our final models' results 76 and showcase their capability to generalize.

78
To investigate the best solution for segmenting our target dataset (ePrime), we compared three 79 independently trained deep learning models: -Net (Ronneberger et al., 2015) trained on the source dataset (dHCP) only and used 81 as a baseline segmentation network (see Figure 2).

82
• Adversarial domain adaptation in the latent space. A 3D U-Net segmentation network trained 83 on source (dHCP) volumes, coupled with a discriminator trained on both source (dHCP) and target (ePrime) datasets (see Figure 3). This solution is similar to the one proposed by (Kamnitsas et al.,85 2017) where the aim was to train the segmentation network such that it becomes agnostic to the data 86 domain.

87
• Adversarial domain adaptation in the image space. Two 3D U-Nets, one acting as a generator, and 88 a second one acting as a segmentation network, coupled with a discriminator trained on both real 89 and fake ePrime volumes. The segmentation network is trained to produce tissue maps of the fake 90 ePrime-like volumes created by the generator (see Figure 4). The normalised cross correlation (NCC) 91 loss is added to the generator network to enforce image similarity between real and synthesised images, 92 a solution which was previously proposed by (Grigorescu et al., 2020).

93
To further validate the harmonised tissue maps, we trained an additional network (a 3D U-Net) to segment 94 binary cortical tissue maps into 11 cortical substructures (see Table 2) based on anatomical groupings 95 of cortical regions derived from the Draw-EM pipeline. The key reasons for training an extra network 96 are: first, we avoid the time consuming task of label propagation between our available dHCP ground 97 truth segmentations and predicted ePrime maps, and second, we can train this network using ground truth 98 cortical segmentations, and apply it on any brain cortical gray matter maps as in this case there will be no 99 intensity shift between target and source distributions.  The cortical parcellation network has the same architecture as the tissue segmentation network, but 116 outputs a 12-channel 3D volume corresponding to the following cortical substructures: frontal left, frontal where w l = 1/( n t ln ) 2 is the weight of the l th tissue type, p ln is the predicted probabilistic map of the 123 l th tissue type at voxel n, t ln is the target label map of the l th tissue type at voxel n, and M is the number 124 of tissue classes. While training, we used the Adam optimizer with its default parameters and a decaying 125 cyclical learning rate scheduler (Smith, 2015) with a base learning rate of 2 · 10 −6 and a maximum learning 126 rate of 2 · 10 −3 .   . The image space domain adaptation setup uses a generator network to produce ePrime-like T 2 w MRI images (marked with T), which are then used as input into the segmentation network. The discriminator is trained to distinguish between real (ePrime) and fake (ePrime-like) volumes, while the generator is trained to produce realistic images in order to fool the discriminator. The NCC loss enforces image similarity between real and synthesised volumes.
generalised Dice loss and an adversarial loss: where α was a hyperparameter increased linearly from 0 to 0.05 starting at epoch 20, and which remained where a signified the 147 label for fake volumes and b was the label for real volumes. The generator and the segmentation network 148 are trained together using the following loss: ]. An additional NCC loss was used between the real and the generated 150 volumes in order to constrain the generator to produce realistic looking ePrime-like images. Without the 151 additional NCC loss, the generator tends to produce synthesized images with an enlarged CSF boundary in 152 order to match the preterm-only born distribution found in the ePrime dataset, as we previously shown in 153 (Grigorescu et al., 2020).

154
These three methods were trained with and without data augmentation for 100 epochs, during which 155 we used the validation sets to inform us about our models' performance and to decide on the best 156 performing models. For data augmentation we applied: random affine transformations (with rotation angles  the harmonisation was successful. We also assessed the association between PMA and cortical thickness in 168 the two cohorts.
169 Figure 5. The results on our dHCP test dataset for all six methods. Models with non-significant differences in mean Dice Scores when compared to the baseline with augmentation method are shown above each pair. The yellow diamond highlights the model which obtained the highest mean Dice score for its respective tissue type.

dHCP test dataset 170
Baseline and domain adaptation models. Figure  Cortical parcellation network. Table 3 Table 1). For these two cohort subsamples with similar GA  197 Volumes. Figure 6 shows the tissue volumes for both the original and the predicted segmentations.   results as well as the example above suggest that the image with augmentation method was more robust. cohorts. In addition, we use the trained cortical parcellation network to produce cortical substructure masks.

237
We perform a term vs preterm analysis on the harmonised cortical gray matter maps and we show the 238 importance of harmonising the data with a proof-of-principle application setting where we investigate the 239 association between cortical thickness and a language outcome measure. preterm-born neonates) happens at roughly the same age (PMA = 38.5 weeks) as in the dHCP-only dataset.
260 Figure 10. Mean cortical thickness measures in our dHCP dataset (first column), and in both of our cohorts before (second column) and after (third column) harmonising the tissue segmentation maps. The first row plots the cortical thickness measures against GA, while the second row plots the cortical thickness measures against PMA, with individual regression lines on top.
We extended the term vs preterm analysis on cortical thickness substructures. Figure 11 shows the results

261
of applying a linear model regressing mean cortical thickness measures on PMA, GA, sex, birth weight and prematurity, where significant differences (p < 0.05) between the two cohorts (term and preterm-born 263 neonates) are highlighted in the image.
264 Figure 11. Comparison of cortical thickness measures for the whole cortex and for each of the 11 cortical subregions between term and preterm-born neonates. The results of the linear regression are reported in the table in terms of differences between term and preterm-born neonates.
Behavioural outcome association. As a final proof-of-principle, we demonstrate the importance of data p < 0.05), suggesting that greater frontal cortical thickness at term-equivalent age is associated with 278 reduced language abilities at toddler age.

279
However, as can be seen in Figure 12, this is likely a spurious effect due to (artefactually) heightened

306
Using the cortical parcellation network, we also produced cortical thickness measures for the 11 cortical 307 subregions (see Table 2). Again, the models trained with augmentation performed better than their 308 no augmentation counterparts (see Figure 8). However, our proposed image with augmentation model 309 performed best, whereby ePrime values, tending towards higher values before harmonisation, were brought 310 downwards into a comparable range of values to dHCP, for 10 out of 11 cortical subregions (see Figure 8 311 last column). For the right parietal lobe, our proposed method outperformed the original segmentations 312 and the other 5 models, but did not manage to bring the values down to a non-significant range. One 313 potential reason for this is that, on a visual insepction, the ePrime cohort appears to suffer from more partial 314 volume artifacts than its dHCP counterpart, which can confuse the segmentation network and can lead to 315 overestimation of the cortical gray matter / cerebrospinal fluid boundary. Moreover, a close inspection of 316 the predicted tissue segmentation maps (see Figure 9) also showed that our proposed model (image with 317 augmentation) corrected misclassified voxels which were prevalent in the other 3 methods.
cortical thickness measures between term and preterm-born neonates. We showed in Figure 11 that our 320 harmonised cortical gray matter maps resulted in global thickness measures which were comparable with 321 the dHCP-only neonates, while also revealing a significant effect of GA and the interaction between age at 322 scan and at birth. We performed a similar analysis on the local cortical thickness measures and highlighted 323 three regions of interest (frontal left, frontal right, and parietal left) which showed significant differences 324 between the two cohorts (see Figure 11). These regions are consistent with previous studies (Nagy et al., showed no significant effect of cortical thickness (second column of Figure 12), consistent with the ground-331 truth results seen in each cohort individually. This analysis demonstrates that without data harmonisation, 332 pooling images from separate datasets can lead to spurious findings that are driven by systematic differences 333 in acquisitions rather than by true underlying effects. Our harmonisation allows for our two datasets to 334 be combined into joint analyses while preserving the underlying structure of associations with real-world 335 outcomes.

336
Our study was focused on unsupervised domain adaptation approaches; in future we would like to 337 investigate semi-supervised approaches as well by including reliable segmentations of the ePrime cohort.

338
Moreover, the latent based domain adaptation method was trained using the features at each layer of the 339 decoding branch, without analysing different combinations of the encoding-decoding layers. In future, we 340 aim to extend our work to harmonise diffusion datasets.

CONFLICT OF INTEREST STATEMENT
The authors declare that the research was conducted in the absence of any commercial or financial 342 relationships that could be construed as a potential conflict of interest.

AUTHOR CONTRIBUTIONS
I.G. prepared the manuscript, implemented the code for the domain adaptation models and the analysis. L.V.

344
participated in the implementation of the analysis code, the study design and interpretation of the results.  are those of the authors and not necessarily those of the NHS, the NIHR or the Department of Health.

ACKNOWLEDGMENTS
We thank everyone who was involved in acquisition and analysis of the datasets. We thank all participants 360 and their families. The views expressed are those of the authors and not necessarily those of the NHS, the 361 NIHR or the Department of Health. This paper is an extension of our previous work (Grigorescu et al., .

SUPPLEMENTAL DATA DATA AVAILABILITY STATEMENT
The dHCP datasets analyzed for this study will become available after the public release of the dHCP data.

364
The code developed for this study will become available online after publication of the article.