Abstract
Previous studies have indicated that white matter hyperintensities (WMH) may evolve, i.e., shrink, grow, or stay stable, over a period of time. However, predicting the evolution of WMH is challenging because the rate and direction of WMH evolution varies greatly across studies. Evolution of WMH also has a non-deterministic nature because some clinical factors that possibly influence it are still not known. In this study, we attempt to predict the evolution of WMH from baseline image while addressing the non-deterministic nature of this process by proposing an end-to-end training model that uses deep learning and an auxiliary input module. We name this proposed model “Disease Evolution Predictor” (DEP). The DEP model receives a baseline image as input and produces a map called “Disease Evolution Map” (DEM), which represents the evolution of WMH from baseline to follow-up. Two models of DEP are proposed, i.e., DEP-UResNet and DEP-GAN, which represent supervised and unsupervised deep learning algorithms respectively. DEP-UResNet uses baseline T2-FLAIR as main input while DEP-GAN uses either baseline irregularity map (IM) or probability map (PM) instead. The generated DEM can be compared to the real DEM by using a simple subtraction between baseline and follow-up assessments. In this study, labels of WMH manually produced by an expert are used as reference standard for evaluation. To simulate the non-deterministic and unknown parameters involved in WMH evolution, we propose modulating a Gaussian noise array to the DEP model as auxiliary input. This forces the DEP model to imitate a wider spectrum of alternatives in the results. The alternatives of using other types of auxiliary input instead, such as baseline WMH and stroke lesion loads were also tested. Note that baseline WMH load is the strongest known risk factor of WMH evolution. Based on our experiments, DEP-GAN using PM and Gaussian noise as auxiliary input yielded one of the best results in almost all evaluations, including clinical plausibility. The DEP-UResNet regularly performed better than the DEP-GAN using PM and Gaussian noise in some evaluations, but eventually it did not show promise in the clinical evaluation. Moreover, supervised DEP-UResNet also requires manual WMH labels on two MRI scans for training, which, in many scenarios, is not applicable. To the best of our knowledge, this is the first extensive study on modelling WMH evolution using deep learning algorithms and dealing with the non-deterministic nature of WMH evolution.
1. Introduction
White matter hyperintensities (WMH) are neuroradiological features seen in T2-weighted and fluid attenuated inversion recovery (T2-FLAIR) brain magnetic resonance images (MRI). Clinically, WMH have been commonly associated with stroke, ageing, and dementia progression (Wardlaw et al., 2013; Prins and Scheltens, 2015). Furthermore, recent studies have shown that WMH may decrease (i.e., shrink/regress), stay unchanged (i.e., stable), or increase (i.e., grow/progress) over a period of time (Ramirez et al., 2016; Chappell et al., 2017; Wardlaw et al., 2017). In this study, we refer to theses changes as “evolution of WMH”.
In early studies, evolution of WMH and its severity were presumed to be linearly progressing over time with age (Veldink et al., 1998; Schmidt et al., 2003) due to lack of data from more than one follow-up assessment (van Leijsen et al., 2017b). More-over, in the early longitudinal studies of WMH, reduction (i.e., regression) of WMH load was only observed in a small number of patients (Schmidt et al., 2003, 2005; Sachdev et al., 2007; Gouw et al., 2008b; Maillard et al., 2009; Prins et al., 2004; Rovira Cañellas et al., 2007). Hence, most earlier studies regarded the regression of WMH as either measurement error (Sachdev et al., 2007; Maillard et al., 2009; Schmidt et al., 2003, 2005) or “no progression” (Prins et al., 2004; van Dijk et al., 2008; Gouw et al., 2008b). Moreover, bias in manual delineation of WMH towards progression cannot be overlooked when the raters were aware of the scans’ time sequence (Schmidt et al., 1999, 2005).
With increasing longitudinal data over the years, recent studies have suggested that evolution of WMH may be a non-linear process over time (Wardlaw et al., 2017; van Leijsen et al., 2017b) and have different dynamics in each patient (Ramirez et al., 2016). For example, WMH volume of one subject may grow in the first follow-up assessment, and shrink in the second follow-up assessment (or vice versa). Also, in an individual patient, different WMH clusters may simultaneously either grow, shrink, or remain stable (see bottom-right figure in Figure 1 as an example).
Predicting the evolution of WMH is challenging because the rate and direction of WMH evolution varies considerably across studies (Schmidt et al., 2016; van Leijsen et al., 2017a,b) and several risk factors, either commonly or not fully known, could be involved in their progression (Wardlaw et al., 2017). For example, some risk factors and predictors that have been commonly associated with WMH progression are baseline WMH volume (Schmidt et al., 1999, 2002a,b, 2003; Sachdev et al., 2007; van Dijk et al., 2008; Wardlaw et al., 2017; Chappell et al., 2017), blood pressure or hypertension (Veldink et al., 1998; Schmidt et al., 1999, 2002b; van Dijk et al., 2008; Godin et al., 2011; Verhaaren et al., 2013), age (van Dijk et al., 2008), current smoking status (Power C et al., 2015), previous stroke and diabetes (Gouw et al., 2008a; Wardlaw et al., 2017), and genetic properties (Schmidt et al., 2002a, 2011; Godin et al., 2009; Luo et al., 2017). Surrounding regions of WMH that may appear like normal appearing white matter (NAWM) with less structural integrity, usually called the “penumbra of WMH” (Maillard et al., 2011), have also been reported as having a high risk of becoming WMH over time (Maillard et al., 2014; Pasi et al., 2016). On the other hand, regression of WMH volume has been reported in several radiological observations on MRI, such as after cerebral infraction (Moriya et al., 2009), strokes (Durand-Birchenall et al., 2012; Cho et al., 2015; Wardlaw et al., 2017), improved hepatic encephalopathy (Mínguez et al., 2007), lower blood pressure (Wardlaw et al., 2017), liver transplantation (Rovira Cañellas et al., 2007), and carotid artery stenting (Yamada et al., 2010). While a recent study suggested that areas of shrinking WMH were actually still damaged (Jiaerken et al., 2018), a more recent study showed that WMH regression did not accompany brain atrophy and suggested that WMH regression follows a relatively benign clinical course (van Leijsen et al., 2019).
In this study, we propose an end-to-end training model for automatically predicting and spatially estimating the dynamic evolution of WMH from baseline to the following time point using deep neural networks called “Disease Evolution Predictor” (DEP) model (discussed in Section 3). The DEP model produces a map named “Disease Evolution Map” (DEM) which characterises each voxel of WMH or brain tissues as progressing, regressing, or stable WMH (discussed in Section 2). For this study we have chosen deep neural networks due to their exceptional performance on WMH segmentation (Rachmadi et al., 2017), which reportedly have produced better results than the conventional machine learning algorithms. We use a Generative Adversarial Network (GAN) (Goodfellow et al., 2014) and the U-Residual Network (UResNet) (Guerrero et al., 2018) as base architectures for the DEP model. These architectures represent the state-of-the-art unsupervised and supervised deep neural network models, respectively.
This study differs from previous studies on predictive modelling in the fact that we are interested in predicting the evolution of specific neuroradiological MRI features (i.e., WMH in T2-FLAIR), not the progression of a disease as a whole and/or its effect. For example, previous studies have proposed methods for predicting the progression from mild cognitive impairment to Alzheimer’s disease (Spasov et al., 2019) and progression of cognitive decline in Alzheimer’s disease patients (Choi et al., 2018). Instead, our proposed DEP model generates three outcomes: 1) prediction of WMH volumetric changes (i.e., either progressing or regressing), 2) estimation of WMH spatial changes, and 3) spatial distribution of white matter evolution at the voxel-level precision. Thus, using the DEP model, clinicians can estimate the size, extent, and location of WMH in time to study their progression/regression in relation to clinical health and disease indicators, for ultimately design more effective therapeutic interventions (Rachmadi et al., 2019b). Results and evaluations can be seen in Section 8.
This study is an extension of our previous work (Rachmadi et al., 2019b). The main contributions of this study that have not been done in our previous work are as follows.
We propose a standard training scheme to predict the evolution pattern of WMH between two time points of assessment. The proposed scheme consists of two parts: 1) generation of the spatial WMH representation and 2) generation of the “Disease Evolution Map” (DEM) using deep neural networks, namely “Disease Evolution Predictor” (DEP).
We propose and evaluate the use of three different modalities to produce the DEM: irregularity map (IM) (Rachmadi et al., 2019b), probability map (PM), and binary WMH label (LBL). The generation of the DEM per se, subtracting the IM from two time points, was proposed in the previous work (Rachmadi et al., 2019b), and is explained further in Section 2.
We propose three different end-to-end DEP learning approaches: 1) unsupervised learning, 2) indirectly supervised learning, and 3) supervised learning. Unsupervised and indirectly supervised learning approaches are based on GAN (i.e., DEP-GAN (Rachmadi et al., 2019b)) whereas the supervised learning approach is based on UResNet (i.e., DEP-UResNet). DEP-GAN and DEP-UResNet are discussed in Sections 3.1 and 3.2 respectively.
We performed an ablation study of four different auxiliary inputs to the DEP model: 1) no auxiliary input, 2) Gaussian noise (Rachmadi et al., 2019b), 3) baseline WMH load, and 4) baseline WMH and stroke lesions (SL) loads. Further explanation can be read in Section 4 while the results can be seen in Section 8.2.
We performed analysis of plausibility of the WMH volumetric changes predicted by the DEP models and risk factors of WMH evolution using analysis of covariance (ANCOVA). The results can be seen in Section 8.4.
2. Disease Evolution Map (DEM)
To produce a standard representation of WMH evolution, a simple subtraction operation between two irregularity maps from two time points (i.e., baseline assessment from follow-up assessment) named “Disease Evolution Map” (DEM) was proposed in our previous work (Rachmadi et al., 2019b). In this study, we evaluate the use of three different modalities in the subtraction operation: irregularity map (i.e. as per (Rachmadi et al., 2019b)), probability map, and binary WMH label.
Irregularity map (IM) is a map/image that describes the “irregularity” level of each voxel with respect to the normal brain tissue using real values between 0 and 1 (Rachmadi et al., 2018b). The IM is advantageous as it retains some of the original MRI textures (e.g., from the T2-FLAIR image intensities), including gradients of WMH. IM is also independent from a human rater or training data, as it is produced using an unsupervised method (i.e., LOTS-IM) (Rachmadi et al., 2019a). DEM resulted from the subtraction of two IMs has values ranging from −1 to 1 (first row of Figure 1). Note how both regression and progression (i.e. dark blue from negative values and bright yellow pixels from positive values) are well represented at the voxel level precision on the DEM obtained from IMs.
Probability map (PM) in this study refers to the WMH segmentation output from a supervised machine learning method. Similar to IM, PM has real values between 0 and 1 which describe the probability for each voxel of being WMH. However, PM differs from IM in the fact that PM only has WMH gradients on the borders of WMH (note that the centres of (big) WMH clusters mostly have probability of 1). Thus, the DEM produced from the subtraction of two PMs also has values ranging from −1 to 1 representing regression and progression respectively, but these are usually located on the WMH clusters’ borders and/or representing small WMH. On the other hand, the rest of DEM’s regions (i.e., the centers of big WMH and non-WMH regions) have probability value of 0 (see the second row of Figure 1). Another caveat is that the quality (i.e., accuracy and meaning) of DEM from PM depends on the performance of the automatic WMH segmentation method that generated the PM.
Lastly, binary WMH label (LBL) refers to the WMH label produced by an expert’s manual segmentation, which is often considered as gold standard (Valdés Hernández et al., 2015). DEM from LBL can be produced by subtracting the baseline LBL from the follow-up LBL, and each voxel of the resulted image is then labelled as either “Shrink” if it has value below zero, “Grow” if it has value above zero, or “Stable” if it has value of zero. We refer this DEM as three-class DEM label (LBL-DEM), and its depiction can be seen in the bottom-right of Figure 1.
3. Disease Evolution Predictor (DEP) Model using Deep Neural Networks
In this study, two different Disease Evolution Predictor (DEP) models are proposed and evaluated: 1) non-supervised DEP model based on generative adversarial networks (DEP-GAN) (Rachmadi et al., 2019b) and 2) supervised DEP model based on UResNet (DEP-UResNet). Each DEP model’s workflow consists on two parts: 1) construction of the WMH spatial representation and 2) generation of the predicted DEM. The general flow of DEP model is depicted in Figure 2.
DEP-GAN uses either IM or PM to represent the WMH while DEP-UResNet uses T2-FLAIR and three-class label DEM (LBL-DEM). DEP-GAN using IM is categorised as unsupervised learning as the input modality of IM is produced by an unsupervised method: LOTS-IM. DEP-GAN using PM is categorised as indirectly supervised learning because the the PM is produced by a supervised deep learning algorithm (i.e., UResNet). Finally, DEP-UResNet is categorised as supervised learning as it simply learns DEM labels from LBL-DEM.
3.1. DEP Generative Adversarial Network (DEP-GAN)
DEP Generative Adversarial Network (DEP-GAN) (Rachmadi et al., 2019b) is based on a GAN, a well established unsupervised deep neural network model commonly used to generate fake natural images (Goodfellow et al., 2014). Thus, DEP-GAN is mainly proposed to predict the evolution of WMH when there are no longitudinal WMH labels available. This model (i.e. DEP-GAN) is based on a visual attribution GAN (VA-GAN) originally proposed to detect atrophy in T2-weighted MRI of Alzheimer’s disease (Baumgartner et al., 2017). DEP-GAN consists of a generator based on U-Net (Ronneberger et al., 2015), a commonly used deep neural network model in medical imaging, and two separate convolutional networks used as discriminators (hereinafter will be referred as critics). The schematic of DEP-GAN can be seen in Figure 3.
Let x0 be the baseline (year-0) image and x1 be the follow-up (year-1) image. Then, the “real” DEM (y) can be produced by a simple subtraction (y = x1 − x0). To generate the “fake” DEM (y′), i.e. without x1, a generator function (M(x)) is used: y′ = M(x0). Thus, a “fake” follow-up image can be produced by . Once M(x) is well/fully trained, the “fake” follow-up and the “real” follow-up (x1) should be indistinguishable by a critic function D(x), while “fake” DEM (y′) and “real” DEM (y) should be also indistinguishable by another critic function C(x). Full schematic of DEP-GAN’s architecture (i.e., its generator and critics) can be seen in Figure 4.
The DEP-GAN’s U-Net-based generator (M(x)) has two parts, an encoder which encodes the input image information to a latent representation and a decoder which decodes back image information from the latent representation. The baseline IM/PM (x0) is feed-forwarded to this generator to generate a “fake” DEM (y′). There is also an auxiliary input modulated into the generator using a FiLM layer (Perez et al., 2018) inside the residual block (ResBlock) to deal with non-deterministic factors of WMH evolution. This auxiliary input and its modulation will be fully discussed in Section 4. The architecture of the DEP-GAN’s generator is depicted in the upper side of Figure 4 (with “A”, “B”, and “E” annotations for U-Net-based generator of M(x), auxiliary input, and residual block (ResBlock) respectively).
Unlike VA-GAN that uses only one critic (i.e., only D(x)) (Baumgartner et al., 2017), DEP-GAN uses two critics (i.e., D(x) and C(x)) to enforce both anatomically realistic modifications to the follow-up images (Baumgartner et al., 2017) and encode realistic plausibility in the modifier (i.e., DEM) (Rachmadi et al., 2019b). Anatomically realistic modifications to the follow-up images can be achieved by optimising the critic D(x) and the anatomically realistic plausibility of the modifier can be achieved by optimising the critic C(x). In other words, we argue that an anatomically realistic DEM is also important and essential to produce anatomically realistic (fake) follow-up images. The architecture of the DEP-GAN’s critics and their connection to the generator are depicted in the lower side of Figure 4 (with “C” and “D” annotations for critic C(x) and D(x) respectively).
The DEP-GAN’s optimisation process is the same as the optimisation of VA-GAN, where the optimisation processes of Wasserstein GAN (WGAN-GP) using a gradient penalty factor of 10 is used (Gulrajani et al., 2017). The optimisation of M(x) is given by the following function where x0 is the baseline image that has an underlying distribution ℙ0, x1 is the follow-up image that has an underlying distribution ℙ1, M(x0) represents the “fake” DEM, is the “fake” follow-up image, and are the critics (i.e. a set of 1-Lipschitz functions (Baumgartner et al., 2017; Gulrajani et al., 2017)), and ∥·∥1 and ∥·∥2 are the L1 and L2 norms respectively.The optimisation was performed by updating the parameters of the generator and critics alternately, where (each) critic is updated 5 times per generator update. Also, the first 25 iterations and every 100 iterations, the critics were updated 100 times per generator update (Baumgartner et al., 2017; Gulrajani et al., 2017).
In summary, to optimise the generator M (x), we need to optimise Equation 1, which optimises both critics (i.e., D(x) and C(x)) using Equations 2 and 3 respectively based on WGAN-GP’s optimisation process (Gulrajani et al., 2017), and use the regularisation function described in Equation 4. Each term in the regularisation function (Equation 4) simply says:
Intensities of “fake” follow-up images have to be similar to the “real” follow-up images (x1) based on L1 norm.
The WMH segmentation estimated from has to be spatially similar to the WMH segmentation estimated from x1 based on the Dice similarity coefficient (DSC) (see Equation 6).
The WMH volume (in ml) estimated from has to be similar to the WMH volume estimated from x1 based on L2 norm.
The WMH segmentation of and x1 is estimated by either thresholding IM values (i.e., irregularity values) to be above 0.178 (Rachmadi et al., 2019a) or PM values (i.e., probability values) to be above 0.5. Furthermore, each term in Equation 4 is weighted by λ1, λ2, and λ3 which equals to 100 (Baumgartner et al., 2017), 1 and 100 respectively.
3.2. DEP U-Residual Network (DEP-UResNet)
In case WMH binary labels (LBL) for the two time points (i.e., longitudinal data set) are available, a simple supervised deep neural network method can automatically estimate WMH evolution. As previously described in Section 2, DEM produced from LBL (i.e., three-class DEM label (LBL-DEM)) consists of 3 foreground labels (i.e., “Grow” (green), “Shrink” (red), and “Stable” (blue)) and 1 background label (black). An example of LBL-DEM can be seen in the bottom-right figure of Figure 1.
In this case, the DEP-GAN’s generator is detached from the critics and modified into DEP U-Residual Network (DEP-UResNet) by changing the last non-linear activation layer of tanh (i.e., for regression) to softmax (i.e., for multi-label segmentation). The DEP-UResNet’s schematic can be seen in the upper right side of Figure 4 (with “F” annotation). DEP-UResNet uses T2-FLAIR as input and LBL-DEM as target output. Note that the auxiliary input proposed in this study can be also applied to the DEP-UResNet. See Section 4 for full explanation about auxiliary input in DEP model.
4. Auxiliary Input in DEP Model
The biggest challenge in modelling the evolution of WMH is mainly their non-deterministic nature and the amount of factors involved in WMH evolution. In our previous work, we proposed an auxiliary input module which modulates random noises from normal (Gaussian) distribution to every layer of the DEP-GAN’s generator to simulate the non-deterministic property of WMH evolution and the unknown/missing factors (i.e., non-image features) involved in WMH evolution (Rachmadi et al., 2019b). To modulate the auxiliary input to every layer of the DEP-GAN’s generator, Feature-wise Linear Modulation (FiLM) layer (Perez et al., 2018) is used. The FiLM layer is depicted as the green block inside the residual block (ResBlock) in Figure 4 (annotated as “E”). In the FiLM layer, γm and βm modulate feature maps Fm, where subscript m refers to mth feature map, via the following affine transformation where γm and βm for each ResBlock in each layer are automatically determined by convolutional layers (depicted as yellow blocks in Figure 4 with “B” annotation). Please note that the proposed auxiliary input module can be easily applied to any deep neural network model. Thus, we applied the auxiliary input module to the both DEP models proposed in this study: DEP-GAN and DEP-UResNet.
In this study, we performed an ablation study of auxiliary input modalities for DEP model by using: 1) no auxiliary input, 2) Gaussian noise, 3) baseline WMH volume, and 4) both baseline WMH and SL volumes. Firstly, we tested DEP models without any auxiliary input. Secondly, we added as input an array of 32 random noises which follow Gaussian distribution of as per our previous work (Rachmadi et al., 2019b). Hereinafter, Gaussian noise in this study refers to this array. Thirdly, instead of the random noises, we used some risk factors that have been commonly associated with WMH evolution. Note that while all factors which influence WMH evolution are not fully well known, baseline WMH load (i.e., cited as the most common and strongest predictor) (Schmidt et al., 2003; Sachdev et al., 2007; van Dijk et al., 2008; Wardlaw et al., 2017; Chappell et al., 2017) and baseline stroke lesions (SL) load (Gouw et al., 2008a; Wardlaw et al., 2017) have been found strongly associated with WMH evolution over time. The WMH and SL volumes were obtained from WMH and SL labels/masks. Please see Section 5 for full explanation on how WMH and SL masks were produced. It is worth to mention that changing the auxiliary input modality from Gaussian noise to WMH and SL loads changes the nature of the DEP model from non-deterministic to deterministic influenced by factors of WMH evolution (i.e., baseline WMH and SL loads).
5. Subjects and Data
We used MRI data from stroke patients (n = 152) enrolled in a study of stroke mechanisms where full recruitment and assessments were also published (Wardlaw et al., 2017). Written informed consent was obtained from all patients on protocols approved by the Lothian Ethics of Medical Research Committee (REC 09/81101/54) and NHS Lothian R+D Office (2009/W/NEU/14), on the 29th of October 2009. In the clinical study that provided the data, patients were imaged at three time points (i.e., first time (baseline) 1-4 weeks after presenting to the clinic with stroke symptoms, at approximately 3 months, and a year after (follow-up)). All images were acquired at a GE 1.5T MRI scanner following the same imaging protocol (Valdés Hernández et al., 2015). Ground truth segmentations were performed using a multispectral semi-automatic method (Valdés Hernández et al., 2015) only from baseline and 1-year follow-up scan visits in the image space of the T1-weighted scan of the second visit, in n = 152 (out of 264) patients. T2-weighted, FLAIR, gradient echo, and T1-weighted structural images at baseline and 1-year scan visits were rigidly and linearly aligned using FSL-FLIRT (Jenkinson et al., 2002). The resulted resolution of the images is 256 × 256 × 42 with slice thickness of 0.9375 × 0.9375 × 4 mm. We used data from all patients who had the three scan visits and ground truth generated as per above. Hence, our sample consists on MRI data (i.e., s = n × 2 = 304 MRI scans) for baseline and 1-year follow-up data. Out of all patients, there are 70 of them that have stroke subtype lacunar (46%) with median small vessel disease (SVD) score of 1. Other demographics and clinical characteristics of the patients that provided data for this study can be seen in Table 1.
The primary study that provided the data used a semi-automatic multispectral method to produce several brain masks including intracranial volume (ICV), cerebrospinal fluid (CSF), stroke lesions (SL), and WMH, all which were visually checked and manually edited by an expert (Valdés Hernández et al., 2015). The image processing protocol followed to generate these masks is fully explained in (Valdés Hernández et al., 2015). Extracranial tissues, SL, and skull were removed from the baseline and follow-up T2-FLAIR images using the SL and ICV binary masks from previous analyses (Chappell et al., 2017; Wardlaw et al., 2017).
In this study, binary WMH labels produced for the primary study that provided the data (Valdés Hernández et al., 2015) were used as the gold standard (i.e. ground truth) for evaluating the DEP models. As per these labels, 98 and 54 out of the 152 subjects have increasing (i.e., progression) and decreasing (i.e., regression) volume of WMH respectively. However, as the DEP models depend on the input, we also generated WMH binary labels from IM (i.e., cutting off IM’s values to be above 0.178 (Rachmadi et al., 2019a)) and PM (i.e., cutting off PM’s values to be above 0.5) and used them as second reference. Hereinafter, these WMH labels will be referred as “WMH label from original input modality” (IM/PM-LBL-DEM). Note that IM/PM-LBL-DEM will be used only for evaluating DEP-GAN using IM and PM respectively.
As previously explained, IM and PM are needed for DEP-GAN (i.e., the non-supervised learning approach of DEP model). We used LOTS-IM with 128 target patches Rachmadi et al. (2019a) to generate IM from each MRI data. To generate PM, we trained a 2D UResNet (Guerrero et al., 2018) with gold standard WMH and SL masks for WMH and SL segmentation. For this training, we used all subjects in our data set and a 4-fold cross validation training scheme. Thus, out of 304 MRI data (152 subjects × 2 scans), 228 MRI data (114 subjects × 2 scans) were used for training and 76 MRI data (38 subjects 2 × scans) were used for testing in each fold. Note that this UResNet is different from the DEP-UResNet, which is newly proposed in this study. Notice that we affix “DEP” key-word to any model’s name used for prediction and delineation of WMH evolution. Whereas, the UResNet was previously proposed for WMH and stroke lesions segmentation by (Guerrero et al., 2018).
6. Experiment Setup
In this study, we opted to use 2D architectures for all our networks rather than 3D ones. This includes the DEP models (i.e., DEP-GAN and DEP-UResNet) for estimating WMH evolution and URe-sNet for WMH and stroke lesions segmentation. The main reason of this decision was the few data available (i.e. only 152 subjects) in this study. VAGAN (i.e., the GAN scheme used as basis for DEP-GAN) used roughly 4,000 subjects for training its 3D network architecture, yet there was still an evidence of over-fitting (Baumgartner et al., 2017). The 2D version of VA-GAN has been previously tested on synthetic data (Baumgartner et al., 2017) and available from the project’s GitHub page1.
To train DEP models (i.e., DEP-GAN and DEP-UResNet), 4-fold cross validation was performed. Note that cross validation was not used in the previous study that introduced DEP-GAN (Rachmadi et al., 2019b). In each fold, out of 304 MRI data (152 subjects × 2 scans), 228 MRI data (114 subjects × 2 scans) were used for training and 76 MRI data (38 subjects × 2 scans) were used for testing. Note that DEP models are subject-specific models, so pairwise MRI scans (i.e., baseline and follow-up) are needed and necessary for both training and testing. Out of all slices from the training set in each fold (i.e., 114 pairwise MRI scans), 20% of them were randomly selected for validation. Thus, around 4,000 slices were used in the training process in each fold. Values of IM and PM did not need to be normalised as these are between 0 and 1. Finally, each DEP model was trained for 200 epochs (i.e., 200 generator updates for DEP-GAN).
In this study, we evaluated and compare the performances of DEP-UResNet, DEP-GAN using IM, and DEP-GAN using PM. Furthermore, we also performed an ablation study to see the effect of an auxiliary input to each DEP models. The procedure of using auxiliary input depends on the input modality and training/testing process. If SL and WMH volumes were used as auxiliary input, these (i.e., not the volumes per slice, but the volume per subject) were feed-forwarded together with one MRI slice. Thus, all slices from one subject used the same number of WMH and stroke lesion volumes. Note that WMH and SL loads for the whole data set (i.e., all subjects) were first normalised to zero mean unit variance before their use in training/testing. If Gaussian noise were used as auxiliary input, an array of Gaussian noise was feed-forwarded together with an MRI slice in the training process as follows: 10 different sets of Gaussian noise were first generated and only the “best” set (i.e., the set that yielded the lowest M* loss (Equation 1)) was used to update the DEP model’s parameters. Note that this approach is similar to and inspired by Min-of-N loss in 3D object reconstruction (Fan et al., 2017) and variety loss in Social GAN (Gupta et al., 2018). In the testing process, also 10 different sets of Gaussian noise were generated first and only the “best” prediction of WMH evolution based on Dice similarity coefficient (DSC) was used in evaluation.
7. Evaluation Metrics
In this study, we used four tests to assess the performance of DEP models:
Prediction error of WMH volumetric change (i.e., whether WMH volume in a subject will increase or decrease).
Volumetric agreement between ground truth and predicted WMH volumes of the follow-up assessment using Bland-Altman plot (Bland and Altman, 1986).
Spatial agreement of the automatic map of WMH evolution in a patient (i.e. after binarisation) using Dice similarity coefficient (DSC) (Dice, 1945).
Clinical plausibility test between the out-come of DEP models in relation with baseline WMH load and clinical risk factors of WMH evolution suggested in clinical studies.
Prediction error is a simple metric to assess how good a DEP model can predict the WMH evolution in the future follow-up assessment (i.e., increasing or decreasing). On the other hand, volumetric agreement using Bland-Altman plot presents the mean volumetric difference and upper/lower limit of agreements (i.e., mean ± 1.96 × standard deviation) between ground truth and predicted WMH volumes of the follow-up assessment. Whereas, spatial agreement of Dice similarity coefficient (DSC) is used to measure spatial agreement between ground truth and automatic delineation results. Higher DSC means better performance. The DSC itself can be computed as follow: where TP is true positive, FP is false positive and FN is false negative.
In addition, we performed clinical plausibility test which evaluate the outcome of DEP models in relation with the baseline WMH load and clinical risk factors of WMH change and evolution suggested in clinical studies. For this, analyses of covariance (ANCOVA) were performed as follows:
The WMH volume at follow-up, predicted from each of the schemes evaluated was used as out-come variable.
The baseline WMH volume was the dependent variable or predictor.
After running Belsley collinearity diagnostic tests, the covariates in the models were: 1) type of stroke (i.e. lacunar or cortical), 2) basal ganglia perivascular spaces (BG PVS) score, 3) presence/absence of diabetes, 4) presence/absence of hypertension, 5) recent or current smoker status (yes/no), 6) volume of the index stroke lesion (abbreviated as “index SL”), and 7) volume of old stroke lesions (abbreviated as “Old SL”).
The outcome from an ANCOVA model using the baseline and follow-up WMH volumes of the gold-standard expert-delineated binary masks was used as reference to compare the outcome of the ANCOVA models that used the volumes generated by thresholding the input and output of the DEP models. All volumetric measurements involved in the ANCOVA models were previously adjusted by patient’s head size. Therefore, all ANCOVA models used the percentage of these volumetric measurements in ICV rather than the raw volumes.
8. Results and Discussion
8.1. Comparison against the Gold Standard WMH Labels
Table 2 shows the results predicting and estimating WMH volumetric changes in the sample against the WMH labels generated by expert-delineated WMH binary masks (i.e. considered the gold standard reference). Table 3 shows the results of evaluating spatial coincidence using Dice similarity coefficient (DSC), including regions of shrinkage, growth, and stable, achieved by automatic boundary delineation of WMH evolution method using different DEP models.
As Table 2 shows, DEP-UResNet with WMH and stroke lesions (SL) volumes as auxiliary inputs, had the best average performance for prediction of WMH volumetric changes in the whole sample (i.e. average rate 77.76%). However, the number of patients that overall experienced growth was better predicted by DEP-UResNet with Gaussian noise as auxiliary input (81.63%) and the number of patients that experienced shrinkage was better predicted with DEP-GAN using probability map (PM) and Gaussian noise as inputs (88.89%).
With respect to the volumetric agreement, DEP-GAN using irregularity map (IM) and Gaussian noise as inputs produced the best estimation of WMH volumetric changes with 0.4420 ml mean difference with respect to the gold standard, followed by DEP-UResNet with Gaussian noise (i.e. −0.7926 ml) and DEP-UResNet with WMH and SL loads (i.e. 0.8114 ml). Interestingly, DEP-GAN using PM, which seemingly had some of the worst agreements of volumetric change estimation, had the better lower and upper limits of agreement (LoA) than the other models. Furthermore, DEP-GAN using PM and Gaussian noise as inputs was the only scheme that showed plausible results with respect to the clinical parameters evaluated (see Table 6 and Section 8.4 for further explanation). From the Bland-Altman plots (Figure 5) the volumetric agreement of DEP-GAN using PM is more similar to the ones from DEP-UResNet than to the ones from DEP-GAN using IM. Observe that the black dashed lines, depicting LoA, in DEP-GAN using PM’s plots are closer to 0 and more similar to those of the DEP-UResNet schemes.
On the automatic delineation of WMH change’s boundaries in the follow-up year, DEP-UResNet using Gaussian noise and DEP-GAN using PM and Gaussian noise produced the best mean DSC performances for the entire WMH; 0.6162 and 0.6155 respectively (see 2nd column of Table 3). DEP-UResNet with Gaussian noise also outperformed the rest of the models on average (i.e., average DSC in delineation of shrinking, growing, and stable WMH clusters) with 0.3200 (i.e. right hand side column of Table 3). In general, DEP-GAN, especially DEP-GAN using IM (unsupervised learning), performed worse than DEP-UResNet (Table 3). Only (indirectly supervised learning) DEP-GAN using PM could compete with the (supervised) DEP-UResNet scheme, especially DEP-GAN using PM with Gaussian noise as auxiliary input. Thus, we conducted Wilcoxon and Kruskal-Wallis tests to evaluate whether the medians and distributions of DSC scores produced by both methods (i.e., DEP-UResNet using Gaussian noise and DEP-GAN using PM and Gaussian noise) were significantly different to each other. The results are listed in Table 4. From Table 4 we can see that performances of DEP-UResNet with Gaussian noise and DEP-GAN using PM with Gaussian noise did not differ from each other in the estimation of the whole volume of WMH at follow-up, in the delineation of the stable and shrinking WMH clusters, and in the average spatial estimation of shrinking, growing, and stable WMH clusters (p-value > 0.05). However, they differed estimating the WMH clusters that experienced growth, being this influential in the overall estimation of change. It is worth to note that the regions that grew and shrink are considerably smaller than those unchanged, and that when stroke lesions coalesce with WMH it is very difficult to discern the borders between them. As previously explained, stroke lesions were removed from the analysis. Nevertheless, inaccuracies in the ground truth while determining the borders between coalescent WMH and stroke lesions, and the small size of the volume changes in each WMH cluster (Rachmadi et al., 2018a) might have influenced in the low DSC values obtained in the regions that experienced change as seen in Table 3.
8.2. Ablation Study of Auxiliary Input in DEP Models
From ablation study of auxiliary inputs, we can see that auxiliary inputs improved the performances of DEP models, especially Gaussian noise in DEP-GAN (see Tables 2 and 3). DEP-GAN with Gaussian noise generally produced better performances than other auxiliary input modalities (i.e., no auxiliary input, WMH load, and both WMH and stroke lesion loads) when using any input modality (i.e., IM or PM) in almost all evaluations (i.e., prediction and estimation of WMH volumetric changes and delineation of dynamic WMH evolution). Moreover, DEP-UResNet with Gaussian noise performed the best on automatic estimation of WMH volumetric changes and delineation of dynamic WMH evolution. Gaussian noise notably failed to improve the performance of DEP-UResNet on the prediction of WMH volumetric changes (columns 3 and 4 of Table 2).
Furthermore, we also can see that DEP models with auxiliary input, either Gaussian noise or known risk factors of WMH evolution (i.e., WMH and SL loads), produced better performances than the DEP models without any auxiliary input. These results show the importance of auxiliary input, especially Gaussian noise which simulate the non-deterministic nature of WMH evolution.
8.3. Comparison to WMH Label from Original Input Modality (PM or IM)
From Table 3, we can see that DEP-GAN using IM did not perform well on automatic spatial delineation of the WMH clusters for the estimated follow-up visit. We believe this happened because DEP-GAN had to regress IM values for the whole brain tissue. In contrast, both DEP-UResNet and DEP-GAN using PM performed delineation and regression only around regions of WMH based on three-class DEM labels and PM’s values respectively.
To corroborate this hypothesis, we performed another experiment where we compared the performances of DEP models to WMH volume and labels produced from the original input modality (i.e., IM/PM-LBL-DEM). In other words, we wanted to see how good DEP-GAN learns available information (i.e., IM/PM values and their changes) based on the original modality of either IM or PM. The results are listed in Table 5, and the volumetric agreement can be seen in the Bland-Altman plots depicted in red in Figure 5. Note that only results from DEP-GAN that uses IM or PM were reassessed in this experiment.
From the analysis of volumetric agreement depicted in the Bland-Altman plots of Figure 5, we can see that the estimation of WMH volumetric changes using DEP-GAN and IM/PM, agreed better with the WMH volume produced from the original modality of IM/PM (depicted by red dots and lines) than with the WMH volume produced from the gold standard WMH masks (depicted by black dots and lines). In other words, Figure 5 shows that the mean difference of WMH volume (depicted as solid lines) between DEP-GAN’s results and the ones produced from the original modality (in red colour) is closer to 0 than the mean differences between DEP-GAN’s results and the gold standard masks(represented in black). Similarly, lower and upper limits of agreement (LoA) between DEP-GAN’s results and WMH volumes produced from the original modality (depicted as red dashed lines) are closer to 0 than ones between DEP-GAN’s results and the gold standard masks (depicted as black dashed lines). Note that the exact values of these “volumetric bias” and lower/upper LoA for all DEP models depicted in Figure 5 are listed in Tables 2 and 5.
From this experiment, we can conclude that DEP-GAN learns reasonable well either from IM or PM. However, information contained in IM is more complex than PM which makes it more challenging to estimate the evolution of WMH unsupervisedly. Furthermore, from Table 5, we can see that auxiliary input of Gaussian noise improved the performance of DEP-GAN. This once again emphasise the importance of Gaussian noise as auxiliary input in DEP model.
8.4. Clinical Plausibility of DEP Models’ Results
From Table 6, we can see that the use of expertdelineated binary WMH masks and WMH maps obtained from thresholding IM or PM (see the second to the fourth rows), all produced the same ANCOVA model’s results; none of the covariates of the model had an effect in the 1-year WMH volume change, yielding almost identical numerical results in the first two decimal places. Therefore, the use of LOTS-IM and UResNet, generators of the IM and PM respectively, for producing WMH maps in clinical studies of mild to moderate stroke seems plausible.
As discussed in Section 1, baseline WMH volume has been recognised the main predictor of WMH change over time (Chappell et al., 2017; Wardlaw et al., 2017), although the existence of previous stroke lesions (SL) and hypertension have been acknowledged as contributed factors. However, from the results of the ANCOVA models (Table 6), none of the DEP models that used these (i.e WMH and/or SL volumes) as auxiliary inputs showed similar performance (i.e. in terms of strength and significance in the effect of all the covariates in the WMH change) as the reference WMH maps. The only DEP model that shows promise in reflecting the effect of the clinical factors selected as covariates in WMH progression was the DEP-GAN that used as input the PM of baseline WMH and Gaussian noise (i.e. underlined in the left hand side column of Table 6).
Some factors might have adversely influenced the performance of these predictive models. First, all deep-learning schemes require a very large amount of balanced (e.g. in terms of the appearance, frequency and location of the feature of interest, i.e. WMH in this case) data, generally not available. The lack of data available imposed the use of 2D model configurations, which generated unbalance in the training: for example, not all axial slices have the same probability of WMH occurrence, also WMH are known to be less frequent in temporal lobes and temporal poles are a common site of artefacts affecting the IM and PM, error that might propagate or even be accentuated when these modalities are used as inputs. Second, the combination of hypertension, age and the extent, type, lapse of time since occurrence and location of the stroke might be influential on the WMH evolution, therefore rather than a single value, the incorporation of a model that combines these factors would be beneficial. However, such model is still to be developed also due to lack of data available. Third, the tissue properties have not been considered. A model to reflect the brain tissue properties in combination with vascular and inflammatory risk factors is still to be developed. Lastly, the deep-learning models as we know them, although promising, are reproductive, not creative. The development of more advanced inference systems is paramount before these schemes can be used in clinical practice.
9. Conclusion and Future Work
In this study, we proposed a training scheme to predict the evolution of WMH using deep learning algorithms called Disease Evolution Predictor (DEP) model. To the best of our knowledge, this is the first extensive study on modelling WMH evolution using deep learning algorithms. Furthermore, we evaluated different configurations of DEP models: unsupervised, indirectly supervised, and supervised (i.e., DEP-GAN using irregularity map (IM), DEP-GAN using probability map (PM), and DEP-UResNet) with auxiliary input (i.e., Gaussian noise, WMH load, and WMH and stroke lesions (SL) loads). These configurations were designed and evaluated to find the best approach to automatically predict and delineate the evolution of WMH from a baseline measurement to a follow-up visit.
From our experiments, on average, supervised DEP-UResNet yielded the best results in almost every evaluation metric. However, it did not perform well in the clinical plausibility test. The indirectly supervised DEP-GAN yielded similar average performance to the supervised DEP-UResNet’s performance and yielded the best results from all schemes in the clinical plausibility test. Moreover, results from DEP-UResNet (using Gaussian noise) and DEP-GAN (using PM and Gaussian noise) were not statistically different to each other on delineating the WMH clusters, those that were unchanged, and those that shrunk. Thus, DEP-GAN using PM and Gaussian noise has the biggest potential to be explored for its improvement, which could lead to its further use in clinical settings.
If we consider the results, time, and resources spent in this study, then DEP-GAN using PM showed the biggest and strongest potential of all DEP models. Not only did it perform similarly to the supervised DEP-UResNet but it also did not need manual WMH labels on two MRI scans for training (i.e., baseline and follow-up scans). The PM needed as input for this model can be efficiently produced by any supervised (deep) machine learning model. Moreover, the development of automatic WMH segmentation for producing better PM could be done separately and independently from the development of the DEP model. If a better PM model is available in the future, then the DEP-GAN model can be retrained using the newly produced PM for better performance. Also, DEP-GAN using PM could be used for other (neuro-degenerative) pathologies, as long as a set of PM from these other pathologies could be produced and used to (re-)train the DEP-GAN.
Based on the ablation study, Gaussian noise successfully improved all DEP models in almost all evaluation metrics when used as auxiliary input. This shows that there are indeed some unknown factors that influences the evolution of WMH. These unknown factors make the problem of predicting/delineating WMH evolution non-deterministic, and Gaussian noise were proposed to simulate this scenario. The intuition behind this approach is that Gaussian noise fills in the missing (unavailable) risks factors or their combination, which could influence the evolution of WMH. Note that it is very challenging to collect and compile all risk factors of WMH evolution in a longitudinal study.
There are several shortcomings anticipated from the results of this study. Firstly, manual WMH labels of two MRI scans (i.e., baseline and follow-up scans) are necessary for training the DEP-UResNet. In many scenarios, this is not applicable and efficient in terms of time and resources. Secondly, the unsupervised DEP-GAN using IM is computationally very demanding as it involves regressing IM values across the whole brain tissue. This resulted in low performances of DEP-GAN using IM in almost all evaluation metrics. Thirdly, the schemes’ performances depend on the accuracy of the quality of input. For example, the PM generated in this study are slightly biased towards overestimating the WMH in the optical radiation and underestimating WMH in the frontal lobe. This could be caused by the absence of correcting the FLAIR images for b1 magnetic field inhomogeneities. However, a previous study on small vessel disease images demonstrated this procedure might affect the results underestimating the subtle white matter abnormalities characteristics of this disease, and recommends this procedure to be used in T1- and T2-weighted structural images but not in FLAIR images for WMH segmentation tasks (Hernández et al., 2016) Hence, the biggest challenge of using DEP-GAN using PM is its highly dependency on the quality of initial PM. Fourthly, volumetric agreement analyses suggest that there are still large differences in absolute volume and in change estimates produced by the proposed DEP models. While this study is intended as a “proof-of-principle” study to advance the field of white matter – and ultimately brain-health prediction, it is worth to mention that better reliability in the WMH assessment is necessary so as DEP models can be used in clinical practice. Furthermore, better understanding of what DEP models extract to estimate WMH evolution would be very useful in clinical practice. Lastly, the limitation of using (Gaussian) random noise in DEP models is the fact that we do not really know which set of Gaussian random noise should be used to generate the best result for each subject. Note that, in this study, all DEP models that used Gaussian noise as auxiliary input were tested 10 times to find the best set of Gaussian noise that produced the best automatic delineation of WMH evolution overall. In conclusion, DEP models suffer similar problems and limitations to any machine learning based medical image analysis methods.
However, the DEP models proposed in this study open up several possible future avenues to further improve their performances. Firstly, multi-channel (e.g., PM and T2-FLAIR) input could be used instead of single channel input. In this study, we only used single channel to draw a fair comparison between DEP-UResNet which uses T2-FLAIR and DEP-GAN which uses either IM or PM. Secondly, 3D architecture of DEP-GAN could be employed when more subjects are accessible in the future. 3D deep neural networks have been reported to have better performances than the 2D ones, but they are more difficult to train (Çiçek et al., 2016; Baumgartner et al., 2017). Thirdly, Gaussian noise and known risk factors (e.g., WMH and SL loads) could be modulated together instead of modulating them separately in different models. By modulating them together, DEP model would be influenced by both known (available) risk factors and unknown (missing) factors represented by Gaussian noise. Lastly, different random noise distribution could be used instead of Gaussian distribution. Note that each risk factors of WMH evolution (e.g., WMH load, age, and blood pressure) could have different data distribution, not only Gaussian distribution. If a specific data distribution (i.e., the same or similar to the real risk factor’s data distribution) could be used for a specific risk factor, then the real data could replace the random noise if available in the testing.
Acknowledgements
The first author (MFR) would like to thank Indonesia Endowment Fund for Education (LPDP) of Ministry of Finance, Republic of Indonesia, for funding his study at School of Informatics, the University of Edinburgh. Funds from Row Fogo Charitable Trust (Grant No. BRO-D.FID3668413) (MCVH) are also gratefully acknowledged. The primary study that provided data for this study was funded by the Wellcome Trust (Ref No. 088134/Z/0). Authors also thank support from the European Union Horizon 2020 (PHC-03-15, project No 666881, ‘SVDs@Target’), Fondation Leducq (CVD 16/05), and the UK Dementia Research Institute at the University of Edinburgh.