Performance of five automated white matter hyperintensity segmentation methods in a multicenter dataset

Heinen, Rutger; Steenwijk, Martijn D.; Barkhof, Frederik; Biesbroek, J. Matthijs; van der Flier, Wiesje M.; Kuijf, Hugo J.; Prins, Niels D.; Vrenken, Hugo; Biessels, Geert Jan; de Bresser, Jeroen

doi:10.1038/s41598-019-52966-0

Download PDF

Article
Open access
Published: 14 November 2019

Performance of five automated white matter hyperintensity segmentation methods in a multicenter dataset

Scientific Reports volume 9, Article number: 16742 (2019) Cite this article

6664 Accesses
34 Citations
2 Altmetric
Metrics details

Subjects

Stroke

Abstract

White matter hyperintensities (WMHs) are a common manifestation of cerebral small vessel disease, that is increasingly studied with large, pooled multicenter datasets. This data pooling increases statistical power, but poses challenges for automated WMH segmentation. Although there is extensive literature on the evaluation of automated WMH segmentation methods, such evaluations in a multicenter setting are lacking. We performed WMH segmentations in sixty patients scanned on six different magnetic resonance imaging (MRI) scanners (10 patients per scanner) using five freely available and fully-automated WMH segmentation methods (Cascade, kNN-TTP, Lesion-TOADS, LST-LGA and LST-LPA). Different MRI scanner vendors and field strengths were included. We compared these automated WMH segmentations with manual WMH segmentations as a reference. Performance of each method both within and across scanners was assessed using spatial and volumetric correspondence with the reference segmentations by Dice’s similarity coefficient (DSC) and intra-class correlation coefficient (ICC) respectively. We found the best performance, both within and across scanners, for kNN-TTP, followed by LST-LPA and LST-LGA, with worse performance for Lesion-TOADS and Cascade. Our findings can serve as a guide for choosing a method and also highlight the importance to further improve and evaluate consistency of methods in a multicenter setting.

Microenvironmental reorganization in brain tumors following radiotherapy and recurrence revealed by hyperplexed immunofluorescence imaging

Article Open access 15 April 2024

nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation

Article 07 December 2020

Cerebral small vessel disease phenotype and 5-year mortality in asymptomatic middle-to-old aged individuals

Article Open access 30 November 2021

Introduction

Pooling of multicenter brain magnetic resonance imaging (MRI) data is a trend in various research fields, including studies on ageing related brain diseases^1,2,3. Pooling of multicenter data increases sample size (and thus statistical power) and can support a faster patient inclusion. Moreover, findings of multicenter studies may have a larger external validity and are more readily translatable to a clinical setting. However, pooling of brain MRI data poses challenges in automated segmentation due to variations in image acquisition.

White matter hyperintensities of presumed vascular origin (WMHs) are frequently encountered in studies on ageing related brain diseases. Achieving accurate and precise WMH segmentations can be challenging across MRI scanners of different vendors, field strengths and scan protocols. Variability in MRI acquisition can lead to differences in the contrast and borders of WMHs and thereby quantification bias^4,5,6.

Several automated and semi-automated methods to segment WMHs currently exist, using various algorithms that rely on intensity, spatial information, or both⁵. These methods can be broadly classified as supervised (i.e. trained using manual segmentations as a refs^7,8), unsupervised (without training^9,10,11) and semi-supervised (with only a small portion of the available data used for training¹². A recent study provided an extensive overview of existing supervised, unsupervised and semi-supervised methods¹³. Challenges for these methods include false positive (e.g. artefacts, infarcts) and false negative (often for punctate lesions) results. Other challenges include dealing with varying WMH lesion loads (usually lower in MS than in patients with WMHs of presumed vascular origin) and with co-occurring pathologies (e.g. extensive atrophy). There is extensive literature on the evaluation of WMH segmentation methods in different settings, also addressing these challenges⁴. However, the performance of such methods is typically evaluated on single center, single scanner datasets. For WMHs of presumed vascular origin, there is a lack of studies comparing performance of these methods in multicenter, multiscanner datasets and this is an important knowledge gap^4,14.

Therefore, the present study aimed to assess performance, in terms of spatial and volumetric correspondence with reference segmentations, of five automated WMH segmentation methods in a multicenter, multiscanner dataset of patients with WMHs of presumed vascular origin. In particular, we also addressed which methods showed variation in performance across scanners. In addition, we assessed if performance was dependent on WMH lesion load. To this end, we selected five methods that were fully automatic and freely available for academic research: Cascade^15,16, k-nearest neighbor classification with tissue type priors (kNN-TTP)¹⁷, Lesion-TOpology-preserving Anatomical Segmentation (Lesion-TOADS)¹¹, the Lesion Segmentation Tool Lesion Prediction Algorithm (LST-LPA) and the Lesion Segmentation Tool Lesion Growth Algorithm (LST-LGA)¹⁰.

Results

Reference segmentations

The reference segmentations showed a very good inter-rater agreement regarding spatial (Dice’s similarity coefficient (DSC) ± standard deviation (SD): 0.80 ± 0.09) and volumetric agreement (Intra-class correlation coefficient (ICC): 0.97). The intra-rater agreement (DSC ± SD: 0.80 ± 0.08; ICC: 0.99) was also very good. In the test set, seventeen subjects had a Fazekas rating of 1, eighteen subjects had a 2, and seven subjects had a 3. The mean WMH volume (±SD) was 21 ± 10 mL with a median of 10 mL and volumes per patient ranging from 0.9 to 199 mL (see Table 1).

Table 1 Mean WMH volume of the reference segmentations and the segmentations of the methods for each scanner (n = 42; n = 7 per scanner).

Full size table

Quality assessment

Examples of the automated WMH segmentation results are shown in Fig. 1. Several differences between methods can be visually appreciated. For example, methods seemed to differ on how they segment (over or under) different types of WMHs (i.e. periventricular, confluent and punctuate WMHs). Also, the nature of segmentation errors varied between methods (i.e. false-positive (FP) versus false-negative (FN) WMH voxels: see Fig. 1). In a quantitative analysis, kNN-TTP showed the lowest mean FP and FN volumes (mean FP volume ± SD/mean FN volume ± SD: 2 ± 2/5 ± 11 mL), followed by LST-LPA (4 ± 4/6 ± 10 mL), LST-LGA (5 ± 5/8 ± 19 mL). Cascade showed a lower mean FP volume (8 ± 7 mL) but higher mean FN volume (12 ± 29 mL) than Lesion-TOADS (10 ± 16/7 ± 12 mL).

Performance of WMH segmentation methods

Performance of each method, both within and averaged across all scanners, is shown in Table 2. The highest mean performance across scanners was seen for kNN-TTP, both in terms of spatial correspondence with the reference segmentations (mean DSC ± SD: 0.73 ± 0.03) as in terms of volumetric correspondence with the reference segmentations (mean ICC ± SD: 0.97 ± 0.02) (see Table 2). LST-LPA showed a slightly lower performance in terms of volumetric correspondence (mean ICC ± SD: 0.92 ± 0.03) and performed less than kNN-TTP in terms of spatial correspondence (mean DSC ± SD: 0.60 ± 0.06). The mean absolute WMH volume differences between the methods and the reference segmentations were also lowest for kNN-TTP (5 ± 3 mL; percentage of the mean WMH volume of the reference segmentations: 24%) and LST-LPA (5 ± 2 mL; 24%) (see Table 2). Both methods did show a tendency for slight underestimation of the WMH volume compared to the reference segmentations. LST-LGA showed a performance comparable to LST-LPA (mean DSC ± SD: 0.57 ± 0.03; mean ICC ± SD: 0.65 ± 0.29) but with a larger mean absolute WMH volume difference (8 ± 5 mL; 38%). Performance was lower for Lesion-TOADS (0.53 ± 0.08/0.65 ± 0.29) and Cascade (0.40 ± 0.05/0.44 ± 0.01) with also markedly higher mean absolute WMH volume differences for both methods (Lesion-TOADS: 12 ± 8 mL; 57%; Cascade: 16 ± 7 mL; 76%) (see Table 2).

Table 2 Performance of the WMH segmentation methods compared to the reference segmentations (n = 42; n = 7 per scanner).

Full size table

Because some methods (Cascade, Lesion-TOADS, LST-LGA, and LST-LPA) do not necessarily have to be trained, analyses were repeated on all subjects (n = 60) without training of the methods. This did not change the ranking of methods (data not shown). The average run time was shortest for Cascade (2 minutes), followed by kNN-TTP (10 minutes), LST-LPA (12 minutes), LST-LGA (25 minutes) and Lesion-TOADS (30 minutes).

Variations in performance across scanners

For each method, we determined if the DSC (i.e. spatial correspondence with the reference standard) for each scanner differed relative to the other five scanners (Table 3). In this analysis, consistency of a method across scanners is reflected in small effect sizes. kNN-TTP showed the smallest variation in performance with the smallest effect sizes (range unstandardized beta coefficient: −0.06 to 0.01), followed by LST-LGA (−0.04 to 0.07), Cascade (−0.08 to 0.09), LST-LPA (−0.10 to 0.11) and Lesion-TOADS (−0.12 to 0.12). None of the effect sizes were significant after family wise error rate correction for multiple testing. Along the same lines, consistency of volumetric correspondence across scanners was assessed, by determining for each method the interaction between scanner and the relation between the assessed volume and the reference volume. Here we found a significant interaction for Lesion-TOADS on the Philips Ingenuity 3T scanner (family wise error rate corrected p < 0.05), indicating that performance was biased by scanner type. All other interactions were not significant (data not shown).

Table 3 Variation in performance across scanners by means of multiple linear regression analyses (n = 42; n = 7 per scanner).

Full size table

Performance of WMH segmentation methods for different WMH lesion loads

For all methods the DSC increased when Fazekas scores increased (see Table 4), as the DSC is particularly dependent on the absolute lesion load and the size of the individual lesions¹⁸. kNN-TTP and LST-LPA showed a good volumetric correspondence compared to the reference segmentations across all WMH lesion loads (see Table 4 and Supplementary Fig. 1). Also, variation in WMH volume measurements of these methods was small (i.e. narrow limits of agreement in the Bland Altman plots; see Fig. 2). Cascade, Lesion-TOADS and LST-LGA showed greater variation for different WMH lesion loads (i.e. wider limits of agreement in the Bland Altman plots, see Fig. 2). LST-LGA underestimated WMH volume at higher WMH lesion loads (see Fig. 2 and Supplementary Fig. 1). Cascade and Lesion-TOADS overestimated WMH volumes at lower WMH lesion loads, while Cascade underestimated WMH volumes at higher WMH lesion loads (see Fig. 2 and Supplementary Fig. 1).

Table 4 Performance of WMH segmentation methods for different WMH lesion loads.

Full size table

Discussion

The current study is the first to investigate the performance of five freely available and fully automated segmentation methods in a multicenter dataset of patients with WMHs of presumed vascular origin. Overall, performance of methods in terms of spatial and volumetric correspondence varied markedly both within and across scanners, with kNN-TTP and LST-LPA being the most consistent and best performing methods. Our findings can serve as a guide for choosing a method. In Table 5, we have provided a qualitative recommendation for each method regarding several aspects when automatically segmenting WMHs based on the results described earlier.

Table 5 Considerations when choosing a method.

Full size table

Many different automated methods currently exist to segment WMHs. Evaluation of these methods has mainly been performed in a single-center, single scanner setting, with variable performance across methods^{6,7,8,10,11,17,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41}. Some of these methods have also been assessed for scan-rescan reproducibility^6,8,18, which is of particular importance when performing longitudinal research. However, since pooling of data across multiple centers is an important trend in small vessel disease research⁴², there also is a need for automated WMH segmentation methods that perform well across different scanners. Clearly, a multicenter setting with different scan vendors poses challenges, as the method cannot be tuned to one single scan protocol. The question is thus which methods perform robustly enough in such a setting, but this has been explored by few studies. A recent study, coordinated by our group, compared the performance of twenty methods, but in contrast to the present study, many of the tested methods are not freely available yet⁴³. Two previous studies compared different linear and nonlinear classification techniques to segment WMHs of presumed vascular origin^44,45. The important difference between these and the current is that they primarily focused on the optimal choice of classifiers for WMH segmentation, using a general preprocessing pipeline. By contrast, we evaluated some of the same classifiers as an integral part of a fully automated WMH segmentation method, where the classifier only partially determines the performance of the entire method.

We observed that for segmentation of WMHs of presumed vascular origin, performance of the five tested methods varied markedly, both within and across scanners. kNN-TTP and LST-LPA were the most consistent methods across scanners. kNN-TTP was also the best performing method within scanners with a DSC comparable to a manual segmentation as performed by a trained rater and an excellent ICC, whereas LST-LPA performed less with regard to spatial correspondence with the reference segmentations. This could be relevant when choosing a method to segment WMHs for further analysis where spatial information of WMHs is of particular importance (e.g. lesion symptom mapping⁴⁶). By contrast, when analyzing WMH volumes as a primary outcome, both methods could be suitable.

All methods tended to slightly underestimate WMH volumes at higher lesion loads, but this was most prominent for LST-LGA and Lesion-TOADS. Lesion-TOADS and Cascade showed the lowest spatial and volumetric correspondence compared to the reference segmentation and especially performance of Lesion-TOADS also varied across scanners. A possible explanation for the differences in performance between methods, both within and across scanners, could be that some methods are more robust to sources of variation in MRI acquisition than others. In our study it is impossible to determine which MRI related factors contribute most to this variation. Future studies are therefore encouraged to determine these sources of variation and the relation to various methods. Another explanation within our study might be the variation in WMH volumes between scanners, which might have introduced variation caused by selection bias. Above all, our study highlights the need to further improve WMH segmentation methods. An important initiative was recently taken in the form of a WMH segmentation challenge⁴³. In this challenge, new WMH segmentation methods were developed and evaluated on a multicenter dataset. The best performing method showed a similar DSC compared to kNN-TTP in the present study.

The number of subjects in our training set is relatively low: only eighteen subjects were used. The ability to train or optimize the included methods with only a limited number of training subjects can be considered a strength of the included approaches. It is often infeasible to acquire large amounts of training data (e.g. 100+ subjects). Our training set was composed in such a way that it included data from the six different scanners—located in two institutes—that were used in this study. This ensured a large amount of possible variation in the MRI data to be used for training (kNN-TTP) or post-hoc optimization (Cascade, Lesion-TOADS, LST-LGA, and LST-LPA) of the methods. Future studies could look into the optimal size and composition of the training set, possibly even further reducing the number of required training subjects. This would increase the applicability of these methods in other centers.

White matter lesions can also have a non-vascular etiology, like in multiple sclerosis (MS). White matter lesions in MS show a different load, morphology and distribution compared to WMHs of presumed vascular origin⁵. Nevertheless, evaluation of methods for segmentation of MS lesions can still be informative for WMH of vascular origin. In the field of MS, a previous study assessed the performance across scanners of Cascade, kNN-TTP, Lesion-TOADS, LST-LGA and LST-LPA⁴⁷. This study showed the highest performance across scanners for kNN-TTP (DSC mean ± SD: 0.44 ± 0.14), followed by LST-LPA (0.37 ± 0.23), Lesion-TOADS (0.35 ± 0.18), LST-LGA (0.31 ± 0.23) and Cascade (0.26 ± 0.17). Although the etiology of MS lesions is different, the overall ranking of methods is comparable to the ranking in our study, with Cascade being the method with the worst performance. The overall performance for MS lesion segmentation of each method is however lower than in our study. This discrepancy can possibly be explained by the difference in white matter lesion load between the previous study in MS (WMH volume mean ± SD: 5 ± 7 mL) and our study (20 ± 9 mL). Particularly for the segmentation of multiple small lesions, the DSC can become relatively low.

The main strength of our study is that it allows a direct comparison in performance of these methods for multicenter use. To achieve this goal, we have constructed a high quality MRI dataset consisting of reference segmentations. A possible limitation could be the downsampling of the 3D FLAIR images, since performance of automated methods tends to be better at higher resolution. However, downsampling was necessary for a fair comparison across all scanners. Furthermore, manual segmentation of 3D FLAIR scans is more time consuming than 2D FLAIR scans. Another limitation could be the comparison of binary reference segmentations with binary automated segmentations (i.e. thresholding the initial probabilistic output of the automated methods). However, the alternative approach of creating probabilistic manual segmentations (e.g. by combining binary manual segmentations of the same subject performed by multiple raters into a single probabilistic segmentation) is very labor intensive. Moreover, it has limited added value over manual segmentation of a larger number of subjects. We have therefore invested in manual segmentations of more subjects in combination with determining optimal thresholds of the automated segmentations by using the training set. Another possible limitation of our study could be that we did not scan the same subject(s) on all six scanners. However, the aim of our study was not to assess (and quantify) the source of variation that could be introduced by using different MRI-scanners, but to determine the performance across scanners of widely used automated WMH segmentation methods in a dataset with different MRI-scanners that reflects general practice. A final limitation could be the selection of subjects for the present study. We chose to exclude subjects with severe motion artifacts and/or presence of large (sub)cortical brain infarcts. However, these brain abnormalities can often be observed in patients with WMH of presumed vascular origin and this could potentially lead to a different ranking in performance of the methods, as some methods might be more robust for these brain abnormalities. With regard to the design of the study and selection of methods, it could be argued that kNN-TTP is a supervised approach that uses fully annotated example data for training, whereas the other methods were only post hoc fine-tuned, which could have “favored” kNN-TTP as compared to the other methods. Yet, the counterargument would be that the training and test sets were kept fully separated in our study. Hence, the observation that a trained method, like kNN-TTP, outperformed the other methods would only strengthen the case for supervised methods in this application. In practice, such training takes only limited effort, as in our case the kNN-TTP was only offered a relatively low amount of training data (eighteen subjects).

In conclusion, performance of different methods for WMH segmentation varied markedly both within and across scanners. Our findings can serve as a guide for choosing a method and also highlight the importance to further improve and evaluate consistency of methods in a multicenter setting. Studies planning to segment WMHs from multicenter datasets should assess performance of their method of choice using a pilot sample of their data with manual segmentations.

Materials and Methods

Study population

Subjects with WMHs of presumed vascular origin (as defined by the STRIVE criteria)⁴⁸ were selected from the TRACE-VCI study. This is a multicenter study on subjects with vascular cognitive impairment (VCI; n = 860) in the Netherlands and was described earlier⁴⁹. In short, all patients that presented with cognitive complaints and vascular brain injury on MRI (i.e. possible VCI) were eligible to participate. Subjects scanned on six different MRI scanners were included. Four scanners were located at the Amsterdam University Medical Center (Amsterdam UMC), Amsterdam, the Netherlands (General Electric (GE) Signa HDxt 1.5T; GE Signa HDxt 3T; GE Discovery MR750 3T [General Electric Healthcare, Milwaukee, Wisconsin, USA] and Philips Ingenuity 3T [Philips Medical Systems, Best, the Netherlands]). Two scanners were located at the University Medical Center Utrecht (UMCU), Utrecht, the Netherlands (Philips Achieva 3T and Philips Ingenia 3T [Philips Medical Systems, Best, the Netherlands]). For the present study, ten subjects with varying WMH lesion load (Fazekas scale 1 to 3)⁵⁰ were randomly selected per MRI scanner to represent the variation in WMH lesion load across the entire cohort. This led to inclusion of a total of 60 subjects (38 females, 22 males; age 68 ± 8 years). Compared to the entire cohort, there was no significant difference in age in the current study population (Student’s t-test; p > 0.05). There was a significant difference in gender (chi-square test; p < 0.05) with a relatively higher percentage of females in the current study population compared to the entire cohort⁴⁹. Subjects with severe motion artifacts and/or presence of large (sub)cortical brain infarcts (less than 10% of the total cohort) were not considered for the present study. From the 60 subjects, we selected a training set of 18 subjects (i.e. three subjects per scanner; one randomly selected subject per Fazekas scale for each scanner) and a test set of 42 subjects (i.e. seven subjects per scanner). The training set and test set showed no significant difference in age (Student’s t test; p > 0.05), gender (chi-square test; p > 0.05) or WMH volume (Mann-Whitney U test; p > 0.05). The study was approved by the institutional review boards of the Amsterdam UMC and the UMCU (approval number 14-083/C). All procedures were in accordance with the ethical standards of the responsible committee on human experimentation (institutional and national) and with the Helsinki Declaration of 1975, as revised in 2013. All participating subjects provided written informed consent.

MR imaging

All subjects were scanned using an MRI protocol that included a 3D T1-weighted and fluid-attenuated inversion recovery (FLAIR) sequence⁴⁹. The MRI sequence parameters are shown in Table 6. To make a fair comparison across all MRI scanners, all 3D FLAIR scans from subjects who were scanned at the Amsterdam UMC, were resampled in the axial plane to better match the 2D FLAIR acquisitions from the UMCU. This was done using a linear interpolation tool in MeVisLab (MeVis Medical Solutions AG, Bremen, Germany), resulting in 3 mm slices with an in-plane resolution of 0.95–1.21 mm⁵¹.

Table 6 Overview of MRI sequence parameters for each scanner.

Full size table

Reference segmentations

WMH reference segmentations were constructed as reference data for training and testing the automated WMH segmentation methods. The reference segmentations were obtained for all subjects, prior to and without knowledge of the results of the automated segmentation methods, using the following procedure. An in-house developed MeVisLab (MeVis Medical Solutions AG, Bremen, Germany) tool was used to semi-automatically delineate the contour of WMHs on all axial slices^46,51. In short, WMHs were segmented using an iso-contouring technique. Contours were converted into binary segmentation masks by including all voxels having a (sub)voxel volume of at least 20% within the contour. This threshold value was chosen by visual comparison of images thresholded with values between 0 and 100% (intervals of 5%). All reference segmentations were constructed by a single rater (RH). To assess inter-rater reliability of the reference segmentations, JMB constructed reference segmentations on a subset of twenty subjects by using the same semi-automatic procedure. To assess intra-rater reliability of the reference segmentations, RH constructed a second segmentation on a subset of twenty subjects.

Automated WMH segmentation methods

For the present study, we included methods that were fully-automated and freely available for academic research: Cascade, kNN-TTP, Lesion-TOADS, LST-LGA, and LST-LPA. All methods were ran on FLAIR and 3D T1-weighted MR-images of all subjects to obtain WMH segmentations. Default settings were used as much as possible. The training set of subjects (n = 18) was used to train and tune each of the methods (i.e. to determine optimal thresholds). For Cascade, we ran the segmentation algorithm on the training set while changing the two main parameters (lower threshold and upper threshold: {0.05, 0.075, 0.100, …, 1.00})^15,16. We then chose the parameter combination that generated the highest DSC in the training set (in the current study: lower threshold = 0.95; upper threshold = 0.975). A similar approach was used to derive the optimal parameter settings for LST-LGA (parameters kappa {0.05, 0.10, …, 1.00} and lesion probability threshold {0.05, 0.10, …, 1.00}; optimal settings for kappa: 0.25 and lesion probability threshold of 0.2)¹⁰. For LST-LPA and kNN-TTP only the lesion probability threshold was tuned {0.05, 0.10, …, 1.00}, resulting in optimal values of 0.3 for LST-LPA and 0.35 for kNN-TTP¹⁷. Because in kNN-TTP, the reference data are actively used in every run of the algorithm, a leave-one-out cross-validation was used to optimize kNN-TTP parameters to ensure independence of the evaluation¹⁷. We did not exclude specific brain regions (such as the brain stem or basal ganglia where often higher false positive rates can be observed) from the analyses, since the aim of our study was to evaluate methods using their own processing. For a detailed overview of the workflow used for each method, see the Supplementary Information.

Statistical analysis

All automated WMH segmentation methods were evaluated on the test set (n = 42; i.e. 7 subjects per scanner). Several evaluation metrics currently exist to evaluate performance of WMH segmentation methods, each with their own advantages and disadvantages (for an overview see⁵²). For the present study, we chose frequently used evaluation metrics that have been used in recent comparative studies on WMH segmentation^8,47.

Quality assessment

We evaluated all methods qualitatively by visually comparing the output of each method with the reference segmentations. Next, we evaluated all methods quantitatively by calculating false positive (FP) volumes (in mL) and false negative (FN) volumes (in mL) of the WMH segmentations of each method using the reference segmentations.

Performance within scanners

The performance of each method was assessed per scanner by measuring: (a) the spatial (i.e. voxel-wise) correspondence with the reference segmentations by using the DSC; (b) the volumetric correspondence with the reference WMH volumes by using the ICC (two-way mixed model with absolute agreement after log-transforming WMH volumes because of non-normal distribution); (c) the mean differences and mean absolute differences between WMH volumes of each method and the reference WMH volumes. Because specific methods (Cascade, Lesion-TOADS, LST-LGA, and LST-LPA) do not necessarily have to be trained, performance was also determined in secondary analyses on all subjects (n = 60) without training of the methods.

Mean performance across scanners

The mean performance of each method across scanners was determined by averaging the mean DSC, ICC and absolute volume differences of each scanner.

Variations in performance across scanners

To investigate the variation in performance across scanners of each method, we performed the following two analyses:

(a)
For each method, we assessed whether the DSC (as an outcome) depended on scanner (as a categorical variable with each scanner being compared to all other scanners as the reference) using linear regression analysis. This resulted in a unstandardized beta coefficient with 95% confidence intervals for each scanner. A significant relation between a certain scanner and the DSC (family wise error rate corrected p-value of <0.05 using a Bonferroni correction) indicates that the performance (in terms of spatial correspondence with the reference segmentation) was biased by the use of that scanner (compared to the other scanners).
(b)
For each method, we assessed whether the relation between the reference WMH volumes (as an outcome) and WMH volumes of the automated WMH segmentation method (as a determinant) depended on scanner (as a categorical variable with each scanner being compared to all other scanners as the reference) by using linear regression analyses. Because of non-normal distribution, WMH volumes of each method and the reference WMH volumes were log-transformed. A significant interaction between the log transformed WMH volume of a method and a certain scanner (family wise error rate corrected p-value of <0.05), indicates that performance of that method (in terms of volumetric correspondence with the reference segmentation) was biased by the use of that scanner (compared to the other scanners).

Performance for different WMH lesion loads

In addition, the MRI scans of all subjects were stratified based on the Fazekas scale (Fazekas scale 1/2/3: n = 17/n = 18/n = 7). We then assessed whether the performance of each method was dependent on the WMH lesion load (i.e. Fazekas scale) using DSC, ICC and mean (absolute) volume differences. In addition, Bland-Altman plots were made to compare WMH volume of each method with the reference WMH volumes⁵³. Bland Altman plots provide a graphical representation of the amount of variation from the mean when comparing WMH volumes of the WMH segmentation methods and the reference segmentations. In these plots, a narrow width of the limits of agreement reflects a small amount of variation between WMH volumes of the WMH segmentation methods and the reference segmentations. The difference between WMH volumes of the WMH segmentation methods and the reference segmentation reflects over- or underestimation of the WMH segmentation methods. Both a change in the direction of WMH volume differences (i.e. positive or negative differences) as well as the distribution of WMH volume differences (narrow or wide) for different WMH lesion loads, can reflect performance of a WMH segmentation method to be dependent on the WMH lesion load.

Data availability

The data that support the findings of this study are available from the final author, upon reasonable request.

References

Carrillo, M. C., Bain, L. J., Frisoni, G. B. & Weiner, M. W. Worldwide Alzheimer’s disease neuroimaging initiative. Alzheimers. Dement. 8, 337–42 (2012).
Article PubMed Google Scholar
Williamson, J. D. et al. The Action to Control Cardiovascular Risk in Diabetes Memory in Diabetes Study (ACCORD-MIND): Rationale, Design, and Methods. Am. J. Cardiol. 99 (2007).
Mueller, S. G. et al. Ways toward an early diagnosis in Alzheimer’s disease: The Alzheimer’s Disease Neuroimaging Initiative (ADNI). Alzheimer’s Dement. 1, 55–66 (2005).
Article Google Scholar
De Guio, F. et al. Reproducibility and variability of quantitative magnetic resonance imaging markers in cerebral small vessel disease. J. Cereb. Blood Flow Metab. 36, 1319–1337 (2016).
Article PubMed PubMed Central Google Scholar
Caligiuri, M. E. et al. Automatic Detection of White Matter Hyperintensities in Healthy Aging and Pathology Using Magnetic Resonance Imaging: A Review. Neuroinformatics 13, 261–276 (2015).
Article PubMed PubMed Central Google Scholar
Jain, S. et al. Automatic segmentation and volumetry of multiple sclerosis brain lesions from MR images. NeuroImage Clin. 8, 367–375 (2015).
Article PubMed PubMed Central Google Scholar
Ghafoorian, M. et al. Automated detection of white matter hyperintensities of all sizes in cerebral small vessel disease. Med. Phys. 43, 6246–6258 (2016).
Article PubMed Google Scholar
Griffanti, L. et al. BIANCA (Brain Intensity AbNormality Classification Algorithm): A new tool for automated segmentation of white matter hyperintensities. Neuroimage 141, 191–205 (2016).
Article PubMed PubMed Central Google Scholar
Bowles, C. et al. Pseudo-healthy image synthesis for white matter lesion segmentation. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 9968 LNCS, 87–96 (2016).
Google Scholar
Schmidt, P. et al. An automated tool for detection of FLAIR-hyperintense white-matter lesions in Multiple Sclerosis. Neuroimage 59, 3774–3783 (2012).
Article PubMed Google Scholar
Shiee, N. et al. A topology-preserving approach to the segmentation of brain images with multiple sclerosis lesions. Neuroimage 49, 1524–1535 (2010).
Article PubMed Google Scholar
Qin, C. et al. A large margin algorithm for automated segmentation of white matter hyperintensity. Pattern Recognit. 77, 150–159 (2018).
Article Google Scholar
Guerrero, R. et al. White matter hyperintensity and stroke lesion segmentation and differentiation using convolutional neural networks. NeuroImage Clin. 17, 918–934 (2018).
Article CAS PubMed Google Scholar
Ling, Y., Jouvent, E., Cousyn, L., Chabriat, H. & De Guio, F. Validation and Optimization of BIANCA for the Segmentation of Extensive White Matter Hyperintensities. Neuroinformatics 1–13, https://doi.org/10.1007/s12021-018-9372-2 (2018).
Article PubMed Google Scholar
Damangir, S. et al. Multispectral MRI segmentation of age related white matter changes using a cascade of support vector machines. J. Neurol. Sci. 322, 211–216 (2012).
Article PubMed Google Scholar
Damangir, S. et al. Reproducible segmentation of white matter hyperintensities using a new statistical definition. Magn. Reson. Mater. Physics, Biol. Med. 30, 227–237 (2017).
Article CAS Google Scholar
Steenwijk, M. D. et al. Accurate white matter lesion segmentation by k nearest neighbor classification with tissue type priors (kNN-TTPs). NeuroImage. Clin. 3, 462–9 (2013).
Article PubMed PubMed Central Google Scholar
Admiraal-Behloul, F. et al. Fully automatic segmentation of white matter hyperintensities in MR images of the elderly. Neuroimage 28, 607–617 (2005).
Article CAS PubMed Google Scholar
Admiraal-Behloul, F. et al. Fully automatic segmentation of white matter hyperintensities in {MR} images of the elderly. Neuroimage 28, 607–617 (2005).
Article CAS PubMed Google Scholar
Anbeek, P., Vincken, K. L., Van Osch, M. J. P., Bisschops, R. H. C. & Van Der Grond, J. Probabilistic segmentation of white matter lesions in MR imaging. Neuroimage 21, 1037–1044 (2004).
Article PubMed Google Scholar
Beare, R. et al. Development and validation of morphological segmentation of age-related cerebral white matter hyperintensities. Neuroimage 47, 199–203 (2009).
Article PubMed Google Scholar
Brickman, A. M. et al. Quantitative approaches for assessment of white matter hyperintensities in elderly populations. Psychiatry Res. - Neuroimaging 193, 101–106 (2011).
Article Google Scholar
de Boer, R. et al. White matter lesion extension to automatic brain tissue segmentation on MRI. Neuroimage 45, 1151–1161 (2009).
Article PubMed Google Scholar
Erus, G., Zacharaki, E. I. & Davatzikos, C. Individualized statistical learning from medical image databases: Application to identification of brain lesions. Med. Image Anal. 18, 542–554 (2014).
Article PubMed PubMed Central Google Scholar
Gibson, E., Gao, F., Black, S. E. & Lobaugh, N. J. Automatic segmentation of white matter hyperintensities in the elderly using FLAIR images at 3T. J. Magn. Reson. Imaging 31, 1311–1322 (2010).
Article PubMed PubMed Central Google Scholar
Herskovits, E. H., Bryan, R. N. & Yang, F. Automated Bayesian segmentation of microvascular white-matter lesions in the ACCORD-MIND study. Adv. Med. Sci. 53, 182–90 (2008).
CAS PubMed Google Scholar
Iorio, M. et al. White matter hyperintensities segmentation: A new semi-automated method. Front. Aging Neurosci. 5 (2013).
Ithapu, V. et al. Extracting and summarizing white matter hyperintensities using supervised segmentation methods in Alzheimer’s disease risk and aging studies. Hum. Brain Mapp. 35, 4219–4235 (2014).
PubMed PubMed Central Google Scholar
Khayati, R., Vafadust, M., Towhidkhah, F. & Nabavi, M. Fully automatic segmentation of multiple sclerosis lesions in brain MR FLAIR images using adaptive mixtures method and markov random field model. Comput. Biol. Med. 38, 379–390 (2008).
Article PubMed Google Scholar
Lao, Z. et al. Computer-Assisted Segmentation of White Matter Lesions in 3D MR Images Using Support Vector Machine. Acad. Radiol. 15, 300–313 (2008).
Article PubMed PubMed Central Google Scholar
Moeskops, P. et al. Evaluation of a deep learning approach for the segmentation of brain tissues and white matter hyperintensities of presumed vascular origin in MRI. NeuroImage Clin. 17, 251–262 (2017).
Article PubMed PubMed Central Google Scholar
Ramirez, J. et al. Lesion Explorer: A comprehensive segmentation and parcellation package to obtain regional volumetrics for subcortical hyperintensities and intracranial tissue. Neuroimage 54, 963–973 (2011).
Article CAS PubMed Google Scholar
Rincón, M. et al. Improved Automatic Segmentation of White Matter Hyperintensities in MRI Based on Multilevel Lesion Features. Neuroinformatics 15, 231–245 (2017).
Article PubMed Google Scholar
Sajja, B. R. et al. Unified approach for multiple sclerosis lesion segmentation on brain MRI. Ann. Biomed. Eng. 34, 142–151 (2006).
Article PubMed PubMed Central Google Scholar
Simões, R. et al. Automatic segmentation of cerebral white matter hyperintensities using only 3D FLAIR images. Magn. Reson. Imaging 31, 1182–1189 (2013).
Article PubMed Google Scholar
Smart, S. D., Firbank, M. J. & O’Brien, J. T. Validation of automated white matter hyperintensity segmentation. J. Aging Res. 2011, 391783 (2011).
Article PubMed PubMed Central Google Scholar
Tsai, J. Z. et al. Automated segmentation and quantification of white matter hyperintensities in acute ischemic stroke patients with cerebral infarction. PLoS One 9, e104011 (2014).
Article ADS PubMed PubMed Central Google Scholar
Wang, R. et al. Automatic segmentation and volumetric quantification of white matter hyperintensities on fluid-attenuated inversion recovery images using the extreme value distribution. Neuroradiology 57, 307–320 (2015).
Article PubMed Google Scholar
Wang, R. et al. Automatic segmentation and quantitative analysis of white matter hyperintensities on FLAIR images using trimmed-likelihood estimator. Acad. Radiol. 21, 1512–1523 (2014).
Article PubMed Google Scholar
Wu, Y. et al. Automated segmentation of multiple sclerosis lesion subtypes with multichannel MRI. Neuroimage 32, 1205–1215 (2006).
Article PubMed Google Scholar
Zhong, Y., Utriainen, D., Wang, Y., Kang, Y. & Haacke, E. M. Automated White Matter Hyperintensity Detection in Multiple Sclerosis Using 3D T2 FLAIR. Int. J. Biomed. Imaging 2014 (2014).
Dichgans, M. et al. METACOHORTS for the study of vascular disease and its contribution to cognitive decline and neurodegeneration: An initiative of the Joint Programme for Neurodegenerative Disease Research. Alzheimer’s and Dementia 12, 1235–1249 (2016).
Article Google Scholar
Kuijf, H. J. et al. Standardized Assessment of Automatic Segmentation of White Matter Hyperintensities; Results of the WMH Segmentation Challenge. IEEE Trans. Med. Imaging 1–36, https://doi.org/10.1109/TMI.2019.2905770 (2019).
Article PubMed Google Scholar
Dadar, M. et al. Performance comparison of 10 different classification techniques in segmenting white matter hyperintensities in aging. Neuroimage 157, 233–249 (2017).
Article PubMed PubMed Central Google Scholar
Samaille, T. et al. Contrast-Based Fully Automatic Segmentation of White Matter Hyperintensities: Method and Validation. PLoS One 7 (2012).
Article ADS CAS PubMed PubMed Central Google Scholar
Biesbroek, J. M. et al. Impact of Strategically Located White Matter Hyperintensities on Cognition in Memory Clinic Patients with Small Vessel Disease. PLoS One 11, e0166261 (2016).
Article PubMed PubMed Central CAS Google Scholar
de Sitter, A. et al. Performance of five research-domain automated WM lesion segmentation methods in a multi-center MS study. Neuroimage 163, 106–114 (2017).
Article PubMed Google Scholar
Wardlaw, J. M. et al. Neuroimaging standards for research into small vessel disease and its contribution to ageing and neurodegeneration. The Lancet Neurology 12, 822–838 (2013).
Article PubMed PubMed Central Google Scholar
Boomsma, J. M. F. et al. Vascular Cognitive Impairment in a Memory Clinic Population: Rationale and Design of the ‘Utrecht-Amsterdam Clinical Features and Prognosis in Vascular Cognitive Impairment’ (TRACE-VCI) Study. JMIR Res. Protoc. 6, e60 (2017).
Article PubMed PubMed Central Google Scholar
Fazekas, F., Chawluk, J. B. & Alavi, A. MR signal abnormalities at 1.5 T in Alzheimer’s dementia and normal aging. American Journal of Neuroradiology 8, 421–426 (1987).
Google Scholar
Ritter, F. et al. Medical image analysis. IEEE Pulse 2, 60–70 (2011).
Article PubMed Google Scholar
Taha, A. A. & Hanbury, A. Metrics for evaluating 3D medical image segmentation: Analysis, selection, and tool. BMC Med. Imaging 15 (2015).
Martin Bland, J. & Altman, D. Statistical Methods for Assessing Agreement Between Two Methods of Clinical Measurement. Lancet 327, 307–310 (1986).
Article Google Scholar

Download references

Acknowledgements

N.P.A. Zuithoff, assistant professor in Biostatistic Research for his help in the statistical analyses. The TRACE-VCI study is supported by Vidi grant 91711384 and Vici grant 91816616 from ZonMw, The Netherlands, Organisation for Health Research and Development and grant 2010T073 from the Dutch Heart Association to Geert Jan Biessels. Research of the VUMC Alzheimer Center is part of the neurodegeneration research program of the Neuroscience Campus Amsterdam. The VUMC Alzheimer Center is supported by Stichting Alzheimer Nederland and Stichting VUMC fonds. F.B. is supported by the NIHR UCLH biomedical research center.

Author information

Authors and Affiliations

Department of Neurology and Neurosurgery, UMC Utrecht Brain Center, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands
Rutger Heinen, J. Matthijs Biesbroek & Geert Jan Biessels
Department of Anatomy and Neurosciences, Amsterdam Neuroscience, Vrije Universiteit Amsterdam, Amsterdam UMC, Amsterdam, The Netherlands
Martijn D. Steenwijk & Hugo Vrenken
Department of Radiology and Nuclear Medicine, Amsterdam Neuroscience, Vrije Universiteit Amsterdam, Amsterdam UMC, Amsterdam, The Netherlands
Martijn D. Steenwijk, Frederik Barkhof & Hugo Vrenken
Institutes of Neurology & Healthcare Engineering, University College London (UCL), London, United Kingdom
Frederik Barkhof
Alzheimer Center & Department of Neurology, Vrije Universiteit Amsterdam, Amsterdam UMC, Amsterdam, The Netherlands
Wiesje M. van der Flier & Niels D. Prins
Department of Epidemiology and Biostatistics, Vrije Universiteit Amsterdam, Amsterdam UMC, Amsterdam, The Netherlands
Wiesje M. van der Flier
Image Sciences Institute, University Medical Center Utrecht, Utrecht, The Netherlands
Hugo J. Kuijf
Department of Radiology, Leiden University Medical Center, Leiden, The Netherlands
Jeroen de Bresser
Department of Neurology, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands
E. van den Berg, G. J. Biessels, J. M. F. Boomsma, L. G. Exalto, D. A. Ferro, C. J. M. Frijns, O. N. Groeneveld, R. Heinen, N. M. van Kalsbeek & J. H. Verwer
Department of Radiology, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands
J. de Bresser
Image Sciences Institute, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands
H. J. Kuijf
Department of Geriatrics, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands
M. E. Emmelot-Vonk & H. L. Koek
Alzheimer Center and Department of Neurology, Amsterdam UMC, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
M. R. Benedictus, J. Bremer, W. M. van der Flier, A. E. Leeuwis, J. Leijenaar, N. D. Prins, P. Scheltens & B. M. Tijms
Department of Radiology and Nuclear Medicine, Amsterdam UMC, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
F. Barkhof & M. P. Wattjes
Department of Clinical Chemistry, Amsterdam UMC, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
C. E. Teunissen
Department of Medical Psychology, Amsterdam UMC, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
T. Koene
Department of Neurology, Onze Lieve Vrouwe Gasthuis West, Amsterdam, The Netherlands
J. M. F. Boomsma & H. C. Weinstein
Hospital Diakonessenhuis, Zeist, The Netherlands
M. Hamaker, R. Faaij, M. Pleizier, M. Prins & E. Vriens

Author notes

A comprehensive list of consortium members appears at the end of the paper

Authors

Rutger Heinen
View author publications
You can also search for this author in PubMed Google Scholar
Martijn D. Steenwijk
View author publications
You can also search for this author in PubMed Google Scholar
Frederik Barkhof
View author publications
You can also search for this author in PubMed Google Scholar
J. Matthijs Biesbroek
View author publications
You can also search for this author in PubMed Google Scholar
Wiesje M. van der Flier
View author publications
You can also search for this author in PubMed Google Scholar
Hugo J. Kuijf
View author publications
You can also search for this author in PubMed Google Scholar
Niels D. Prins
View author publications
You can also search for this author in PubMed Google Scholar
Hugo Vrenken
View author publications
You can also search for this author in PubMed Google Scholar
Geert Jan Biessels
View author publications
You can also search for this author in PubMed Google Scholar
Jeroen de Bresser
View author publications
You can also search for this author in PubMed Google Scholar

Consortia

TRACE-VCI study group

E. van den Berg
, G. J. Biessels
, J. M. F. Boomsma
, L. G. Exalto
, D. A. Ferro
, C. J. M. Frijns
, O. N. Groeneveld
, R. Heinen
, N. M. van Kalsbeek
, J. H. Verwer
, J. de Bresser
, H. J. Kuijf
, M. E. Emmelot-Vonk
, H. L. Koek
, M. R. Benedictus
, J. Bremer
, W. M. van der Flier
, A. E. Leeuwis
, J. Leijenaar
, N. D. Prins
, P. Scheltens
, B. M. Tijms
, F. Barkhof
, M. P. Wattjes
, C. E. Teunissen
, T. Koene
, J. M. F. Boomsma
, H. C. Weinstein
, M. Hamaker
, R. Faaij
, M. Pleizier
, M. Prins
& E. Vriens

Contributions

R.H., M.S., H.V., G.J.B. and J.B. designed the study. R.H., M.S., M.B. and H.K. collected and analyzed the data. F.B., W.F. and N.P. collected data. R.H. and J.B. wrote the initial draft of the manuscript. G.J.B., F.B., W.F., N.P. and H.V. critically revised the manuscript. All authors of the present manuscript agreed to contribute and carefully revised the manuscript.

Corresponding author

Correspondence to Rutger Heinen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Heinen, R., Steenwijk, M.D., Barkhof, F. et al. Performance of five automated white matter hyperintensity segmentation methods in a multicenter dataset. Sci Rep 9, 16742 (2019). https://doi.org/10.1038/s41598-019-52966-0

Download citation

Received: 31 May 2019
Accepted: 22 October 2019
Published: 14 November 2019
DOI: https://doi.org/10.1038/s41598-019-52966-0

This article is cited by

Structural retinal changes in cerebral small vessel disease
- S. Magdalena Langner
- Jan H. Terheyden
- Robert P. Finger
Scientific Reports (2022)
Changes of the retinal and choroidal vasculature in cerebral small vessel disease
- Clara F. Geerling
- Jan H. Terheyden
- Robert P. Finger
Scientific Reports (2022)
Tract-based white matter hyperintensity patterns in patients with systemic lupus erythematosus using an unsupervised machine learning approach
- Theodor Rumetshofer
- Francesca Inglese
- Pia C. Sundgren
Scientific Reports (2022)
Longitudinal white matter hyperintensity changes and cognitive decline in patients with minor stroke
- Jingwen Jiang
- Kanmin Yao
- Suiqing Weng
Aging Clinical and Experimental Research (2022)
A deep learning algorithm for white matter hyperintensity lesion detection and segmentation
- Yajing Zhang
- Yunyun Duan
- Yaou Liu
Neuroradiology (2022)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

Reference segmentations

Quality assessment

Performance of WMH segmentation methods

Variations in performance across scanners

Performance of WMH segmentation methods for different WMH lesion loads

Discussion

Materials and Methods

Study population

MR imaging

Reference segmentations

Automated WMH segmentation methods

Statistical analysis

Quality assessment

Performance within scanners

Mean performance across scanners

Variations in performance across scanners

Performance for different WMH lesion loads

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Author notes

A comprehensive list of consortium members appears at the end of the paper

Consortia

TRACE-VCI study group

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Comments

Search

Quick links