Vasculature segmentation in 3D hierarchical phase-contrast tomography images of human kidneys

Efficient algorithms are needed to segment vasculature in new three-dimensional (3D) medical imaging datasets at scale for a wide range of research and clinical applications. Manual segmentation of vessels in images is time-consuming and expensive. Computational approaches are more scalable but have limitations in accuracy. We organized a global machine learning competition, engaging 1,401 participants, to help develop new deep learning methods for 3D blood vessel segmentation. This paper presents a detailed analysis of the top-performing solutions using manually curated 3D Hierarchical Phase-Contrast Tomography datasets of the human kidney, focusing on the segmentation accuracy and morphological analysis, thereby establishing a benchmark for future studies in blood vessel segmentation within phase-contrast tomography imaging.


Introduction
Because blood vasculature extends to all organs of the body, identifying vessels in medical images ("vessel segmentation") is an important part of image processing in applications ranging from basic science research to clinical care.
As the vascular network is an inherently 3D and multi-scale spanning structure, imaging techniques that do not require physical subsampling to achieve high-resolution images across multiple scales are the ideal tool for vascular analysis 1 ; Hierarchical Phase-Contrast Tomography (HiP-CT) is such a technique 2 .Leveraging the European Synchrotron Radiation Facility's (ESRF) Extremely Brilliant Source (EBS), the world's first high-energy fourth generation synchrotron source, HiP-CT enables imaging of intact human organs at unprecedented scale and resolution.HiP-CT can be used to map and quantify the arterial vascular network of an intact human kidney down to the arteriolar level 3 .However, the creation of 3D models from HiP-CT data using traditional manual segmentation techniques is very time-consuming and labor-intensive.This labor-cost barrier impedes the broader utilization of 3D data to answer questions proposed by the biomedical community, including understanding exosome transport.
Considering such aims and challenges, a global citizen science competition was organized on Google's machine learning (ML) platform, Kaggle, to develop open-source machine learning solutions capable of automatically segmenting large datasets to generate 3D models of the blood vasculature of the human kidney.The Kaggle platform has been previously leveraged to collaboratively develop machine learning solutions for complex biomedical tasks thereby advancing the field [4][5][6][7] .The competition presented in this paper focuses on 3D data, challenging teams to develop solutions to accurately segment blood vessels in HiP-CT images to create high quality 3D models of the vessels from organs such as the kidney.Such segmented 3D models can be used to obtain additional information such as length, diameter, branching angles, tortuosity, inter-vessel distance, etc.
The two sponsors of this competition are the NIH Common Fund's Human BioMolecular Atlas Program (HuBMAP) and Cellular Senescence Network (SenNet) Program.Both provide examples of how researchers can benefit from improved methods of vessel segmentation and served as motivation for the competition: 1. HuBMAP is constructing a map of all the cells in the healthy adult human body.It envisions blood vasculature pathways serving as landmarks to describe the location of cells in organs and tissue [8][9][10] .In this "Vasculature Common Coordinate Framework" (VCCF) 11 , accurate vessel segmentation is essential to define these landmarks.As part of this work, HuBMAP has published an initial database of more than 900 blood vessels based on literature review and domain expert curation 12 .However, this needs to be linked to experimental data for validation and to extend it to microvasculature, where there are many gaps in existing knowledge.2. SenNet studies senescent cells of the human body-in health and in disease-across the lifespan 13,14 .A key challenge is understanding the secretome, i.e., the set of proteins these cells secrete into the extracellular space and/or the bloodstream.Exosomes-a subcategory of secretomes-are membrane-bound extracellular vesicles used to transport diverse cargo to neighboring cells and to other organs.However, understanding the generation, transportation, and ingestion of exosomes requires data on the human blood vasculature-the main exosome transport system.[17] This paper presents the competition setup, the curated HiP-CT 3D datasets, the competition metric, and the top-5 winning solutions.Additionally, to enable a holistic evaluation of ML model performance, three additional analyses are provided: quantitative (based on additional metrics), qualitative (based on visual analysis of predictions), and morphological (based on vasculature morphology features).Due to the novelty of HiP-CT data and lack of gold standard segmentations for vasculature in such data, the paper presents a benchmark dataset, performance metrics, and benchmark scores for future research in the domain, which can accelerate the augmentation of our knowledge of the blood vasculature system in support of answering key questions posed by the research community.

Competition design
The "SenNet + HOA: Hacking the Human Vasculature in 3D" competition-running from November 7, 2023 through January 30, 2024-aimed to develop machine learning solutions for segmentation of blood vasculature in 3D HiP-CT scans of the human kidney.The imaging datasets were collected as part of the Human Organ Atlas effort (https://human-organ-atlas.esrf.eu).Vessel segmentation is a challenging task for various reasons: it is an imbalanced class problem (where prediction targets are disproportionately lower in volume compared to non-target background), there is a very low tolerance for error since small losses in segmentation connectivity can lead to large variations in simulated function 18 , there can be collapse or infilling of vessels during preparation which makes them challenging to segment 19 , there is wide image variability due to natural human anatomical variation and due to the nature of imaging post-mortem organs, and there exist relatively small training datasets due to the novelty of the HiP-CT imaging technique.These issues were documented in previous work by the authors 19 which also showcased the performance of a baseline NN-Unet 20 model on HiP-CT data.A key challenge for participants was to develop solutions that can handle large 3D datasets and that generalize to different datasets containing variability in resolutions, image intensities, and donors.
The HiP-CT datasets used in the competition were sourced from the kidneys of five adult donors.  .This choice of test data was made to reflect the real challenges faced by our research teams e.g., to create models from limited training datasets such that the models generalize effectively to new donors and are robust to some changes in the imaging setup.In total, approximately 600 human hours were spent in segmentation and verification of the gold standard labels for the competition dataset, underscoring the time and resource consuming nature of the segmentation problem and the need for automated methods for the task.
The challenges associated with the creation of high quality gold standard labels for training ML models is well known 21 .However, with new modalities such as HiP-CT, this poses a particular challenge as there is no large pool of potential expert curators who are familiar with the data type and the particular artifacts that can be present.In order to acknowledge the challenge with creating this gold standard, we chose to create a labeling protocol involving three independent curators (see Methods for details).Using this protocol, we were able to provide both dense labeling (i.e., where the third curator found 100% of vessels were labeled), and to provide sparse labeling with an estimate of the sparsity.This allowed some teams, e.g., Team 3, to use the sparse labels effectively as training data within a pseudo-labeling context.This was critical in the competition design as it is relatively fast for an experienced curator to sparsely label an image volume but highly time consuming to label it densely.
In addition, a high-resolution VOI was provided, which is a subset of the lower resolution whole kidney training dataset that can be rigidly registered to the whole kidney dataset to provide more accurate labels for smaller vessels.Previously, we have shown how these higher resolution VOIs can be used to validate manual segmentation in the lower resolution dataset 3 .Interestingly, none of the teams attempted to utilize or extend this multi-resolution approach to improve the segmentation of small vessels, although teams that used this dataset as additional training data, such as Team 1, found a greater proportion of the smaller vessels.
In addition to the competition data, participants were allowed to use publicly available external datasets.Some teams used other HiP-CT datasets (gold standard labels absent) available via the Human Organ Atlas data portal (https://human-organ-atlas.esrf.eu) to help improve the performance of their solutions.All inference code was submitted via the Kaggle submission portal and was run on the public and private test sets within the Kaggle infrastructure, leading to team rankings on the public and private leaderboards (LB), respectively (see Methods).Throughout the competition, the private LB scores and rankings are not visible to the participants which forces them to use the public LB scores and rankings as a validation dataset for their experiments.The test datasets are not visible to the participants but they are provided 3 z-slices each from both test sets as examples.They are also provided the voxel resolution for both test sets.Algorithm performance was judged based on the Normalized Surface Dice (NSD) metric as previously proposed by the Google Deepmind team 22,23 , with tolerance threshold set to 0 (see Methods).At the end of the competition, the top-5 teams on the final private leaderboard won the performance prizes and the associated prize money (see Methods).The competition observed participation by 1,401 individual members divided across 1,149 teams from 78 countries.There were 204 members that participated for the first time in a Kaggle competition, five of which were in the top-100 on the final private leaderboard.There were 32,391 unique code submissions, 500 public notebooks created, 756 discussion comments, and 200 discussion forum topics.There was also a separate Discord server created for the competition where teams engaged in more informal discussions throughout the competition, although this was not monitored by the hosts/authors.Figure 2a graphs the changes in the leaderboard scores, forum activity, and number of participants during the three-month period of the competition.Figure 2b shows the number of submissions versus the highest private score for all teams.

Overview of winning solutions
In this competition, in general, teams employed a variety of techniques to tackle the challenge, including customized U-Net architectures, different data augmentation techniques, and complex loss functions.Notably, methods integrating data augmentation and self-training paradigms emerged as effective solutions to address challenges related to the small dataset.Teams utilized pseudo-labelling techniques, either refining pseudo labels from sparse to dense iteratively, or creating labels for external datasets.Additionally, post-processing steps such as to eliminate disconnected vessels also helped improve solutions.
The competition highlighted that the U-Net architecture 24 remains a strong base model for medical image segmentation.All of the top five teams used variants of U-Net-2D, 2.5D, or 3D-with different backbones, customizing the models to enhance performance while keeping the U-shaped structure unchanged.Among the advanced backbones, MaxViT-Large-512 25 stood out for its ability to handle large input sizes (like 512x512 pixels), making it suitable for high-resolution HiP-CT images.By leveraging the dual capabilities of Convolutional Neural Networks 26 (CNN) and Transformers 27 , MaxViT-Large-512 enhanced U-Net's ability to accurately segment complex anatomical structures.Additionally, ensembling different U-Net architectures was popular in the competition, with three out of the top-5 winning teams utilizing ensemble models.Individual models might overfit to specific patterns in the training data; ensembling reduces the risk of overfitting by averaging out biases from individual models, resulting in better generalization on unseen data.Additionally, by combining the strengths of different models, an ensemble provides more robust predictions, effectively handling anomalies and noise in the data.Larger sliding-window size generally had better performance as it was able to remove more artifacts at the window boundary introduced by sliding-window inference.
Team 1 developed a tailored 2.5D U-Net 24 with ConvNeXt-tiny 28 as the encoder, using a combination of loss functions to directly optimize for NSD.Team 2 utilized a 3D U-Net 29 , focusing on improving generalization through extensive data augmentation and morphological postprocessing.Team 3 combined multiple U-Net-based architectures with CNN and Transformer models, refining sparse labels with dense ones for better segmentation performance.Team 4 combined 2D and 3D models, leveraging test-time augmentation (TTA) and pseudo-labeling to enhance accuracy.Finally, Team 5 ensembled three 2D U-Net models, integrating pseudo-labeling and boundary loss to sharpen boundary predictions.Through innovative data preprocessing, careful model selection and hyperparameter choice, and effective post-processing, these teams achieved high accuracy, securing top-five positions in the competition.
Furthermore, there were some key implementation details in the solutions proposed by the top-5 teams that highlighted differences in their approaches.Team 1: Defining a custom Marching Cube Loss function, adding an extra stem block to the model to leverage high resolution information, using 3D rotations, and generating more data by slicing along different axes.Team 2: Generating more data using 3D affine transformations on training sets, and removing distantly placed disconnected chunks to reduce false positives.Team 3: Refining pseudo labels from sparse to dense iteratively, emulating test set magnification, and intensity augmentation.Team 4: Percentile normalization of training data, pseudo labels on external data, Boundary Difference over Union loss for both 2D and 3D models, and sliding window inference.Team 5: Creating soft-labels with pseudo labels on external data, combined with composite loss with more weight to the boundaries of masks, and 3D interpolation to account for the different resolutions.
As discussed earlier, vessel segmentation is generally a difficult image processing challenge, due to the multi-scale nature and small structure of vessels.With HiP-CT data of human organs, several additional challenges are present, all of which are compounded by a low tolerance for connectivity error in the final segmentations 19 .Most of the teams attempted to overcome the issue of a small training dataset by employing different augmentation techniques and pseudo-labeling approaches.For example, Teams 1 and 2 significantly enhanced their scores with the use of 3D rotational augmentation.After implementing this technique, Team 1's private score increased from 0.682 to 0.835.Team-3's strategy of iterative pseudo-labeling enabled training of models on an expanded dataset, securing it the third rank in the competition.
To tackle the issue of unbalanced classes, some teams introduced various boundary losses, which take the form of a distance metric based on contours rather than regions.This approach helped mitigate the challenges associated with highly unbalanced datasets by focusing on integrals over the interfaces between regions, rather than unbalanced integrals over the regions themselves.For instance, Team 1 experienced a notable improvement after integrating a combination of focal, dice, and boundary losses with a custom loss function (see Supplementary Note 1), elevating its private score from 0.534 to 0.682.Meanwhile, Team 4's use of BoundaryDOULoss 30 (see Supplementary Note 4) boosted its public LB score by 1.5% and its private LB score by 5%.To address the connectivity issue, Team 2 effectively used a post-processing step to enhance connectivity and eliminate false positives, by removing small unconnected segments with depthfirst search.In order to mitigate the effect of variability of the imaging data, Team 3 emulated a test set magnification to have matching resolutions.Detailed technical information for each winning team's solution, including information for some other teams, is provided in Supplementary Notes 1-6.

Quantitative analysis of solutions
The top-winning team, Team 1, reached a NSD score of 0.774 on the final private LB and also ranked first on the public LB with a score of 0.898, followed by 0.755 for Team 2, 0.727 for Team 3, 0.712 for Team 4, and 0.691 for Team 5.The top-5 teams made a total of 784 out of 32,391 submissions (2.42% of total submissions).Supplementary Table 2 lists public and private LB scores for all five winning teams, including the baseline NN-Unet model's performance, and a brief summary of their solutions.
In comparing team rankings across public and private LBs, Team 1's solution proved to be the most stable as it ranked at the top on both LBs.Team 4 and Team 5 were fairly stable: Team 4 rising by one place and Team 5 rising by five places on the private LB compared to the public LB.Solution by Team 2 was the most unstable, jumping 1,022 places on the private LB, while Team 3 jumped by 568 places.Team 2's solution was highly overfitted to the private test set, and scored 0.04 NSD score on the public LB.
Additionally, authors previously identified certain key metrics for the vessel segmentation problem after reviewing the existing literature 19 .These additional metrics-Dice Similarity Coefficient (Dice), Centerline Dice (clDice), NSD (tolerance=1), and Average Symmetric Surface Distance (ASSD)are computed for the predictions of the top-5 winning teams (see Methods for details on the metrics).All computed metrics for both test sets are provided in the Supplementary Table 3 and visualized in Figure 2c.The difference between the NSD scores at tolerance 0 and tolerance 1 are larger for the private test set than for the public test set across all teams.For the private test set, Team 1 scores higher than the other 4 teams across all additional metrics, except in the case of clDice where Team 2 performs slightly better.While the performance of Team 3 is comparable to other teams on the private test set, it performs worse on the public test set across multiple metrics, thereby highlighting the low generalizability of solutions by Team 2 and Team 3.
A fundamental challenge in organizing such competitions that need to be scored based on a single metric is that changing the metric can change the outcome of leaderboards as well as the ability of the created solutions 31,32 .In light of this, it is valuable to see the correlation between public and private LB rankings as well as the effect of rankings based on different metrics.Subsequently, Kendall's Tau 33

Qualitative analysis of winning solutions
Due to the low generalizability of solutions by Team 2 and Team 3, the qualitative and morphological analysis focuses on Teams 1,4 and 5. Figure 3 shows a Maximum Intensity Projection (MIP) for the gold standard labels for the private test set and the predicted labels from Teams 1,4, and 5. Similar figures for Team 2 and Team 3 are presented in Supplementary Figure 1.Visual analysis across the three teams (Team 1, 4, and 5), shows a high degree of missing vessel connectivity, particularly for the small vessels which can be seen in the insets in Figure 3 (green arrows).This unconnected vessel morphology is particularly evident at the outer edge of the kidney where the majority of vessels are thinner and the variations between the teams are more visually pronounced.Interestingly, Team 4's solution appears to have the least number of small unconnected components, favoring larger vessel detection, whereas Team 5's solution detects parts of the small vessels but they are highly disconnected.Team 5 also finds a large number of small unconnected components outside the boundary of the kidney itself (red inset in Figure 3).This is a feature which Team 4 were able to remove in post processing though a simple but effective masking of the kidney outer edge, a strategy which dramatically reduces the number of false positive pixels.Team 1's solution visually shows better connection of the small vessels in comparison to Team 5's potentially due to the extra stem in their model designed to capture these smaller vessels.Additionally, Team 1's solution appears to detect tubular structures that are not in the gold standard (orange arrowhead in Figure 3).Several of these structures, when compared back to the raw data, appear to be correctly identified vessels that were not found by the manual segmentation process.This highlights that manual segmentation of data, even when performed by multiple curators, can only ever be considered as a gold standard (with some degree of such acceptable data noise) rather than a complete ground truth.Finally, all teams appear to predict vessels that are thinner than the gold standard labels.
Figure 3. Maximum Intensity Projections (MIP) for the labels (private test set) for gold standard, Team 1 predictions, Team 4 predictions, and Team 5 predictions, with two insets per team in the yellow and red squares, respectively.Green arrows show the same location in each prediction highlighting the missing and unconnected vessels in Teams 1,4 and 5.The orange arrow in the gold standard and Team 1 show an instance of a vessel predicted by Team 1's model that is not in the gold standard but is, in fact, in the raw data.
As the test datasets-both public and private-were only portions of the whole intact kidneys, we computed inference for each team's model on the remaining parts of the private test set (kidney 6) (Figure 4a) and the public test set (kidney 5) (Supplementary Figure 4).The outputs in Figure 4a serve to further demonstrate the differences, particularly in the ability to capture larger vessels between these models and their generalizability.Team 1's solution, which appeared to find many of the smaller vessels in test data, evidently misses large portions of the larger vessels in the whole private test data (kidney 6), and to a lesser extent in the public test data (Kidney 5).The larger vessels are evidently more fully captured by Teams 3-5 in both the public and private test datasets.
The lack of generalizability in Team 2 is highly apparent when comparing the outputs for kidney 5 (Supplementary Figure 4) and kidney 6 (Figure 4a).

Morphological analysis of segmentations
While the quantitative segmentation metrics and qualitative inspection of the voxelized output provides an insight into the performance of different solutions, a comparative morphological approach that aligns with the eventual intended use of the segmentations is highly relevant to assessing model biases and outputs.Given that our ultimate aim is to use the vascular segmentations to 1) model exosome transport, 2) extract vessel morphology metrics to create a VCCF; the segmented vascular networks will be skeletonized.Skeletonization reduces the 3D voxel representation to a skeleton representation described by segments and nodes-with each segment defined by start and end nodes-and having a length, radius, and connectivity to other segments (see Methods for further details).Here, we skeletonize each of the predicted segmentation outputs to investigate how the differences in segmentation from the various models lead to variation in the extracted vascular network geometry.
Figure 4b shows the spatial variations in the mean radius and subgraphs (unconnected portions of the network) between model outputs.While the number of subgraphs varies between teams, Teams 1,4, and 5 find the same larger vessel trees.Variations are mostly restricted to the smaller disconnected vessels at the kidney periphery which was also noted in the qualitative analysis.Similarly, while the spatial distribution of radii is similar, the absolute values vary between each solution and the gold standard.Supplementary Figure 2 shows the same visualizations for Team 2 and Team 3. Supplementary Figure 3 shows the skeleton forms of the vessel networks for all teams on public test data.
Figure 5 highlights these morphological differences between the skeletonized networks geometry for the private test data (similar plots for public test data are provided in Supplementary Figure 4).For all teams, the number of subgraphs is higher than the gold standard, which indicates many short unconnected segments that form the partial path of the vessels (Figure 5a).The shorter mean length of segments (Figure 5c) also supports this conclusion as does the proportions of terminal nodes compared to branched nodes which is higher than the gold standard for all teams other than Team 2. The mean radius (Figure 5b) has large variability highlighting the challenge of accurately determining vessel thickness.In all cases, the mean radius is lower than in the gold standard which appeared to be the case in the qualitative analysis.The thinner vessels are also reflected by the much higher NSD score where there is a 1-voxel tolerance (see Figure 2).For tortuosity and branching angle, values closer to the gold standard and baseline models indicate better modeling of natural vessel shapes.As tortuosity is associated with length, it is interesting to note that while Team 4 has a lower vessel length than Teams 3 and 5, it has a higher tortuosity, indicating that many short but highly curved vessels are present in this case.
Team 1 has the highest number of segments, proportion of terminal nodes to branched nodes, and number of subgraphs, indicating the identification of many vessels but with a highly fragmented segmentation.Team 1 also has the second lowest mean radius indicating there is either an under segmentation of the vessel lumen or an increase in small vessels that may not be present in the gold standard.
Team 2 has the smallest number of subgraphs (247) comparable to that of the gold standard (199).It also has the most similar number of segments and proportion of terminal-to-branched nodes compared to the gold standard.This indicates better continuity in vessel segmentation, which can be attributed to the solution's post-processing steps, and is also reflected in the team's relatively higher clDice score.Team 2's lower radius and length might also suggest under-segmentation or more conservative thresholding.
Team 3 serves as the midpoint of the teams having the third highest number of subgraphs behind Team 1 and Team 5, with a similar proportion of terminal-to-branched nodes as Team 5.By comparison to Team 5, Team 3 has a lower mean radius and length suggesting a more conservative thresholding strategy leading to a thinner radius and slightly better predictions for smaller unconnected vessels.Team 4's lower vessel length coupled with the relatively higher mean radius, lower number of subgraphs, and lower proportion of terminal-to-branched nodes, indicates a model which is more targeted to the larger well-connected structures in the network rather than the smaller peripheral vessels, and reflects the masking of small false positive structures in the background outside the kidney.
Team 5 has the largest average radius and length, the third highest number of subgraphs, and the second highest proportion of terminal-to-branched nodes.This implies that while the vessel thickness may have been more accurately predicted (from radius and length), many small unconnected structures were still identified leading to the high subgraphs and high proportion of terminal nodes.The team's lower position on the LB suggests these small unconnected structures were not smaller unconnected vessel fragments but likely came from noise predicted outside the kidney, as seen in Figure 3 and Figure 4.

Discussion
Vasculature segmentation is a highly complex and challenging task in medical image segmentation due to the low proportion of vessel voxels, complex structure, and variable nature of human vasculature across different individuals.Citizen science enables experts and hobbyists alike from industry, academia, and government globally to engage in open and collaborative experimentation and development of solutions.Furthermore, all data and code are openly shared, serving as a benchmark for future algorithm performance evaluations and comparisons for HiP-CT data.
In this competition, participants employed various techniques to tackle the kidney vessel segmentation task.Notably, methods integrating data augmentation and self-training paradigms emerged as effective solutions to address challenges related to the small dataset.In addition to utilizing data augmentation, normalization and pseudo labeling techniques, implementing postprocessing steps to eliminate disconnected vessels proved essential for developing generalized solutions.Furthermore, some teams mitigated the challenges associated with highly unbalanced datasets by using custom loss functions, such as marching cube loss.The competition further highlights that the U-Net architecture 24 remains a strong base model for medical image segmentation since all of the top five teams used variants of U-Net, customizing the ML models to enhance performance.
While this competition has generated interesting solutions for a fairly new imaging modality, several known limitations persist: (1) the models were trained on a relatively small dataset, which increases the risk of overfitting; (2) segmenting both the very fine and the large the microvascular network accurately still remains a challenge; (3) the competition might rely on specific evaluation metrics that do not fully capture the clinical relevance or quality of the segmentation, potentially overlooking important aspects of model performance in real-world applications; and (4) some models are computationally expensive and might be impractical or inefficient for deployment at scale.
Linking qualitative and quantitative interpretation of the teams' outputs enables a more comprehensive understanding of ML model outputs and biases.It also allows assessment of the solutions in ways which are highly relevant for the potential downstream uses of such data.Based on the morphological analysis, a key challenge most teams faced was the high number of unconnected components and lower proportion of branched-to-terminal nodes compared to the gold standard.This has important implications particularly for modeling of blood flow or exosomes transport, as every terminal node requires a prescribed boundary condition, introducing a high dependence of model output onto the assigned boundary condition rather than the connectivity of the network as previously shown 18 .Simple post-processing strategies which remove unconnected components, such as that applied by Team 2, or more sophisticated approaches which seek to join fragmented vessel sections would be an important next step in utilizing the output of these models.
In future, the authors would like to further generalize the winning solution and train on other 3D HiP-CT datasets to explore the vasculature trees, exosome flow patterns, and morphological differences in organs other than the kidney.

Kaggle platform
The private leaderboard for identifying the five prize winners was finalized on February 8, 2024.The prize winners were awarded cash prizes (US$25,000 for first place; US$20,000 for second place; US$15,000 for third place; US$5,000 for fourth place; US$5,000 for fifth place).The teams submitted their inference code, after training their ML models, on the Submission portal.The submitted code was then run over the public test set to rank the teams on the public leaderboard.
The teams typically use this score to validate their models.With each submission, the code is also run on the private test set for preliminary rankings on the private leaderboard (not visible to teams).Teams can make an unlimited number of submissions before the competition deadline but are limited to five submissions per day.On competition end, the teams can choose up to two solutions to submit as their final submissions, which are then scored on the private test set to rank the teams on the final private leaderboard.All scoring is done using the normalized surface dice as the evaluation metric and the top-5 teams on the final private leaderboard are selected as winners.

Hierarchical Phase-Contrast Tomography data acquisition
HiP-CT is a propagation phase-contrast local tomography technique, as described by Walsh et al. 2 Data acquisition follows the sample preparation and scanning protocols outlined in prior work 2,34 .Briefly, organs are fixed, partially dehydrated, and stabilized for tomographic scanning using an agar-ethanol mixture.HiP-CT scans are performed on one of two beamlines, BM05 or BM18 at the European Synchrotron Radiation Facility (ESRF).The scans are conducted hierarchically; the entire organ is first scanned at ca. 25 µm per voxel, followed by local tomography at specified locations within the intact organ at ca. 6.5 µm per voxel and ca.1.3-2.5 µm per voxel.
Tomography scans are reconstructed into image volumes using a filtered back projection algorithm as detailed in prior work 34 .The final volumes consist of isotropic 3D image datasets, where higher resolution volumes of interest are registered within the larger organ volume using a rigid transformation.The contrast within these images arises from interference patterns caused by refraction of X-rays as they pass through the samples.Refraction is caused by physical density differences within the sample, and thus the modality particularly highlights edges between tissues with different densities.
Gold Standard Label Acquisition.Five human kidneys were used to create the competition dataset.Three kidneys (Kidney 1, Kidney 2, and Kidney 3) were used for the training set.Two kidneys (Kidney 5 and Kidney 6) were used for the test set-Kidney 5 as the public test set and Kidney 6 as the private test set.Except Kidney 2, all were collected from donors who had consented to body donation to the Laboratoire d'Anatomie des Alpes Françaises prior to death.Kidney 2 was obtained after a clinical autopsy at the Hannover Institute of Pathology at Medizinische Hochschule, Hannover (Ethics vote no.9621 BO K 2021).The transportation and imaging protocols received approval from the French Health Ministry.The basic scan parameters and demographic information of subjects can be found in Supplementary Table 1.
The segmentation of the five kidneys was conducted using Amira Version 2021.1.Initially, the raw image data underwent average binning, either x2 or x4.In x2 binning, the resolution was reduced from approximately 25µm to 50µm voxel size.In x4 binning, the resolution was reduced from 15.8µm to 63.1µm voxel size.
A 3D median filter was then applied using Amira-Avizo v2021.1 with 3 iterations and a 26 voxel neighborhood.To enhance the appearance of vessels, background detection correction was performed (Amira v2021.1;default settings).The segmentation process involved semi-manual techniques using Amira v2021.1'smagic wand tool, an interactive 3D region-growing tool.Curators selected a seed voxel within a vessel slice and refined parameters such as intensity threshold, contrast threshold, and hard-drawn limits to determine the stopping criteria for the 3D region growing.In cases where vessels were infilled with blood or largely collapsed, a manual voxel paintbrush tool was used to correct or fill the vessel in every slice.Further details and supplementary video of the semi-automated segmentation protocol is outlined in prior work 3 .
To ensure segmentation quality and provide error estimates on the gold standard, an expert segmentation validation process was implemented.An initial curator applied the above procedure on a dataset using three orthogonal views.An independent curator re-labeled the data adding to the labels of the initial curator, again using the three orthogonal planes.A third curator (referred to as proof-reader) was presented with 5-7 randomized 2D sections of the data in any one of the three orthogonal planes.They counted the number of vessel cross-sections in the slice.They recorded the number of true positive and false negative vessel cross-sections that were segmented.Note that a true positive means that the curators 1 and 2 have visually correctly labeled the vessel but does not assess the quality of the segmentation on a pixel-by-pixel basis.This ratio of true positive and false negative is used as the estimate for the number of vessels correctly segmented.The proof-reader then returns the data to the two curators, with the estimated segmentation proportion for each 2D area.The curators 1 and 2 then focus on 3D segmentation in areas where the proof-reading has highlighted the lowest vessel segmentation proportion.The process repeats iteratively until the proof-reader does not find any false negatives 3 .Note that false positives are not considered in the labeling/proofreading process as these are easily detected owing to their lack of connectivity to the main vessel tree.All data is made publicly available, see Data Availability.

Evaluation metrics
The submitted solutions were ranked in the competition using Normalized Surface Dice (NSD) with tolerance (t) set to 0. NSD determines what fraction of segmentation boundary is predicted correctly.A boundary element is considered correctly predicted if the closest distance to the reference boundary is smaller than or equal to the specified threshold (tolerance).The tolerance determines the acceptable amount of deviation in pixels.The value of NSD lies between 0 and 1.The specific implementation 22,23 proposed by Google Deepmind is used in the competition with an optimized version available as a notebook on the Kaggle platform at https://www.kaggle.com/code/metric/surface-dice-metric.To further evaluate the winning solutions and their performance post competition, four auxiliary metrics were chosen based on prior work by the authors 19 : Dice Similarity Coefficient (Dice) 35,36 , Centerline Dice (clDice) 37,38 , Average Symmetric Surface Distance (ASSD) 22,39 , and NSD (with tolerance=1).ASSD was computed using the MONAI library 40,41 v1.3.0.

Morphological analysis code
Vasculature in segmentation masks is reduced to spatial graph representations of the network through skeletonization.The resulting spatial graph describes the vessel network in terms of four entities: nodes, points, segments, and sub-segments.A segment is defined as being between a start node i and end node j; which correspond to either a branching point leading into another segment branch or a terminal end where no further branches were detectable.Nodes have an ID and a 3D spatial position (x, y, z).Between the start node i and terminal node j of each segment lie sub-segments with points, marking the start and end of each sub-segment able to capture the curvature of the segment.Each sub-segment has an associated radius s and length s.
To create a spatial graph from a segmented image, we utilize an implementation of the parallelized version of the distanced ordered medial thinning algorithm 18,42 in Amira v2021.2termed the "Autoskeleton" plugin.The radius of each subsegment is estimated using 1/5th of the maximum Chamfer distance and the parameters applied are smooth = 0.5, attach_to_data = 0.25, and iterations = 10.We extract morphological parameters from the spatial graph following the definitions from prior work 18 .See Code Availability.

Participation analysis
At the conclusion of the competition, participation metadata becomes publicly available on Meta Kaggle 43 -Kaggle's public data on competitions, users, submission scores and kernels.This data is used to understand how the competition unfolded over its 3-month period.Analysis is implemented using standard python packages for data science such as Pandas, NumPy, Matplotlib, and Seaborn; creating all visualizations in Jupyter Notebooks, see Code availability.

Statistical analysis
Kendall's Tau [31][32][33] (also called Kendall's Rank Correlation) is used to quantify the agreement between two rankings and is independent of the number of entities ranked.Tau values closer to 1 mean a strong positive correlation between the two rankings-value of 1 means perfect alignment-whereas values closer to -1 mean a strong disagreement.A p-value associated with the tau value indicates the statistical significance of the correlation; lower p-values (closer to 0) indicate higher significance of the relationship between the two rankings such that it is unlikely to occur by chance.Kendall's tau is computed using the implementation in the Python Scipy 44 library.
Training and Inference Details: This team trained the proposed model using a 32 batch size and a gradient accumulation process 7 .AdamW 8 with additional WarmUpConsineAnnealing 9 was used to optimize the model.The model was trained for 30 epochs.
In the testing process, all the slices were converted and resized to 3072*3072.The torch compile algorithm 7 was utilized to accelerate the inference process.Moreover, the data were processed across three axes with large inference sizes and test time augmentation.Supplementary Note 2: Team 2 solution details Summary: Team 2 used a customized 3D U-Net 10 model for segmenting the kidney vessels.To overcome the generalizability problem, data augmentation techniques, including random positions and rotations, were used.Eventually, morphological post-processing techniques were used to remove small false positives and improve the model performance.

Model:
The general architecture of the proposed neural network is based on 3D U-Net 10 .However, to overcome the problem of gradient vanishing, simple convolution layers were replaced by ResNet blocks 3 in the encoder part of the model.To decrease the computational cost, all traditional upconvolutional layers have been replaced by up-sampling operations.Online data augmentation techniques, specifically affine transformations such as rotation, scaling, and shearing, were employed to prevent overfitting.This process yielded a substantial increase in the number of patches, reaching over 7k for every epoch.
For the post-processing, the depth-first search algorithm 11 was used to identify the unconnected, distantly placed chunks and remove them to increase the model's overall performance.
Training and Inference Details: This team trained the proposed model with a batch size of 4 and a size of 128*128*32.For optimization, they used Focal loss 4 and cosine decay scheduling 9 .The training lasted two weeks, and the maximum number of epochs was 700.0.25 was used as the ultimate threshold for the inference to extract binary masks for vessels.

Pseudo-labelling:
Considering that data is important to train an effective deep neural network, the team involved two additional datasets: LADAF-2020-31 kidney 20 and LADAF-2020-27 spleen 21 , from the Human Organ Atlas (https://human-organ-atlas.esrf.eu/)into the training.A 4-step training scheme, was applied to gradually generate pseudo labels for the new datasets and incorporate them into training: 1) The team trained a base model on kidney_1 and pseudo-labeled, kidney_2; 2) Kidney_2 was incorporated into the training data together with kidney_1 to re-train the model.Then, they used this model to pseudo-labeled the LADAF-2020-31 kidney.3) Similarly, they pseudo-labeled LADAF-2020-27 spleen dataset from the model re-trained on kidney_1, kidney_2 and LADAF-2020-31 kidney.4) Finally, the model was trained on four datasets.The 4-step training scheme enabled generating pseudo-labels for the new dataset.However, pseudo-labels can introduce training noise with false positives which are difficult to completely remove in this scheme without post-processing.To overcome this problem, the team used soft labels, the probabilities after sigmoid function, as pseudo-labels instead of hard labels.

Loss:
As surface dice is one of the metrics in this competition, evaluating the predicted boundaries of the vasculature, the team applied a boundary weighted binary cross-entropy loss as equation: where n is the size of a mini-batch. is a boundary weight and   is a binary value, indicating if the value   is on the boundary.The last term is a typical binary cross-entropy function.The team set  as 0.9 to roughly double the loss on the boundary pixels.Apart from a boundary weighted binary cross-entropy loss, the overall loss function added dice loss 5 and focal loss 4 .

Models:
Four different 2D backbone networks: effnet_v2_s, effnet_v2_m, maxvit_base and dpn68 were ensembled for the final submission.They are implemented by a python package -Segmentation Models PyTorch (SMP) 22 .The input was cropped from xy, xz and yz axes in size of 512 2 , followed by pre-processing methods, such as 2D rotation, intensity contrast changes, horizontally and vertically flipped.After training, the inference was done with a sliding window method with an overlap size of half the input size.
Inference: Team 5 solution highlighted that 3D interpolation to make the voxel sizes of the training data and test data the same is important when performing vasculature segmentation inference on HiP-CT data.In this competition, the training data of kidney 1 to kidney 3 are ~50 µm/voxel, however, the test dataset of kidney 6 is ~63 µm/voxel.Team 5 applied a rescaling operation through 3 axes even though training was performed on 2D slices.The rescaling was implemented by trilinear interpolation and the rescaled data resulted in better performance.

Figure 1 .
Figure 1.An overview of the competition setup and process, including 3D renderings of the gold standard labels and representative 2D slices for all data in the competition.

Figure 2 .
Figure 2. a. Number of teams and messages, and leaderboard high scores per day over competition period.b.Number of submissions vs. highest private leaderboard score for each of the 1,149 teams as a heatscatter.c.Scores for competition metric and additional metrics for top-5 teams on both test sets.ASSD is normalized between 0-1 for visualization (lower is better).

Figure 4 .
Figure 4. a. Visualization of the 3D output for inference of each team on the whole intact kidney datasets for the private test data (Kidney 6).Gold standard (red) shows the part of the whole kidney that was fully labeled and was part of the original competition dataset.b.Visualization of the skeleton forms of the vessel network for the gold standard and teams 1, 4 and 5 for the Private test data (Kidney 6).In each case, each vessel segment is rendered with the mean radius.In the first

Figure 5 .
Figure 5. a. Bar chart showing the number of subgraphs, nodes, segments, terminal nodes, and branched nodes for all team solutions as well as the baseline model predictions and gold standard (GS) labels for private test data.b-e.Plots showing the radius, length, tortuosity of segments, and the branching angle between segments; mean and 95% Confidence Interval (CI) are shown for all metrics.

Supplementary Figure 5
. a. Bar chart showing the number of subgraphs, nodes, segments, terminal nodes, and branched nodes for all team solutions as well as the baseline model predictions and gold standard (GS) labels for public test data.b-e.Plots showing the radius, length, tortuosity of segments, and the branching angle between segments; mean and 95% Confidence Interval (CI) are shown for all metrics.

Table 1
provides an overview of the donor demographics, image and scanning metadata and gold standard label metadata associated with each dataset (Full scanning and donor medical metadata can be found at https://human-organ-atlas.esrf.eu,DOI and link for each dataset are provided in Supplementary

Table 1 .
Competition dataset details, including metadata and donor demographics.

Table 2 .
Scores for competition metric for top-5 teams and baseline model, including a brief summary.