Robust and generalizable segmentation of human functional tissue units

The Human BioMolecular Atlas Program aims to compile a reference atlas for the healthy human adult body at the cellular level. Functional tissue units (FTU, e.g., renal glomeruli and colonic crypts) are of pathobiological significance and relevant for modeling and understanding disease progression. Yet, annotation of FTUs is time consuming and expensive when done manually and existing algorithms achieve low accuracy and do not generalize well. This paper compares the five winning algorithms from the “Hacking the Kidney” Kaggle competition to which more than a thousand teams from sixty countries contributed. We compare the accuracy and performance of the algorithms on a large-scale renal glomerulus Periodic acid-Schiff stain dataset and their generalizability to a colonic crypts hematoxylin and eosin stain dataset. Results help to characterize how the number of FTUs per unit area differs in relationship to their position in kidney and colon with respect to age, sex, body mass index (BMI), and other clinical data and are relevant for advancing pathology, anatomy, and surgery.


Introduction
The Human BioMolecular Atlas Program (HuBMAP) aims to create an open, computable human reference atlas (HRA) at the cellular level 1 . The envisioned HRA will make it possible to register and explore human tissue data across scales-from the whole-body macro-anatomy level to the single-cell level. Medically and pathologically relevant functional tissue units (FTUs) are seen as important for bridging the meter level scale of the whole body to the micrometer scale of single cells. Functional tissue units are defined as a three-dimensional block of cells centered around a capillary where each cell is within diffusion distance from any other cell within the same block; a Figure 1. FTU Datasets. a. The 30 kidney and 7 colon tissue datasets were registered into the corresponding male/female, left/right HuBMAP 3D reference organs for kidney and colon to capture the size, position, and rotation of tissue blocks. b. Sample kidney WSI (scale bar: 2mm) with zoom into one glomerulus annotation (scale bar: 50µm). Right below is a sample colon WSI (scale bar: 500µm) with zoom into a single crypt annotation (scale bar: 20µm). c. Metadata for 37 WSI sorted top-down by vertical location within the reference organs; test datasets are given in bold.
The Kaggle dataset was split into a 15 WSI training and 5 WSI validation dataset; both were available to competition participants. The 10 WSI private test dataset was used for scoring algorithm performance, see competition design in Methods. Analogously, the colon dataset was split into five WSI used in training and two WSI used for testing. All test datasets are rendered in bold in Fig. 1c.

Algorithm Comparison
The top-5 winning algorithms from the "Hacking the Kidney" Kaggle Competition are from teams named Tom, Gleb, Whats goin on, DeepLive.exe, and Deepflash2. All five use the UNet architecture, see algorithm descriptions in Methods section. Performance results are shown in Fig. 2 using violin plots for three metrics: DICE, precision, and recall (see details in the Methods section, data values are in Supplementary Table 5 and interactive data visualization at https://cns-iu.github.io/ccf-research-kaggle-2021). For each of the five algorithms we report DICE coefficient in Fig. 2a, recall in Fig. 2b, and precision in Fig. 2c. For each metric, we show distribution for the ten kidney WSI with 2038 glomeruli on the left and the distribution for the two colon WSI with 160 crypts transfer learning predictions on the right. Performance on kidney vs. colon data can be easily compared. As expected, all five algorithms have a higher DICE coefficient for kidney data than for transfer learning on colon data. Tom-the Kaggle competition performance winner-has the highest mean DICE score of 0.88 for transfer learning on colon data. As for recall, Tom again has the highest value with 0.92-with 9 false negatives and 17 false positives out of 160 crypts. In terms of precision, DeepLive.exe wins with 0.86-with 8 false negatives and 19 false positives. The data in Supplementary Table 5 also shows that all five algorithms have the lowest DICE scores on WSI 7 and 28. 7 has a low number of crypts, only 51; any false positive/negative prediction has a major impact on the DICE coefficient. WSI 28 has several artifacts and overall lower quality (higher saturation and darker) than other kidney WSIs. The crypt segmentation solution for Tom in comparison with ground truth for colon data can be explored at https://cns-iu.github.io/ccf-research-kaggle-2021. Violin plots show performance for kidney on the left (one dot per 2,038 glomeruli) and transfer learning performance for colon data (one dot for each of the 160 crypts) on the right. a. DICE coefficient. b. Recall performance. c. Precision performance. Interactive versions of these graphs are at https://cns-iu.github.io/ccf-researchkaggle-2021.
Run time performance was recorded for the training phase on kidney data, colon data exclusively (no transfer), and on kidney data and colon data, see Table 1. We also report run time for the two prediction tasks: from scratch without transfer learning (i.e., trained on five colon, tested on two colon datasets) and transfer learning (i.e., trained on 15 kidney datasets initially and then trained on five colon datasets, then tested on two colon datasets), see Methods section for details. As can be seen, training takes time (three hours to 48 hours) while prediction runs are fast (3-30h for kidney and 2mins for colon). Total algorithm run time (training on kidney, then colon plus inference on colon) is lowest for Deepflash2 (6h), followed by Gleb (12h) and Tom (16h). Whats goin on and Deepflash2 are fast in kidney prediction (30mins). Note that one of the winning teams (Deepflash2) had access to a biomedical expert as a teammate and two of the teams (Tom and DeepLive.exe) used additional data to improve generalizability. The teams did employ clever approaches to sampling (e.g., Deepflash2 using probabilistic sampling to make training time faster) and classification (e.g., DeepLive.exe using a classifier to distinguish between healthy and diseased glomeruli).

Characterizing Human Diversity
Information on the spatial location of FTUs in human tissue makes it possible to characterize human diversity in support of understanding human diversity. Specifically, we use data on 7,102 glomeruli and 395 crypt annotations to study the impact of sex, age, BMI but also location of tissue in the human body on the number of FTUs per square millimeter. Fig. 3a shows the impact of age on the number of detected glomeruli Blocks that have the same age are from the very same donor. As can be seen, out of the 8 females, one has 4 tissue blocks (in y-sequence, top-down: 5, 6, 8, 4), one has 3 tissue blocks (3, 2, 7), and two have 2 tissue blocks (9, 11; 10,14). For the 8 males, two have 3 tissue blocks (in y-sequence: 21, 23, 24; 29, 30, 26) and three have 2 tissue blocks (25, 22; 28, 27; 18, 17). In general, the number of glomeruli per mm 2 seems to decrease for females and increase for males (except for HBM 322:KQBK.747) going from top to bottom of the kidney. (Slides are numbered by y-position of the 3D reference organ registration.) Understanding the spatial location and density of FTUs across organs is critically important for advancing the construction of a Human Reference Atlas (HRA) 14 . A robust and highly performant FTU detector would make it possible to compute the size, shape, variability in number, and location of FTUs within tissue samples. This information can then be used to characterize human diversity; to decide on what tissue data should be collected next to improve the coverage and quality of a HRA, and for quality control (e.g., FTU size and density that is vastly different from normal might indicate disease, problems with data preprocess, or segmentation algorithms).

Discussion
There is a need for efficient and accurate segmentation of FTUs both within HuBMAP and the broader biomedical community. Despite many breakthroughs in the field, the currently available methods for glomerulus and crypt image segmentation do not meet this need. This paper compared winning algorithms from the recently completed "HuBMAP -Hacking the Kidney" Kaggle competition and identified the Tom algorithm as the most accurate, generalizable, and best run time performant algorithm. To our knowledge, this is the first time that scientific evidence is provided of the value of Kaggle competitions to develop algorithms that are superior to existing code. The 1,600 Kaggle teams performed many iterations of experimentation that our team would not have had the time/resources for or thought to try; they build on solutions taken from many different domains to arrive at the winning entries. Given the success of this first competition; we are planning three new Kaggle competitions that aim to advance tissue segmentation and annotation.
Code has been documented and made available freely for anyone to use. We are in the process of preparing this winning algorithm for production usage in the HuBMAP Data Portal 15 and making it available as part of the HRA ecosystem-for free usage by anyone interested to register and analyze tissue. Going forward, kidney and colon datasets that were spatially registered using the HuBMAP registration user interface 16 and that have anatomical structures in which FTUs are known to exist will automatically be segmented. In addition, we are in the process of creating additional datasets with FTU annotations for other organs (nephron tubule in kidney; alveoli in lung; hair follicle in skin; white pulp in spleen; lobule in liver; lobule in lymph node; lobule in thymus; sarcomere in heart). The datasets will be used to run transfer learning for FTUs in other organs and to develop robust pipelines for the automatic segmentation and analysis of FTUs across major organs of the human body.
Since all the five winning models use some specific methodology-either in data preprocessing, sampling, or training-that gives them an edge over the others, we are exploring taking the best parts of each and constructing a sixth model. For example, Deepflash2 uses a probabilistic sampling strategy that makes its training faster; DeepLive.exe uses additional data and a classifier in its model to improve its results. Plus, training time can be reduced by using distributed training; training can be monitored in support of optimization and explainability.
Going forward, 3D data of FTUs will be used to identify the number, size, and shape of FTUs in support of machine learning and single-cell simulation of the structure and function of FTUs. Resulting data will be used to increase our collective understanding of (and variability in) the size, number, and location of FTUs in relation to donor sex, age, ethniticy, and BMI. This data and work is also critical for integrating top-down (segmenting out larger known structures) and bottom-up (single-cell data) in multiplexed imaging techniques and relating composition within these structures. Top-down and bottom-up data integration and analysis are needed for constructing an accurate and comprehensive Human Reference Atlas.

Datasets
Renal glomeruli data Renal glomeruli are groups of capillaries that facilitate filtration of blood in the outer layer of kidney tissue known as the cortex 17 . The size of normal glomeruli in humans ranges from 100-350 μm in diameter and they have a roughly spherical shape 4 . Glomeruli contain four cell types: parietal epithelial cells (CL:1000452), podocytes (CL:0000653), fenestrated endothelial cells (a.k.a. glomerular capillary endothelial cell CL:1001005), and mesangial cells (CL:1000742) 18 . Parietal epithelial cells form the Bowman's capsule. Podocytes cover the outer layer of the filtration barrier. Fenestrated endothelial cells are in direct contact with blood and coated with a glycolipid and glycoprotein matrix called glycocalyx. Mesangial cells occupy the space between the capillary blood vessel loops and are stained by the colorimetric histological stain called Periodic acid-Schiff (PAS) stain 18 . PAS stains polysaccharides (complex sugars like glycogen) such as those found in and around the glomeruli making it a favored stain for delineating them in tissue sections 19 .
The kidney data used in the "HuBMAP -Hacking the Kidney" Kaggle competition comprises 30 whole slide images (WSIs) provided by the BIOmolecular Multimodal Imaging Center (BIOMIC) team at Vanderbilt University (VU) who are also members of HuBMAP's Tissue Mapping Center at VU (TMC-VU). The tissue blocks were collected through the Cooperative Human Tissue Network 20 and either fresh frozen (FF) or formalin fixed, paraffin embedded (FFPE) 21 for preservation. FF tissue is frozen in liquid nitrogen (-190°C) within 30-60 minutes after surgical excision; this type of preservation has been the method of choice for transcriptomics and immunohistochemistry; tissue samples are often embedded in Optimal Cutting Temperature (OCT) media for thin sectioning 22 or carboxymethylcellulose (CMC) for imaging mass spectrometry 23 . FFPE tissue is the preferred method for clinical pathology samples for histology assessment since the formalin aldehyde cross links proteins to maintain structural integrity of the sample 24 . After preservation, the tissue blocks were sectioned 25 and imaged using Periodic acid-Schiff (PAS) staining 26 . The slides were scanned with a brightfield scanner, and the resulting images were converted from vendor formats to Tagged Image File Format (TIFF). The images have a spatial resolution of 0.5µm, and the average annotation area was calculated in pixels and µm 2 . On average, the 7,102 glomerulus annotations cover 81,813.5 pixels, or 20,453.4 µm 2 .
Each of the 30 kidney datasets used in the Kaggle competition included a PAS stain whole slide image, anatomical region (AR) masks, and glomeruli segmentation masks. The masks were modified GeoJSON files that captured the polygonal outline of annotations by their pixel coordinates (see samples in Fig. 1b), and they were generated from a mix of manually and deep learning (DL) generated annotations. The initial annotations were generated automatically by a segmentation pipeline 27 , then they were inspected and edited by subject matter experts (SMEs) 28 using QuPath 29 . In addition, information on sample size, location, and rotation within the kidney and pertinent clinical metadata (age, sex, ethnicity, BMI, laterality) was provided (see Supplementary Table 3).
For the Kaggle competition, this data was split into three datasets: public train (n=15, for training models), public test (n=5, for model validation), and private test (n=10, for scoring and ranking models). The public datasets were openly available for the competitors to use when designing their models and creating submissions, and the private test set was only available to the Kaggle team and hosts for evaluation of the submissions. After the competition concluded, all data was made available publicly at the HuBMAP Data Portal 15 as the "HuBMAP 'Hacking the Kidney' 2021 Kaggle Competition Dataset -Glomerulus Segmentation on Periodic acid-Schiff Whole Slide Images" collection 30 .

Colonic crypts data
Colonic crypts are epithelial invaginations into the connective tissue (stroma) surrounding the colon, or large intestine 31 . Also known as the crypts of Leiberkühn, they contain stem/progenitor cells in their base and are thought to protect these cells from metabolites 32 . They are also the site of absorption and secretion activities within the colon 33 . Normal human colonic crypts have a diameter of 73.5±3.4µm and length of 433±25µm 34 . In addition to stem cells, there are many epithelial subtypes, major subsets include: Paneth (CL:0009009), goblet (CL:1000321), enteroendocrine (CL:0000164), and enterocytes (CL:0002071) 31 . Total number of goblet cells is increasing from the proximal to distal ends of the colon 35 . Enterocytes are absorptive cells which decrease in numbers from the proximal to distal end of the colon and are responsible for absorption of nutrients 35 . Enteroendocrine cells make up a small proportion of the colonic epithelium (<1%) and secrete hormones that control gut physiology 35 .

Spatial location in human body
The HuBMAP Registration User Interface (RUI) 16,38 was used to capture the three-dimensional size, position, and rotation of all tissue blocks used in this study in close collaboration with subject matter domain experts. The resulting data was used to compute the vertical position of the mass points of all kidney tissue blocks as a proxy of the sequence of tissue sections used here. For the colon, we report the sequence of tissue sections according to the serial extraction sites (ascending colon, transverse colon, descending colon, sigmoid colon).

Computation of FTU density
The approximate number of glomerulus annotations in a square millimeter of cortex annotation, henceforth referred to as "FTU density", was calculated to compare it across cohorts of donors who varied in sex, age, race, and BMI. The 30 glomerulus annotation masks were read into a jupyter notebook from .json format and saved as shapely Polygons 39 . The average area per glomerulus annotation per sample was calculated in pixels, then converted to square microns. The anatomical region masks, which are rough estimates based on quickly-drawn annotations by SMEs, were read into the same jupyter notebook from .json files as shapely polygons, then the total cortex annotation area per sample was calculated by summing the area of all cortex annotations, then converting from pixels to square microns. The approximate FTU density was calculated from these two values and converted to the number of glomerulus annotations per square millimeter.
Postprocessing of prediction masks employed a CNN with an overlapping sliding window operator to segment glomeruli in trichrome-stained images, but they used training data of human origin and watershed segmentation 4 . Methods employing CNNs for the task of glomerulus segmentation seem to be increasingly popular in recent years with highly promising performance 6-8 .

Colon crypt segmentation prior work
In 2010, Gunduz-Demir et al. approached the task of automatic segmentation of colon glands using an object-graph in conjunction with a decision tree classifier, which obtained a Dice coefficient of 88.91±4.63, an improvement over the pixel-based counterparts at the time 41 44 . Kainz, Pfeiffer, and Urschler submitted the "vision4GlaS" method, a CNN for pixel-wise segmentation and classification paired with a contour based approach to separate pixels into objects, to the GlaS Challenge Contest. Their method ranked 10th in the challenge's entries 45 . They paired two distinct CNNs (Object-Net for predicting labels and Separator-Net for separating glands) together for pixel-wise classification of the same HE stained images 9 . For this second method, they also preprocessed the RBG images, only inputting the red channel into the model. Banwari et al. took a very computationally efficient approach to colonic crypt segmentation by also isolating the red channel from the GlaS Challenge dataset images and applying intensity based thresholding 11

Competition design
The "HuBMAP -Hacking the Kidney" Kaggle competition teams were tasked with the challenge of detecting glomeruli FTUs in colon data across different tissue preparation pipelines (FF and FFPE). The goal was the implementation of a highly accurate and robust FTU segmentation algorithm.
Two separate types of prizes were offered: Accuracy Prizes and Judges Prizes. The Accuracy Prize awarded $32,000 to the three teams with the highest scores on the Kaggle leaderboard at the conclusion of the competition (1st: $18,000, 2nd: $10,000, 3rd: $4,000). The Judges Prize awarded $28,000 to the teams that advanced science and/or technology (Scientific Prize: $15,000), were the most innovative (Innovation Prize: $10,000), or were the most diverse (Diversity Prize: $3,000) as identified by the panel of judges through a presentation of the teams' findings and subsequent scoring based on a predetermined rubric 51 . Teams were allowed to enter in multiple categories and had the option of either receiving cash prizes or choosing to have their winnings donated to a charity foundation. Additionally, the use of supplemental publicly available training data was allowed, but organizers were not permitted to participate.
The competition launched on November 16th, 2020 and ran through the final submission date of May 10th, 2021. The data was updated and timeline extended on March 9th, 2021, and the Awards ceremony was held on May 21st, 2021. Submissions were made in the form of Kaggle notebooks with a run-length encoding of the predictions saved in a "submission.csv" file. The notebooks had to run in less than or equal to 9 hours without internet access. See Competition Rules 52 and Judging Rubric 51 for more details.
Algorithm performance was evaluated using the mean Dice coefficient (see Metrics). The leaderboard scores were the mean of the Dice coefficients for all ten WSI in the private test set. Any test WSI with predictions missing completely were factored into the mean score as a zero. This metric has been successfully used for previous segmentation task challenges. For example, 922 teams competed in the "Ultrasound Nerve Segmentation" Kaggle competition 53 . The top scoring teams achieved a mean Dice coefficient of 0.73226 and 0.73132 for the private and public leaderboards, respectively. Another competition, entitled "SIIM-ACR Pneumothorax", engaged 1,475 teams to classify and segment pneumothorax from chest radiographic images, with leaderboard scores topping at 0.8679 and 0.9304 mean Dice coefficients for private and public datasets, respectively 54 . A third competition,"Severstal: Steel Defect Detection" focused on localizing and classifying surface defects on a sheet of steel 55 ; it had 2,427 teams competing and achieved mean Dice coefficients of 0.90883 (private leaderboard) and 0.92472 (public leaderboard).
In the "HuBMAP -Hacking the Kidney" Kaggle competition, a total of 1,200 teams competed and the top-5 scoring teams had a mean Dice coefficient of 0.9515 and 0.9512 for the private and public leaderboards, respectively. These are the highest scores for this type of challenge ever achieved.

Transfer learning
While the Kaggle competition involved developing models for segmenting glomeruli in kidney tissue samples, it is crucial to test the generalization capability of such segmentation models across other organs. To accomplish this goal, we implemented several strategies to train and test the models: 1) The models are trained only on the kidney data and tested on kidney data. 2) The models are trained on kidney data and tested on colon data (without training on any colon data).
3) The models are trained only on colon data and tested on colon data. 4) The models are trained on colon data (using the pretrained models on kidney data for transfer learning) and tested on colon data.
The fourth strategy is called transfer learning in machine learning. It is widely used to improve performance on a dataset by pretraining it on a different but similar dataset. This allows the model to learn more features from the previous dataset and helps improve the generalizability of the overall model. Transfer learning may involve training the entire model or freezing some layers of the model and training the remaining unfrozen layers.

Algorithms
Teams "Tom," "Gleb," and "Whats goin on" won first, second and third place for the accuracy prize respectively. DeepLive.exe and Deepflash2 won the first and second judges prizes respectively. The setup, optimization, and prediction run of all five algorithms are discussed here.

Tom
The model uses a single U-Net SeResNext101 architecture with Convolutional Block Attention Module (CBAM) 56 , hypercolumns, and deep supervision. It reads the WSIs as tiled 1024x1024 pixel images and then further resized as 320x320 tiles and sampled using a balanced sampling strategy. The model is trained using a combination of Binary Cross-entropy loss 57 and Lovász Hinge loss 58 , and the optimizer used is SGD (Stochastic gradient descent) 59 . Training is for 20 epochs, with a learning rate of 10 -4 to 10 -6 and batch size of 8 (i.e., training is done using batches of 8 samples per batch).
For the model trained on colon data from scratch or using transfer learning, the training is done for 50-100 epochs and the validation set is increased from 1 slide to 2 slides.

Gleb
The model is trained using an ensemble of four 4-fold models namely, Unet-regnety16, Unet-regnetx32, UnetPlusPlus-regnety16 60 , and Unet-regnety16 with scse attention decoder. The model reads tiles of size 1024x1024 sampled from the kidney/colon data. During model training, general data augmentation techniques such as adding gaussian blur and sharpening, adding gaussian noise, applying random brightness or gamma value are used. The models are trained for 50-80 epochs each, with a learning rate of 10 -4 to 10 -6 , and batch size of 8. The loss function is Dice coefficient loss 61 and the optimizer used is AdamW 62 .
For the model trained on data from scratch or using transfer learning, the model is trained for 50-100 epochs and the sampling downscale factor is changed from 3 to 2.

Whats goin on
Model training uses an ensemble of 2 sets of 5-fold models using the U-Net 63 architecture (pretrained on Image) with resnet50_32x4d and resnet101_32x4d 64 as backbones, respectively. Additionally, the a Feature Pyramid Network (FPN) 65 is added to provide skip connections between upscaling blocks of the decoder, atrous spatial pyramid pooling (ASPP) 66  For the model trained on data from scratch or using transfer learning, the batch size is increased to 64 and the expansion tile size is increased to 64. DeepLive.exe The model architecture used is a simple U-net 68 with an efficientnet-b 69 encoder. In addition to the provided training data, the model is trained on additional data from Mendeley 70 (31 WSIs), Zenodo 71 (20 WSIs), and the HuBMAP Data Portal 15 (2 WSIs). The additional data is annotated into two classes: healthy and unhealthy glomeruli. The model employs a dynamic sampling approach whereby it samples tiles of size 512x512 pixels (at a resolution downscale factor of 2) and 768x768 pixels (at a resolution downscale factor of 3). The tiles are sampled from regions having visible glomeruli in them based on annotations, instead of sampling randomly. Model training uses the cross-entropy loss, Adam optimizer, an adaptive learning rate (linearly increased up to 0.001 during the first 500 iterations and then linearly decreased to 0), and a batch size of 32. During training the general data augmentation techniques are used such as brightness and contrast changes, RGB shifting, HSV shifting, color jittering, artificial blurring, CutMix 72 and MixUp 73 . The model is trained using 5-fold cross validation for at least 10,000 iterations.
The key to the model is to reframe the problem as a healthy/unhealthy glomerulus classification problem along with a segmentation problem. This setup enables the model to learn to classify the unhealthy glomeruli as glomeruli and then decide whether the particular instance is healthy enough.
For the model trained on colon data from scratch, on_spot_sampling of 1 and an overlap factor of 2 is used. For the model trained on colon data using transfer learning, on_spot_sampling is set to 1 and an overlap factor of 1 is used. In both cases, no external datasets are used for training.

Deepflash2
The model architecture used is a simple U-Net architecture with an efficientnet-b2 encoder (pretrained on ImageNet 74 ). Input data is converted and stored as .zarr file format for efficient loading on runtime. The model collectively employs two sampling approaches: 1) Sampling tiles that contain all glomeruli (to ensure that each glomerulus is seen at least once during each epoch of training). 2) Sampling random tiles based on region (cortex, medulla, background) probabilities (to give more weight to the cortex region during training since glomeruli are mainly found in the cortex). The region sampling probabilities were chosen based on expert knowledge and experiments: 0.7 for cortex, 0.2 for medulla, and 0.1 for background. On runtime, the model samples tiles of size 512x512 and uses a resolution downscale factor of 2, 3, and 4 in subsequent runs. During training, general data augmentation techniques are applied such as flipping, blurring, deformation, etc. Model training uses a weighted sum of Dice 75 and crossentropy loss 76 (where both losses have equal weight), Ranger 77 optimizer (a combination of RAdam 78 and LookAhead optimizer 79 ), a maximum learning rate of 1e-3, and a batch size of 16. The model training is done using a learning rate scheduler whereby the learning rate is scheduled with a cosine annealing 80 from max_learning_rate / div to max_learning_rate (where div=25). The models are trained and tested using 5-fold cross validation in which each fold is trained on 12 WSIs and validated on 3 WSIs. The best model ensemble for the final score consists of three models trained on different zoom scales (i.e., 2x, 3x, 4x).
For the model trained on the colon data (both with and without transfer learning), the background probability is set to 0.1 and the colon probability is set to 0.9 for sampling, since the colon data lacks the masks for anatomical structures. A weight decay of 10 -5 was added (for the model trained without transfer learning). For the transfer learning model, saved weights are loaded from the model trained on kidney data at 3x downsampling and the first 13 parameter groups are frozen during training.

Performance Metrics Terminology
Ground Truth. The set of all FTU segmentations in the human annotated dataset using the SOP at (cite) is called ground truth (GT, blue in Fig. 4). The sets can be represented via vector-based polylines or pixel masks and different algorithms are used to compare these. Note that the metrics in Fig. 4 can be applied to pixels that represent an object of interest (e.g., an FTU) or to FTU counts.

Performance Metrics
Dice coefficient, or Sørensen-Dice index 81 , is widely used to compare the pixel-wise agreement between a predicted segmentation and its corresponding ground truth. The formula is given by 2 * | ∩ | | |+| | , see Fig. 4. The Dice coefficient is defined to be 1 when both sets are empty.
Mean Dice coefficient is the sum of all Dice coefficients (e.g., one for each image in the test set) divided by the count of all numbers in the collection (e.g., the number of images in the test set).
Recall, also referred to as sensitivity, measures the proportion of instances that were correctly predicted compared to the sum of false negatives and true positives. It is defined as + , see Fig. 4.
Precision denotes the proportion of predictions that were correct and it is defined as + , see Fig. 4.
Other performance metrics used by related work, see Supplementary Table 1 and 2: F-measure/F-score/F 1 -score: The F-measure, also called the F-score or F 1 -score is the harmonic mean of Precision and Recall, defined as .

Hausdorff Distance:
The Hausdorff distance is a measure used to calculate how similar two objects or images are to one another by calculating the distance between two sets of edge points 82 . , where A and B are the two objects being compared, e.g., GT and PS in Fig. 4. It represents the proportion of area of overlap out of the area of union for the two objects.

Segmentation Mask Analysis
Ground truth segmentation masks were provided as vector files (one polyline per FTU; many FTUs per WSI). However, algorithm predictions are generated as run-length encodings-one mask for all FTUs in each WSI. Some FTUs are adjacent, effectively merging multiple FTUs into one; this makes it hard to count FTUs or to compute the Dice coefficient but also recall and precision per FTU.
Manually, we added 647 lines to the 70 predicted kidney WSI segmentation masks (232 lines for 50 kidney slides and 415 lines for 20 colon slides) to separate glued together FTUs. We then converted pixel masks for each FTU into one polyline per FTU. Next, we calculated the Dice coefficient for each segmented FTU (glomerulus or crypt) separately; assuming that a Dice coefficient greater than 0.5 indicates that the FTU was correctly predicted, the set of true positives. All FTUs with a Dice coefficient less than 0.5 are false positives (FP), while all ground truth masks with no matching algorithm predictions are false negatives (FN). All results of Dice coefficient, recall, and precision computations are provided in Supplementary Table 5.