CellSeg3D: self-supervised 3D cell segmentation for light-sheet microscopy

Understanding the complex three-dimensional structure of cells is crucial across many disciplines in biology and especially in neuroscience. Here, we introduce a novel 3D self-supervised learning method designed to address the inherent complexity of quantifying cells in 3D volumes, often in cleared neural tissue. We offer a new 3D mesoSPIM dataset and show that CellSeg3D can match state-of-the-art supervised methods. Our contributions are made accessible through a Python package with full GUI integration in napari.

The analysis of such large-scale 3D datasets presents a significant challenge due to the size, complexity and heterogeneity of the samples.Yet, accurate and efficient segmentation of cells is a crucial step towards density estimates as well as quantification of morphological features.To begin to address this challenge, several studies have explored the use of supervised deep learning techniques using convolutional neural networks (CNNs) or transformers for improving cell segmentation accuracy (2)(3)(4)(5).Various methods now exist for performing instance segmentation on the models' outputs in order to separate segmentation masks into individual cells.
Typically, these methods use a multi-step approach, first segmenting cells in 2D images, optionally performing instance segmentation, and then reconstructing them in 3D using the volume information (3).While this can be successful in many contexts, this approach can suffer from low recall or have trouble retaining finer, non-convex labeling.Nevertheless, by training on (ideally large) human-annotated datasets, these supervised learning methods can learn to accurately segment cells in 2D, and ample 2D datasets now exist thanks to community efforts (6).
However, directly segmenting in 3D ("direct-3D") volumes could limit errors and streamline processing by retaining important morphological information (2).Yet, 3D annotated data is lacking (6), likely due to the fact that it is highly timeconsuming to generate.For example, to our knowledge, 3D segmentation datasets of cells in whole-brain LSM volumes are not available, despite the existence of open-source microscopy database repositories (7).
Moreover, unsupervised learning, such as self-supervised learning, has emerged as a powerful approach for training deep neural networks without the need for explicit labeling of data.In the context of segmentation of cells, several studies have explored the use of unsupervised techniques to learn representations of cellular structures and improve segmentation accuracy (8,9).However, these methods rely on adversarial learning, which can be difficult to train and have not been shown to provide accurate 3D results on cleared tissue for LSM data, which can suffer from clearing and artifacts.
Here, we present a new 3D dataset that is hand-annotated in 3D from mesoSPIM-acquired volumes (Figure 1a), and custom toolbox for direct-3D supervised and self-supervised cell segmentation built on state-of-the-art transformers and 3D CNN architectures (10,11) paired with classical image processing techniques (12).First, we benchmark our methods against Cellpose and StarDist, two leading supervised cell segmentation packages with user-friendly workflows, and show our methods match or outperform them (in the low data regime) in 3D semantic segmentation on mesoSPIMacquired volumes (Figure 1b, c).Then, we show that our self-supervised model, WNet3D, without any human labeled data can be as good as, or better than, supervised models (Figure 1d-h).
First, we developed a 3D human-annotated dataset based on data acquired with a mesoSPIM system (1) (Figure 1a, see  Methods).Using whole-brain data from mice we cropped small regions and human annotated in 3D 2,632 neurons that were endogenously labeled by TPH2-tdTomato (Figure 1a).
We then trained two models for supervised direct-3D segmentation.Specifically, we used a SwinUNetR transformer (11), and a SegResNet CNN (13) from the MONAI project (14).We benchmarked these models against Cellpose (3, 15) and StarDist (2) and find that our supervised models have comparable instance segmentation performance on held-out (unseen) test data set as measured by the F1 vs. IoU threshold (see Methods, Figure 1b, c).Note, for a fair comparison, we performed a hyperparameter sweep of all models tested (Supplemental Figure S1a-d), and in Figure 1b and c we show the quantitative and qualitative best models.
Next, we built a new self-supervised model for direct-3D segmentation that requires no ground truth training data, only raw volumes.Our new model, called WNet3D, is built on WNet (10) (see Methods, Figure 1d).Our changes include a conversion to a fully 3D architecture and adding the Soft-NCuts loss, replacing the proposed two-step model update with the weighted sum of the encoder and decoder losses, and trimming the number of weights for fast inference (see Methods).
We found that WNet3D could be as good or better than the fully supervised models, especially in the low data regime, on this dataset at semantic segmentation (Figure 1e, f).Notably, our pre-trained WNet3D, which is trained on 100% of raw data without any labels, achieves 0.81 ± 0.004 F1-Score with simple filtering of artifacts (removing the slices containing the problematic regions; Suppl.Figure S1g) and 0.74 ± 0.12 without any filtering.To compare, we trained supervised models with 10, 20, 60 or 80% of the training data and tested on the held-out data subsets.Considering models with 80% of the training data, the F1-Score for SwinUNetR was 0.83 ± 0.01, 0.76 ± 0.03 for Cellpose tuned, 0.74 ± 0.06 for SegResNet, 0.72 ± 0.07 for StarDist (tuned), 0.61 ± 0.07 for StarDist (default), 0.43 ± 0.09 for Cellpose (default).For WNet3D with 80% raw data for training was 0.71 ± 0.03 (unfiltered) (Figure 1f), which is still on-par with the top supervised models.
Note that WNet3D uses brightness to detect objects, and therefore cannot discriminate cells vs artifacts.Filtering could be used when sufficient (e.g., using rules based on label volume to remove aberrantly small or large particles), or one could use WNet3D to generate 3D labels in order to train a suitable supervised model (such as Cellpose or SwinUNetR), which would be able to distinguish artifacts from cells.
To show the feasibility of this approach, we trained a Swin-UNetR using WNet3D self-supervised generated labels (Figure 1g) and show it can be nearly as good as a fully supervised model that required human 3D labels (no significant difference across F1 vs. IoU thresholds; Kruskal-Wallis H test H=4.91,p=0.085, n=9).
Lastly, we highlight that the models we present are available in a new napari plugin we developed, with full support for labeling, training (self-supervised or supervised), model inference, and evaluation plus many other utilities (Figure 2a).Moreover, our pretrained WNet3D can be used "zero-shot" on diverse data, such as Platynereis nuclei, mouse skull bone nuclei (both collected with confocal microscopy; (Figure 2bc), even though qualitatively these datasets are quite distinct looking from the dataset used for pretraining.We also tested the pretrained WNet3D on c-FOS stained tissue, which had more difficult signal to noise due to clearing and antibody staining, from whole brains of mice acquired with a mesoSPIM (Figure 2d).

Limitations & Conclusions
This work focused on a budding new area of self-supervised 3D cell segmentation for light-sheet microscopy.We provide the first-in-kind open source ground truth dataset of mesoSPIM data that we hope sparks more methods to be developed.One major limitation for the field has been the lack of 3D data (6).Thus, while we put considerable efforts here to provide a new 2500 3D dataset, more datasets will be needed in the future to understand the limitations of selfsupervised learning for this type of data.
Another limitation is that self-supervised methods are going to excel in samples that have enough separation in the signalto-noise (i.e., clear nuclei).As discussed above, our method works by detection, and as with any semantic segmentation method, this then requires fine-tuning of threshold parameters.With ground truth data this is straight-forward, but if one lacks any ground truth, this can be subjective.We aim to limit this problem by showing how active-learning can help.In Figure 1g, we show how self-supervised learning can act as a step to pseudo-label.We provide the software to then inspect and correct these pseudo-labels.Then one can use these labels for training and perform on par-with top supervised methods, here, the transformer approach, SwinUNetR.
While we focused our efforts on rather uncluttered nuclei (although see the mouse skull in Figure 2b, c), we believe that our self-supervised semantic segmentation model could be applied to more challenging data as long as the above limitations are taken into account.For instance segmentation, if the cells are more overlapping, etc., more complex methods, such as the star-convex polygons used by StarDist to approximate the shapes of cell nuclei, could be adapted to recover higher-quality instance labels (since it is agnostic to the classification backbone used (2)).
We also report the effects of artifacts in the outputs before and after thresholding (Figure 1f).We believe that the benefit of fully self-supervised learning is worth the cost post-hoc processing of these types of easy-to-spot and fix mistakes.Generating a large ground truth 3D dataset is on the order of hundreds of human-hours of labeling efforts.Along these lines, in Figure 2c we demonstrate the performance on moredensely labeled samples, which is still possible with this approach.We also qualitatively hint that this approach can perform on noisy samples, such as in c-FOS labeled brains (Figure 2d).While no ground truth is available for formal benchmarking, this is the type of data this tool was designed forcleared brain samples collected with MesoSPIM systems.
In summary, CellSeg3D supports high-performance supervised and self-supervised direct-3D segmentation, primarily in light-sheet data.Our napari plugin supports both the pretrained WNet3D for testing, the ability to train WNet3D and other models presented here (SegResNet, SwinUNetR).We also provide various tools for pre-and post-processing as well as utilities for labeling in 3D.We additionally provide our new 2500 cell 3D dataset intended for benchmarking 3D cell segmentation algorithms on mesoSPIM acquired cleared-tissue (see Data Card), and all code is fully open-sourced at https://github.com/AdaptiveMotorControlLab/CellSeg3D.

Datasets
CellSeg3D mesoSPIM dataset: acquisition and labeling.The whole-brain data by Voigt et al. (1) was obtained from the IDR platform (7); the volume consists of CLARITY cleared tissue from a TPH2-tdTomato mouse.Data was acquired with the mesoSPIM system at a zoom of 0.63X with 561 nm excitation.
The data was cropped to several regions of the somatosensory (5 volumes, without artifacts) and visual cortex (1 volume, with artifacts) and annotated by an expert.All volumes were annotated by hand (see Dataset Card below for more details).The ground-truth cell count for the dataset is as follows: Additional datasets.Additional datasets, used in Figure 2c were taken from the GitHub page of EmbedSeg, by (16).We used our pretrained WNet3D, without re-training (the model was only trained on our new dataset described above), to generate semantic segmentation.Images and labels were first cropped to contents, discarding empty regions on the edges.We then downscaled the images and labels by a factor of two to reduce runtime.We obtain the raw WNet3D prediction simply by adding the images to napari, and using the Inference tool of the plugin with WNet3D, without changing any parameters from default.Next, the channel containing the foreground was thresholded and the Voronoi-Otsu method used to generate instance labels (for Platynereis data), with hyperparameters based on the F1-Score metric with the ground truth from data separate to the one on which we evaluate performance.However, these parameters can also be estimated directly.This is documented at https://c-achard.github.io/cellseg3d-figures/fig2-b-c-extra-datasets/self-supervised-extra.html#threshold-predictions.
For the Mouse Skull Nuclei instance segmentation, we performed additional post-processing using clEsperanto (12) to perform a morphological closing operation with radius 8 on semantic labels in order to remove small holes.The image was then remapped to values ∈ [0; 100] for convenience, before merging labels with a touching border within intensity range between 35 and 100 using the merge_labels_with_border_intensity_within_range function.This is documented in our linked For Figure 2d, we used a wild type C57BL/6J adult mouse (17 weeks old, Female) that was given appetitive food 90 min before deep anesthesia and intra-cardial perfusion with 4% PFA.We followed establish guidelines for iDISCO (17).In brief, the brain was dehydrated, bleached, permeabilized and stained for c-FOS using anti-c-FOS Rat monoclonal purified IgG (Synaptic Systems, Cat.No. 226 017) followed by a Donkey anti-Rat IgG Alexa Fluor™ 555 (Invitrogen A78945) secondary antibody.Then, the whole brain was imaged on a mesoSPIM (1).Imaging was performed with a laser at a wavelength of 561 nm, with a pixel size of 5.26 × 5.26 µm in x,y, and a step size of 5 µm in z.All experimental protocols adhered to the stringent ethical standards set forth by the Veterinary Department of the Canton Geneva, Switzerland, with all procedures receiving approval and conducted under license number 33020 (GE10A).
Segmentation models and algorithms: Self-supervised semantic segmentation WNet3D model architecture.To perform self-supervised cell segmentation, we adapted the WNet architecture proposed by Xia and Kulis (10), an autoencoder architecture based on joining two U-Net models end-to-end.We provide a modified version of the WNet, named WNet3D, with the following changes: • A conversion of the architecture for fully-3D segmentation, including a 3D SoftNCuts loss • Replacing the proposed two-step model update with the weighted sum of the encoder and decoder losses, updated in a single backward pass • Reducing the overall depth of the encoder and decoder, using three up/downsampling steps instead of four • Replacing batch normalization with group normalization, tuning the number of groups based on performance Reducing the number of layers improved overall performance by reducing overfitting and sped up training and inference.This trimming was meant to reduce the large number of parameters resulting from a naive conversion of the original WNet architecture to 3D, which were found to be unnecessary for the present cell segmentation task.Finally, we introduced group normalization(18) to replace batch normalization, which improved performance in the present low batch size setting, as well as training and inference speed.
To summarize, the model consists of an encoder U enc and decoder U dec , as originally proposed; however, each UNet comprises 7 blocks, for a total of 14 blocks, down from 9 blocks per UNet originally.U enc and U dec start and end with 2 3 × 3 × 3 3D convolutional layers, in-between are 5 blocks, each block being defined by two 3 × 3 × 3 3D convolutional layers, followed by a ReLU and group normalization (18) (instead of batch normalization).Skip connections are used to propagate information by concatenating the output of descending blocks to that of their corresponding ascending blocks.Blocks are followed by 2 × 2 × 2 max pooling layers in the descending half of U enc and U dec , the ascending half uses 2 × 2 × 2 transpose convolution layers with stride= 2 ; U enc is then followed by a 1 × 1 × 1 3D convolutional layer to obtain class logits, follwed by a softmax, the output of which is provided to U dec to perform the reconstruction.U dec is similarly followed by a 1 × 1 × 1 3D convolutional layer and outputs the reconstructed volume.Refer to figure (Figure 1d) for a complete overview of the WNet3D architecture.
Losses.Segmentation is performed in U enc by using an adapted 3D SoftNCuts loss Shi and Malik (19) as an objective, with the voxel brightness differences defining the edge weight in the calculation, as proposed in the initial Ncuts algorithm.
The SoftNCuts is defined as where cut(A, B) = u∈A,v∈B w(u, v), V is the set of all pixels, A k the set of all pixels labeled as class k (K being the number of classes, which is set to 2 here) and w(u, v) is the weight of the edge uv in a graph representation of the image.In order to group the voxels according to brightness, w(u, v) is defined here as with F (i) = I(i) the intensity value, X the spatial position of the voxel, σ I the standard deviation of the feature similarity term, termed "intensity sigma", σ X the standard deviation of the spatial proximity term, termed "spatial sigma", and r the radius for the calculation of the loss, to avoid computing every pairwise value.
In our experiments, lowering the radius greatly sped up training without impacting performance, even with a radius as low as 2 voxels.For the spatial sigma, the original value of 4 was used, whereas for the intensity sigma we use a value of 1 (originally 4), after remapping voxel values in each image to the [0; 100] range.
U dec then uses a suitable reconstruction loss to reconstruct the original image; we used either Mean Squared Error (MSE) or Binary Cross Entropy (BCE) as defined in PyTorch.

Hyperparameters.
To achieve proper cell segmentation, it was crucial to prevent the SoftNCuts from simply separating the data in broad regions with differing overall brightness; this was achieved by adjusting the weighting of the reconstruction loss accordingly.In our experiments, we empirically adapted the weights to equalize the contribution of each loss term, making sure we have uniform gradients in the backward pass.This proved effective for training on our provided dataset; however, for different samples, adjusting the reconstruction weight and learning rate using the ranges specified below was necessary for good performance; other parameters were kept constant.
The default number of classes is two, to segment background and cells, but this number may be raised to add more brightnessgrouped classes; this could be useful to mitigate the over-segmentation of cells due to brightness "halos" surrounding the nucleus, or to help produce labels for object boundary segmentation.
We found that summing the losses, instead of iteratively updating the encoder first followed by the whole network as suggested, improved stability and consistency of loss convergence during training; in our version the trade-off between accuracy of reconstruction and quality of segmentation is controlled by adjusting the parameters of the weighted sum instead of individual learning rates.
This modified model was usually trained for 50 epochs, unless stated otherwise.We use a batch size of 2, 2 classes, a radius of 2 for the NCuts loss and the MSE reconstruction loss, and use a learning rate between 2 • 10 −3 and 2 • 10 −5 and reconstruction loss weight between 5 • 10 −3 and 5 • 10 −1 , depending on the data.
See Supplemental Figure S2a for an overview of the training process, including loss curves and model outputs.

Segmentation models and algorithms: Supervised semantic segmentation
Model architectures.In order to perform supervised fully-3D cell segmentation, we leveraged computer vision models and losses implemented by the MONAI project, which offers several state-of-the-art architectures.The MONAI API was used as the basis for our napari plugin, and we retained two of the provided models based on their performance on the provided dataset: • SegResNet (13) • SwinUNetR (11) SegResNet is based on the Convolutional Neural Network (CNN) architecture, whereas SwinUNetR uses a transformer-based encoder.
Several relevant segmentation losses are made available for training: • Dice loss (20) • Dice-Cross Entropy loss • Generalized Dice loss (21) • Tversky loss (22) The SegResNet and SwinUNetR models shown here were trained using the Generalized Dice loss for 50 epochs, with a learning rate of 1 • 10 −3 , batch size of 5 (SwinUNetR) or 10 (SegResNet), and data augmentation enabled.Unless stated otherwise, a train/test split of 80/20% was used.
The outputs were then passed through a threshold to discard low-confidence predictions; this was estimated using the training set to find the threshold that maximized the Dice metric between predictions and ground truth.Using the training set for this process ensures that we do not overfit the evaluation set on which we calculate the metrics.See the following notebook for the corresponding code: https://github.com/C-Achard/cellseg3d-figures/blob/main/thresholds_opti/find_best_thresholds.ipynb.The same process was repeated for Cellpose (cell probability threshold) and StarDist (nonmaximum suppression (NMS) and cell probability thresholds) to ensure fair comparisons, see "Model comparison" below and Supplemental Figure S1a,b,c,d for tuning results.Inference outputs are processed a-posteriori to obtain instance labels, as detailed below.
Instance segmentation.Several methods for instance segmentation are available in the plugin: the connected components and watershed algorithms (scikit-image), and the Voronoi-Otsu labeling method (clEsperanto).The latter combines an Otsu threshold and a Voronoi tessellation to perform instance segmentation, and more readily avoids fusing clumped cells than the former two, provided that the objects are convex, which is the case in the present task.
The Voronoi-Otsu method was therefore used to perform instance segmentation in the benchmarks, with its two parameters, spatial sigma and outline sigma, tuned to fit the training data when relevant, and manually selected otherwise.

Model Comparisons
StarDist was retrained using the provided example notebook for 3D, using default parameters.For the model we refer to as "Default", we used a patch size of 8x64x64, a grid of (2,1,1), a batch size of 2 and 96 rays, as computed automatically in the provided example code for StarDist.For the "Tuned" version (referred to simply as "StarDist"), we changed the patch size to 64x64x64 and the grid to (1,1,1).
Cellpose was retrained without pretrained weights using default parameters, except for the mean diameter which was set to 3.3 according to the provided object size estimation utility in the GUI (and visually confirmed).We investigated pretrained models provided by Cellpose, as well as attempting transfer learning, but no pretrained model was found to be suitable for our data.Despite Cellpose automatically resizing the data to match its training data, neither the automated estimate of object size, nor fixing the object size value manually helped in improving performance, therefore we retrained those models with our data."Default" refers to automatically estimated parameters for StarDist (NMS and probability threshold, estimated on the training data), and cell probability threshold of 0 with resampling enabled for Cellpose.For both models, inference hyperparameters (respectively NMS and cell probability threshold for StarDist and cell probability threshold and resampling on CellPose) were tuned on the training set to maximize the Dice metric with GT labels, exactly like our models.After tuning, we found that Cellpose achieved best performance with a cell probability threshold of −9 and resampling enabled (see Supplemental Figure S1a and https: //github.com/C-Achard/cellseg3d-figures/blob/main/thresholds_opti/cellpose_find_thresh.ipynb)across all data subsets.For StarDist, best parameters varied across subsets (see Supplemental Figure S1d and https://github.com/C-Achard/cellseg3d-figures/blob/main/thresholds_opti/stardist_find_thresh.ipynb), however, as this did not affect performance significantly, we used the parameters estimated automatically as part of the training.
Models provided in the plugin (SwinUNetR, SegResNet and WNet3D), which we refer to as "pretrained", are trained on the entire dataset to obtain best possible performance, using all images (and labels only for the supervised models).The WNet3D model was used in Figure 1f (WNet3D -pretrained), g (WNet3D pretrained and SwinUnetR), and Figure2b (WNet3D).Hyperparameters used are as mentioned above, except for the number of epochs, which was selected based on validation curves.
For Figure 1b, we trained each model on a subset of the dataset (sensorimotor volumes), chunked into 64 pixels cubes using an 80/20% training/validation split, and estimated the best threshold on the same training data.Next, we used the remaining held-out data (visual volume) to evaluate performance.Code for thresholds optimisation may be found at https://github.com/C-Achard/cellseg3d-figures/blob/main/thresholds_opti/find_best_thresholds.ipynb, and code to create Figure 1 b can be found at https://c-achard.github.io/cellseg3d-figures/fig1-b-supervised/supervised_benchmark.html#.

Label efficiency comparison
To assess how many labeled cells are required to reach a certain performance, we trained StarDist, Cellpose, SegResNet, SwinUNetR and WNet on three distinct subsets of the data, each time holding out one full volume of the full dataset for evaluation, fragmenting the remaining volumes and labels into 64 pixels cubes, and training on distinct train/validation splits on remaining data.We used 10%, 20%, 60% and 80% splits in order to assess how much labeled data is necessary for the supervised models, and whether they show variability based on the data used for training.To note, the evaluation data remained the same for all percentages in a given data subset, ensuring a consistent performance comparison.We used 50 epochs for all runs, and no early stopping or hyperparameter tuning was performed based on the validation performance during training.Instead, we reused the best hyperparameters found for Figure 1b.
For example, the first subset consists of all five somatosensory cortex volumes as training/validation data, and the visual cortex volume is held out for evaluation.For Cellpose two conditions are shown, default (cell probability threshold of 0) and fine-tuned (threshold of -9), which improved performance.
To avoid training on data with artifacts present in the visual cortex volume, WNet3D was only trained on the first of the subsets.Instead, the model was trained on a percentage of the first subset using three different seeds.We also avoid evaluating on artifacts in the visual volume, unless mentioned otherwise, as the model is not meant to handle these regions.
Instance labels were generated as stated above, and then converted back to semantic labels to compute the F1-Score, see Performance Evaluation section below.

WNet3D-based retraining of supervised models
To assess whether WNet3D can generalize to unseen data when trained on a specific brain volume, we trained a WNet3D from scratch using volumes cropped from a different mesoSPIM-acquired whole brain sample, labeled with c-FOS, imaged at 561 nm with a pixel size of 5.26 × 5.26µm in x and y, and a step size in z of 5µm (see Additional Datasets).
This model was then used to generate labels for our provided dataset.A SwinUNetR model was then trained using these WNet3D generated labels, and compared to the performance of the pretrained model we provide in our napari plugin.

Performance evaluation
The models were evaluated using standard segmentation metrics (23), namely F1-Score and intersection over union (IoU ).The equations for these evaluation metrics are shown below, with TP, FP, and FN representing true positives (TP), false positives (FP), and false negatives (FN) respectively.The higher the F1 (precision and recall), the better the model performance.
, Precision = TP TP+FP , Recall = TP TP+FN We used the evaluation utilities provided by StarDist (2), specifically the code from here.
To assess performance for semantic segmentation in Figure 1f we report the F1-Score without any IoU threshold, which is then equivalent to the Dice score computed on the semantic labels, given the Boolean nature of the data.
The metric to assess instance segmentation accuracy can be computed as functions of several overlap thresholds; true positives are pairings of model predictions and ground-truth labels having an intersection over union (IoU ) value greater than the specified threshold, with automated matching to prevent additional instances from being assigned to the same ground truth or model-predicted instance of a label.We report the F1-Score over a range of IoU thresholds between 0.1 and 0.9 (step size of 0.1) in Figures 1b, g and 2b.
For instance segmentation, we take the model's probability outputs and apply an intensity threshold to get semantic predictions; this threshold which ultimately affects the reported metrics therefore we discuss the procedure here.We set these thresholds based on the training set.Specifically, to determine the optimal threshold for evaluating instance segmentation on a training fold, pairs of predictions and corresponding labels from the training set were taken.For each pair, the threshold that maximized the F1-Score at IoU >= 0, which is equivalent to the Dice coefficient, was computed.This process was repeated for all images within the training fold.The resulting optimal thresholds provided the threshold used when evaluating that particular fold.The code for each use case can be found at https://github.com/C-Achard/cellseg3d-figures/blob/main/thresholds_opti/find_best_thresholds.ipynb.For the mesoSPIM data this threshold was empirically found to be to 0.4 (SwinUNetR), 0.3 (SegResNet) and 0.6 (WNet3D), in Figure 1f, g.For Figure 2b, the thresholds for WNet3D were: 0.45 for Mouse Skull, 0.55 for both the Platynereis datasets.Then we convert these to instance labels using the Voronoi-Otsu algorithm, whose parameters were chosen based on the F1-Score metric between ground truth labels and model-generated labels on the training set, as described in the Model Section above describing instance segmentation.If a model is not trained, i.e., for example in Figure 2b, we set these parameters manually threshold by eye.To reproduce the F1-Scores as shown, we used the following values:

Dataset
Outline  2b, c for instance segmentation with Voronoi-Otsu.

CellSeg3D napari plugin workflow
To facilitate the use of our models, we provide a napari plugin where users can easily annotate data, train models, run inference, and perform various post-processing steps.Starting from raw data, users can quickly crop regions into regions of interest, and create training data from those.Users may manually annotate the data in napari using our labeling interface, which provides additional interface such as orthogonal projections to better view the ongoing labeling process, as well as keeping track of time spent labeling each slice, or alternatively train a self-supervised model to automatically perform a first iteration of the segmentation and labeling, without annotation.Users can also try pretrained models, including the self-supervised one, to generate labels which can then be corrected using the same labeling interface.Supervised or self-supervised models can then be trained using the generated data.Full documentation for the plugin can be found on our GitHub website.
In the case of supervised learning, the volumes (random patches or whole images) are split into training and validation sets according to a user-set proportion, using 80%/20% by default.Input images are normalized by setting all values above and below the 1st and 99th percentile to the corresponding percentile value, respectively.Data augmentation can be used; by default a random shift of the intensity, elastic and affine deformations, flipping and rotation are used.
For the self-supervised model, images are remapped to values in the [0;100] range to accommodate the intensity sigma of the SoftNCuts loss.No percentile normalization is used and data augmentation is restricted to flipping and rotating in this case.
Deterministic training may also be enabled for all models and the random generation seed set; unless specified otherwise, models were trained on cropped cubes with 64 pixels edges, with both data augmentation and deterministic training enabled.
We additionally provide a Colab notebook to train our self-supervised model using the same procedure described above.The pretrained weights for all our models are also made available through the HuggingFace platform (and automatically downloaded by the plugin or in Colab), so that users without the recommended hardware can still easily train or try our models.All code is open source and available on GitHub.

Dataset Card
A. Motivation. 1.For what purpose was the dataset created?Was there a specific task in mind?Was there a specific gap that needed to be filled?Please provide a description.
The contributions of our dataset to the vision and cell biology communities are twofold: 1) We release a 3D cell segmentation dataset of 2632 TPH2 positive cells, based on data from Voigt et al. (1).2) It is entirely human-annotated.The dataset is one of the first cell segmentation datasets to date created in 3D.
2. Who created the dataset (which team, research group) and on behalf of which entity (company, institution, organization)?
The human-annotated dataset was created by the Mathis Lab of Adaptive Intelligence of EPFL, who are co-authors of this work.The raw brain data is publicly available on https://idr.openmicroscopy.org/webclient/?show=project-854.
3. Who funded the creation of the dataset?If there is an associated grant, please provide the name of the grantor and the grant name and number.
This project was funded, in part, by the Wyss Center via a grant to PI Mathis.

Any other comments? No.
Composition. 1.What do the instances that comprise the dataset represent (e.g.,documents, photos, people, countries)?Are there multiple types of instances (e.g., movies, users, and ratings; people and interactions between them; nodes and edges)?Please provide a description.
The instances in our dataset represent 3D volumetric segments, extracted from mesoSPIM scans of mouse brains.Each instance is essentially a three-dimensional image that has been carefully hand-cropped mainly from the somatosensory and visual cortex of the scanned data.In each of these 3D volumes, TPH2 cells are identified and labeled.

How many instances are there in total (of each type, if appropriate)?
There are six 3D volumetric segments, that contain a total of 2638 TPH2 positive cells identified and labeled in 3D.
3. Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?If the dataset is a sample, then what is the larger set?Is the sample representative of the larger set (e.g., geographic coverage)?If so, please describe how this representativeness was validated/verified.If it is not representative of the larger set, please describe why not (e.g., to cover a more diverse range of instances, because instances were withheld or unavailable).
The dataset provided is a subset of the available whole-brain sample, selected from larger raw volumetric data obtained from mesoSPIM scans of mouse brains.This selection primarily consists of 3D volumes cropped mainly from the somatosensory and visual cortex regions, where the TPH2 cells are labeled meticulously.The broader dataset from which these instances were extracted represents scans of whole mouse brains.However, due to the immense volume of the entire scanned data, creating a manageable and focused dataset was key for addressing specific research questions and computational manageability.
4. What data does each instance consist of?"Raw" data (e.g., unprocessed text or images) or features?In either case, please provide a description.
Each instance in the dataset consists of "raw" 3D volumetric data derived from mesoSPIM scans of mouse brains, specifically focusing on the somatosensory cortex and vision cortex regions.The instances are essentially unprocessed and maintain the integrity of the original scanned data.
5. Is there a label or target associated with each instance?If so, please provide a description.
Yes, each instance in the dataset is human-annotated with masks.There are no categories or text associated with the masks.
6.Is any information missing from individual instances?If so,please provide a description, explaining why this information is missing (e.g., because it was unavailable).This does not include intentionally removed information, but might include, e.g., redacted text.
In our dataset, there is no information missing from individual instances.
7. Are relationships between individual instances made explicit (e.g., users' movie ratings, social network links)?If so, please describe how these relationships are made explicit.
Not applicable.
8. Are there any errors, sources of noise, or redundancies in the dataset?If so, please provide a description.
While we have taken extensive measures to ensure the accuracy and quality of the dataset, it is challenging to rule out the presence of minor errors or noise, especially considering the complex nature of the 3D cell segmentation task.Nonetheless, we believe that any such inconsistencies do not compromise the overall reliability and utility of the dataset.9. Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)?If it links to or relies on external resources, a) are there guarantees that they will exist, and remain constant, over time; b) are there official archival versions of the complete dataset (i.e., including the external resources as they existed at the time the dataset was created); c) are there any restrictions (e.g., licenses, fees) associated with any of the external resources that might apply to a dataset consumer?Please provide descriptions of all external resources and any restrictions associated with them, as well as links or other access points, as appropriate.
The dataset is self-contained.
10. Does the dataset contain data that might be considered confidential (e.g., data that is protected by legal privilege or by doctor-patient confidentiality, data that includes the content of individuals' non-public communications)?If so, please provide a description.

No.
11. Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety?If so, please describe why.
No. The dataset is composed solely on scientific, non-human biological data.
12. Does the dataset identify any subpopulations (e.g., by age, gender)?If so, please describe how these subpopulations are identified and provide a description of their respective distributions within the dataset.

Not applicable.
13.Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset?If so, please describe how.
Not applicable.
14. Does the dataset contain data that might be considered sensitive in anyway (e.g., data that reveals race or ethnic origins, sexual orientations, religious beliefs, political opinions or union memberships, or locations; financial or health data; biometric or genetic data; forms of government identification, such as social security numbers; criminal history)?If so, please provide a description. No.

Any other comments?
No.
Collection Process. 1.How was the data associated with each instance acquired?Was the data directly observable (e.g., raw text, movie ratings), reported by subjects (e.g., survey responses), or indirectly inferred/derived from other data (e.g., part-ofspeech tags, model-based guesses for age or language)?If the data was reported by subjects or indirectly inferred/derived from other data, was the data validated/verified?If so, please describe how.
The data associated with each instance was acquired through mesoSPIM scans of mouse brains, providing raw, directly observable 3D volumetric data.The data was not reported by subjects or indirectly inferred or derived from other data; it was directly observed and recorded from the scientific imaging process.All collected volumes were annotated by expert human annotators.The quality of the annotations was validated by an external expert not involved in the annotation process.
2. What mechanisms or procedures were used to collect the data (e.g., hardware apparatuses or sensors, manual human curation, software programs, software APIs)?How were these mechanisms or procedures validated?
The raw data is open source and provided by the Image Data Resource (IDR).
3. If the dataset is a sample from a larger set, what was the sampling strategy (e.g., deterministic, probabilistic with specific sampling probabilities)?
Our sampling strategy was designed to select volumes where TPH2 cells are clearly discernible.We aimed to include a varied range of volumes, from densely packed with TPH2 cells to ones more sparsely populated, ensuring a good representation of various brain areas.Another important factor was the manageability of the volumes from an annotation perspective, to facilitate accurate and efficient labeling.
4. Who was involved in the data collection process(e.g.,students,crowdworkers, contractors) and how were they compensated (e.g., how much were crowdworkers paid)?
The released masks were created by research personnel of the Mathis Lab of Adaptive Intelligence, EPFL.

5.
Over what timeframe was the data collected?Does this timeframe match the creation timeframe of the data associated with the instances (e.g., recent crawl of old news articles)?If not, please describe the timeframe in which the data associated with the instances was created.
The raw data was downloaded from the Image Data Resource (IDR) website.The labels were created between June and October

2021.
If the dataset does not relate to people, you may skip the remaining questions in this section.
Preprocessing / Cleaning / Labeling.1. Was any preprocessing / cleaning / labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)?If so, please provide a description.If not, you may skip the remaining questions in this section.
Yes, extensive preprocessing, and labeling were conducted to ensure the usability and reliability of the dataset.The initial step involved examination of the raw 3D volumetric data, where we ruled out the presence of anomalies or artefacts.During this phase, we ensured the visibility of TPH2-positive cells within the volumetric segments.We proceeded to label the TPH2positive cells through a well-defined annotation process, where each cell within the selected volumes was identified and marked by our experts.At the end of the annotation process, the quality of the work was verified by a human expert not involved in the annotation work.
2. Was the "raw" data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)?If so, please provide a link or other access point to the "raw" data.
The raw data is open source and available on the Image Data Resource (IDR) website.
3. Is the software that was used to preprocess/clean/label the data available?If so, please provide a link or other access point.
Yes.We used the napari interactive viewer for multidimensional images in Python and used our plugin.
Uses. 1. Has the dataset been used for any tasks already?If so, please provide a description.
The dataset was used to train segmentation models.
2. Is there a repository that links to any or all papers or systems that use the dataset?If so, please provide a link or other access point.
Yes, the repository hosting the model weights which were trained on our data, as well as the repository for our napari plugin for 3D cell segmentation.

What (other) tasks could the dataset be used for?
We intend the dataset to be used to train 3D cell segmentation models.However, we invite the research community to gather additional annotations for mesoSPIM acquired datasets via the tools we contribute in the present publication.
4. Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses?For example, is there anything that a dataset consumer might need to know to avoid uses that could result in unfair treatment of individuals or groups (e.g., stereotyping, quality of service issues) or other risks or harms (e.g., legal risks, financial harms)?If so, please provide a description.Is there anything a dataset consumer could do to mitigate these risks or harms?Not applicable.
5. Are there tasks for which the dataset should not be used?If so, please provide a description.
Full terms of use for the dataset can be found at https://github.com/AdaptiveMotorControlLab/CellSeg3D,but the project is made open source under an MIT license.
Distribution. 1.Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?If so, please provide a description.
2. How will the dataset will be distributed (e.g., tarball on website, API, GitHub)?Does the dataset have a digital object identifier (DOI)?

When will the dataset be distributed?
The dataset is released on zenodo at: https://zenodo.org/records/11095111alongside the publication of this paper.
4. Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)?If so, please describe this license and/or ToU, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms or ToU, as well as any fees associated with these restrictions.

Figure 1 .
Figure 1.Performance of 3D Semantic and Instance Segmentation Models.a: Raw mesoSPIM whole-brain sample, volumes and corresponding ground truth labels from somatosensory (S1) and visual (V1) cortical regions.b: Evaluation of instance segmentation performance for several supervised models over three data subsets.F1-score is computed from the Intersection over Union (IoU) with ground truth labels, then averaged.Error bars represent 50% Confidence Intervals (CIs).c: View of 3D instance labels from supervised models, as noted, for visual cortex volume in b evaluation.d: Illustration of our WNet3D architecture showcasing the dual 3D U-Net structure with modifications (see Methods).e: Example 3D instance labels from WNet3D; top row is S1, bottom is V1, with artifacts removed.f: Semantic segmentation performance: comparison of model efficiency, indicating the volume of training data required to achieve a given performance level.Each supervised model was trained with an increasing percentage of training data (with 10, 20, 60 or 80%, left to right within each model grouping); F1-Score score with an IoU >= 0 was computed on unseen test data, over three data subsets for each training/evaluation split.Our self-supervised model (WNet3D) is also trained on a subset of the training set of images, but always without human labels.Far right: We also show performance of the pretrained WNet3D available in the plugin (far right), with and without removing artifacts in the image.See Methods for details.The central box represents the interquartile range (IQR) of values with the median as a horizontal line, the upper and lower limits the upper and lower quartiles.Whiskers extend to data points within 1.5 IQR of the quartiles.g: Instance segmentation performance comparison of Swin-UNetR and WNet3D (pretrained, see Methods), evaluated on unseen data across 3 data subsets, compared with a Swin-UNetR model trained using labels from the WNet3D self-supervised model.Here, WNet3D was trained on separate data, producing semantic labels that were then used to train a supervised Swin-UNetR model, still on held-out data.This supervised model was evaluated as the other models, on 3 held-out images from our dataset, unseen during training.Error bars indicate 50% CIs.

Figure 2 .
Figure 2. CellSeg3D napari plugin pipeline, training, and example outputs.a: Workflow diagram depicting the segmentation pipeline: either raw data can be used directly (self-supervised) or labeled and used for training and then other data can be used for model inference.Each stream concludes with posthoc inspection and refinement, if needed (post-processing analysis and/or refining the model).b: Instance segmentation performance (zero-shot) of the pretrained WNet3D on select datasets featured in c, shown as F1-score vs IoU with ground truth labels.c: Qualitative examples with WNet3D for semantic and instance segmentation.d: Qualitative example of WNet3D-generated prediction (thresholded) and labels on a crop from a whole-brain sample, with c-FOS-labeled neurons, acquired with a mesoSPIM.

Figure S2 .
Figure S2.Training WNet3D a: Overview of the training process of WNet3D.The loss for the encoder Uenc is the SoftNCuts, whereas the reconstruction loss for U dec is MSE.The weighted sum of losses is calculated as indicated in Methods.For select epochs, input volumes are shown, with outputs from encoder Uenc above, and outputs from decoder U dec below.

Table 1 .
Dataset ground-truth cell count per volume.

Table 2 .
Parameters used in Figure