Abstract
The primary step in tissue cytometry is the automated distinction of individual cells (segmentation). Since cell borders are seldom labeled, researchers generally segment cells by their nuclei. While effective tools have been developed for segmenting nuclei in two dimensions, segmentation of nuclei in three-dimensional images remains a challenging task for which few tools have been developed. The lack of effective methods for three-dimensional segmentation represents a bottleneck in the realization of the potential of tissue cytometry, particularly as methods of tissue clearing present researchers with the opportunity to characterize entire organs. Methods based upon deep learning have shown enormous promise, but their implementation is hampered by the need for large amounts manually-annotated training data. Here we describe 3D Nuclei Instance Segmentation Network (NISNet3D), a deep learning-based approach in which training is accomplished using synthetic data, profoundly reducing the effort required for network training. We compare results obtained from NISNet3D with results obtained from eight existing techniques.
1 Introduction
Over the past ten years, various technological developments have provided biologists with the ability to collect microscopy images of enormous scale and complexity. Methods of tissue clearing combined with automated confocal or lightsheet microscopes have enabled three-dimensional imaging of entire organs or even entire organisms at subcellular resolution. Novel methods of multiplexing have been developed so that researchers can now simultaneously characterize 50 or more targets in the same tissue. However, as biologists turn to the task of analyzing these extraordinary volumes (tissue cytometry), they quickly discover that the methods of automated image analysis necessary for extracting quantitative data from images of this scale are frequently inadequate to the task. In particular, while effective methods for distinguishing (segmenting) individual cells are available for analyses of two-dimensional images, corresponding methods for segmenting cells in three-dimensional volumes are generally lacking. The problem of three-dimensional image segmentation thus represents a bottleneck in the full realization of tissue cytometry as a tool in biological microscopy.
There are two main categories of segmentation approaches, traditional image processing and computer vision techniques and techniques based on machine learning and in particular deep learning [1, 2]. Traditional techniques (e.g. watershed, thresholding, edge detection, and morphological operations) can be effective, but generally require careful optimization of processing parameters so that settings are seldom robust, even across images to be pooled or compared. Segmentation techniques based on deep learning have shown great promise, in some cases providing accurate and robust results across a range of image types [3–7]. However, their utility is limited by the large amounts of manually annotated (ground truth) data needed for training, validation, and testing. Annotation is a labor-intensive and time-consuming process, especially for a 3D volume. While tools have been developed to facilitate the laborious process of manual annotation [8–11], the generation of training data remains a major obstacle to implementing segmentation approaches based upon deep learning.
To some degree, the problem of generating sufficient training data can be alleviated using data augmentation, a process in which existing manually-annotated training data is supplemented with synthetic data generated from modifications of the manually annotated data [12–15]. An alternative approach is to use synthetic data for training [14, 16, 17]. In [18], 2D distributions of fluorescent markers are generated using GANs, and 3D microscopy volumes are generated by stacking the 2D synthetic image slices. Similarly, in [19], a 3D GAN was used to generate fully 3D volumetric cell masks with variability matching real volumes. We have shown that Generative Adversarial Networks (GANs) can be used to generate synthetic microscopy volumes that can be used for training [7, 20–22], and incorporated this approach into the DeepSynth segmentation system [6].
Here we describe the 3D Nuclei Instance Segmentation Network (NISNet3D), a deep learning-based segmentation technique that use synthetic volumes, manually annotated volumes or a combination of synthetic and annotated volumes. NISNet3D is a true 3D segmentation method based on a 3D Convolutional Neural Network (CNN). CNNs have had great success for solving problems such as object classification, detection, and segmentation [23, 24] and the encoder-decoder architectures have been widely used for biomedical image analysis including volumetric segmentation [25–27] and medical image registration [28]. CNNs have also been developed for nuclear segmentation [3, 6, 15, 29–33] but are designed for segmentation of two-dimensional images and either cannot be used for segmentation of 3D volumes [29, 32, 34] or involve a process in which objects segmented in two dimensional images are fused together to form 3D objects [3, 14, 30] that fail to represent the 3D anisotropy of microscope images. In contrast, NISNet3D is a true three-dimensional segmentation system that operates directly on 3D volumes, using 3D CNNs to exploit 3D information in a microscopy volume, thereby generating more accurate segmentations of nuclei in 3D image volumes.
We demonstrate that NISNet3D can accurately segment individual nuclei using five different types of microscopy volumes. The qualitative and quantitative evaluation results show that NISNet3D achieves promising results on nuclei instance segmentation with and without the use of manual ground truth annotations when compared to other approaches.
We summarize the contributions of this paper as follows:
We designed a fully convolutional neural network with residual concatenation and self-attention mechanism for nuclei instance segmentation for 3D volumes. We proposed new 3D markers for nuclei instance segmentation using 3D marker-controlled watershed.
NISNet3D can use annotated volumes and/or synthetic microscopy volumes for training and can analyze large 3D microscopy volume using a divide- and-conquer inference strategy.
We present three error/difference visualization methods for visualizing segmentation errors in large 3D microscopy volumes without the need of ground truth annotations.
We conducted experiments on a variety of microscopy volumes using multiple evaluation metrics, and compared NISNet3D to other deep learning image segmentation methods to demonstrate the effectiveness of our approach.
2 PROPOSED METHOD
The block diagram of our proposed nuclei instance segmentation system is shown in Figure 1, which includes: (1) 3D microscopy image synthesis and annotated data, (2) NISNet3D training and inference, and (3) 3D nuclei instance segmentation.
2.1 Notation and Overview
In this paper, we denote I as a 3D image volume of size X × Y × Z voxels, and I(x,y,z) is a voxel having coordinate (x, y, z) in volume I. We will use superscripts to indicate the types of image volumes. For example, Iorig, Ibi and Isyn denote the original microscopy volumes, binary segmentation masks and synthetic microscopy volumes, respectively. Ilabel is the gray-scale label volume for Ibi where different nuclei are marked with unique pixel intensities. This is done to distinguish each nuclei instance. In addition, Itarget is a four-channel volume of size 4 × X × Y × Z, where the first three channels denote the 3D vector field volume Ivec that contains the nuclei shape and size information, and the last channel is the 3D binary masks Ibi. To represent the 3D nuclei centroid and boundary information in Ivec, each nuclei voxel denotes a 3D vector that points to its nearest nucleus centroid. The details of 3D vector field generation will be discussed in Section 2.3.1. Imask and Îvec are the output of NISNet3D where Imask is the 3D binary segmentation mask and Îvec is the estimated 3D vector field volume. Note that Îvec(x), Îvec(y), and Îvec(z) represent the x, y, and z channel of Îvec, respectively. Îvec is then decoded to generate the gradient volume IG where nuclei boundaries are highlighted. Imark is the high quality markers generated from IG. To separate touching nuclei, we use 3D marker-controlled watershed segmentation with the pairs of Imark and Imask, which will be discussed in Section 2.3.2. The post-processing includes small object removal and nuclei color coding for visualization. Iseg is the final color-coded segmentation volume. The overview of our proposed approach is shown in Figure 1.
2.2 3D Microscopy Image Synthesis
Deep learning methods generally require large amounts of training samples to achieve accurate results. However, manually annotating ground truth is a tedious task and impractical in many situations especially for 3D microscopy volumes. To address this issue, NISNet3D can use synthetic microscopy 3D volumes for training the segmentation network. It must be emphasized that NISNet3D can use synthetic volumes or annotated real volumes or combinations. We demonstrate this in the experiments.
For generating synthetic volumes we first generate synthetic segmentation masks which are used as the ground truth masks, and further translate the synthetic segmentation masks to synthetic microscopy volumes using an unsupervised image-to-image translation model known as SpCycleGAN [7].
2.2.1 Synthetic Segmentation Mask Generation
We first generate synthetic binary nuclei segmentation mask Ibi by iteratively adding candidate binary nucleus Inuc to an empty 3D volume of size 128 × 128 × 128. To synthesize ellipsoidal nuclei, the candidate nuclei are modeled as 3D binary ellipsoids with random size and orientation parameterized by a and θ. Then we iteratively generate N candidate nuclei in different size and orientation, where a range these parameters are randomly selected based on the observation of nuclei characteristics in actual microscopy volume (e.g. nuclei size, shape, and orientation).
The nuclei size is parameterized by the semi-axes length a = [ax, ay, az]T of an ellipsoid, the orientation is defined by a rotation angle θ = [θx, θy, θz]T, and the location is represented by a translation vector t = [tx, ty, tz]T towards the origin. Equation 1 defines the kth candidate nucleus with voxel intensity k ∈ {1, …, N}. The voxels are assigned intensity values to differentiate them from each other. Equation 2 defines the translated nucleus coordinates . where Rx(θx), Ry(θy) and Rz(θz) are rotation matrices around the x, y, z axes with angles θx, θy, and θz respectively. Note that the maximum overlap between two candidate nuclei cannot be more than to voxels.
Many of the nuclei are not strictly ellipsoidal. Instead, they look more like deformed ellipsoids (See Figure 2 (left)). To model these non-ellipsoidal nuclei, we use elastic transformation [35] to deform the 3D binary mask of the nuclei. Suppose the 3D binary volume to be deformed is denoted as Ibi and is of size X × Y × Z. We define the amount of deformation using what we will call a displacement vector field. We define the “smooth displacement vector field” Ismooth as a matrix of size 3 × X × Y × Z to represent three 3D volumes Ismooth(x), Ismooth(y), Ismooth(z) each of size X × Y × Z. In Ismooth(x), each voxel indicates the distance the voxel needs to be shifted on the x-axis. Similarly, for Ismooth(y) and Ismooth(z), each voxel and indicates the distance of the voxel that needs to be shifted on the y and z-axis, respectively. We then describe how to construct Ismooth.
Next we define the “coarse displacement vector field” Icoarse as a matrix of size 3×d×d×d, where d controls the amount deformation in the nuclei. Each entry in Icoarse is a random variable that is independent and normally distributed 𝒩(0, σ2). The “smooth displacement vector field” Ismooth is obtained from Icoarse using spline interpolation [36] or bilinear interpolation [35]. The deformation control, d, is used to define the size of Icoarse. A larger d will result in more deformation for Ismooth whereas lower d indicates less deformation for Ismooth. In our experiments, we used spline interpolation and d ∈ {4, 5, 10}. Examples of deformed ellipsoids are shown in Figure 2 (right).
2.2.2 Synthetic Microscopy Volume Generation
We use the unpaired image-to-image translation model known as SpCycleGAN [7, 37] for generating synthetic microscopy volumes. By unpaired we mean that the binary segmentation mask we created above is not the ground truth of actual microscopy images. The input to the SpCycleGAN is the binary segmentation masks we created and actual microscopy images (i.e. unpaired). As shown in Figure 1, we use the binary segmentation mask we created Ibi and actual microscopy volumes Iorig for training the SpCycleGAN. After training we generate synthetic microscopy volumes (Isyn) by using different synthetic microscopy segmentation masks (Ibi) we created as input to the SpCycleGAN. Note that since the SpCycleGAN generates 2D slices, we then use the slices to construct a 3D synthetic volume. The SpCycleGAN [7], an extension of CycleGAN [37], is shown in Figure 3. SpCycleGAN consists of two generators G and F, two discriminators D1 and D2. G learns the mapping from Iorig to Ibi whereas F performs the reverse mapping. Also, SpCycleGAN introduced a segmentor S for maintaining the spatial location between Ibi and F(G(Ibi)). The entire loss function of SpCycleGAN is shown in Equation 3. where λ1 and λ2 are weight coefficients controlling the loss balance between ℒcycle and ℒspatial, and where ∥·∥1 and ∥·∥2 denotes the L1 norm and L2 norm. 𝔼I is the expected value over all input volumes of a batch to the network.
2.3 NISNet3D
The overview of NISNet3D is shown in Figure 4. In this section, we describe the architecture of modified 3D U-Net in NISNet3D, how to train and inference the modified 3D U-Net, and nuclei instance segmentation of NISNet3D.
2.3.1 Modified 3D U-Net
In this section we describe the modified 3D U-Net of NISNet3D as shown in Figure 4. NISNet3D uses an encoder-decoder network (See Figure 5) which outputs the same size volume as the input. The encoder consists of multiple Conv3D Blocks (Figure 6(a)) and Residual Blocks (Figure 6(b)) [38]. Instead of using max pooling layers, we use Conv3D Blocks with stride 2 for feature down-sampling, which introduces more learnable parameters. Each convolution block consists of a 3D convolution layer with filter size 3 × 3 × 3, a 3D batch normalization layer and a leaky ReLU layer. The decoder consists of multiple TransConv3D blocks (Figure 6(d)) and attention gates (Figure 6(c)). Each TransConv3D block includes a 3D transpose convolution with filter size 3 × 3 × 3 followed by 3D batch normalization and leaky ReLU. We use a self-attention mechanism described in [39] to refine the feature concatenation while reconstructing the spatial information.
Training The Modified 3D U-Net
The modified 3D U-Net can be trained on both synthetic microscopy volumes Isyn or actual microscopy volumes Iorig if manual ground truth annotations are available.
We define the “3D vector field volume,” Ivec, as a volume where each nucleus voxel is a 3D vector that points to the centroid of the current nucleus in Ilabel that is being used for training. Ivec is generated from Ilabel and used as part of the ground truth for training the modified 3D U-Net. We next describe the steps for 3D vector field volume generation (VFG).
As shown in Figure 4, during training, the modified 3D U-Net in NISNet3D takes Isyn or Iorig, Ilabel and Ivec as input, and outputs the estimated 3D vector field volume Îvec (Îvec(x), Îvec(y), Îvec(z)) and the 3D binary segmentation masks Imask. Ilabel is the gray-scale label volume for Ibi with size X × Y × Z where different nuclei are marked with unique pixel intensities. As shown in Figure 7, the first step for 3D vector field volume generation (VFG) is to obtain the centroid of each nucleus in Ilabel. We denote the kth nucleus as the voxels with intensity k in Ilabel, and the centroid of kth nucleus is (xk, yk, zk).
Ivec is a matrix of size 3 × X × Y × Z that represents the three volumes Ivec(x), Ivec(y), Ivec(z) each of size X × Y × Z. Note that . We use to represent a 3D vector at the current location and points to its nearest nucleus centroid. Note that if a voxel is a background voxel (i.e. ), the corresponding voxels in the vector field volume are set to 0. Equation 5 shows the definition of Ivec. is a 3D vector from (x, y, z) to (xk, yk, zk). By using the 3D vector field volume, the boundary information is more easily learned since the vectors in boundary regions points to very different directions which will result in very large gradients. Ivec is used as part of the training groundtruth for the network to learn nuclei centroid and boundary information whereas Ilabel is used for learning the segmentation masks. Finally, Ivec and Ilabel are used as the ground truth for training modified 3D U-Net.
Loss Functions
The modified 3D U-Net simultaneously learns the nuclei segmentation masks Imask and the 3D vector field volume Îvec. Two branches are used and there is no sigmoid function used to obtain Îvec because the vector represented at a voxel can point to anywhere in a volume. In other words, the voxel in Îvec can be a negative number or a large number. Unlike previous methods [31, 40, 41] that directly learn the distance transform map, the 3D vector field volume contains both nuclei centroid and boundary information which can avoid over-detection for irregular nuclei. The output 3D vector field volumes Îvec are compared with the ground truth vector field volume Ivec and optimized using the Mean Square Error (MSE) loss function, whereas the segmentation results Imask are compared with the ground truth binary volumes Ibi and optimized using the combination of Focal Loss [42] ℒFL and Tversky Loss [43] ℒTL. The entire loss function is shown in Equation 6, where where α1 +α2 = 1 are two hyper-parameters in Tversky loss [43] that controls the balance between false positive and false negative detections. β and γ are two hyper-parameters in Focal loss [42] where β balances the importance of positive/negative voxels, and γ adjusts the weights for easily classified voxels. In addition, S is the ground truth binary volume, Ŝ is the segmentation volume, V is the ground truth vector field volume and is the estimated vector field volume. vi ∈ V is the ith voxel in V, and is the ith voxel in . Similarly, si ∈ S is the ith voxel in S, and ŝi ∈ Ŝ is the ith voxel in Ŝ. We define ŝi0 as the probability that the ith voxel in Ŝ is the nuclei class, and ŝi1 as the probability that the ith voxel in Ŝ is the background class. Similarly, si1 = 1 if si is a nuclei voxel and 0 if si is a background voxel, and vice versa for si0. Lastly, P is the total number of voxels in a volume.
Modified 3D U-Net Inference
To segment a large microscopy volume, we propose an divide-and-conquer inference scheme shown in Figure 8. We use an inference window of size K × K × K that slides along the original microscopy volume Iorig of size X × Y × Z and crops a subvolume. Considering that the partially included nuclei on the border of the inference window may cause inaccurate segmentation results, we construct a padded window by symmetrically padding each cropped subvolume by voxels on each border. Also, the stride of the moving window is set to so every step it slides, it will have a voxel overlapping with the previous window. For the inference results of every K × K × K window, only the interior subvolume, denoted as interior window, in the center will be used as the segmentation results. In this paper, we use K = 128 for all testing data. Once the inference window slides along the entire volume, a segmentation volume Imask and 3D vector field volume Îvec of size X × Y × Z will be generated. In this way, we can inference on any size input volume, especially very large volumes. Imask is the binary segmentation results whereas Îvec is the estimated 3D vector field volume that needs to be decoded to locate the nuclei centroids using 3D nuclei instance segmentation.
2.3.2 3D Nuclei Instance Segmentation
Figure 4 shows an overview of the proposed method for 3D nuclei instance segmentation. Based on the output of the modified 3D U-Net, a 3D gradient field generation, marker generation and marker refinement step is used for separating densely clustered nuclei. The estimated 3D vector field volume Îvec is a 3-channel volume with the same size as the ground truth 3D vector field volume Ivec where each nuclei voxel represents a 3D vector pointing to its nearest nucleus centroid. The neighbor voxels on the boundary of two touching nuclei generally point to different directions and thus have a large gradient. Let ∇Îvec be the gradient of Îvec. We use Îvec to obtain the gradient map IG as described in Equation 8, where Îvec(x), Îvec(y), and Îvec(z) are the x, y, and z channel of the estimated 3D vector field volume Îvec. Sx, Sy, Sz are 3D Sobel filters and ∗ is the convolution operator. In Equation 9, the gradient map IG is then determined by choosing the maximum gradient component of each vector on the x, y and z direction. The boundary of touching nuclei in IG have larger values (larger gradients) which can be used to identify individual nucleus. In Equation 10, Imask is the binary segmentation mask from the modified 3D U-Net and τ(x, t) is a thresholding function such that τ(x, t) = 1 if x ≥ t otherwise 0. τ(IG, Tm) is used to highlight the boundary of individual nuclei and σ(∗) is a rectifier that sets all negative values to 0. Iblob is the interior regions of the nuclei which are potential markers for watershed segmentation [44, 45] To better refine Iblob, we use a 3D conditional erosion with a coarse structuring element Bc and fine structuring Bf shown in Figure 9. We first iteratively erode each object using Bc until the object size is smaller than tc then erode each object using Bf until the size of each object is smaller than tf. The markers Imark for watershed segmentation are obtained using Equation 11 where defines the iterative erosion of all objects in Iblob with coarse structuring element Bc until the size of each object is smaller than coarse object threshold tc. Similarly, the output of is then eroded with fine structuring element Bf and fine object threshold tf. Finally, marker-controlled watershed [45] is used to generate instance segmentation masks Iseg. Small objects less than 20 voxels are then removed, and each object is color coded for visualization.
3 EXPERIMENTAL RESULTS
3.1 Evaluation Datasets
Due to the limited availability of annotated volumes, we present two training and evaluation strategies described in Section 3.2. The corresponding trained versions of NISNet3D obtained using the two strategies are denoted as “NISNet3D-slim” and “NISNet3D-synth”. Both of the versions are evaluated on four different microscopy volumes denoted as 𝒱1-𝒱4, which are fluorescent-labeled (Hoechst 33342 stain) nuclei collected from rat kidneys, rat livers, and mouse intestines using confocal microscopy. The manually annotated ground truth subvolumes for each type of the evaluation volumes were obtained using ITK-SNAP [8]. In addition, we also trained NISNet3D-slim on a publicly available electron microscopy volume of a zebrafish brain. This volume is known as NucMM[46] and will be denoted as 𝒱5 in our evaluation datasets. The detailed information of all five datasets used in our evaluation is shown in Table 1.
3.2 Experimental Settings
The parameters for generating Ibi are shown in Table 2 where (amin, amax) is the range of ellipsoids semi-axes a. to is the maximum allowed overlapping voxels between two nuclei and N is the total number of nuclei in a synthetic volume. These parameters are based on visual inspection of nuclei characteristics in the actual microscopy volumes. The synthetic microscopy volumes were verified by a biologist (one of the co-authors). The SpCycleGAN was trained on unpaired Ibi and Iorig and the trained model was used to generate synthetic microscopy volumes. The weight coefficient of the loss function are set to λ1 = λ2 = 10 based on the experiments described in [7].
Both the SpCycleGAN and NISNet3D are implemented using PyTorch. We used 9-block ResNet for generators G, F and the segmentor S (see Figure 3). The discriminators D1 and D2 (Figure 3) are implemented with the “Patch-GAN” classifier [47]. The SpCycleGAN was trained with Adam optimizer [48] for 200 epochs with initial learning rate 0.0002 that linearly decays to 0 after the first 100 epochs. Figure 10 shows the generated synthetic nuclei segmentation masks and corresponding synthetic microscopy images.
For NISNet3D-slim, we trained 5 versions denoted as ℳ1-ℳ5 corresponding to 𝒱1-𝒱5 (See Table 2). We used three training methods for NISNet3D-slim: (1) train only on corresponding synthetic microscopy data (ℳ1, ℳ2, ℳ3), (2) transfer the weights from ℳ3 of NISNet3D-slim and continue training on a limited number of actual microscopy subvolumes (ℳ4), (3) directly train on only actual subvolumes (for ℳ5).
After training, two different evaluation schemes are used for NISNet3D-slim: (1) directly test on all subvolumes of original volume (for ℳ1, ℳ2, ℳ4, ℳ5), (2) use 3-fold cross-validation: split ground truth subvolumes randomly into 3 equal sets and iteratively update the model on one of the set and test on other two sets (for ℳ3). We use cross-validation in our experiments to show the effectiveness of our method when the evaluation data is limited. Note that for lightly retraining, we update all parameters of NISNet3D while continue training on actual microscopy volumes. The training and evaluation scheme for each model is shown in Table 2.
In addition, we also trained NISNet3D-synth. This is trained on 800 synthetic microscopy volumes including synthetic 𝒱1-𝒱4 without updating using any actual microscopy volumes. NISNet3D-synth is designed for the scenario where no ground truth annotations are available. NISNet3D-slim is used for the case where limited ground truth annotated volumes are available. In this case synthetic volumes are used for training and the small amount of ground truth data is for used light retraining.
Both NISNet3D-slim and NISNet3D-synth were trained for 100 epochs using the Adam optimizer [48] with constant learning rate 0.001. The weight coefficients of the loss function are set to λ3 = 1, λ4 = 10, λ5 = 10. The hyper-parameters β, γ of ℒFL are set to 0.8 and 2, and the hyper-parameters α1, α2 in ℒTL are set to 0.3 and 0.7 [43, 49]. The nuclei instance segmentation parameters used in our experiments are shown in Table 3.
3.3 Comparison Methods
We compared the NISNet3D with both deep learning image segmentation methods including VNet [27], 3D U-Net [26], Cellpose [3], DeepSynth [6], and StarDist3D [15]. In addition, we also compare NISNet3D with several commonly used biomedical image processing tools including 3D Watershed [45], Squassh [50], CellProfiler [51], and VTEA [52].
VNet [27] and 3D U-Net [26] are two popular 3D encoder-decoder networks with shortcut concatenations designed for biomedical image segmentation. Cellpose [3] uses a modified 2D U-Net for estimating image segmentation and spatial flows, and uses a dynamic system to cluster the pixels and further separate touching nuclei. When segmenting 3D volumes, Cellpose works from three different directions slice by slice and reconstruct the 2D segmentation results to a 3D segmentation volume [3]. DeepSynth uses a modified 3D U-Net to segment a 3D microscopy volume and uses watershed to separate touching nuclei [6]. StarDist3D uses a modified 3D U-Net to estimate the star-convex polyhedra to represent the nuclei [15]. 3D Watershed [45] uses the watershed transformation [44] and conditional erosion [45] for nuclei instance segmentation. Squassh is a ImageJ plugin for both 2D and 3D microscopy image segmentation based on the the use active contours [50]. CellProfiler is a image processing toolbox and provides customized image processing and analysis modules [51]. VTEA is an ImageJ plugin that combines Otsu’s thresholding and watershed to segment 2D nuclei slice by slice and reconstruct the results to a 3D volume [52].
We trained and evaluated the comparison methods using the same dataset as used for NISNet3D. We also used the same training and evaluation strategies as NISNet3D described in Section 3.2. This is discussed in Section 3.6. Note that 3D Watershed, Squassh, CellProfiler, and VTEA do not need to be trained because they use more traditional image analysis for these techniques which we describe in Section 3.6.
3.4 Evaluation Metrics
We use object-based metrics to evaluate nuclei instance segmentation accuracy. We define as the number of True Positive detection where the Intersection-over-Union (IoU) between a detected nucleus and a ground truth nucleus is greater than t voxels. Similarly, is the number of False Positives, and is the number of False Negatives [53, 54]. volume are correctly detected. The higher measures how many nuclei in a the more accurate the detection method. represents the detected nuclei are not actually nuclei which are false detections. represents the number of nuclei that did not detected. A precise detection method should have high but low and .
We then construct metrics based . To reduce the bias [55], we use the mean Precision mean Recall and mean F1 score on using multiple IoU thresholds TIoUs. We set TIoUs = {0.25, 0.3, …, 0.45} for datasets 𝒱1-𝒱4, and set TIoUs = {0.5, 0.55, …, 0.75} for datset 𝒱5. The selection of the TIoUs is described in more detail below.
We observed that the nuclei in datasets 𝒱1-𝒱4 are more challenging to segment than the nuclei in 𝒱5. If we use the same IoU thresholds for evaluating all the datasets, the evaluation accuracy for 𝒱1-𝒱4 will be much lower than the evaluation accuracy for 𝒱5. Thus, we chose two different sets of IoU thresholds for 𝒱1-𝒱4, and 𝒱5, respectively.
We also examined commonly used object detection metrics: Average Precision (AP) [56, 57] by estimating the area under the Precision-Recall Curve [58] using the same thresholds for TIoUs as described in the previous paragraph. For example, AP.25 is the average precision with IoU threshold 0.25. The mean Average Precision (mAP) is obtained as .
In addition, we use the Aggregated Jaccard Index (AJI) [59] to integrate object and voxel errors. The AJI is defined as: where Gi denotes the ith nucleus in ground truth volume with total number of N nuclei. is the mth connected component in the segmentation mask which has the largest Jaccard Index with Gi, and U is the segmented nuclei without corresponding ground truth. Note that each segmented nucleus with index m cannot be used more than once.
All methods are optimized to achieve the best visual results by parameter tuning. This is further discussed in Section 3.6. The quantitative evaluation metrics for each microscopy datasets are shown in Table 4 and Table 5. Figure 11 shows the AP scores using multiple IoU thresholds and the box plot of AJI on each subvolume of dataset 𝒱2. The orthogonal views (XY focal planes and XZ focal planes) of the segmentation masks that are overlaid on the original microscopy subvolume for each method on 𝒱1-𝒱5 are shown in Figure 12 and Figure 13. Note the colors correspond to different nuclei.
3.5 Visualizing Errors and Differences
Entire microscopy volumes contain many regions with varying spatial characteristics. In order to see how the segmentation methods perform on various regions, we propose three methods for visualizing the errors and differences between a “test segmented volume” and a “reference segmented volume”. We use the NISNet3D segmented volume as the reference segmented volume. We visualize the segmentation errors between VTEA and NISNet3D and we also visualize the errors between DeepSynth and NISNet3D on entire volumes. We describe how to generate an Overlay Volume and three types of Error Volumes using the three methods which we will call Visualization Method A, B, and C. Note that Visualization Method A does not need a “reference segmented volume” whereas Visualization Method B and C need a “reference segmented volume”.
Next we describe how to generate an Overlay Volume. Using the notation shown in Figure 4, we denote Iorig as the original microscopy volume and m as the maximum intensity of Iorig. We will use Iorig for overlaying the segmentation errors from the “test segmented volume” to construct the visualization. For a “reference segmented volume”, we denote Imask as the binary segmentation masks of Iorig, and denote Iseg as the color-coded segmentation of Iorig with RGB channels . Similarly, for a “test segmented volume”, we denote Cbi as the binary segmentation masks of Iorig, and denote C as the color-coded segmentation of Iorig with RGB channels CR, CG, CB.
We then denote the Overlay Volume as L with RGB channels LR, LG, LB. As shown in Equation 13, the Overlay Volume for a “test segmented volume” is generated by adding the original microscopy volume to each of the RGB channels of the color-coded segmented volume. We then define the notation we used for generating the three types of Error Volumes. To represent the segmented nuclei in a “test segmented volume”, let be the set of all 3D segmented nuclei in C, where is a volume with same size of C but only contains the ith segmented 3D nucleus, and let be the set of all 2D objects in C from each XY focal planes, where is a volume with same size of C but only contains the ith segmented 2D nucleus.
Similarly, to represent the segmented nuclei in a “reference segmented volume”, we denote Iseg as the 3D segmentation volume from NISNet3D, which will be used as the “reference segmented volume”. Let be the set of all 3D objects in Iseg, and let be the set of all 2D objects in Iseg from each slice. Next, we describe how to generate the three types of Error Volume using the Visualization Methods A, B, and C.
3.5.1 Visualization Method A
The Error Volume generated by Visualization Method A shows voxels in the original microscopy volume that are not segmented by either test or reference methods. The input to Visualization Method A is the original microscopy volume and a segmented volume (“test segmented volume” or “reference segmented volume”). Using VTEA as an example: the VTEA segmented volume is subtracted from the original microscopy volume. This is shown in Equation 14. The Error Volume IA shows the voxels in the original microscopy image that are not segmented. We can replace Cbi in Equation 14 with Imask to obtain the Error Volume for the “reference segmented volume” NISNet3D. Figure 14 ((a), (b), (c)) shows the Error Volumes generated by Method A for VTEA, DeepSynth, and NISNet3D on dataset 𝒱3.
3.5.2 Visualization Method B
The Error Volume generated by Visualization Method B shows the under-segmentation regions where multiple nuclei in the “reference segmented volume” are detected as a single nucleus in the “test segmented volume”. Here we use NISNet3D as “reference segmented volume” and use VTEA or Deep-Synth as “test segmented volume”. The input to Visualization Method B is the VTEA (or DeepSynth) segmented volume and the NISNet3D segmented volume.
Using VTEA as an example: if two or more nuclei in the NISNet3D segmented volume intersect with the same single nucleus in the VTEA segmented volume, then we show the single nucleus segmented by VTEA in the Visualization Method B Error Volume. This is shown in the Equation 15. Then the result volume IB is overlaid on the original microscopy volume using Equation 13. Figure 14 ((d), (e)) shows the Error Volumes generated by Visualization Method B for VTEA and DeepSynth on dataset 𝒱3.
3.5.3 Visualization Method C
The Error Volume generated by Visualization Method C shows nuclei segmented by a “reference segmented volume” but are completely missed by a “test segmented volume”. The input to Visualization Method C is the VTEA (or DeepSynth) segmented volume and the NISNet3D segmented volume. Using VTEA as an example: if the voxels of a nucleus in NISNet3D segmented volume do not intersect with any voxel of any segmented nucleus from the VTEA segmented volume, then the Visualization Method C Error Volume will show this nucleus from NISNet3D. This is shown in the Equation 16: Note: Since we are using the NISNet3D segmented volume as the “reference segmented volume”, we do not provide Error Volume for Visualization Methods B or C on the NISNet3D segmented volumes. The Error Volumes for Visualization Method A, B, and C is shown in Figure 14.
3.6 Discussion
Due to the limited availability of annotated volumes, we use synthetic microscopy subvolumes for training NISNet3D. It should emphasized that NISNet3D can be trained on annotated volumes if available or one could use a combination of synthetic volumes and actual annotated volumes. In order to examine the performance of NISNet3D, we tested NISNet3D on four fluorescence microscopy datasets from multiple rat organs and tissue regions. In addition, we also tested electron microscopy data from a zebrafish brain from the NucMM Challenge [46].
We compared NISNet3D with VNet [27], 3D U-Net [26], Cellpose [3], Deep-Synth [6], StarDist3D [15], 3D Watershed [45], Squassh [50], CellProfiler [51], and VTEA [52].
For 3D Watershed, we used 3D Gaussian filter to preprocess the image and used Otsu’s method to segment the objects from background structure. Then the 3D conditional erosion described in Section 2.3.2 was used to obtain the markers, and the marker-controlled 3D watershed implemented by Python scikit-image library was used to separate touching nuclei.
For CellProfiler [51], customized image processing modules including inhomogeneity correction, median filtering, and morphological erosion were used to preprocess the image, and default “IdentifyPrimaryObject” module was used to obtain 2D segmentation masks on each slice. Then the 2D segmentations are merged to a 3D segmentation using the blob-slice method described in [52, 60]. For Squassh [50], we used the “Background subtraction” with tuned “rolling ball window size” parameter. The rest of the parameters are set to default. We see that Squassh did fairly well on data 𝒱2 but totally failed on data 𝒱4 due to the densely clustered nuclei. For VTEA [52], Gaussian filter and background subtraction were used to preprocess the image. The object building method is set as “Connect 3D”, and segmentation threshold is determined automatically. We tuned the parameters “Centroid offset”, “Min vol”, and “Max vol” to obtain the best visual segmentation results. Finally, the watershed is chosen for instance segmentation. Since VTEA and Cellprofier’s “IdentifyPrimaryObject” module only works on 2D images, we see that their segmentation results shown in Figure 12 suffer from over-segmentation errors on XZ planes.
For VNet [27], 3D U-Net [26] and DeepSynth [6] methods, we improved the segmentation results by using our 3D conditional erosion described in Section 2.3.2 followed by 3D marker-controlled watershed to split the touch nuclei. For Cellpose, we use the “nuclei” style and since the training of Cellpose is only limited on 2D images, we trained Cellpose on every XY focal planes of our subvolumes follow the training schemes in Table 2. We observe that Cellpose has trouble capturing some very large or small nuclei in an input subvolume and performs worse on “thinner” subvolumes containing more nonellipsoidal nuclei. Figure 15 shows the 2D to 3D reconstruction error from Cellpose and VTEA compared with NISNet3D. For StarDist, we observed that it has difficulty segmenting non-star-convex objects in 𝒱3 and achieves better performance on regular ellipsoidal nuclei in 𝒱4 (See Figure 12).
Figure 12 and Figure 13 are the color coded instance segmentation volumes for compared methods. NISNet3D can accurately identify each individual nucleus and segment the nuclei out from the background structure. Note that NISNet3D does not need any prior information about nuclei size or shape, and does not resize or interpolate the input volume. Using our inference scheme shown in Figure 8, NISNet3D can run on a large volume with any given size without losing accuracy from interpolation. We used object-based evaluation metrics to quantitatively evaluate the performance of NISNet3D and other methods. The summary of evaluation results shown in Table 4 and Table 5 indicate that NISNet3D achieved highest mAP and mF1 on all of our test datasets. As shown in Figure 11, in order to quantify how well the segmented nuclei matches the ground truth nuclei, we use the Average Precision (AP) under different IoU thresholds criteria and Aggregated Jaccard Index (AJI) to evaluate both segmentation and detection accuracy. Figure 12 and Figure 13 show the color coded instance segmentation results. Our method can better separate the touching nuclei as well as maintaining the nuclei shape.
4 CONCLUSION
In this paper, we described a true 3D Nuclei Instance Segmentation Network, known as NISNet3D, for fluorescence microscopy images analysis. Our approach directly works on 3D volumes by making use of a modified 3D U-Net and a nuclei instance segmentation system for separating touching nuclei based on a 3D vector field volume and a 3D gradient volume. NISNet3D can be trained on both actual microscopy volumes and synthetic microscopy volumes generated using SpCycleGAN or a combination of both. We demonstrate that NISNet3D performs well when compared to other methods on a variety of microscopy data both visually and quantitatively. In addition, we also present three error/difference visualization methods for visualizing segmentation errors in large 3D microscopy volumes without the need of ground truth annotations.
5 ACKNOWLEDGMENT
This research was partially supported by a George M. O’Brien Award from the National Institutes of Health under grant NIH/NIDDK P30 DK079312 and the endowment of the Charles William Harrison Distinguished Professorship at Purdue University.
The authors have no conflicts of interest.
The original image volumes used in this work were provided by Malgorzata Kamocka, Sherry Clendenon, and Michael Ferkowicz at Indiana University. We gratefully acknowledge their cooperation.
The NISNet3D source code package is available upon request to imart{at}ecn.purdue.edu. The source code is released under Creative Commons License Attribution-NonCommercial-ShareAlike - CC BY-NC-SA. The source code cannot be used for commercial purposes. The test volumes are also a available from imart{at}ecn.purdue.edu. The test volumes are released under Creative Commons License Attribution-NonCommercial-NoDerivs - CC BY-NC-ND.
Address all correspondence to Edward J. Delp, ace{at}ecn.purdue.edu
References
- [1].↵
- [2].↵
- [3].↵
- [4].
- [5].
- [6].↵
- [7].↵
- [8].↵
- [9].
- [10].
- [11].↵
- [12].↵
- [13].
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵