A deep learning-based strategy for producing dense 3D segmentations from sparsely annotated 2D images

Producing dense 3D reconstructions from biological imaging data is a challenging instance segmentation task that requires significant ground-truth training data for effective and accurate deep learning-based models. Generating training data requires intense human effort to annotate each instance of an object across serial section images. Our focus is on the especially complicated brain neuropil, comprising an extensive interdigitation of dendritic, axonal, and glial processes visualized through serial section electron microscopy. We developed a novel deep learning-based method to generate dense 3D segmentations rapidly from sparse 2D annotations of a few objects on single sections. Models trained on the rapidly generated segmentations achieved similar accuracy as those trained on expert dense ground-truth annotations. Human time to generate annotations was reduced by three orders of magnitude and could be produced by non-expert annotators. This capability will democratize generation of training data for large image volumes needed to achieve brain circuits and measures of circuit strengths.

For each of the above object-level sparsities, we included three additional disk sparsities, resulting in a total of 32 sparsities (Supp.Fig. 1).We included disk sparsities based on considerations from the local shape descriptors (LSDs); the regions thought to be most crucial for learning LSDs are concentrated around object boundaries and entirely within objects 1 .Under this assumption, sparsity could theoretically be represented by deliberately positioned incomplete objects (such as paint strokes).For example, placing a stroke on one side of an object, another stroke on the opposite side, and a stroke in the center could contain much of the necessary information to instruct a network about the presence of a smooth gradient inside an object and sharp transitions across boundaries.Unlabeled areas could then be considered unknown using a weighted loss function.Hence, for a disk sparsity of N disks, we selected N points within the field of view (FOV) and drew circles with a sufficient radius intersecting the labels.
The intersecting circle cuts off the labels in disk sparsities, leading to incorrect ground truth LSDs towards the center of objects where there should be a sharp gradient instead of a smooth or small gradient.We expected this discrepancy to adversely impact predictions and result in circular artifacts or false boundaries.However, we were surprised to observe that the network still predicted these regions correctly.This could be attributed to several factors: Firstly, we randomly sampled enough locations in batches that were deemed correct.An ablation study could involve limiting these circles to always include incorrect boundaries.Secondly, the network might become confused in these regions due to conflicting signals from the "correct" regions.Consequently, the network regresses to predict approximately 0.5 in these areas, which happens to map to gray in RGB space, akin to the intentional design of the original LSDs.

B. Experiment and grid search
The experimental procedure in Fig. 2 is described here in more detail.For every dataset, sparsity, and repetition:   4).Total time to segmentation starting from sparse annotation was estimated (Fig. 4).The best segmentation in terms of accuracy was designated as the representative bootstrapped segmentation for the dataset, sparsity, and rep in question.

C. Evaluation metrics
We automatically generated ground-truth skeletons for evaluation from ground-truth labels with the following steps.We used a watershed algorithm on computed ground-truth affinities to generate an oversegmentation (resulting in supervoxels).Each supervoxel center of mass was stored as a node with position coordinates in a region adjacency graph.For each ground-truth mask, we computed the minimum spanning-tree of the nodes using the physical distance between nodes as the weight.
We compare the accuracy of bootstrapped segmentations using the normalized, variation of information (VOI) metric and min-cut metric (MCM) 1-3 .VOI is the voxel-based measure of similarity between a segmentation and ground-truth labels.VOI reports the amount of split and merge errors.We note that VOI can be sensitive to slight differences in boundaries.At the same time, small topological changes might go unnoticed, which is especially problematic in fine neuropil.Nevertheless, VOI can be a good proxy for segmentation quality 1 .
Quantifying segmentation accuracy in terms of proofreading effort required to correct it is more interpretable and relevant to optimize.False splits require only one interaction to merge.False merges can also be fixed with a few interactions 1,4,5 .The MCM measures the total number of split and merge edit operations needed to make a segmentation agree with ground-truth skeletons.We report the total edits per object, total edits per path length, total splits and total merges per object.
The 3D MTLSD network used for bootstrapping was a 3D UNet with two separate decoder heads for the two learning tasks: affinities and LSDs.The MTLSD U-Net consisted of three layers with downsampling factors [1,2,2].The bottleneck and the adjacent layers have 2D convolution kernels.Thirteen initial feature maps were used with a multiplication factor of 6 between layers.The resulting data was further convolved and passed through a sigmoid activation to get from 13 output feature maps from each decoder head to 3 (3D affinities) and 10 (3D LSDs).
The 2D U-Nets used in the 2D->3D method consisted of three layers and were downsampled by a factor of [2,2] in all layers.Twelve initial feature maps were used and features were multiplied by a factor of 6 between layers.The resulting data was further convolved and passed through a sigmoid activation to get from 12 output feature maps to either 2 (2D affinities), 6 (2D LSDs) or 8 feature maps (2D MTLSD).
The lightweight 3D U-Nets in the 2D->3D method consisted of two layers and were downsampled by a factor of [1,2,2].The convolution kernel sizes for the first downsampling and last upsampling layers are [2,3,3] and [1,3,3] for the other layers.5 initial feature maps were used and features were multiplied by a factor of 5 between layers.The resulting data was further convolved and passed through a sigmoid activation to get from 5 output feature maps to 3 (3D affinities).All networks used a mean-squared error loss, minimized with an Adam optimizer.

E. Training pipelines
For all sparse 2D and 3D networks and the 3D MTLSD network used for bootstrapping, each training batch was randomly picked from the available sections or volume.For each batch, the raw data was first normalized and padded with zeros.Labels were padded with the maximum padding required to contain at least 0.01% (10% for 3D MTLSD, since there are more pseudo GT labels available) of labeled ground-truth data within the image assuming a worst case rotation of 45 degrees.Data was randomly sampled and augmented with elastic transformations, random mirrors and transposes, gaussian blur, and intensities (see Supplementary Table 2 for augmentation hyper-parameters used for all networks).For the networks with affinities as an output, a scale array was created to balance loss between target affinity labels.
For the lightweight 3D networks in the 2D->3D method, each training batch begins as a 3D array of zeros.Synthetic 3D labels are randomly grown with the strategies listed below.Labels were then augmented with elastic transformations and random mirrors and transposes, after which they were used to simulate stacked 2D affinities or LSDs.The stacked 2D affinities or LSDs were then augmented with random noise, intensities and gaussian blur to simulate realistic stacked 2D predictions.Finally, target 3D affinities were computed and a scale array was created to balance loss between class labels.

Synthetic 3D Labels Generation
A. N many randomly chosen voxels in the array of zeros get set to 1, where N is a random integer between 25 and 50.The speckled array is relabeled such that each labeled voxel has a unique integer ID.The labels are grown outward by D pixels without overlapping other labels, where D is another random integer between 25 and 40.The ID of the background label is bumped to a non-zero and unique integer ID.B. Similar to A, but the speckled binary array is dilated section-wise with the binary structure of a 2x2 square or a disk binary structure with a random radius between 1 and 5.The binary array is then labeled such that each foreground instance has a unique integer ID.The uniquely labeled objects are then expanded into unoccupied spaces using a distance transform of the background.C. A gaussian filter with sigma=5 is applied to an array of random float values to obtain peaks.The peaks are accentuated with a maximum filter with a size of 10 pixels and maxima are identified as seeds.Watershed is then applied to the peaks to grow labels from the seeds.D. An equal mix of the above three strategies.Kernel sizes down [ [2,3,3], [2,3,3]], [ [1,3,3], [1,3,3]], [ [1,3,3], [1,3,3]] Kernel sizes up [ [1,3,3], [1,3,3]], [ [2,3,3], [2,3,3]], Input shape [10,148,148] Output shape [6, 108,

Fig.s 3
- The best segmentation was designated as the pseudo ground-truth training data for the untrained 3D MTLSD model without any proofreading.5.The 3D MTLSD model was trained on the pseudo ground-truth training data of Volume 2. 6. Inference was done with different iterations of the trained 3D MTLSD model to generate predictions of 3D affinities of Volume 1. 7. Post-processing was done in a grid-fashion to generate 3D segmentations of Volume 1 from the different predictions.(Supp.Table3b) 8.All the generated segmentations of Volume 1 were evaluated for accuracy.(Fig.3,Supp.

Table 1 :
Overview of datasets.Dataset rows show Volume 1 above Volume 2. A size filter of 500 pixels was applied to the labels before counting objects.