Gossamer: Scaling Image Processing and Reconstruction to Whole Brains

Neuronal reconstruction–a process that transforms image volumes into 3D geometries and skeletons of cells–bottlenecks the study of brain function, connectomics and pathology. Domain scientists need exact and complete segmentations to study subtle topological differences. Existing methods are diskbound, dense-access, coupled, single-threaded, algorithmically unscalable and require manual cropping of small windows and proofreading of skeletons due to low topological accuracy. Designing a data-intensive parallel solution suited to a neurons’ shape, topology and far-ranging connectivity is particularly challenging due to I/O and load-balance, yet by abstracting these vision tasks into strategically ordered specializations of search, we progressively lower memory by 4 orders of magnitude. This enables 1 mouse brain to be fully processed in-memory on a single server, at 67× the scale with 870× less memory while having 78% higher automated yield than APP2, the previous state of the art in performant reconstruction.


I. INTRODUCTION
Lightsheet microscopy [38] has reached a maturity that it can digitize whole mouse brains in 1 day and produce terabytes of data that require image processing and reconstruction (Figure 1).Unfortunately, these new datasets, at baseline, have marginal value since they lack quality and semantic labeling.How do we future-proof software design against the increasing scale of unstructured data to derive meaningful 3D assets?We offer an application that can efficiently transform large-scale volumes into relatively accurate models for analysis.
Intelligent systems (e.g., biological brains, computer microprocessors) have inherent long-range context that benefits from reconstructing as a complete whole.However, existing methodologies can not scale to modern data sizes; we test up to 30 TB raw inputs for this study.Unstructured data has a larger memory footprint and necessitates further computation to extract a useful representation, yet data bandwidth is a bottleneck across all parts of the modern computing infrastructure stack.This makes distributed or cloud strategies impractical, suggesting instead an edge computing approach.Therefore, our design proposes a single system composed of 1 microscope and 1 workstation processing a single brain at a time.This is feasible because a single modern server can have terabytes of memory and multiple CPUs and GPUs with hundreds of cores.Our system can efficiently process or reconstruct a whole brain * These authors contributed equally to this work in a similar time that it takes to image or transfer to the cloud (about 4 days each).
Lack of large labeled datasets of 3D geometries is also an issue in the broader graphics community but this need is more acute in science.For researchers to take advantage of recent data-driven and supervised methods they need highly specific, often novel, training labels tailored to new scientific queries.A further complication is that, unlike in other domains like the arts, accuracy is paramount.Neuroscientists will go to great lengths to ensure that acquired data reflects ground truth exactly before using it for scientific analysis since experimental differences can be incredibly subtle.Transforming automated but inaccurate labels into fully correct ones, which we refer to as proofreading, is a time-consuming interactive human-led process.Understandably, this creates a bottleneck where acquiring the numerous proofread samples required for statistical significance prolongs project iteration time.
Brain research has a constricting limitation: reconstruction.This involves (1) image processing and encoding (2) segmenting cell bodies (seeds) (3) segmenting long-range branches and (4) transforming into curve skeletons [16].Morphology, the 3D shape of semantic objects like neurons, and topology, the arrangement, coverage and connectivity, may be a contributing facet in the study of function, pathology and cell type [46].Reconstruction, a common method of studying neuronal morphology, is a process in which fluorescently labeled cells in 3D images are segmented and compacted into ball and stick models (skeletons) [13].Skeletons are uniquely useful in shape analysis as they distill both topological and morphological features.Recent reviews in neuroscience identified that a scalable [3], comprehensive application [27] for single-cell reconstruction is the greatest need of the community.
Gossamer is an end-to-end pipeline for image processing and reconstruction of lightsheet images with several unique advancements.
1) We introduce several novel 3D and adaptive methods and accomplish enhancement, artifact removal, segmentation and reconstruction of high-scale images.2) We achieve state-of-the-art (SOTA) performance in space and time even when compared to the equivalent 2D SOTA via linear complexity algorithms, which is critical for rhizomatic structures (neurons) that are power-law in size.
Lightsheet Image Acquisition Voxels = 0.4 3  3) We advance spatially sparse methodologies to enable the long-range contexts inherent in neuroscience to be computed entirely in-memory on a single workstation.4) Across many of our workloads, we have the first hardware-accelerated and parallel implementations for multi-GPU or multi-core environments.5) We enable efficient scaling to whole brain volumes (13000×13000×16000 voxels), a 67× end-to-end size increase over the most scalable alternate algorithm.

II. BACKGROUND AND RELATED WORK
Alternative methods fall short in these critical areas which leads to severe losses in automation, yield, performance and accuracy.

A. Lightsheet imaging data
Raw lightsheet data consists of 3D tiles that must be preprocessed to remove camera artifacts, stitched together to form a whole volume, and post-processed to normalize pixel intensity (brightness) and refine the resolution.There are three major sources of non-uniformity in image brightness [34]: (1) some CMOS cameras have midline artifacts that produce a vertical artifact in the center of each tile.On the horizontal axis, there are also horizontal non-uniformities in brightness; (2) some objects in the tissue can partially block the passage of laser light creating horizontal shadows, i.e. stripes; (3) repeated imaging of areas desaturates the brightness unevenly referred to as image bleaching.
It is better to address CMOS artifacts, shadows and inherent image noise before stitching the image tiles to form a whole volume since these issues are confined to each tile, the computational cost is higher after stitching, and correcting them can improve automated alignment of tiles.However, correcting image bleaching after stitching is preferred since it benefits from a wider view of the whole brain image.Lightsheet, like other light microscopy forms, suffers from a lower resolution and a blur of signal on the z-axis.Deconvolution not only mitigates this issue but also increases foreground signal brightness.By addressing global image normalization and enhancing the foreground signal, a simple thresholding algorithm can segment the image (Figure 5a).Although existing software can function at smaller scales [39][10][9], we are not aware of a complete software pipeline that can operate at the native scale of modern lightsheet microscopy.

B. Seed segmentation
Segmentation spatially assigns contiguous pixels of an image that belong to classes with a known meaning, for example, cell bodies.These filled regions can also be broken into instances, separated by occurrence in the volume.Neural network models that leverage convolution like U-net [35] are adept at these tasks at small scales.They also are linear time complexity but have poor runtime and high resource usage due to their space complexity which is on the order of the whole image-i.e., the working set W-which is prohibitive.

C. Cell segmentation
In addition to some unavoidable level of image artifacts, neurons branches are thin, intricate and filament-like.This poses a challenge in automated segmentation and topological accuracy.If the frequency or severity of errors is high, humans will simply discard a neuron or start reconstructing from scratch rather than invest the time to proofread an automated output.We refer to the rate of adequately accurate outputs as the automated yield.
Though the problems in neuron reconstruction are specific, the metrics used do not adequately differentiate the causes of errors.We categorize relevant issues into semantics, In this work, there are only 2 biological semantic categories: cell bodies (shown in green) and branches (blue) as illustrated in Figure 2. Cell bodies, also known as somas, are the root of a neuron's branching projections.Due to their importance in neuronal reconstruction, it is clear that somas are natural seeds (the starting set) in the search for all connected branches.Semantic classification and morphological segmentation can be evaluated with established metrics such as recall and precision which report rates of false negatives and false positives respectively.
Tree topological errors are more nebulous and usually have an upstream cause from the segmentation input to skeletonization.Path breaks are false discontinuities in the segmented branches of cells.In contrast, over-segmentation leads to path collisions-the incorrect joining of disconnected branches or neuronal paths, which are biologically infeasible.Collisions can either be intra (within) or inter (between) neurons; both manifest as topological errors.Collisions between neurons are particularly harmful because they can create connected segmentations, erroneously including dozens or hundreds of other nearby cells.To make up for poor software, scientists assume they must lower the chemical labeling further.While this does lower the frequency of such fused clusters, this approach is short-sighted since it lowers the neuron yield per animal.Lower reconstructed counts per brain mean vastly longer and more expensive experiments.Scalability is of principal importance to handle the power-law distribution of cells and their inevitable clusters.
Poor accuracy can also arise from problems at the image level such as artifacts, thresholding, normalization, pixel-level noise and anisotropic blurring.Neural network approaches tend to have a low-frequency bias and regularization techniques tend to increase dimensional blur instead of correcting for it as demonstrated in NeAT [36].Neural functions also tend to lack continuity on thin signals as mitigated by Lipschitz regularization [25].Additionally, CNNs can be prone to oversegmentation and path collisions since there is not an adequate loss for the topological effects of detection.For example, though less fine than neurons, the thin struts of chairs in ShapeNet [14] often create collisions and breaks.
Tracing from seed locations to their connected branches can be conducted by propagating a wavefront spatially across a grid, in a general pattern known as the Fast Marching Method (FMM) [37].Though FMM is a core part of the top neuronal reconstruction methods discussed in the skeletonization section, it is inefficient both in time and space.Brain volumes are large (many Terabytes), highly sparse (less than 1% foreground) and individual neurons can span the whole volume.When FMM computes these images it leads to extremely low arithmetic intensity and runtimes dominated by streaming dense image regions from disk repeatedly [29].
Very similar in process to Dijkstra's algorithm, FMM leverages a priority queue to choose its next visited location, which dictates its O(nlogn) time complexity.The FMM computation differs from Dijkstra's in that it stores each path distance across the whole domain grid, which makes it O(n) in space.The Fast Sweeping Method (FSM) [47] improves the time complexity to O(n) by sweeping updates in diagonal planes.Refer to [47] for an explanation of how the diagonal access pattern converges to the same solution as pulling updates from a queue.The parallel FSM algorithm [48] leverages the fact that dependencies are independent along these planes.FSM is an excellent basis for implementation of applications since it is both parallel and has the favorable algorithmic benefits of sweeping.
The OpenVDB library [31] provides methods and data structures for spatially sparse volumetric data (Figure 3).However, modeling the domain in a dense fashion is inefficient.Much of the advantages of voxel grids-an almost infinite domain size and lower computation and memory-stems from their innate concept of a sparse active foreground set.VDBs, like other similar sparse data structures such as B+ trees [8], tradeoff memory footprint and access speed, due to hierarchically compacted data at various resolutions.Due to an optimized hash map, they retain average case O(1) even for random access, updates and deletions.Higher tree depths yield better compaction of sparse signals at the cost of indirection steps (cache misses) at each level.Following from the observation Fig. 3.A visualization of the various spatial levels of granularity of VDB grids on a sparse neuronal slice of data.The outline of the cubes are highlighted to get a sense of the 4 scales.The sparse working set (Ws) of foreground voxels is shown in red.Active leaf node regions-those that contain a foreground voxel-are outlined in blue.Active internal nodes regions are outlined in yellow and active secondary internal nodes in green.Empty leaf nodes and internal nodes have no outlines and have tiny memory footprint overheads.Refer to [31] for a further explanation of this terminology and VDBs.Fig. 4. The ball (spheres) and stick (obscured connections) abstraction that neuroscientists work with.Note that this is a spatially embedded tree graph rooted at the seed which is the blue sphere at the bottom left corner.Red spheres are path nodes, yellow spheres are termini/leaf nodes and green spheres are branch nodes.Refer to [13] for a further explanation of tree graph (SWC) terminology.that accesses are generally spatially coherent, VDB reads and writes are mediated by an accessor structure which leverages coherence when finding new elements via a bottom-up search from a cached set of recent visits.
The drawback of OpenVDB is that operating even on the whole sparse working set (W s ) is excessive.For medical imaging volumes such as brains, these methods would consume an estimated 40 TB of peak RAM which is currently beyond what is feasible on a single workstation.We explain these working set (W ) definitions and how to circumvent these issues in Section 3.

D. Skeletonization
Abstracting segmentations into ball (with radii) and stick (connections) models (see Figure 4) is known as skeletonization.This is useful for (1) efficient proofreading and interaction with labels and (2) analysis of morphological and topological features and errors.Neuroscientists primarily model cells and their topological branching patterns with these skeletal graphs.More specifically, they use trees: each node in the graph has a single unidirectional edge which traces a path back to the tree's seed (the cell body) [13].This process is bottlenecked by scalability because of entrenched domainspecific methods.
There are hundreds of semi-automated (requiring human correction) implementations specialized for neuronal recon-struction [1,27].However, these methods lack scalabilitythe ability to handle large image regions.This limits their usage to (1) neuron types that are highly localized, (2) small hand-cropped volume sizes, and (3) neurons with little or no collisions.To our knowledge, there is no other method for reconstruction that does not either manually or automatically crop the neurons arbitrarily for performance reasons.If these alternative methods did not do so, neurons with collisions would be lost due to timeout or excessive reconstruction time.The majority of other algorithms also operate in 2D due to runtime concerns and maintain both a heavy objectoriented view of the data while also consulting the dense representations of the image volume.This coupling results in extremely high memory footprints and low arithmetic intensity as we demonstrate in Section V. Enabling full concurrency and exploiting the cache hierarchy through causal and spatial sparsity nullifies these issues, but only if every working set can fit entirely in memory (see Section V-A for the specifics of our system).
Based on benchmarks comparing numerous neuron reconstruction methods [45][28], APP2 [44] has the lowest runtime and resource usage and the best algorithmic complexity.APP2 also placed in the top 5 highest accuracies, with Neutube [18] placing first in [45].Neutube [18] as well as many highperforming reconstruction methods [4] employ a fastmarching algorithm similar to APP2.Such methods are all of O(nlogn) complexity.As mentioned because of excessive runtimes, a common practice is to manually crop around a neuron which can create cutoffs, artifacts, and intensive human intervention.For example, a recent prepublication [26] with similar pipeline goals only allows a maximum crop size of 512×512×256 and applies APP2 on these crops due to performance constraints.Such small sizes are only suitable for a subset of neuron types.APP2, like other available iterative reconstruction methods, is only single-threaded.This is why large-scale applications must start multiple APP2 programs for each seed to enable some concurrency [23].We compare to APP2 for performance and accuracy metrics.

E. Application Pipeline
Neuroscience data sizes tend to increase at the rate of 1000× every 4 years.This explosion in the acquisition of unstructured data creates problems which we explore below.
As shown in Table 1, Mouselight [43] is an interactive visualizer of large volumes.Its respective memory footprint reflects the dense volumes size of the images we tested.It relies on a human-guided tracing of neurons which takes several hours per neuron.APP2 [44] reflects the max sizes we could complete on a 4 TB system with our application (in single threaded mode) to auto-crop the region to its bounding box.The commercial software Imaris has an alternate cropping feature but the process is manual per neuron and also fails on images this size.Our method, APP2 and an APP2 based pipeline [26] all trace neuron branches automatically, though they need subsequent proofreading.
Creation.This involves tools for generating, editing or authoring assets.Much like the autonomous driving space, in microscopy, large-scale noisy 3D volumes are plentiful yet instance-level models (neurons), especially those that are corrected and labeled by type are limited.The diversity and uniqueness of scientific queries similar to artistic endeavors makes creating specific data even more valuable.We contribute an application that creates a variety of 3D representations.From raw images, we improve image quality, segment cells and output volumes, 3D models (meshes), skeletons and training labels (image window crops).We also provide the test suite TreeBench to establish the accuracy of these 3D assets.We rely on the existing tools Neutube [18], Fast Neurite Tracer (FNT) [19], and TeraFly [12].All of these are volume-rendering GUI tools for either (1) proofreading our automated outputs or (2) creating neuron skeletons or seed labels from scratch (in the case of inadequate skeletal accuracy).
Mastery.This relates to utilities that facilitate discovery, search, reuse and understanding of data.Solving issues in scale brings new problems in the management of numerous outputs.Due to operating on whole brains, we generate hundreds of automated reconstructions per brain.As a result of their multiplicity, these reconstructions have diminishing marginal value unless they are automatically sorted according to their quality which has become untenable to do manually.We filter and tag outputs automatically based on quality.
In the creation of neurons, we use low-bias unsupervised techniques.Our aims are minimal: we seek that paths are continuous, do not collide and do not have unfaithful mutations.Generic transformations are more adaptable because the scientific study of morphological differences is subtle and can be corrupted by the influence of out-of-distribution training data.This allows the outputs to be generic and therefore more likely to be candidates for reuse in future scientific queries.Finally, a semantic understanding of cell bodies and their branches allows automatic tagging and labeling of images.This automation can help bootstrap future methods and more specialized experiments on unforeseen data types.
Distribution.Contributions in this category aid the dissemination and sharing of assets between groups.Mouse brain volumes are on the scale of several terabytes, thus sharing and storage are problematic.Since our end-to-end application is essentially domain-specific compression, whole brains have extremely low memory footprint representations (less than 2 GBs) which can be output at various stages, resumed and shared across computers or teams easily.We apply strict final passes of the skeletons to adhere to the community standards established in StdSwc [5].This allows interoperability with outside or legacy tools and is a requirement before it can be added to the community database NeuroMorpho.org[2].

III. METHOD
Computer vision tasks involving 3D geometry or volumetric data should leverage not only spatial but also causal sparsity.Spatial sparsity reduces computational demands by exploiting local structures in data.Segmented regions and their values, for example, tend to be spatially coherent within various scales.Languages [22], libraries [31] or applications [30] that leverage patterns in spatial distribution are an increasing necessity on simulations at scale.
Causal sparsity further decreases resource usage by dynamically growing a smaller working set as needed from key salient structures or regions.We term the set of these key features or elements seeds and refer to them as {S}.This approach separates a large-scene into object-level tasks where instancelevel computation can exploit more temporal locality.Spatial and causal sparsity complement each other.Not only can they share data representations, but their locally restricted access and search patterns can offer best-in-class performance.
The core of our method can be broken down into 3 specializations of search: 1) W s scan: traverse a dense working set W such as an image volume and choose a semantic subset W s .2) {S} scan: find key elements termed seeds {S} within W s .3) W c select: iteratively apply a recurrence relation from {S} to select a causally sparse working set W c as needed.These patterns are particularly effective on data-intensive parallel applications.For example, this combination of spatial and causal sparsity dramatically increases the image scale that can be processed on a single workstation.In the context of this manuscript, the dense 3D cube (the whole image) is the original working set W . Refer to the grid diagrams (labeled 8-11) in Figure 5a.In these grids, the labeled regions are shown with white colored pixels-this is the spatial sparse working set W s .The blue colored regions are the cell segmentations that specify a causally sparse working set W c .Refer to Table IV for the element count reductions.W, W s and W c are three views of the same data.Their successively smaller element counts lead to empirical performance gains.Problems, especially those arising from natural data, often start with a dense working set W that can be compacted to a spatially sparse working set W s .W s represents a set of relevant elements narrowed spatially, whereas W c is the often much smaller dynamic set of values derived from causal selection.W s scan can also minimize the representation or place it in a specialized data structure to suit later computation.Selective search allows algorithms to terminate early, thereby preventing unnecessary accesses and  2).In this work, our example scan and select functions are of linear complexity which is essential for problem sizes with power-law distribution in scale.Scans are trivially parallelizable and are generally decomposed spatially, whereas selection is iterative but parallelizable across seeds.
Narrowing the working set is effective across varied graphics methods because data is often relevant mainly in contextual relation to seeds.This relevance can often be decomposed into a chain of computations leading back to a seed.Therefore, exploiting causality involves identifying seeds and allocating or weighting resources and computation according to their seed relation.

IV. IMPLEMENTATION
We abstract these design patterns and apply them individually, on 4 example computer vision workloads, and in combination with a challenging application.Each example workload described below can be broken into scans and selections.At a more abstract level, each workload itself is analogous to one of these patterns, especially when viewed in a feedforward pipeline.Encoding is analogous to a W scan, seed segmentation to a {S} scan, cell segmentation, a W c selection and skeletonization, a W s scan.These scan and select patterns offer a neuron reconstruction method that is fully parallel, hardware-aware and entirely in-memory.

A. Encoding
Image processing enhances the image.This allows continuing with a far more minimal view of the image while still retaining the structures of interest.
1) W scan: Destriping applies a 2D Gaussian and custom filters to remove camera artifacts and fine nonuniformities in the image tiles.2) W scan: Debleaching removes coarse non-uniformities in the stitched image.3) W scan: Deconvolving improves the axial resolution and improve the brightness of the foreground signal.4) W s scan: Conversion traverses the image volume, placing voxels above a value in a spatially sparse data structure (VDB).W s is the segmented mask of the image which holds all foreground voxels.
Destriping: Using 2D wavelet decomposition, the image is decomposed into horizontal, vertical and diagonal coefficients.The horizontal and optionally vertical coefficients are decomposed with 1D Fourier transform on the appropriate axes.The Fourier transforms are multiplied to a Gaussian notch filer with a desired sigma.Finally, using inverse Fourier transform and 2D wavelet recomposition the image is reconstructed.
Debleaching: First, the stitched image is destriped bidirectionally to correct regions that are directly bleached by the laser beam.Finally, a low-pass filter is applied to the image in order to capture the gradual change in the brightness of the background from the surface of the brain to its center that happens because of poor penetration of laser light.The image is divided by the filter to reverse the bleaching.
Deconvolution: In lightsheet microscopy, the image of a sphere appears convolved by an ellipsoid kernel.By knowing the kernel shape, the original shape of the object can be restored via inverse convolution operation.
Conversion: Due to the substantial increase in signal to noise ratio from the previous steps the image can be segmented by keeping pixels with sufficiently high intensity (brightness).To make conversion more robust across different brains, we use a default desired foreground percent and compute a hard threshold value that most closely matches the foreground percent.We do so with quicksort, the only place in the pipeline outside of the skeleton standardization passes that is not of linear algorithmic complexity.
1) Parallel Implementation and Software Pipeline: Scan steps are embarrassingly parallel in nature.All stages at the encoding level must operate on the extreme size of the dense W volume (see Tables II and IV).Therefore, we parallelized all W scan steps across multiple GPUs and CPU cores and in so doing developed several programs with significant advancements in scalability.
1) We implemented a multi-GPU accelerated version of Pystripe [39] that has a novel bidirectional destriping and Gaussian denoising algorithm as well as a faster parallel processing model.2) Our software offers a multi-GPU accelerated version of TeraStitcher [10] image stitching software.

3) We developed a multi-GPU and parallel version of
LsDeconvolve [9] -an open-source program for the deconvolution of lightsheet microscope images.4) process_images.py is a script that automates the stitching process.5) conver.py, is a high-scale parallel file conversion tool for the various formats shown in Figure 5b.6) The FNT cube processor program can apply the destriping algorithm on the z-axis of FNT files.Our conversion tool can read Imaris files, 2D TIF series, or XML output of Terastitcher to convert them to processed 2D TIF series, Imaris, TeraFly, FNT or MP4 video while applying the above-mentioned filters (destriping, bleach correction, baseline and background subtraction, 16-bit to 8-bit conversion), or resizing or downsampling images to a desired isotropic voxel size.Our stitching implementation can also optionally apply a destriping algorithm, a low-pass filter to the stitched image to remove coarse bleaching artifacts and can downsample the data.Finally, all the above-mentioned image processing tools are in-memory to reduce IO overhead and also have resume support and advanced error-handling mechanisms to reduce day-to-day operational challenges.See supplementary notes 1 to 8 for more information.

B. Seed segmentation
It is critical to first detect cell bodies (seeds, shown in green below) from brain images (black and white grid) to further segment each neuron's connected branches.We do so with the following (steps 8-10 in Figure 5a): 1) {S} scan: convert the segmented mask VDB from the previous thresholding step into a VDB which stores voxels only at the surface of the foreground W s .{S} is then the surface voxels of W s (step 8).2) W c select: seeds are segmented by successive region growing steps outwards and inwards from {S} (step 9-10).W c is the resulting segmented cell body surfaces.Cell bodies are large and spherical and can therefore be segmented by spatially expanding and contracting the fore-ground regions.The images are chemically labeled only on the cell surface due to the usage of a novel genetic technique [41] specially designed to improve reconstruction on thin, finegrained and long-range structures.Since the signal inherently describes shells [41], it lends itself naturally to the sparse level set methods provided in [31] which our application exploits effectively.To fill hollow cell bodies we perform morphological dilation.Since dilation expands the sizes of signals we apply equal and opposite erosion.Together this is known as morphological closing.To remove the cell branches which are thin and tubular we employ erosion followed by an equal amount of dilation, which is referred to as opening.This recovers seed regions whose geometries are relatively unchanged.Refer to the grid diagrams of Figure 5.The output of seed segmentation is the green regions.We use a singlethreaded implementation of connected component analysis provided by the OpenVDB library to separate all somas into a set of instances.

C. Cell segmentation
We can segment entire cells (step 11 Figure 5a) by computing reachability (blue regions below) from the known cell connected component analysis.This yields all instance segmentation surfaces separated into individual neurons.We employ the fastsweeping method in [32], but rather than use it globally we start it from the key semantic areas of interest: the cell bodies from the previous step, and allow it to naturally terminate once its computed distance on all reachable voxels.In the grid at right, notice how the reachable voxels have a distance computed whereas the unreachable voxels are empty indicating an invalid value that is discarded with a subsequent parallel scan.Fastsweeping is generally applied on dense 2D or 3D regions (i.e., W ).
The VDB library improves on this with a sophisticated sparse implementation.This is a key advantage to VDB's sparse methods, although one can have an almost infinite domain size, performance costs only reflect the sparse voxel count W s .However, when W s is still too high, as in our case, application on all of W s is a drawback.Instead, we start the selective search only from known cell bodies.This further improves performance and reduces peak memory usage since only the reachable voxels (e.g., W c ) are accessed during the W c select.Tables II and IV enumerate the differences between W s and W c .Table IV in particular indicates the reduction in computation and memory of our approach.In addition to an implied contiguity (reachability), fastsweeping also computes maximum distance from any cell body surface.This is employed to finally prune the voxels that have a default (maximum distance) value.Although this approach still entails a final prune over all of W s , pruning is a scan which, unlike fastsweeping's selective search, has perfect parallelizabilty, minimal computation and memory requirements and almost no overhead.
During segmentation, we also optionally employ morphological closing of the branches.Whereas the foreground percent parameter merely keeps or rejects different values in the image, closing is a rather biased mutation of the morphology.In theory, it mitigates path breaks at the cost of path collisions, a relationship we explore in the results.A higher foreground percentage may be preferable for faithful surface reconstructions however closing is tested as an alternative method to keep the global W s low.

D. Skeletonization
Skeletonization of the segmented cells creates a compact ball and stick representation-a spatially embedded graph with radii information (steps 12-15 in Figure 5a).Reference [16] gives a formal definition of curve skeletons.This allows neuroscientists to effectively study topological features.We do so via: 1) W s scan: traverse the set of surface voxels and create W s -the spatially sparse mesh that retains topology (step 12).2) {S} scan: using heuristics, choose a set of seeds from W s (step 13).3) W c select: from each seed, iteratively search for a separator by expanding a sphere and adding encompassed vertices (step 13).We follow the local separators (LS) skeletonization technique [7][6] but add an initial W s scan to prevent high runtime at high resolutions.Due to multi-level coarsening of the surface mesh and projection as explained in [6], skeletonization achieves O(n) complexity.LS is particularly scalable on tubular signals especially with mesh (surface) representation-a volumetric representation would have less favorable performance.
The LS has a rather subtle advantage in reconstruction fidelity over other methods: it is topology-native.It can recover essentially all, even tiny, highly fine-grained protuberances and bumps while still capturing topology at vastly larger scales.This becomes apparent when switching to a 3D surface representation, where we find that neurons have small spines and projections on the surface.These details are often not visible in existing volume rendering tools unless the volume is binarized and projected in 2D.This ability stems from LS employing the select pattern until the wavefront (neighbor vertices of those visited so far) splits into more than 1 connected components.Our approach decouples anisotropy, resolution and scale differences in structures irrelevant from the skeletonization process.Unmanaged, this ability would come at a cost to runtime.
Since neuroscientists historically are only interested in the broader topology of branches and not surface features we remove these before skeletonization via, an optional, surface smoothing of the mesh.Since this mesh pre-coarsening step has the added benefit of a net gain in speed (about 44×), it follows that we use the simple and fast light-edge matching approach which is also employed in the LS work.This method guarantees not to introduce new path breaks or path collisions (constant number of connected components), but can produce artifacts related to smoothing.It operates by a user-specified number (coarsen steps) of successive linear scans.
Mesh coarsening allows us to reconstruct only the desired topology of neurons while remaining accurate and performant regardless of the resolution of the image.Even higher resolutions would benefit further from this option, therefore coarsening steps is a runtime parameter determined by default by the image's voxel size.APP2 does not use a mesh intermediate representation which makes its performance and resource usage highly sensitive to resolution.

A. Data, Setup and Code Availability
We intentionally implement for a single workstation, as that could allow for the application to be deployed on the same microscope computer that acquires raw images.For massively compressive pipelines such as this, we advocate for computation on the edge as opposed to distributed due to the prohibitive network cost of the latter.Therefore for these workloads, we recommend a single workstation with 2× AMD EPYC 7763 (128 cores/ 256 threads) CPUs, 4 TB RAM and 8× A100 GPUs.The large memory capacity allows data sizes of 1 whole mouse brain at 1um 3 resolution to be processed entirely in-memory for all tasks.We also test an equivalently sized volume at the higher .4um 3 resolution to test accuracy effects.The 1um 3 brain has 374 gold-standard neurons of a variety of neuron types with which to compare.Note that only a highly sparse subset of neurons (500-5000) per brain are labeled and only on the cell surface due to leveraging a novel genetic technique [41].Some with far-reaching branches that span a large portion of the volume.In contrast, the 181 labeled neurons used for the .4um 3 dataset are local and bush-like which allows us to compare to APP2.Both image volume dimensions are approximately 13000×13000×16000 voxels, thus the higher resolution image only captures approximately half of a brain.
We manually proofread all our automated outputs twice using Neutube [18] or Fast Neurite Tracer (FNT) [19] to establish our ground truth skeletons for comparison.We embed an APP2 implementation into our codebase to lend it automated seed segmentation and windowing around the neuron for a fair comparison.We also supply APP2 with our foreground percent algorithm since its native segmentation method, a raw background threshold value, fails completely.
We tested various foreground percentages with TreeBench and found metrics tend to peak around .8%.Unfortunately APP2 can only complete a single neuron (the smallest one) at this density, therefore all APP2 runs are at .4%.These parameter choices illustrate how entangled performance and accuracy is for poor scalability algorithms.Code for the reconstruction and image preprocessing pipeline is available at https://github.com/UCLA-VAST/recutand https://github.com/ucla-brain/image-preprocessing-pipeline respectively.

B. Encoding
1) Empirical scalability and performance: For our images, destriping and debleaching take about a half a day to complete.Deconvolution's runtime is also about a half day.As illustrated in Figure 7, conversion of the dense image W to the spatially sparse VDB representation W s takes 15.5 hours on a single core and 21 minutes on 128 cores, which is about a 44× increase in speed.Before the VDB conversion the working set was over 3 orders of magnitude larger than the spatially sparse working set W s used for the remainder of seed segmentation (Table IV).
On a Windows workstation equipped with 2x Quadro RTX A6000 GPUs and 2x Xeon Platinium 8260 CPUs, GPU  Processed / Original Image Size × 100 Foreground Percentage Fig. 6.The effect of each image processing on image size both in memory and disk.Image processing steps and unstructured data (for example, TIF format) do not have any impact on memory usage but they can improve the data compressibility on disk.VDB format on the other hand can dramatically reduce both memory and disk usage.(a) Unprocessed TIF format occupies 100 percent of the disk and memory.The same data after thresholding and sparsification occupies less than 10 percent of memory and about 1 percent of the disk using a disk-assisted compression method.Interestingly, different stages of image processing increase the data compressibility not only for VDB files but also for compressed TIF files by removing noise and fluctuation in the signal.Using mask-VDB, the memory footprint will be a mere 1.3 percent for reconstruction in our pipeline.
acceleration improved the destriping speed from 10 image/s to 73 images/s in our testing.
2) Accuracy: We test the efficacy of preprocessing steps in 3 ways.We confirm qualitatively that known artifacts are removed after each stage (top row in Figure 5a).Compressibility is a proxy for signal-to-noise ratio.Figure 6 shows that each stage other than deconvolution improves the compaction ratio of the image.Deconvolution sharpens the image which can exacerbate high frequency noise or surface details hence the slight increase in size.Finally, we test each encoding stage with respect to reconstruction accuracy as discussed in Table VIII and the ablations section.
C. Seed segmentation 1) Empirical scalability and performance: When fully parallelized, seed segmentation lasts roughly 35 minutes per brain.This is longer than the other workloads VDB conversion, cell segmentation and skeletonization combined (32 minutes) due to the single-threaded connected component analysis at the end of seed segmentation.The other two portions, morphological closing and opening, only take 48 and 5 seconds respectively.
2) Accuracy: Automated seed detection has high recall as seen in Table III; it can recover .94 of true positive (TP) seeds.However it produces a high proportion of false positives (FP); only .06 of predicted seeds were true positives.We explain subsequent work to mitigate this behavior in the discussion.Though scaling is sub-linear, added cores still yield significant speedup.More importantly, additional cores allow each task a higher peak memory ceiling which is critical at these scales.1) Empirical scalability and performance: Fastsweeping completes in under 9 minutes and cellular connected component segregation requires 23 seconds.Computing reachability from known seeds narrows the working set by almost 6× (Table IV).

D. Cell segmentation
2) Accuracy: Fastsweeping is a simple heuristic that we use indirectly to find regions contiguous to a seed.Its utility depends on the continuity of the image after thresholding and conversion to the VDB representation (W → W s ).In other words, it still suffers from path breaks or collisions if they remain in the converted image.Below, we examine the downstream effects on topology to best assess its efficacy.

E. Skeletonization
1) Empirical scalability and performance: Observe the empirical results in Figure 8: our method uses 3 orders of magnitude less memory per neuron than APP2.The expla-   nation for this difference can be traced back to table IV.Per neuron, APP2's is forced to operate on W , which is at worst 4 orders of magnitude larger if a neuron spans the whole brain.In contrast, our method is on the order of W c both in time and space at the time of skeletonization.In addition to reading and retaining the dense 3D bounding box (crop window) of a neuron, APP2 also densely allocates several scratch pads, also on the order of W .On our system with 4 TB of memory, the largest neuron that APP2 could reconstruct (right most red cross) was of size 3788×5380×1995.This largest cluster took about 2609 GB of peak RAM.Our method can operate on a neuron or many neurons simultaneously that span the whole brain (67× larger than APP2) while also using 870× less memory.Skeletonization of 374 seeds takes under 2 minutes fully parallelized and under 43.5 minutes for 1 core (see Figure 7).Although APP2 is the fastest and most scalable alternative method for single cell reconstruction, our method can skeletonize even the largest clusters in seconds where APP2 takes over a 100 minutes.Surface coarsening is largely responsible for this speedup since it allows reconstruction at a userspecified resolutions.We use 2 or 4 steps of coarsening since its more suitable for recovering topology (as opposed to surface structure).Note that APP2 pruning must approximate only in 2D due to performance, whereas our methods are all natively 3D.We summarize the differences in Table V. 2) Accuracy: DIADEM [20] is the gold standard comparison tool for neuron reconstructions largely because it has substantial correlation with both 1) proofread times and 2) expert subjective opinion on automated quality.DIADEM is a strict tree topological similarity score which yields 1 for a perfect match.For each ground truth skeletal node DIADEM checks 3 criteria: 1) Is there a matching node in the test skeleton within a Euclidean distance?2) Do the potential node matches from step 1 have a corresponding matching ancestor node? 3) Is the path distance between node and ancestor and potential match node and its ancestor roughly similar?Step 1 reports the basic recall or completeness of the skeleton.
Step 2 prevents nodes on different or incorrectly directed branches from counting as a match.Step 3 further lowers the chance that nodes with a correct ancestor but a different path back to a seed get counted as a match.The distance threshold is dataset specific and is determined by the diameter of the largest branch point of all gold standard skeletons.For both datasets tested this value is set to 8um.
DIADEM is a good starting point for understanding the tree accuracy, however its largest drawback is a lack of transparency.To compensate, we isolate basic properties that correspond to known error types deduced during proofreading.Each of these topological qualities (see Table VI) helps explain DIADEM's aggregated score.We term this suite, along with DIADEM, TreeBench.Its implementation is available along with the rest of the example benchmarks.
In tables VI, VII and VIII, topology indicates the DIADEM score which is an aggregate tree similarity metric.Recall is the fraction of ground truth skeletal nodes that have a matching node in the test.Branch and leaf also report the proportion of branches and leaf node matches respectively.Direction indicates the proportion of only the matching branch nodes that are also oriented in the correct direction which is why the score is often higher than the branch score.Precision is the fraction of test nodes that correspond to a node in the ground truth.Yield gives the proportion of automated neurons that had a topological accuracy within 2 standard deviations of the mean.Count illustrates the number of input seeds (cell bodies) used.The topology score reflects only the successful reconstructions, whereas all other scores aggregate errors across both successful and failed reconstructions.
When we adjust our mesh resolutions such that our runtimes roughly match those of APP2 per neuron, our yield is 78% higher (Table VI).We lend APP2 our foreground percenting algorithm (at the same .8%)for this comparison since it most closely matches the thresholding used by [26] another recent pipeline.APP2 accuracy and yield are far lower due to lack of the image processing steps of destriping, debleaching and deconvolution.
Higher resolution can achieve better overall yield and less path breaks at the cost of more collisions and artifacts.Path breaks are the biggest loss in accuracy since a substantial fraction of ground truth skeleton nodes do not have a match in our automated outputs.Branch nodes tend to be more proximal to a seed and as expected have higher accuracy whereas leaf nodes which are the terminal points of branches have low recall.Of course, the global recall is in between both branch and leaf node recall, since it is an aggregate for all of a neuron's nodes.
More complete neurons is often a priority when reconstructing neuron types that contain longer projections (axons).Reconstructing these neuron types is a key advantage since other pipelines can not capture them automatically at all due to performance.However, a collision or path direction error is much more time consuming to correct since it can involve deletion of a substantial proportion of the tree.In contrast, current proofreading tools are optimized to interactively fix path breaks.A lower foreground percent reduces the overall completeness and termini (leaf) recall due to breaks.Closing boosts the automated yield by a similar amount to the combined image preprocessing stages, however it only has these benefits when combined with these enhancements.However, decreasing the morphological closing (1 step instead of 4) improves individual scores of TreeBench which follows our observation that it is the most biased transformation we apply.For example, closing can move skeletal nodes to such an extent as to lower the matches.Branch points are particularly affected by closing, presumably due to erroneously joining the morphology of branch points further distally.

F. Application Pipeline Ablations
The highest yield occurs when applying all developed image processing techniques and additionally downsampling the zdimension by a factor corresponding to the inherent point spread function of lightsheet microscopy.This smear in z is rather dramatic (about 2.5×) and deconvolution does not effectively remove it.It also limits the smallest resolution that skeletonization can place nodes and higher sampling leads to more accurate and natively smooth graphs.VI.DISCUSSION AND FUTURE WORK Seed segmentation is effective at localizing and segmenting potential seeds, but performs poorly at distinguishing them from the background.This is likely because cell bodies are generic simple spheres that are difficult to discriminate from background noise after morphological closing.Adding a lightweight second pass classifier such as a multi-layer perceptron to cull false positives seeds could enable competitive accuracy.As it stands, humans must filter the seeds manually.Though these false positives can usually be batch deleted quickly with existing tools because they are clustered in unintentionally high saturation areas of the image.
It takes a similar time to compute reachability on a whole brain scale as it does similar methods like APP2 to complete fastmarching on moderate sized neurons.Cell segmentation is particularly fast because 1) it applies sparse fastsweeping 2) it starts with a large broadly distributed set of seeds which enables high concurrency with little ramp up and 3) unreachable regions are not accessed until the final prune step.Since the sparse working set W s of cell segmentation is about 2 GB, it can stay in memory after conversion, and there are no disk read penalties of long-range connections.Other methods couple their algorithms with dense images, thus incurring reads from disk.
The local separators method [7][6] we used has linear time complexity.This is a major improvement over APP2 and Neutube's O(nlogn) complexity which has compounding performance penalties when combined with the power law distribution of components.Although the number of seeds is generally larger than the thread count, some components are orders of magnitude larger.This leads to poor load balance, with some neurons taking much longer to complete regardless of thread count.
Our results suggest that even higher resolution would improve accuracy but scales much larger than 13k , consume more than 4TB peak memory and are thus not possible to process with our method and reference system at this time.Still, we show that computational techniques can improve yield and accuracy with available resolutions.

A. Conclusion
It is rather surprising that generalizations of search, when ordered effectively, can alter the scale in which neuroscience is performed.Competing methods employ highly integrated domain-specific heuristics that are coupled with dense image representations making them infeasible to parallelize.Our modular approach is more efficient and can be specialized for a particular representation and on growing datasets which allows it to gain in light of future data driven and supervised techniques.

B. Contributions
We would like to thank the following authors and acknowledgments for their contributions to this work.We thank Tony Nowatski for his suggestions on an early paper draft.Muye Zhu provided the foreground percent algorithm and inspired the pipeline with the original MCP3D pipeline.Muye Zhu, Masood Akram, Chris Choi, Hongwei Dong, Chris Sin Park, X.William Yang, Zhe Chen, and Jason Cong advised on application and pipeline design and provided helpful discussions and feedback.Masood Akram helped evaluate reconstructions on gold standard datasets.Yuze Chi implemented an accelerated version of graph partitioning and fastmarching on FPGA.Chris Sin Park, Chris Choi and X. William Yang provided the mouse imaging data for the .4umlightsheet volume.Keivan Moradi and HongWei Dong provided the mouse imaging data for the 1um lightsheet volume.Chris Choi and Ming Yan led proofreading efforts related to the project and provided proofread soma locations for the .4umimage.Ming Yan analyzed soma detection accuracy, provided scripts for overcoming manual steps in the pipeline including neuron filtering methods and designed a soma proofreading protocol.Brian Zingg and Chris Sin Park performed the surgeries needed to generate genetically labeled brains.Mitchell Rudd and Chris Sin Park performed tissue clearing and lightsheet imaging.Adriana Gutierez, Hyun-Seung Mun, and Qing Xue humanvalidated all reconstructions used in the manuscript.Sumit Nanda and Keivan Moradi, programmatically post-processed SWC files to prepare them for analysis.Sumit Nanda carried out preliminary testing of reconstruction output comparison with manual reconstructions using DIADEM metric Keivan Moradi designed and implemented image processing stages (encoding), wrote the corresponding portions of the manuscript and also second-pass proofread all 555 neuron reconstructions used in the manuscript.Karl Marrett designed and implemented application portions related to VDB, reconstruction and TreeBench, and performed the analysis and wrote the manuscript.This research is funded under NIH Grant No.: (U01MH117079-01).

Fig. 1 .
Fig. 1.Whole-brain lightsheet data.(a) Data acquisition and processing pipeline.(b) A sample mouse brain before and after tissue clearing.(c) Since the field of view of the microscope is limited and cannot cover the entire brain in one scan lightsheet data is in the form of a 2D image series that form 3D tiles.(d) Image stitching is needed to make a whole-brain image.(e) A 3D rendering of the image (black and white) and some of the reconstructed neurons (colored part) that shows the scale of the problem.

Fig. 2 .
Fig. 2. A schematic of various error types and their relative importance.The diagram itself loosely resembles the topological structure of a cell.

Fig. 5 .
Fig. 5. Detailed data processing pipeline towards the reconstruction: (a) We suggest a specific order in which the data needs to be processed to improve the quality of reconstructions.(b) Gossamer pipeline produces a wide variety of outputs for different needs.(c) A sample reconstructed cell that is visualized in neuTube software and is ready for human validation.

Fig. 7 .
Fig. 7. Aggregate runtime (left) and throughput (right) of the example tasks.Though scaling is sub-linear, added cores still yield significant speedup.More importantly, additional cores allow each task a higher peak memory ceiling which is critical at these scales.

Fig. 8 .
Fig. 8. Crosses (APP2) and circles (ours) compare the memory allocated during of skeletonization of individual connected components (ideally 1 neuron).APP2 takes roughly 1000× more memory per neuron which is problematic with large clusters with intersections.

TABLE I WE
COMPARE THE MAXIMUM SCALE ALLOWED FOR A SINGLE NEURON AND THE MEMORY COST ASSOCIATED (RAM) TO 3 MAJOR RELATED WORKS.OUR METHOD IS 67× LARGER IN SCALE WITH 870× LESS MEMORY THAN APP2 WHICH IS THE MOST EFFICIENT ALTERNATIVE AUTOMATED APPROACH.

TABLE II SINCE
EACH STAGE PROGRESSIVELY SHRINKS THE WORKING SET, W, Ws AND Wc TAKE ON DIFFERENT MEANINGS AT EACH STAGE.THE DERIVED OUTPUTS OF ONE STAGE ARE OFTEN THE INPUTS TO A LATER STAGE.For example, if regions of an image are not reachable from a seed and thus not causally linked then they are ignored (see the connectivity of the diagram in Figure c ) as opposed to O(W s )).
W s scan: traverse the grid to prune voxels whose values indicate unreachability.The kept voxels with a valid reachable distance compose a new W s .3)W c select: from the new W s , separate disjoint cells in a

TABLE III SEED
SEGMENTATION COORDINATE ACCURACY WITH 1 RADIUS WIDTH TOLERANCE FOR 1 WHOLE MOUSE BRAIN IMAGED AT 6X 1UM 3 RESOLUTION.RECALL IS THE FRACTION OF TRUE POSITIVES (TP).PRECISION IS THE FRACTION OF FALSE POSITIVES (FP) TO TP.THE SUM OF TP AND FP REPRESENTS THE TOTAL AUTOMATED SEED AND CELL

TABLE IV THE
WORKING SET ELEMENT COUNTS FOR THE DENSE IMAGE VOLUME W, THE SPATIALLY SPARSE FOREGROUND WS AND THE REACHABLE CELL SEGMENTATIONS WC .THE FACTOR COLUMN SHOWS THE SIZE INCREASE RELATIVE TO WC .WS IS ROUGHLY AN ORDER OF MAGNITUDE LARGER AND W IS 4 ORDERS OF MAGNITUDE LARGER.

TABLE V A
COMPARISON OF THE TWO MAJOR STAGES OF SKELETONIZATION WHERE n INDICATES THE TOTAL WORKING SET COUNT AND r INDICATES THE RADIUS DISTANCE OF A PARTICULAR NODE.

TABLE VI SKELETONIZATION
ACCURACY FOR 1 WHOLE MOUSE BRAIN AND 1 HALF BRAIN IMAGED AT 1UM 3 AND .4UM 3 VOXEL RESOLUTION RESPECTIVELY.OUR METHOD AT THE HIGHER RESOLUTION HAS THE BEST YIELD INDICATING THAT EVEN LARGER SCALES WILL HAVE BETTER ACCURACY WITH LESS PROOFREAD EFFORT.AGGREGATE TOPOLOGICAL SCORE IS EQUAL BETWEEN OUR METHOD AT VARIOUS RESOLUTIONS BUT MUCH HIGHER THAN APP2.THE HIGHER RESOLUTION IS MORE EFFECTIVE AT CAPTURING FINE-GRAINED PATHS LEADING TO MORE COMPLETE (RECALL) OVERALL SKELETONS AND ESPECIALLY TERMINI (LEAVES).OUR METHOD AT THE LOWER RESOLUTION IS ABLE TO RESOLVE BRANCH POSITIONS AND THEIR DIRECTION AND THE OVERALL SKELETON MORE PRECISELY THOUGH THAT MAY BE BECAUSE IT HAS A SMALLER, BUT MORE CORRECT AUTOMATED YIELD SET OF NEURONS.

TABLE VII TREEBENCH
SCORES WITH SEVERAL KEY PARAMETERS CHANGED.WE COMPARE EACH ABLATION TO THE SECOND COLUMN OF

TABLE VI (
OURS AT .4UM 3 ) WHICH HAS THE DEFAULT SETTINGS OF 2 STEPS OF SURFACE COARSENING, FOREGROUND PERCENT OF .8AND 4 STEPS OF MORPHOLOGICAL CLOSING.EACH OF THESE COLUMNS CHANGES ONE OF THESE PARAMETERS.

TABLE VIII TREEBENCH
SCORES FOR VARIOUS ABLATIONS IN IMAGE PREPROCESSING.AGAIN, WE COMPARE EACH ABLATION TO THE SECOND COLUMN OF TABLE VI (OURS AT .4UM 3 ) WHICH APPLIES ALL PREPROCESSING STAGES.EACH COLUMN IS CUMULATIVE USING THE IMAGE PROCESSING STEP TO ITS LEFT.ANISOTROPIC INCLUDES DECONVOLUTION (OMITTED WITH A DOUBLE VERTICAL LINE DUE TO ALREADY BEING LISTED AS THE DEFAULT IN

TABLE VI )
BUT ADDS A FINAL STEP OF DOWNSAMPLING THE Z-DIMENSION BY A FACTOR OF 3 TO PROBE THE EXTENT TO WHICH ADDED INFORMATION CONTRIBUTES TO RECONSTRUCTION ACCURACY.asillustrated in TableVII.Although light-edge matching can not introduce new path breaks or collisions, the resulting lower resolution meshes do have the morphological artifact of shortened branches.The sock-like termini at the distal-end of cell branches have high curvature and tend to erode proximally (towards the seed) at 4 steps of coarsening.Very thin branches are more prone to this shortening, which causes losses in leaf and branch completeness, as well as overall recall, though to a smaller extent.