## Abstract

Three-dimensional (3D) microscopy data often is anisotropic with significantly lower resolution (up to 8×) along the *z* axis than along the *xy* axes. Computationally generating plausible isotropic resolution from anisotropic imaging data would benefit the visual analysis of large-scale volumes. This paper proposes `niiv`, a self-supervised method for isotropic reconstruction of 3D microscopy data that can quickly produce images at arbitrary (continuous) output resolutions. Within a neural field, the representation embeds a learned latent code that describes the implicit higher-resolution isotropic image region. Under isotropic volume assumptions, we self-supervise this representation on low-/high-resolution lateral image pairs to reconstruct an isotropic volume from low-resolution axial images. We evaluate our method on simulated and real anisotropic electron (EM) and light microscopy (LM) data. Compared to a state-of-the-art diffusion-based method, `niiv` shows improved reconstruction quality (+1 dB PSNR) and is over three orders of magnitude faster (2,000×) to infer. Specifically, `niiv` reconstructs a 128^{3} voxel volume in 1/10th of a second, renderable at varying (continuous) high resolutions for display.

## 1 Introduction

3D imaging data is ubiquitous in scientific domains such as biology or material sciences. However, many imaging modalities like 3D electron microscopy (EM) or light microscopy (LM) have limited axial (*z*) resolution due to physical sectioning of tissue slices or optical limitations. Thus, resolution is typically much higher in the lateral directions than in the axial direction (Fig. 1a).

Downstream tasks like interactive visual analysis would benefit from high-resolution isotropic volumes, but these can be extremely costly or impossible to obtain. Several computational methods attempt to generate isotropically resolved volumes [29,30,11,20]. Some approaches [29,30] require specific shape priors, such as exact point spread functions (PSFs), which are difficult to measure in practice. Machine learning approaches like *diffusion* can model complex data distributions but require copious training data and are slow to infer [11,20]— requiring multiple minutes or even hours to reconstruct a small isotropic volume.

At the same time, anisotropic imaging volumes grow in size every year, containing terabytes [35,27] or even petabytes [23] of imaging data. Reconstructing isotropic volumes is typically an offline postprocess after image acquisition, but with increasing data sizes this becomes infeasible. For instance, Lee et al. [12] take up to three minutes to reconstruct a 128^{3} voxel volume. Instead, isotropic volumes should be reconstructed on-demand and locally at interactive rates from anisotropic data for visual inspection.

To this end, we propose a self-supervised method to reconstruct isotropic volumes from anisotropic data. Building on recent advances in neural field representations [31,10,18,25], our model uses a super-resolution encoder [13] to relate a low-resolution axial slice to a plausible high-resolution image via a latent space [2,3]. A multi-layer perceptron (MLP) decodes a set of latent codes into a high-resolution axial slice sample, with bilinear latent interpolation creating an output image at any pixel resolution (Fig. 2). Both encoder and decoder are trained end-to-end on simulated anisotropic slices by downsampling isotropic lateral slices (Fig 1b). Inference is fast, requiring only 1/10th of a second to generate a 128 × 128 × 128 volume. This allows interactive isotropic visual inspection.

For validation, we compare our approach against bilinear upsampling and two current self-supervised methods: a neural field reconstruction without the super-resolution encoder [25] and a diffusion-based model [12]. Our approach shows quality improvements over both and quantitative improvements by +1 dB over the diffusion model and +3 dB over the neural implicit baseline. We demonstrate this through peak-signal-to-noise ratio (PSNR) computations in a sweep of a frequency-clipped Fourier domain, which offers a more robust metric than pixel-wise PSNR to noise in the training data. For computational efficiency, `niiv` is 2,000× faster than the diffusion model and 1,000× faster than the SIREN baseline considering volume-specific pertaining times.

## 2 Related Work

### Isotropic Volume Reconstruction

Recent self-supervised approaches [20] [12,14] train 2D diffusion models to learn the distribution of high-resolution lateral images. During reconstruction, they use low-resolution axial slices as priors for the backward diffusion process to predict missing volume information. While diffusion models achieve high-quality results, their usability is limited by compute-intensive training and time-consuming inference. In contrast, our approach improves reconstruction quality while inferencing three orders of magnitude faster than the diffusion baseline [12]. Recently, Zhang et al. [32] also use arbitrary scale superresolution neural representations for anisotropic volume reconstruction. However, they focus on MRI imaging requiring different degradation models. Additionally, they do not discuss model inference speeds and evaluation strategies in the presence of noisy ground truth data. Deng et al. [4] learn a degradation model that generates realistic low-resolution, high-resolution training pairs. These pairs are used to train a reconstruction model like Iso-Net-2 [29]. Both models are trained independently, leading to complex training setups. On the other hand, supervised approaches [9] show high-quality results but are difficult to use in practice since isotropic volumes are required at training time. Other methods use video transformers [8], optical flow field inter-polation [1], or standard Conv-Nets [29,34], which are often limited to a specific output pixel resolution, whereas our model can be decoded at any resolution.

### Neural Implicit Super-Resolution

Encoding spatial information through neural networks [19,31] has proven to be useful in areas, such as inverse graphics [18,25], shape representation [26], video encoding [10], and super resolution [2,3] [17]. Our approach builds upon local implicit image functions (LIIF) [2], which allows the sampling of images at arbitrary resolution while retaining highquality visual details. Chen et al. [3] extend LIIF for video super-resolution. However, their supervised approach requires high-resolution and high-frame rate training videos and can thus not be directly transferred to self-supervised isotropic reconstruction. Finally, Kim et al. [10] propose a neural video encoding with super-resolution capabilities. However, they optimize a single model per video, making their approach unsuitable for fast reconstruction of unseen volumes.

## 3 Methodology

### Problem Statement

Given a sampling of a volume *V* with isotropic *xy* axes and anisotropic *z* axis, we aim to learn a model *g* that reconstructs an isotropic *z* sampling of a volume purely by self-supervision. We assume that the volume sampling contains a physical medium whose distribution of material can be effectively modeled by observing local statistics of the *xy* samplings. Given the low-resolution anisotropic slice as input, *g* must reconstruct a plausible high-resolution isotropic *xz* slice , where *i* denotes a slice from the input volume sampling and *α* is the axial anisotropy factor (e.g., 8). Then, is constructed by stacking predicted high-resolution slices . The same approach applies equally to both *xz* and *yz* slices.

### Model Architecture

Our hybrid neural field [31] uses latent codes within a 2D space to encode the high-resolution slice. First, a convolutional neural network super-resolution encoder *E*_{ϕ} with parameters *ϕ* embeds a low-resolution axial slice *I* ∈ ℝ^{x×z} into a 2D referenced *d*-dimensional latent grid *M* ∈ ℝ^{x×z×d}, of latent codes *l*. Finally, a MLP decoder *f*_{θ} with parameters *θ* (Fig. 2) produces reconstructed image intensities at an output pixel coordinate *c*:
Here, *c* is encoded using a 2-band frequency basis *γ*(*c*), and *v* is the pixel value obtained by simply bilinearly interpolating *I* at *c*. As *M* accepts continuous input, we can query *M* at any 2D coordinate *c* = (*h*_{c}, *w*_{c}) to retrieve a latent code *l*_{c} ∈ ℝ^{d}, making it possible to sample arbitrary pixel resolutions. Thus, `niiv` can adapt flexibly to the requirements of interactive display across devices, unlike other approaches [9,29]. Note that *E*_{ϕ} and *f*_{θ} are shared between all volumes in the respective training and test dataset.

### Simulating axial degradation

During training, we use artificially degraded *xy* images as model inputs and supervise outputs with the respective high-resolution *xy* slices . We apply a function *d* that aims to simulate degradation along the *z* axis such that . Here, we define *d* as an average pooling operator. In principle, other degradation models [4,12,7] can be applied based on the specific application and imaging domain.

### Model Training

We minimize a loss function ℒ by end-to-end optimization of the encoder and decoder parameters *ϕ* and *θ*. Studies show that perceptual image quality metrics are superior to simple pixel-wise metrics for image super-resolution tasks [6]. Perceptual metrics [6,5,33] use features of deep pretrained models like VGG [24] as a basis for comparison. Yet, the *mean absolute error* (MAE) excels at image denoising tasks [6]. Thus, we combine the perceptual *Deep Image Structure and Texture Similarity* (DISTS) [5] metric and the MAE between predicted and isotropic slices . Our loss function ℒ is therefore
where *w*_{i} are the respective weights. We find *w*_{1} = 1 and *w*_{2} = 30 to consistently achieve good results.

## 4 Evaluating Reconstructions with Noisy Ground Truth

Evaluating `niiv` brings challenges, as only noisy ground truth exists, leading to uninformative PSNR values (Fig. 3a). Thus, a method that perfectly reconstructs lateral slices is overfitting to the noise; this is especially problematic in low-data regimes with powerful data-fitting models [12]. We wish to assess the reconstruction of biological structures despite the noise. Prior research has addressed this problem by downscaling the data to diminish noise [9] at the cost of sacrificing resolution. We propose calculating the PSNR in the Fourier domain where it is easier to separate high-frequency components of signals [7] such as noise (Fig. 3b), where we vary a cutoff frequency *f*_{cutoff} (Fig. 3c) across a range of values to observe the quality across frequencies. For example, if a method only achieves a greater PSNR than another at high *f*_{cutoff} but not at low *f*_{cutoff}, then it is likely overfitting the noise. Given Parseval’s theorem [21] and the unitary nature of the Fourier transform ℱ, we can directly compute the PSNR in the frequency domain (see derivation in supplement), sidestepping the inverse transformation to the spatial domain.

We now incorporate the clipping operation in the Fourier domain, denoted by a low-pass operator *L* that discards frequencies above *f*_{cutoff}. Given the above relationship, we receive
That is, we can transform both images into the Fourier domain, apply the clipping operation via *L*, and then compute the PSNR directly in Fourier space.

## 5 Experiments

### 5.1 Data and Implementation Details

We demonstrate the effectiveness of our approach on the publicly available FlyEM Hemibrain [22], FAFB [35] EM datasets and also ablate against LM approaches like LICONN [28] (Fig 5b). While the Hemibrain contains the central brain region of Drosophila melanogaster imaged at isotropic 8×8×8 nm pixel resolution, we downsample the data to 8× anisotropy along the *z* -axis through average pooling. FAFB shows the entire brain of a female adult fruit fly at naturally 5× anisotropic 8×8×40 nm pixel resolution. We randomly sample 400 subvolumes (128^{3} pixels in the Hemibrain and 130^{3} pixels in FAFB) and separate them into training (*N* = 350) and test datasets (*N* = 50). All metrics are reported on entire volumes rather than individual images. Our method is implemented in PyTorch, and all experiments were performed on a single NVIDIA RTX 3090 Ti GPU. All experiments use the EDSR [13] super-resolution encoder without upsampling modules, 16 residual blocks, and 64-dimensional output features. The MLP is 5 layers deep, each 256 neurons wide. We train our model for 1500 epochs using the Adam optimizer and a learning rate of 5×10^{−5}. Note that we train separate models for the Hemibrain, FAFB, and LICONN data.

### 5.2 Qualitative and Quantitative Comparison

To showcase our method’s suitability for interactive isotropic reconstruction, we capped the GPU memory usage at 4 GB for all methods, reflecting a mid-tier laptop’s typical capacity. Within this constraint, our approach significantly out-performs the diffusion baseline, delivering inference speeds up to three orders of magnitude faster (0.11 vs. 264 seconds) for an anisotropy reconstruction task with anisotropy *α* = 8 on 128^{3} volumes (Hemibrain). The advantage is due to Diffusion-EM’s slow iterative inference process and the need to enforce frame-by-frame consistency for the probabilistic reconstruction process by conditioning each slice inference on a latent code retrieved from the previous slice, prohibiting batch processing. We also outperform the neural implicit SIREN [25] baseline as it requires separate pretraining for each subvolume, leading to costly inference on unseen data (Table 1). Comparing reconstruction quality, in contrast to the baselines, our model can reconstruct fine details (Fig. 4a) with slice-by-slice consistency (see supplementary video) and sharper edges (Fig. 4b). While the diffusion results visually look sharp, small details are often reconstructed incorrectly, explaining the lower metric scores. Diffusion EM also fails for volume sizes, not in {2^{i}} (Fig. 4b). Also, SIREN reconstruction results look blurry, making it unsuitable for isotropic reconstruction.

### 5.3 Ablation Studies

#### Fourier PSNR Threshold

We tested the effect of the Fourier clipping threshold on the PSNR (Fig. 5a). If no clipping threshold is applied, the PSNR values of our method and the baselines are low and close together due to the random image noise in the ground truth data. However, the black box (Fig. 5a) indicates a clipping window in Fourier space where image quality differences are more accurately represented through the PSNR given noisy GT images.

#### Data Modality

We use a recent, near isotropic voxel-size (9.7×9.7×13 nm) expansion LM dataset [28] of a mammalian hippocampus and simulate 8× anistropy using average pooling as a degradation model (Fig. 5b). Next, we train on 350 randomly sampled volumes and reconstruct 50 unseen 128^{3} voxel test volumes at isotropic voxel size. Fig. 5b shows input and GT images and also compares our results with nearest- and bilinear interpolation. We find that `niiv` produces sharper images compared to bilinear interpolation also for LM data.

## 6 Conclusions and Future Work

Interactive isotropic rendering of anisotropic data is useful for large-scale data visual inspection tasks. To this end, we demonstrate that neural fields and encoder-based superresolution representations are promising for fast and flexible self-supervised volume reconstruction. We propose three avenues for future work. First, developing more accurate physics-based axial degradation models (Sec. 3) to improve the simulation of anisotropic *xy* slices during training. Deng et al. [4] take a first step in that direction. Second, integrating machine-learning elements like our approach into low-power Web-based image-rendering tools such as neuroglancer [15] or Viv [16] would rapidly deploy these advances. Third, future work should investigate if latent image representations express higher-level semantically-interpretable morphological features, as these could be useful in downstream tasks like tissue classification (e.g., neuron typing).