Deep learning-enhanced light-field imaging with continuous validation

Visualizing dynamic processes over large, three-dimensional fields of view at high speed is essential for many applications in the life sciences. Light-field microscopy (LFM) has emerged as a tool for fast volumetric image acquisition, but its effective throughput and widespread use in biology has been hampered by a computationally demanding and artifact-prone image reconstruction process. Here, we present a framework for artificial intelligence–enhanced microscopy, integrating a hybrid light-field light-sheet microscope and deep learning–based volume reconstruction. In our approach, concomitantly acquired, high-resolution two-dimensional light-sheet images continuously serve as training data and validation for the convolutional neural network reconstructing the raw LFM data during extended volumetric time-lapse imaging experiments. Our network delivers high-quality three-dimensional reconstructions at video-rate throughput, which can be further refined based on the high-resolution light-sheet images. We demonstrate the capabilities of our approach by imaging medaka heart dynamics and zebrafish neural activity with volumetric imaging rates up to 100 Hz. A deep learning–based algorithm enables efficient reconstruction of light-field microscopy data at video rate. In addition, concurrently acquired light-sheet microscopy data provide ground truth data for training, validation and refinement of the algorithm.

C apturing neuronal activity distributed over whole brains, explaining long-range molecular signaling networks or analyzing structure and function of beating hearts in small animals necessitate imaging methods that are capable of resolving these dynamic processes on the scale of milli-seconds and hundreds of micrometers. To address these challenges, several imaging approaches have been developed or optimized 1 , such as highly optimized point [2][3][4] and line scanning 5 , selective-plane illumination 6,7 or reducing the dimensionality of the image acquisition 8,9 . An attractive candidate for high-speed three-dimensional (3D) imaging in biology is light-field microscopy (LFM), due to its ability to instantaneously capture 3D spatial information in a single camera frame, thus permitting volumetric imaging limited by the frame rate of the camera only [10][11][12] . The ability to image the 3D distribution of fluorescent emitters over large, hundreds of micrometer-scale field of view (FOV) with millisecond temporal resolution has opened new avenues in developmental and neurobiology, such as the recording of whole-brain neuronal activity in several model organisms [12][13][14][15] or the visualization of cardiovascular dynamics 16,17 . LFM has seen a steady rise in performance over the past years in terms of spatial resolution 16 , signal-to-noise 17,18 and the reduction of image reconstruction artifacts 16,[19][20][21][22][23] . While the large FOV and high volumetric imaging rate of LFM has been used to address questions in, for example, neurobiology 13,15 , the widespread use of this technique in the life sciences has been hampered by a computationally demanding, iterative image reconstruction process. Therefore, realistic applications of LFM in biology demand a large-scale com-putational infrastructure, restricting the effective experimental throughput, especially with respect to long-term recordings.
With the development of deep learning and convolutional neural networks (CNNs), multiple algorithms have recently been proposed with the aim to replace iterative deconvolution procedures such as Richardson-Lucy's by a CNN 24 . In the natural image domain, CNNs are now the primary method for removing motion blur and other artifacts traditionally solved by iterative deconvolution 25 . Similarly, in microscopy several deep learning-based methods have been introduced for deblurring, denoising or super-resolution applications 26 . Although these methods demonstrate excellent image reconstruction performance and empirically have been shown to generalize to data similar to those used in training, no theoretical guarantees on generalization can be given. It is therefore important to extensively validate and, if needed, to retrain the CNN for each experimental setting 27 . This requirement presents a problem for many bioimaging applications, and in particular for dynamic imaging with LFM, as raw light-field images are difficult to interpret. A trained network running 'in production' thus offers no way to evaluate the quality of its reconstructions, leaving potential users rightfully concerned about the trustworthiness of the 'black-box magic' 28 . Furthermore, recent work 29,30 resorted to arresting the biological activity for the acquisition of volumetric training data by confocal or other imaging modalities on a separate microscope, which is often not possible or desirable.
To overcome this limitation, we present a framework for fast and high-fidelity reconstructions of experimental LFM images, termed hybrid LFM (HyLFM). We demonstrate the capabilities of our HyLFM system by imaging the dynamics of the beating heart in medaka (Oryzias latipes) hatchlings (8 days postfertilization (dpf)) across a 350 × 300 × 150 µm 3 FOV at a volumetric speed of 40-100 Hz, as well as by calcium imaging in 5-dpf zebrafish (Danio rerio) larvae over 350 × 280 × 120 µm 3 FOV at 10 Hz. HyLFM-Net achieves high image reconstruction quality and spatial resolution at a video-rate (>20 Hz) inference throughput. In addition to reconstruction performance demonstrations for networks trained on static volumes or dynamically acquired individual planes, we show how to validate HyLFM-Net predictions on-the-fly and how to fine-tune the network on a part of the dynamically acquired image data.

Results
HyLFM approach. Our approach is based on reconstruction by a CNN that is enhanced by simultaneous acquisition of high-resolution image data for training and validation. Our neural network-which we term HyLFM-Net-is designed for light-field data processing and 3D image reconstruction (Extended Data Fig. 1 and Supplementary Table 1). To avoid potential bias to the previously seen training data and to enable direct validation of the reconstructions from uninterpretable LFM images, we have included an additional, continuous validation mechanism into our LFM imaging setup, thereby achieving and ensuring high-fidelity reconstructions. Experimentally, this is realized by adding a simultaneous selective-plane illumination microscopy (SPIM) modality into the LFM setup, which continuously scans through the volume at high speed with an electrically tunable lens 31 . This produces high-resolution ground truth images of single planes for validation, training or refinement of the CNN. The training can be performed both on static sample volumes and dynamically from a single SPIM plane that sweeps through the volume during 3D image acquisition. Besides direct training from nonstatic samples, the latter approach allows retraining the network if its reconstructions do not agree with the individual SPIM plane images taken during continuous validation.
The design of the HyLFM imaging system is based on an upright SPIM configuration ( Fig. 1a and Extended Data Fig. 2). This setup allows to simultaneously or sequentially illuminate the entire sample volume for light-field and/or a single plane for SPIM recording. On the detection side, the objective (20 × 0.5 numerical aperture (NA)) collects the excited fluorescence, which is split either via a 30/70 beamsplitter or based on wavelength into separate optical paths for SPIM and LFM imaging, respectively. A fast, galvanometric mirror in the illumination path, together with an electrical-tunable lens (ETL) in the SPIM detection path enables to arbitrarily reposition the SPIM excitation and detection planes in the sample volumes at high speed (15 ms), in the LFM imaging volume (Fig. 1b). An automated image processing pipeline 16 ensures that both LFM and SPIM volumes are coregistered in a common reference volume and therefore coordinate system with high precision, which is a prerequisite for CNN training and validation. The simultaneous acquisition of both two-dimensional (2D) and 3D training data is paramount to ensure high-fidelity and reliable CNN light-field reconstructions of arbitrary samples, including data not seen in previous training. Furthermore, this includes dynamic samples for which the process of interest cannot be arrested to acquire a static training volume at high resolution. This is an advance of our system over previous work on artificial intelligence-enhanced LFM reconstructions 29,30,32 . CNN architecture. We followed a fully supervised 33,34 approach and trained HyLFM-Net directly on pairs of SPIM-LFM images, with the LFM image as input and individual SPIM planes or the complete volumes as a reconstruction target. The LFM image is trans-formed into a tensor where the individual pixels of each lenslet are rearranged as channels (Fig. 1c). This rearrangement puts different angular views 11 in different channels. The convolution operations can then act on angular views, while the projection (1 × 1 convolution) layers learn to combine information from different angles. The multi-channel 2D images are passed through 2D residual blocks and a transposed convolution. The output goes through a final 2D convolution layer and is then transformed to 3D by reinterpreting network filters as the axial spatial dimension. The 3D images are further processed by 3D residual blocks and upsampled by transposed convolutions to finally yield the reconstructed 3D volume. For training on single planes, the registration transform between the two detection modalities is encoded into the last network layer to enable direct comparison with the acquired 2D light-sheet image (Extended Data Fig. 1).
Performance characterization of HyLFM. To evaluate the performance of our HyLFM system, we imaged subdiffraction-sized, fluorescent beads suspended in agarose (Fig. 2). We quantified the improvement in both spatial resolution and overall image quality by comparing it with the standard, iterative light-field deconvolution 11,12 (LFD) as well as with the same deconvolution improved by a deep learning-based image restoration (content aware restoration) method 27 (LFD + CARE). For LFD + CARE, we trained the image restoration network to transform the LFD volume into the SPIM volume starting from the LFD deconvolutions of the data used for the training of HyLFM-Net. We found that HyLFM-Net correctly inferred the 3D imaging volume from the raw light-field data, yielding high and uniform lateral (1.8 ± 0.2 µm, mean ± s.d.) as well as axial (7.1 ± 1.3 µm) resolution across the imaging volume (n = 4,966 beads, Fig. 2b), substantially better than what could be obtained by LFD (Fig. 2c), on par with the performance of LFD + CARE (Fig. 2d). Furthermore, HyLFM-Net reconstructions do not suffer from artifacts near the native focal plane that are common in LFD 11 (Fig. 2c, Extended Data Fig. 3). A similar trend can be observed when measuring image quality metrics across the volume, using the SPIM ground truth as reference ( Fig.  2e-h). While HyLFM-Net and LFD + CARE yield qualitatively comparable results, it is important to note that LFD + CARE uses conventional, slow LFD reconstructions as input and is therefore intrinsically limited in overall reconstruction speed. Since the true positions of fluorescent beads are known from SPIM, we further extended our comparison from image-level metrics to object-level precision recall (Fig. 2i, Extended Data Fig. 4). Here, HyLFM-Net outperforms LFD + CARE at all threshold levels, while 'classic' LFD demonstrates low recall, probably caused by suboptimal performance in the outer planes of the volumes.
With all CNN based reconstruction methods, it has to be noted that the point-spread function (PSF) of the signal reconstructed with a CNN depends strongly on the resolution and the shape of the signal in the training data. Therefore, training on small structures such as subdiffraction-sized beads will lead to unnaturally precise signal localizations and, when such a network is applied to a dataset with larger structures (or beads), can lead to erroneous reconstructions. Conversely, training only on large structures can lead to a network that merges small neighboring objects (Extended Data Fig. 5). While empirically, the bias to training data can be alleviated by ensuring more diverse training datasets, at the current state of machine learning theory no formal guarantees on network generalization performance can be made. This observation has motivated our hybrid microscope setup, where the network can always be validated on concomitantly acquired high-resolution data and, if necessary, retrained directly on the imaged sample instead of relying on a sufficiently broad and general composition of static training data, which in practice is typically lacking or time-and/or resource-intensive to produce.
Demonstration of dynamic imaging and online validation. Next, we applied the HyLFM system to image a beating medaka fish heart in vivo to show its capability to capture dynamic cellular movements in 3D ( Fig. 3 and Supplementary Videos 1-3). Here we used two-color labeling of cardiomyocytes (myl7::H2B-eGFP; myl7::H2A-mCherry; that is, nuclear eGFP and mCherry for the LFM and SPIM detection paths, respectively) to image the same features simultaneously in SPIM and LFM with 40-100 Hz volume rate. Doing so allowed us to visualize the heart at single-cell resolution and free of reconstruction artifacts for both pharmacologically arrested (static) hearts ( Fig. 3a-e, FOV roughly 350 × 300 × 180 µm 3 ), and beating (dynamic) hearts ( Fig. 3f-o, V-FOV roughly 350 × 300 × 150 µm 3 ). As described above, HyLFM-Net can be trained both from full 3D SPIM volumes and from individual 2D SPIM images acquired dynamically (that is, in sweeping 3D mode). We evaluated both training regimes and compared them to the conventional LFD and LFD + CARE reconstruction approaches. For training, HyLFM-Net-stat received LFM images as input and the corresponding full SPIM volumes as target, LFD + CARE received LFD reconstructions as input and full SPIM volumes as target, HyLFM-Net-dyn received LFM images as input and individual SPIM slices as target. Whether with static or dynamic training, HyLFM-Net yielded high image quality metrics: multi-scale-structural similarity index measure (MS-SSIM) = 0.78 ± 0.05, peak signal-to-noise ratio (PSNR) = 21.2 ± 1.9 for HyLFM-Net-stat, MS-SSIM = 0.77 ± 0.05, PSNR = 21.1 ± 1.9 for HyLFM-Net-dyn with regards to SPIM (Fig.  3e), with overall performance on unseen data superior to conventional LFD and on par with the LFD + CARE combination both qualitatively and quantitatively. Further experiments without pharmacological arrest ( Fig. 3f-o) demonstrate equally good performance for HyLFM-Net-stat and HyLFM-Net-dyn on samples that have never been arrested. This underscores the feasibility of our hybrid 2D/3D imaging approach and opens the door to network fine-tuning on individual SPIM images. Note that we obtained these reconstructions and image quality metrics in inference mode from networks trained on separate fish hearts, illustrating the generalization ability of HyLFM-Net. HyLFM-Net allowed 3D volume inference at up to 26.7 Hz on a consumer-grade graphics processing unit (GPU) such as the NVIDIA GeForce RTX 2080Ti. This represents at least a 1,000-fold reconstruction speed improvement over common LFD 11,12 (Supplementary Table 2). Performing 3D volume inference at video-rate speed boosts overall experimental imaging throughput and further enables real-time 'quality' control of uninterpretable light-field images during acquisition. To demonstrate this aspect of on-the-fly validation, we compute MS-SSIM of the volume slice corresponding to the simultaneously acquired scanning SPIM plane (Fig. 3o). This allows us to continuously monitor-in real time-the image quality of our network reconstruction and to check whether the metrics fall short of a user-defined threshold that depends on the reconstruction accuracy requirements of the experiment (for example, MS-SSIM nearing 0.7 in Fig. 3j,o). While in principle any image metric can be chosen for validation, we found MS-SSIM to be most consistent with the human perception of image quality.
Demonstration of network refinement. In our HyLFM approach, network fine-tuning can be performed based on the single-plane SPIM images as training data ( Fig. 4 and Extended Data Fig. 6). To illustrate this fine-tuning process on dynamically beating medaka heart data, we started with a network that was trained on a static heart. Then we used 8,883 individual SPIM planes from a 5-min time lapse as training data and further 756 unseen consecutive planes from the same time lapse to evaluate the performance. After just 25 iterations (7 min) of retraining, image quality improved substantially (Fig. 4e,f). The network converged to MS-SSIM of at least 0.9 after 2,500 iterations (7 h of GPU time, Fig. 4e,h), achieving also very homogenous reconstruction performance across z (Supplementary Video 4). We note that in principle ongoing imaging experiments do not need to be stopped for fine-tuning, as this can be performed on the already acquired imaging data. With sufficient training time, fine-tuning can compensate for much larger domain gaps and successfully refine networks originally trained on beads or LFD reconstructions of previous experiments (Extended Data Fig. 6).
Application to calcium imaging. LFM is a promising method for neural activity imaging in small model organisms. To demonstrate the potential of HyLFM to deliver quantitatively accurate reconstructions, we imaged 5 dpf transgenic larval zebrafish brains expressing the nuclear-confined calcium indicator GCaMP6s (Tg(elavl3:H2b-GCaMP6s)) (Extended Data Fig. 7). When distributing the excited fluorescence into the LFM and SPIM detection arms, we could record whole-volume light-field and high-resolution SPIM imaging data at 10 Hz each, over a 350 × 280 × 120 µm 3 volume. The concurrent availability of ground truth data enabled our HyLFM system to learn and infer not only structural, but also intensity-based information, as demonstrated by the high degree of correlation of Ca 2+ signal traces between HyLFM and ground truth data obtained by SPIM or conventional LFD (Extended Data Fig.  7b,c).

Discussion
In summary, we have demonstrated a framework for deep learning-based microscopy with continuous ground truth generation for enhanced reconstruction reliability. Our approach enables light-field imaging with high spatial resolution and minimal reconstruction artifacts, and compared to previous work based on multiview deconvolution 16 , achieves this performance in the relaxed imaging geometry of a standard two-objective SPIM. We note that for simultaneous SPIM and LFM imaging, the sample requires two-color labeling so that the modalities can be split by wavelengths. Alternatively, for single-color imaging the modalities need to be rapidly alternated and the fluorescence split stochastically, which could in principle be limiting for highly dynamic samples. Although the initial training time of HyLFM-Net from scratch can be long, our fine-tuning experiments suggest an approach to avoid it becoming a bottleneck: HyLFM-Net trained on beads can be converted to reconstruct dynamic biological samples in 7 h of training on a single GPU. After retraining, the network can deliver optimal results for the target system and-in case of a change in the imaging conditions-can be retrained again on a new sample in mere minutes.
The ability to reconstruct light-field volumes at subsecond (video-)rate eliminates the main computational hurdle for light-field imaging in biology, and we thus expect this to further accelerate the uptake of LFM by the community. This is particularly relevant for applications that rely on tens to hundreds of imaging replicates. Here, once trained or refined, our HyLFM networks can lower overall processing (reconstruction) times considerably (up to around 1,000×), enabling the 3D visualization of more than a million raw light-field images per day on a single GPU, corresponding to, for example, roughly 250 min of LFM recording at 100 Hz volume rate. The integration of a high-resolution imaging modality into our LFM system mitigates the omnipresent problem of acquiring appropriate training data, as it can be generated simultaneously and on-the-fly. Furthermore, the online availability of single-plane ground truth (SPIM) images distributed over 3D space and time enables continuous CNN output validation and fine-tuning, as early time points of a time-lapse imaging experiment can be used for network training and/or refinement. This approach to supervised AI-enhanced microscopy solves the problem of transferability, as the network over time learns on the actual experimental data, and therefore does not require pre-acquisition of training images from particular specimen types with separate microscopes. This is a key general advantage of our approach, which can also be combined with existing 29,30 or future developments in neural network architecture. While the HyLFM setup has been developed specifically for light-field imaging, the general principle behind it is applicable to other imaging approaches that rely on iterative or trained computational methods for image reconstruction or restoration. Finally, we note that access to time-and resource-efficient light-field reconstructions further facilitates data-intensive, long-term four-dimensional (4D) imaging experiments at high throughput as it allows to store volumetric image data in compressed form, that is as raw 2D light-field images 29 .
The fact that LFM only requires the addition of a suitable microlens array into the imaging path, and is in principle compatible with any custom or commercial SPIM realization, bears further potential for widespread use of this method in the life sciences, especially in the context of repetitive and high-throughput imaging studies.

online content
Any methods, additional references, Nature Research reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/ s41592-021-01136-0.

HyLFM-SPIM imaging setup.
The microscope consists of one illumination and one detection objective, orthogonal to each other (Extended Data Fig. 2). The illumination sources are continuous-wave lasers (wavelength λ = 488 nm, 20 mW, Omicron and λ = 561 nm, 50 mW, Cobolt). We use a 10 × 0.3 NA (Nikon CFI Plan Fluor ×10 W) water dipping objective for illumination and a 20 × 0.5 NA (Olympus UMPLFLN 20XW) water dipping objective for detection. For the latter, a tube lens with focal length of 200 mm (Nikon MXA20696) yields an effective magnification of ×22.5. Two illumination paths, combined by a dichroic mirror (D1), enable simultaneous dual-color light-sheet and light-field illumination. For single-color calcium imaging, two separate 488-nm excitation lasers were used and the dichroic mirror (D1) was replaced by a nonpolarizing beamsplitter (Thorlabs, 70/30). A digital light sheet was generated using one of the axes of a 2D galvo pair (Cambridge Technology) combined with a scan lens and tube lens (TL1, 200 mm). To achieve a selective volume illumination for light-field excitation, the laser beam was first expanded and the central plateau of the (Gaussian) illumination profile was used to illuminate the entire volumetric FOV at once. The lateral extensions of the volumetric FOV could be adapted by changing the size of an aperture (2D slit) that was used to crop out the central region of the laser beam. A tube lens (TL2, 300 mm) focused the light on the objective back aperture. On the detection side, the microscope has two arms separated by either a dichroic mirror (D2) for dual-color imaging or a nonpolarizing beamsplitter (Thorlabs, 2', 70/30 ratio) for single-color HyLFM image acquisition and registration. The training data for our HyLFM-Net network consist of SPIM data (ground truth) and the corresponding 2D light-field image (input). Two options can be pursued to acquire high-resolution training data in our HyLFM setup: (1) the SPIM imaging plane remains stationary to sample the dynamics (for example, multiple heart beats) over time and obtain sufficient variability in training data (for example, heart shapes during the beat cycle) before it moves to the next axial position in the volume and repeats the data acquisition. (2) The SPIM modality continuously loops in 3D to acquire enough variability at each respective z plane. The second option minimizes potential photobleaching and phototoxicity effects and was thus chosen for most of the experiments. We additionally displaced the heart axially with respect to the focal plane to acquire enough variability at the extremities of the reconstruction volume and assure generalization of the network on differently sized hearts. Due to experimental imperfections, the FOV of the two detection paths might not overlap completely. To register the light-sheet data to the light-field volume, we acquired a light-field image of fluorescent beads and a light-sheet stack of the same volume by displacing the light sheet with respect to the focal plane with a galvanometric mirror and refocusing on the illuminated plane with the ETL. The light-field volume was then reconstructed from the recorded light-field image using Richardson-Lucy deconvolution (LFD) as in ref. 12 and available at http:// www.lightfieldscope.org/. The two corresponding volumes were registered with the Multiview Reconstruction Plugin in Fiji 35 , yielding the affine transformation that maps the light-field volume to the light-sheet stack. In dynamic training this affine transformation is then used in the final layer of the network and only the slice for which a light-sheet equivalent has been acquired is sampled from the predicted volume during training. This routine also allows for an easy comparison between SPIM images and volumes reconstructed by the HyLFM-Net and LFD. We used the affine transformation that was computed from the fluorescent bead sample throughout all experiments.
HyLFM-Net for light-field reconstruction. The input to the network is a 3D tensor. The original 2D light-field images, composed of up to 70 × 85 lenslets, 19 × 19 pixels each, are rearranged to contain 361 (19 2 ) channels, with each channel corresponding to an angular view, that is same pixels of each lenslet.
The input is normalized by its 5th and 99.8th percentile without clipping. For training and evaluation, the light-sheet target images are normalized by their 5th and 99.8th (99.99th for beads and brain) percentile. During training, several data augmentations are applied. These include addition of Gaussian or Poisson noise, joined random rescaling of light-field input image and light-sheet target image, lateral axis flipping, as well as joined random 90° rotations (applied before rearranging the light field to 361 channels). The full network architecture 36 is shown in Extended Data Fig. 1 and Supplementary Table 1. Briefly, the rearranged 3D input tensor is passed through two or three residual blocks interlayered with transposed convolutional layers scaling up by factor of two in the lateral dimensions. The output of the last residual block undergoes a final 2D convolution, after which its channel dimension is re-interpreted as an axial dimension and a smaller channel dimension. This 4D tensor is passed through 3D residual blocks and transposed convolutional layers, further upsampling in the two lateral dimensions. For predictions aligned with the SPIM data, the last layer of the network encodes the registration of the reconstructed light-field volume to the SPIM volume. For static volumes, the SPIM volume was transformed instead. In dynamic training only the one slice, for which a light-sheet equivalent has been acquired, is sampled from the predicted volume. The network is trained either with L2 loss or with a weighted, smooth L1 loss (down-weighting nonpeak-signal-pixels with a decaying weight). This choice only has a notable impact on network convergence for the sparse bead data, as with the L2 loss the network converges to only predict background. The Adam optimizer is used with the learning rate set between 1.0 × 10 −5 and 3.0 × 10 −4 . The hyperparameters to train HyLFM-Net-stat were determined by a Bayesian search with a 'Weights and Biases Sweep' 36 (https://www.wandb.com/). Training times for the networks are: 26.5 h for HyLFM-Net-beads (beads), 19.7 h for static heart (stat), 68 h for dynamic heart (dyn) and 89.8 h for neural activity tasks (brain), which were determined by observing the smooth L1 validation loss (beads) or the MS-SSIM validation score (stat, dyn, brain). Final MS-SSIM metrics on the respective validation datasets are 0.982 ± 0.002 (beads), 0.91 ± 0.02(stat), 0.78 ± 0.04 (dyn) and 0.90 ± 0.02 (brain). As illustrated in Fig. 3, a refined network based on a HyLFM-Net-stat was additionally trained with a mini-batch size of eight and smooth L1 loss for up to six epochs on 8,883 dynamic heart slices scanning the entire volume 47 times. The refined network was tested on 756 dynamic heart slices from the same imaging session. It improved the MS-SSIM on its training and test data (both unseen test data for the unrefined HyLFM-Net-stat) from 0.79 ± 0.06/0.78 ± 0.06 to 0.94 ± 0.02/0.95 ± 0.02. However, here we note that all image quality metrics depend on the normalization of the SPIM target, so a particular threshold for the necessity of refinement has to be chosen by the user, based either on visual inspection or on network performance metrics on validation data. Similarly, the network used for Extended Data Fig. 7 (HyLFM-Net-brain) was refined on 100 planes at a fixed axial position for 30 epochs with a mini-batch size of five (roughly 1.3 h training time), which improved the test data (500 planes) from 0.90 ± 0.02 to 0.938 ± 0.003. The refined predictions on the test planes were then used to evaluate Ca 2+ -trace correlations (Calcium imaging analysis).
Inference for summary plots Fig. 3e was performed on 197 volumes depicting two different fish with axial displacement and rotation used to create different samples. Inference for summary plots Fig. 3j was performed on 10,659 slices out of 51 swipes through the volume depicting one fish, not seen in training.

Reconstruction quality analysis.
To quantify the microscope's performance in terms of spatial resolution, we imaged a 3D distribution of 0.1-μm sized fluorescent beads (TetraSpeck, Thermo Fisher Scientific) embedded in agarose. The Fiji plugin 'Multiview Reconstruction' 35 was used to detect 3D bead positions in the recorded light-sheet stacks. The same positions were then used to fit a 3D Gaussian and to compare the full-width at half-maximum (FWHM) in SPIM, LFD and HyLFM-Net prediction volumes respectively. The PSF close-ups in Fig. 2a-d were computed by averaging over the same ten beads for each modality at z = −30 µm, using the MOSAIC Fiji plugin 37 . To investigate bias to training data and shape priors we imaged 4 μm sized fluorescent beads (TetraSpeck, Thermo Fisher Scientific), cross-applied trained deep neural networks and computed FWHM for all possible combinations (Extended Data Fig. 5). We computed MS-SSIM and PSNR values per z plane for light-field and network predictions, using light-sheet planes as the reference, for the fluorescent beads and the medaka heart, respectively. The following values were used for MS-SSIM computations: NumScales=5, ScaleWeig hts=fspecial('gaussian' , [1,numScales], 1), Sigma=1.5 (ref. 38  were performed according to local animal welfare standards (Tierschutzgesetz §11, Abs. 1, number 1) and by European Union animal welfare guidelines. The fish facility is under the supervision of the local representative of the animal welfare agency. Medaka was raised and maintained as described previously 40 . For in vivo imaging, embryos were kept in 165 mg l −1 1-phenyl-2-thiourea in embryo rearing medium (ERM) from 1 dpf until imaging to inhibit pigmentation. For heart imaging, the following transgenic medaka lines were crossed: myl7::H2B-eGFP (ref. 16 ) and myl7::H2A-mCherry. For the generation of the myl7::H2A-mCherry transgenic medaka line the myl7::eGFP cassette of the pDestTol2CG plasmid (http://tol2kit.genetics.utah.edu/index.php/PDestTol2CG) was replaced by a myl7::H2A-mCherry cassette and the modified plasmid was coinjected with Tol2 transposase mRNA into wild-type stock Cab embryos as described earlier 41 . The calcium imaging experiments were performed using a zebrafish (D. rerio) line with a nuclear-localized calcium sensor (Tg(elavl3:H2B-GCaMP6s)).
To create a static medaka heart for HyLFM-Net training, myl7::H2B-eGFP, myl7::H2A-mCherry transgenic medaka hatchlings were sedated with 150 mg l −1 tricaine and the heart was pharmacologically arrested with 40 mM 2,3-butanedione 2-monoxime (BDM). Pre-treated hatchlings were mounted in 1% low-melting agarose (in ERM) containing 150 mg l −1 tricaine and 40 mM BDM. In the case of re-onset of cardiac contractions, BDM stock solution (100 mM) was titrated into the sample chamber until the heart stopped beating. Then light-field images and light-sheet stacks were acquired subsequently for the same position of the static heart. To get sufficient variability for training data, we imaged the heart at multiple positions and at different angles. For this purpose, a linear piezo stage was used, which displaced the static heart diagonally to the detection objective, thereby assuring variations in two coordinates simultaneously, while the sample angle was modified manually with a rotation stage. For imaging the beating heart, we simultaneously acquired pairs of light-field and light-sheet images at all z planes in the volume. For the SPIM modality, we scanned through the entire volume with 241 individual planes (spaced 1 µm apart). At an effective SPIM frame rate of 40 Hz (15 ms refocusing + 10 ms exposure time) the entire heart was thus imaged every 6 s.
Zebrafish calcium imaging. Zebrafish calcium imaging was performed using a nuclear-localized calcium reporter Tg(elavl3:H2b-GCaMP6s). The zebrafish embryos were mounted according to previous work 16 in 1% low-melting agarose and imaged 5 d after fertilization using alternatingly SPIM and light-field illumination with 10 Hz for both modalities.
Training, validation and test data were recorded by alternatingly acquiring light-field and SPIM images at 10 Hz each (15 ms exposure time + 12 ms read-out time for SPIM, followed by 61 ms exposure + 12 ms read-out time for LFD). To cover the whole volume, the SPIM plane was swept along the axial dimension for training and validation data. For test data, we recorded consecutive time points at a fixed axial position to capture Ca 2+ transients.
Calcium imaging analysis. SPIM and light-field collected datasets were motion corrected by piecewise rigid motion correction package NoRMCorre 42 . The motion corrected SPIM dataset was used to segment regions of interest (ROI) using custom Fiji macros. To do this, a standard deviation projection was obtained every ten frames and the maximum projection of these was used to semi-automatically segment ROI by thresholding. Signal threshold was set such that the signal obtained from ROI was at least twice the standard deviation of the whole FOV. This ROI set was used to extract the Ca 2+ signal from both the SPIM dataset and the corresponding light-field datasets. We calculated the z score of extracted raw Ca 2+ signals and smoothed them using a ten-point moving average. Traces from ROI with no clear Ca 2+ transients in the SPIM dataset were excluded from further analysis. The Pearson correlation coefficient between SPIM traces and their corresponding light-field traces was calculated and compared using the Dunn-Sidak test. Normality was verified using Kolmogorov-Smirnov test. P < 0.05 was considered significant.

Statistics and reproducibility.
No statistical method was used to predetermine sample size. No data were excluded from the analyses. The experiments were not randomized. The investigators were not blinded to allocation during experiments and outcome assessment.
Reporting Summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
The datasets generated and/or analyzed during the current study are available at https://doi.org/10.5281/zenodo.4020352, https://doi.org/10.5281/zenodo.4020404 and https://doi.org/10.5281/zenodo.4019246. Links to additional datasets are provided at https://github.com/kreshuklab/hylfm-net. Source data are provided with this paper. Fig. 1 | Network architecture. This architecture was used for beads and neural activity volumes. For the medaka heart, slightly different layer depth was used with the same overall structure (see Supplementary Table 1). Res 2/3d : residual blocks with 2d or 3d convolutions with kernel size (3×)3 × 3. Residual blocks contain an additional projection layer (1 × 1 or 1 × 1 × 1 convolution) if the number of input channels is different from the number of output channels. Up 2/3d : transposed convolution layers with kernel size (3×)2 × 2 and stride (1×)2 × 2. Proj 2d/3d : projection layers (1 × 1 or 1 × 1 × 1 convolutions). The numbers always correspond to the number of channels. With 19 × 19 pixel lenslets (n num = 19) the rearranged light field input image has 19 2 = 361 channels. The affine transformation layer at the end is only part of the network when training on dynamic, single plane targets; otherwise, in inference mode it might be used in post-processing to yield a SPIM aligned prediction, or the inverse affine transformation is applied to the SPIM target for static samples to avoid unnecessary computations. Fig. 2 | LFM-sPIM optical setup. Schematic 2D drawing of the LFM-SPIM setup showing the main opto-mechanical components. The sample is illuminated through a single illumination objective with two excitation beam paths (ocra, light sheet illumination and blue, light field selective volume illumination) combined by a dichroic mirror (D1). The fluorescence is detected by an orthogonally oriented detection objective and optically separated onto two detection arms with a dichroic mirror (D2). Bandpass filters (BP1 and BP2) are placed in front of a tube lens (TL3,TL4) for the respective detection path. For the light field detection path (green), the tube lens (TL4) focuses on the microlens array (ML) and the image plane (shown in magenta) displaced by one microlens focal length is relayed by a 1-1 relay lens system (RL6) to an image plane coinciding with the camera sensor (shown in magenta). For the light sheet detection path, a combination of several relay lenses (RL1 to RL4), a 1:1 macro lens (RL5) together with a lens pair consisting of an offset lens (OL) and an electrically tunable lens (ETL) is used to image two axially displaced objective focal planes (shown in magenta, dotted and solid) to a common image plane at the sensor. The refocusing is achieved by applying different currents on the ETL. The mirror M1 is placed at a Fourier plane, such that the FOV of the light sheet path can be laterally aligned to fit the light field detection FOV. For single color imaging, the dichroic mirrors D1 and D2 are replaced by beamsplitters. See Methods for details. Fig. 5 | Cross-application of trained deep neural networks can reveal bias to training data. We created two kinds of samples, one with small (0.1 μm) and one with medium-sized (4 μm) beads suspended in agarose. In (a), HyLFM-Net was trained on small beads and applied to small beads. FWHM of the beads in the reconstructed volume is shown (6025 beads measured). b, HyLFM-Net was trained on large beads and applied to large beads (682 beads measured). In (c), HyLFM-Net was trained on small beads and used to reconstruct a volume with large beads (525 beads measured). Similarly, in (d), HyLFM-Net trained on large beads and used to reconstruct a volume with small beads (2185 beads measured). e, SPIM image of 0.1 μm beads, (f) reconstructions of HyLFM-Net from (a), trained on small beads, (g) reconstructions from HyLFM-Net from (d), trained on large beads. h, SPIM image of 4 μm beads, i, reconstructions of HyLFM-Net from (b), trained on large beads, (j) reconstructions of HyLFM-Net from (c), trained on small beads. Line profile is shown to highlight a reconstruction error (red arrows), where the network reconstructs very small beads (as found in the training data) and produces an additional erroneous peak where none is present in the ground truth SPIM volume. Shadows in (a-d) denote standard deviation. Scale bar 2 μm in (e-g), and 10 μm in (h-j).