Maximum a posteriori natural scene reconstruction from retinal ganglion cells with deep denoiser priors

Visual information arriving at the retina is transmitted to the brain by signals in the optic nerve, and the brain must rely solely on these signals to make inferences about the visual world. Previous work has probed the content of these signals by directly reconstructing images from retinal activity using linear regression or nonlinear regression with neural networks. Maximum a posteriori (MAP) reconstruction using retinal encoding models and separately-trained natural image priors offers a more general and principled approach. We develop a novel method for approximate MAP reconstruction that combines a generalized linear model for retinal responses to light, including their dependence on spike history and spikes of neighboring cells, with the image prior implicitly embedded in a deep convolutional neural network trained for image denoising. We use this method to reconstruct natural images from ex vivo simultaneously-recorded spikes of hundreds of retinal ganglion cells uniformly sampling a region of the retina. The method produces reconstructions that match or exceed the state-of-the-art in perceptual similarity and exhibit additional fine detail, while using substantially fewer model parameters than previous approaches. The use of more rudimentary encoding models (a linear-nonlinear-Poisson cascade) or image priors (a 1 /f spectral model) significantly reduces reconstruction performance, indicating the essential role of both components in achieving high-quality reconstructed images from the retinal signal.


Introduction
A torrent of visual information arrives at each of our eyes, but only a small portion of it is transmitted to the brain via the optic nerve, which is comprised of the axons of the retinal ganglion cells (RGCs). Elucidating the nature of this encoded information, and the inference process the brain uses to interpret it, is fundamental to understanding biological vision. Image reconstruction provides a method of visualizing the information encoded in RGC signals, evaluating it using standard image quality metrics, and reasoning about how the brain might interpret it [1,2,3,4]. The fidelity and quality of reconstructed images also provides a useful objective function for optimizing the design of electrical stimulation patterns delivered by devices implanted to restore vision [5,6].
The simplest and most well-studied image reconstruction method is linear regression [1,3,7]. Optimal reconstruction kernels are learned for each RGC using least-squares regression of recorded responses to many visual images, and the reconstruction of a new incident image is computed with the sum of the filters weighted by the response of each cell. The quality of linearly reconstructed images can be enhanced by applying an autoencoder neural network to leverage natural image priors [8], or by using deep neural networks to non-linearly recover additional high spatial frequency image components [4]. Neural networks can also be directly trained (supervised) for reconstruction, but this is data-intensive, and to date has limited their use to simulated data, or low-dimensional stimuli and small numbers of cells [9]. These regression approaches leave substantial room for improvement and interpretation. A Bayesian formulation, in which encoding model and prior probabilities are made explicit and are separately fitted, could provide a more flexible and interpretable solution, and could potentially improve the fidelity of reconstructed images.
Here we present a method for approximate maximum a posteriori (MAP) image reconstruction from RGC spikes, that combines a retinal encoding model that accurately captures retinal responses [10] with state-of-the-art image priors that are implicitly embedded in deep denoising networks [11,12,13,14,15,16]. By separating the effects of image prior and retinal spiking response likelihood, our method offers two primary advantages over existing methods for reconstruction: (1) any pre-trained or closed-form natural image prior can be used, and the effects of different priors can be compared; and (2) any model of RGC encoding that provides an explicit likelihood can be used, and the method can quantify the relative importance of different model components in representing the visual signal, including spike train history, cell-to-cell correlations and output nonlinearities. We apply our method to reconstruct static flashed natural images from responses of several hundred macaque RGCs of identified types recorded with large-scale electrode arrays. We compare our method directly to published state-of-the-art linear and neural network regression methods (Section 4.1). The new method matches or significantly outperforms previous methods, producing sharper, more naturalistic reconstructions (Figure 2), and similar or greater perceptual similarity to ground truth ( Figure 3, Tables 1 and 2). However, our method also produces some reconstructions with distinctive spurious image structure, as would be expected when RGC signals are noisy and image priors dominate the reconstruction process. Finally, comparisons to more conventional encoding models and image priors reveal that both aspects of the approach are important for the most accurate reconstructions.

Retinal data and stimuli
Extracellular recordings from RGCs in the peripheral macaque retina were performed ex vivo using a 512-electrode system [17] as described previously [3]. Retinas were obtained from terminally anesthetized macaque monkeys used by other laboratories, in accordance with Institutional Animal Care and Use Committee requirements. Spikes from individual RGCs were identified with the YASS [18] spike-sorter. A 30-minute spatiotemporal white noise stimulus [19] was used to compute spatio-temporal receptive fields, and to identify cells of distinct types. Analysis focused on the four major RGC types of the primate retina (ON parasol, OFF parasol, ON midget, and OFF midget) [20,21], totaling roughly 700 cells per recording. The receptive fields [22] of all four cell types formed regular mosaics, uniformly covering a region of visual space ( Figure 1). Natural images were presented to the retina as described previously [3]. Images from the ImageNet database [23] were converted to grayscale and cropped to 256x160. Each pixel measured approximately 11 ⇥ 11 µm at the retina. Each stimulus image was displayed for 100 ms, followed by a 400 ms uniform gray display, allowing each image presentation to be treated as an independent trial. This trial design does not fully mimic natural vision because it does not account for eye movements [24], and because the temporal component of the stimulus is known to the reconstruction algorithm. Two retinas from different animals were used, with 19,000 and 10,000 image/response pairs, respectively. Details are summarized in Tables 4 and 5 in the Appendix. Figure 1: Receptive field mosaics from one retina for the four major RGC types (ON parasol, OFF parasol, ON midget, OFF midget) used in the image reconstructions. Image quality metrics were computed over the shaded blue region, to exclude areas that were insufficiently covered by receptive fields of recorded RGCs.

MAP image reconstruction from RGC spikes
MAP reconstruction estimates the stimulus image x from observed RGC spike trains s by minimizing the negative log of the posterior, log p(x | s), which can be expressed using Bayes' Rule as: Both terms in equation (1) have intuitive interpretations in the context of reconstruction from RGC spikes. The first, log p(s | x), is the negative log likelihood (NLL) of an encoding model describing the probabilistic spiking of RGCs given a stimulus. The parameters of this model can be learned from experimental data. The second is the negative log prior of the stimulus image x and can be learned from natural images independently of retinal responses. Because encoding models with varying levels of fidelity and detail can be mixed and matched with priors of varying sophistication, the MAP approach allows us to probe the distinct roles of these two components in image reconstruction [25,26].

RGC encoding models
Encoding models for each RGC must be learned from the experimental data before performing MAP reconstruction. Two types of encoding models were fitted to the data: (1) a linear-nonlinear-Poisson (LNP) cascade model with an exponential nonlinearity, the most commonly used model of RGC responses to visual stimuli [7]; and (2) a generalized linear model (GLM) that augments the LNP model with a feedback loop and cross-connections between neighboring cells, and which can accurately capture fine spike timing structure and cell-cell correlations [10]. Encoding models were fitted on the entire training partition, and regularization hyperparameters were tuned by evaluating the test partition NLL for a small subset of RGCs of each type.
Linear-nonlinear-Poisson (LNP) encoding model The LNP model is the de facto standard model for describing the probabilistic spiking of RGCs in response to visual stimuli [7]. The model parameters for a single RGC consist of a linear spatial filter m, and a scalar bias b. In the model, a scalar generator signal m T x + b is passed through a nonlinearity to compute a spike rate . The RGC spike count in a 150 ms interval is modeled as a draw from a Poisson distribution with rate .
Assuming an exponential nonlinearity, the encoding NLL for a single RGC is which is convex in m and b. In practice, to ensure that the spatial filters were spatially compact and corresponded approximately with the receptive fields obtained using reverse correlation with white noise, the MAP objective was augmented with two regularization terms: an L1 sparsity-inducing penalty, and an L2 penalty enforcing similarity to the receptive field. The complete objective function with both regularization terms is described in A.4. The parameter optimization was solved separately for each cell using FISTA [27].
MAP reconstruction requires the joint encoding NLL of the observed spikes from every RGC given the image. Since the LNP model assumes that the RGC responses are statistically independent, this is simply the sum of the single-cell NLLs, which is convex in the image x: Generalized linear encoding models The generalized linear model (GLM) is an augmentation of the LNP model that incorporates the effects of spiking history and cell-cell correlations on neural response [10]. In the GLM, each RGC (indexed by i) is parameterized by a spatio-temporal stimulus , and a bias b i . The GLM was fitted to spike counts measured within 1 ms time bins, approximately matched to the refractory period of the cells [28,29]. To limit the number of parameters and improve computational efficiency, the stimulus filter was assumed to be space-time separable Using a sigmoidal nonlinearity and Bernoulli spiking, the encoding NLL used to fit a single cell (see A.5.2 for complete derivation) is To simplify the GLM, the filters h i , f i , and c (j) i were each represented as weighted sums over a set of cosine "bump" functions [10]. As with the LNP model, L1 and L2 regularization terms were added to constrain the spatial filters, and an additional L 1,2 group sparsity penalty for the coupling filters was added to eliminate spurious cell-cell correlations. The complete objective function is described in detail in A.5.3. Model parameters were found by alternating between spatial and temporal filter convex minimization steps for each RGC, using FISTA [27,30] for each step.
The joint encoding NLL over all of the cells used for image reconstruction (see A.5.4 for derivation) is again convex in the image x:

MAP with Gaussian 1/F priors
The Gaussian 1/F prior is the among the simplest and most commonly used image priors [31], and is the basis for many classical image processing algorithms. The 1/F prior assumes that pixels of the image are drawn from a stationary jointly Gaussian distribution (and thus that the spatial covariance matrix is diagonalized by the 2D Fourier basis) and that spectral power (variance of each spatial frequency component) falls off in inverse proportion to the square of the frequency F 2 . Discarding terms that do not depend on the image x, the negative log prior can be written is the amplitude of the k th Fourier coefficient at frequency f k . MAP image reconstruction using the 1/F prior can be performed using standard unconstrained convex minimization methods as both the negative log prior and the RGC encoding NLLs described in (3) and (6) are convex.

Approximate MAP with denoising convolutional neural network (dCNN) priors
Modern denoising convolutional neural networks can represent powerful image priors, but these priors are implicit [12,32]: they are not expressed in closed form, and their values cannot be computed explicitly, making exact MAP inference difficult. The "plug-and-play" methodology provides an approximate iterative procedure for using such denoisers in MAP estimation problems [11], by incorporating them into variable-splitting optimization methods such as half-quadratic splitting (HQS) [33] or alternating direction method of multipliers (ADMM) [34,35]. Here, we adopt a method based on the HQS method presented in [15] to perform MAP reconstruction from RGC spikes. As in [15], we introduce an auxiliary variable z, split the original problem in (1) into two sub-problems, incorporate a regularization parameter p to control the prior term, and solve by alternating between two complementary optimization problems: Since the problem in equation (8) has the same objective function as MAP Gaussian denoising with known noise variance p /⇢ (k) , we solve it approximately with a single forward pass of a pre-trained unblind DRUNet denoiser network [15], resulting in Algorithm 1.
Unlike most applications of HQS, the encoding term in equation (7) is non-quadratic in x and hence (7) was solved iteratively (gradient descent with momentum and backtracking line search) rather than in closed form. z (1) was initialized as the linear solution, though using random Gaussian initialization does not significantly affect the results (see Appendix A.6). Because convergence in the mathematical sense is not necessary for most imaging applications [36], K = 25 iterations were used in Algorithm 1. As in [15], ⇢ (k) was increased per-iteration on a log-spaced schedule. The hyperparameters p and [⇢ (1) , ⇢ (25) ] were determined by performing a grid search and evaluating reconstruction quality on an 80-image subset of the test partition. Variations in hyperparameters over a reasonable range (⇢ (1) 2 [10 2 , 10 1 ], ⇢ (25) 2 [30, 500], p ⇡ 0.1) produced similar reconstruction quality, and optimal hyperparameters were similar across the two retinas and across LNP and GLM encoding models. Approximate MAP reconstructions using this algorithm are termed MAP-GLM-dCNN and MAP-LNP-dCNN for the GLM and LNP encoding models, respectively.

Benchmark: nonlinear regression with artificial neural networks
Current state-of-the-art methods for reconstruction of natural images from RGC spikes rely on an initial linear reconstruction step [1,3], followed by ad hoc application of nonlinear neural networks. Specifically, Parthasarathy et al.

Approximate MAP with GLM/dCNN matches or exceeds state-of-the-art results
To test whether our MAP-GLM-dCNN method outperforms state-of-the-art approaches, image reconstructions were generated from the test partitions of the datasets, and were compared both qualitatively and quantitatively. Example reconstructions are shown in Figure 2 for the L-CAE [8], for Kim et al. [4], and for our method. MAP-GLM-dCNN reconstructions are seen to be sharper than those of L-CAE, and contain additional image details (especially extended contours, as in rows C, E, G, H, I, and L). When compared to Kim et al., MAP-GLM-dCNN tended to recover more content, particularly straight edges (rows E, G, H, I, and L), but sometimes exaggerated the contrast (rows A, H, and I). MAP-GLM-dCNN produced qualitatively different artifacts than the other methods. In particular, it sometimes hallucinated naturalistic structure not present in the stimulus images (rows J, K, and N), including striking irregularities in contours (rows D, K, L, and M).
Quantitative comparisons between MAP-GLM-dCNN and the two benchmark regression methods were also made. Scatter plots comparing MS-SSIM and PSNR on the test partition of one retina are shown in Figure 3A and 3B for Kim et al. and the L-CAE, respectively, and summary statistics over the test and heldout partitions for both retinas are presented in Tables 1 and 2. On an image-forimage basis, MAP-GLM-dCNN reconstructions have greater MS-SSIM than those of L-CAE (3B), demonstrating that the new method systematically achieves greater perceptual similarity to ground truth. The MAP-GLM-dCNN method resulted in comparable MS-SSIM perceptual similarity to the much more complicated Kim et al. method (3A). The PSNR of MAP-GLM-dCNN reconstructions was systematically worse than either benchmark. This is not surprising, as the MAP optimization procedure does not necessarily minimize MSE. These results held for both retinas (Tables 1 and 2).

Deep denoiser prior substantially improves image quality over 1/F prior
To test the importance of the image prior, MAP-GLM-dCNN results were compared against reconstruction using the GLM encoding model with the classical 1/F Gaussian prior (MAP-GLM-1F). Example reconstructed images using the denoiser prior and 1/F prior are shown in columns 5 and 7, respectively, of Figure 2. Images reconstructed with the denoiser prior are less "grainy", and tend to have better-defined edges and smoother surfaces. The artifacts seen in the 1/F examples are expected, since this simple prior does not constrain phase [31], whose alignment is essential for generating sharp spatially-localized features. Scatterplots of image quality on the test partition using MS-SSIM and PSNR are shown for one retina in Figure 3C, and mean values for both retinas are summarized in Tables 1 and 2. Consistent with the visual appearance, PSNR and MS-SSIM were systematically higher when using the denoiser prior, in both retinas. Thus, using the more sophisticated denoiser image prior substantially increased the perceptual similarity of the reconstructions to ground truth.

GLM encoding model recovers additional image structure over LNP encoding model
To test the importance of the encoding model, we compared images reconstructed using the GLM and LNP encoding models, both using the same denoiser prior. Example images are shown in columns 5 and 6 of Figure 2. Images reconstructed using both models exhibit natural image structure like smooth surfaces and well-defined edges, but the GLM-reconstructed images tended to have more realistic-looking textures, whereas the LNP-reconstructed images tended to be overly simplified. Moreover, the GLM method recovered more high spatial frequency details (e.g., the legs of the insect in row C, the horizontal stripes on the tape cassette in row D, and the details on the hammock in row E, and the structure on the file cabinets in row G). The quality of image reconstructions for each image/response pair in the test partition for one retina were compared using MS-SSIM and PSNR in Figure 3D, and their mean values over the test and held out partitions for both retinas are summarized in Tables 1 and 2. In both retinas, images reconstructed using the GLM encoding model had systematically greater MS-SSIM scores, indicating greater perceptual similarity to ground truth, than those reconstructed using the LNP encoding model. This demonstrates that the choice of encoding model significantly affects reconstruction quality, and that the inclusion of temporal spike dependencies and cell-to-cell correlations in the more sophisticated GLM encoding model provides important constraints on the information encoded by the RGC spikes. This finding is consistent with previous work showing that decoding using the GLM (without priors) can access more information than simplified models lacking the cell-cell correlations or spiking history [10].

Discussion
This paper presents a novel approximate MAP method for reconstructing natural images from the simultaneously recorded spikes of several hundred RGCs, using an accurate probabilistic model of retinal encoding and a natural image prior implicit in a pre-trained denoising neural network. The method matches or outperforms the current state-of-the-art in terms of recovering naturalistic image structure and/or the perceptual similarity of reconstructions to ground truth, while also being more principled and interpretable due to the explicit Bayesian separation of the encoding model and prior. The new approach uses substantially fewer parameters than previous state-of-the-art methods based on CNNs, and does not require training CNNs on retinal data (the prior is obtained from a network trained exclusively on image denoising). We showed that both encoding model and image prior contributed to the high-quality image reconstructions: removal of either substantially degraded performance. Thus, we expect that cell-cell correlations and temporal structure of spike trains, as well as image priors, will prove important in understanding how the retinal signal is used by the brain.
Several previous studies have used GLM encoding models for stimulus reconstruction from experimentally-recorded retinal signals, revealing the significance of cell-cell correlations for decoding temporal structure in white noise stimuli [10,38], and the significance of the temporal structure of spike trains in tracking moving features [2]. By including a complex natural image prior into a Bayesian reconstruction method, the present work more efficiently exploits both the GLM and experimental data to produce state-of-the-art natural image reconstructions.
The enhanced reconstructions and interpretability obtained with our method could lead to improved function of retinal implants for restoring vision. Previous work [5] has suggested that electrical stimulation with a retinal implant can be guided by minimizing the expected MSE of linearly reconstructed images. This method ignores potentially important cell-cell correlations and fine temporal structure in RGC spike trains, and assumes that image priors captured by linear regression are sufficient for high performance. The method presented here offers an alternative approach to choosing simulation patterns to produce higher-fidelity artificial vision, while potentially being more robust than ad hoc neural network methods. However, achieving this in real time with minimal latency presents a substantial technical challenge.
Though the present work is limited to reconstruction of flashed static natural images from RGC spikes, extensions of our approximate MAP reconstruction method could be used to probe how neurons encode visual information under more natural conditions. For example, a central problem is understanding how the visual system achieves high-acuity perception in the presence of "jitter" in eye position, even when fixated [24]. Previous computational efforts have probed this question, but have been largely limited to simulated data with simple encoding models and stimuli [39,40,41]. Combining the methods put forth here with modern algorithms for image deblurring and motioncorrection [42,15] could yield more powerful methods to decode images from jittered retinal inputs. A related problem is understanding how the retina encodes the information contained in complex naturalistic movies [2], including movement of objects within a scene and other non-rigid transformations over time. The dimensionality of such stimuli and the consequent data requirements are high, so the ability to capture stimulus priors using modern machine learning tools [43,44] separately from the retinal data, as was done here, will be important for understanding reconstruction in more naturalistic visual contexts.