Abstract
Neural system identification aims at learning the response function of neurons to arbitrary stimuli using experimentally recorded data, but typically does not leverage coding principles such as efficient coding of natural environments. Visual systems, however, have evolved to efficiently process input from the natural environment. Here, we present a normative network regularization for system identification models by incorporating, as a regularizer, the efficient coding hypothesis, which states that neural response properties of sensory representations are strongly shaped by the need to preserve most of the stimulus information with limited resources. Using this approach, we explored if a system identification model can be improved by sharing its convolutional filters with those of an autoencoder which aims to efficiently encode natural stimuli. To this end, we built a hybrid model to predict the responses of retinal neurons to noise stimuli. This approach did not only yield a higher performance than the “stand-alone” system identification model, it also produced more biologically-plausible filters. We found these results to be consistent for retinal responses to different stimuli and across model architectures. Moreover, our normatively regularized model performed particularly well in predicting responses of direction-of-motion sensitive retinal neurons. In summary, our results support the hypothesis that efficiently encoding environmental inputs can improve system identification models of early visual processing.
Significance Computational models use experimental data to learn stimulus-response functions of neurons, but they are rarely informed by normative coding principles, such as the idea that sensory neural systems have evolved to efficiently process natural stimuli. We here introduce a novel method to incorporate natural scene statistics to predict responses of retinal neurons to visual stimuli. We show that considering efficient representations of natural scenes improves the model’s predictive performance and produces biologically-plausible receptive fields. Generally, our approach provides a promising new framework to test various (normative) coding principles using experimental data for understanding the computations of biological neural networks.
Introduction
In the past years, advances in experimental techniques enabled detailed, large-scale measurements of activity at many levels of sensory processing (1). As a consequence, neural system identification (SI) approaches have flourished (Fig. 1a top). They empirically fit the stimulusresponse (transfer) function of neurons based on experimentally recorded data (2–4). A classic example is the generalized linear model (GLM, (2, 5)), which consists of a linear filter as a first order approximation of a neuron’s response function (i.e., its receptive field; (6)), followed by a point-wise nonlinear function for the neuron’s output. To account for additional non-linearities (e.g., (7, 8)), several extensions, such as linear-nonlinear cascades (9, 10), have been proposed. More recently, deep neural network-based SI approaches inspired by the hierarchical processing along the visual pathway (11, 12) have been developed (reviewed in (13–17)). While SI methods became particularly successful in predicting responses of visual neurons (18–22), they often require large amounts of training data and, more critically, do rarely consider adaptions to the natural environment.
However, like other senses, vision has evolved to promote a species’ survival in its natural environment (23), driving visual circuits to efficiently represent information under a number of constraints, including metabolic limits and space restrictions (24, 25). As a consequence, the visual system has adapted to natural statistics, as shown, for example, by the fact that the distribution of orientation preferences of visual neurons mirrors the dominance of cardinal orientations in natural scenes (26–28).
Such adaptations are at the heart of efficient coding (EC) approaches (Fig. 1a bottom): They derive computational principles underlying neural systems from the statistical properties of natural stimuli and by incorporating biological constraints (15, 24, 25, 29–31). Here, one popular strategy starts from the assumption that early visual processing serves to decorrelate the redundant signals in natural environments (32, 33). This theory can reproduce feature selectivity, e.g., difference-of-Gaussian (DoG) kernels that have similar receptive field (RF) properties as retinal ganglion cells (RGCs; (34)). Recently, deep neural networks-augmented EC approaches were proposed, such as convolutional autoencoders (35, 36), which are trained to optimally reconstruct inputs in the presence of an information “bottleneck” (i.e., from a constrained latent representation). Such convolutional autoencoders have been shown to yield center-surround spatial RFs with similar properties as those observed in RGCs when encoding either pink (1/f) noise or natural scenes (37, 38). Still, a downside of EC is that it is not always straight forward to experimentally measure coding efficiency and feature selectivity predicted by these approaches in neural systems (discussed in (39, 40)) and, hence, the interpretation of EC models with respect to the biological underpinnings remains challenging.
Here, we tested whether the EC hypothesis can serve as a useful inductive bias for learning the response functions of neurons. To investigate this, we built a hybrid model combining a SI branch with an EC branch, forced the two branches to share filters (Fig. 1b) and asked, if knowledge about natural scene statistics could help predicting retinal responses. To this end, we experimentally recorded Ca2+ signals of neurons in the mouse retina while presenting it with visual stimuli and then used these responses to train the SI branch, which aims to predict retinal responses. We used natural movies that we recorded in mouse habitats outdoors to train the EC branch, which aims to represent natural scenes efficiently (38). We found a synergy between neural prediction and natural scene statistics: The hybrid approach did not only have a better predictive performance than a pure SI approach, it also produced more biologically-plausible filters. Our results demonstrate that predicting sensory responses benefits from considering adaptations to the natural environment.
Results
Hybrid system identification and efficient coding models
To test if learning an efficient representation of natural input could help predict neuronal responses in the early visual system, we employed normative regularization, i.e., statistical regularization that is informed by normative coding principles, such as the idea that sensory systems have evolved to efficiently process natural stimuli. Specifically, we used this strategy to incorporate EC as a regularizer and developed a hybrid model that combines SI-based neural prediction and EC in a single model. The two model branches are linked by shared convolutional filters (Fig. 1b).
The SI branch approximates the response functions of recorded neurons to a visual dense noise (see below), and was implemented using a convolutional neural network (CNN) (Fig. 2a). Here, we used an L2 regularization on the convolutional layers to encourage smooth filters (41) and an L1 regularization on the fully connected (FC) layer for sparse readouts ((19); for details, see Methods).
The EC branch was trained to efficiently reconstruct input stimuli (i.e., natural scenes) from a constrained latent representation. For this branch, we used a convolutional autoencoder network (CAE) that we published before (for details, see (38) and Methods). Also in the EC branch, we enforced smooth filters by using L2 regularization, and limited the bandwidth by adding Gaussian noise and imposing L1 regularization on the hidden activations. The latter regularization also encourages sparse representations.
In the hybrid model, we implemented interactions between the two branches by shared filters (symbolized by red circle in Fig. 1b). Both branches were trained in parallel, with a weighted sum of their respective losses (LSI and LEC) used as optimization objective. By changing the weighting of the two losses, we were able to control the relative contribution of two branches on shaping the shared filters, and test our hypothesis to which degree efficient representations of natural scenes improve neural predictions (Fig. 2a,b). Specifically, weight w was used to define the hybrid model’s loss function as LHybrid = w · LSI + (1 – w) · LEC (Methods). For w = 1, the EC branch had no influence on the shared filters and, hence, the hybrid model behaved like the pure SI model. Conversely, for w = 0, the SI branch had no influence on the shared filters and, hence, the hybrid model behaved like the pure EC model. Thus, the smaller the weight, the more the EC branch contributed to shaping the filters.
To evaluate the influence of stimulus statistics on neural response predictions, we fed not only natural stimuli to the EC branch, but also phase-scrambled natural stimuli as well as noise. We refer to these models as hybrid-natural, hybrid-pha-scr and hybrid-noise (Fig. 2c). Moreover, to examine whether the performance improvements could be attributed to simple low-pass filtering, we trained SI networks using spatial convolutional filters composed from different numbers of basis functions derived from principle component analysis (PCA) on natural images (Fig. 2d), or the discrete cosine transform (DCT). These models are referred to as SI-PCA and SI-DCT networks.
To train the SI branch of our hybrid framework, we recorded somatic Ca2+ responses from populations of cells in the ganglion cell layer (GCL) of the ex-vivo mouse retina to 9-minute long noise stimuli using two-photon imaging (Fig. 3a; Methods; (42, 43)). The GCL contains the RGCs, which represent the retina’s output neurons and form in the mouse about 40 parallel feature channels to higher visual brain areas (reviewed in (23)). RGCs gain their specific response properties by integrating upstream input from distinct sets of bipolar cells and amacrine cells. Note that the GCL also contains some “displaced” amacrine cells (dACs; (42, 44)). If not indicated otherwise, we did not distinguish between these two GCL cell classes in our datasets. The noise stimulus contained two chromatic components (UV, green) matching the spectral sensitivities of mouse photoreceptors (45). We used the data of n=96 GCL cells that passed our quality criteria (Methods) to fit a pure SI model with factorized spatial and temporal convolutional filters, whose predictive performance served as our baseline (Fig. 3b left).
Neural system identification b enefits from natural scene statistics
First, we measured the predictive performance of the hybrid-natural model on the validation data (for hyperparameter tuning) by systematically varying the relative impact of the two branches by changing the weight w. We found that the performance steadily increased with increasing EC influence ( i.e., decreasing w) up to an optimum (peaking at w = 0.2; Fig. 3c, red), after which the SI had too little influence on the shared filters and the performance dropped. Note that the correlation values for the validation data are relatively low because these predictions were calculated on a single-trial basis (Methods).
Next, we replaced the natural input to the EC pathway by phase-scrambled scenes (hybrid-pha-scr) and white noise across space and chromatic channels (hybrid-noise). Like for the hybrid-natural model, the performance of the two control models also increased with increasing EC influence up to a certain point, peaking at w = 0.3 and w = 0.4 for hybrid-pha-scr and hybrid-noise, respectively (Fig. 3c). This indicates that when incorporating EC, all hybrid model versions showed some improvement up to certain w values, before performance sharply declined.
To test to what extent simple low-pass filtering contributes to the performance improvement observed for the hybrid-natural model, we quantified the performance of two additional SI models, one with PCA and the other one with DCT bases. By varying the number of bases used, we found a maximum in predictive performance at 16 and 4 bases for SI-PCA and SI-DCT (zig-zag ordering), respectively (Suppl. Fig. S1b).
Finally, to compare the performance on the test data across models, we picked for each model, the w or number of bases with the best predictive performance for the validation data. We found that the hybrid model with natural inputs to the EC branch attained the best performance among all tested models (Fig. 3d,e). The hybrid-natural model’s superior performance compared to the hybrid-pha-scr model suggests that the benefit of learning natural scene statistics extends beyond second-order statistics such as the 1/f power spectrum of natural images. Nevertheless, the hybrid-pha-scr model performed better than the hybrid-noise version, pointing at a general benefit of learning second-order statistics in the EC branch. More-over, the hybrid-natural model was consistently better than low-pass filtering control m odels ( SI-PCA and SI-DCT), suggesting that simple low-pass filtering does not fully explain the benefits of sharing kernels with the EC branch trained to efficiently represent natural stimuli.
Together, our results suggest that normative network regularization — in particular, based on natural statistics — can improve the performance of neural SI models.
Hybrid models with natural inputs learn the most biologically-plausible filters
To confirm that our hybrid models capture the properties of the recorded cells, we estimated their RFs (Fig. 3b; Suppl. Fig. S1f; Methods). Indeed, we found that the models learned antagonistic center-surround RFs with biphasic temporal kernels, reminiscent of RGC RFs found in other studies (2, 42). To get insights to which degree our models resembled biological vision systems, we next investigated the internal representations by analyzing the filters of the models’ subunits (18, 46). To this end, we compared the shared spatial convolutional filters between our tested models. As neurons in the retina and further upstream in the early visual system often feature smooth, Gaussian or DoG shaped RFs (e.g., (42, 47, 48)), we considered models with such shared filters as more biological plausible than those with other filter organizations.
Interestingly, while the learned neuronal RFs were quite consistent between models (cf. Fig. 3b), their shared spatial filters differed considerably (Fig. 3f,h). When using natural images in the EC branch (hybrid-natural), filters indeed became smoother and more Gaussian-shaped, which may be a result of the regularization by the EC branch on the SI branch and which may have contributed to the performance improvement of predicting responses. This effect persisted though reduced when phase-scrambled images were used (hybrid-pha-scr). More-over, for smaller w values (i.e., stronger EC influence), Gaussian-shaped filters became more frequent in the hybrid-natural but not in the hybrid-noise model (Fig. 3f, upper vs.lower row). For the SI models with PCA or DCT basis, we found all filters to be smooth as they profited from low-pass filtering of the respective transformation. However, compared to the hybrid-natural model, their filters were less frequently Gaussian-shaped (Fig. 3h).
To quantify these findings, we fitted 2D Gaussian functions to the filters and measured the goodness of the fit via the coefficient of determination (R-squared; Methods). Notably, for all three hybrid models, the w with the best Gaussian fit was the same w that also resulted in the best response predictive performance (w = 0.2, w = 0.3, and w = 0.4 for hybrid-natural, hybrid-pha-scr, and hybrid-noise, respectively; Fig. 3g). The filters of the hybrid-natural model resembled smooth 2D Gaussians more than for any other model (Fig. 3i), including SI-PCA and SI-DCT. The difference of fit quality between hybrid-natural vs. hybrid-pha-scr and hybrid-pha-scr vs. hybrid-noise may be related to higher-order statistics and second-order statistics of natural scenes, respectively.
Taken together, our comparisons of the hidden spatial representations suggest that natural scene statistics promote latent feature representations akin to transformations in the early visual system.
Efficient coding increases the data efficiency of system identification
Next, we asked if the observed performance increase in the hybrid-natural vs. the baseline SI model was sensitive to the amount of training data, both with respect to their response predictions (Fig. 4a) and their learned spatial filters (Fig. 4b). To this end, we trained the SI and the hybrid-natural model (w = 0.2) with different amounts of data, ranging from 30% to 100%.
Not unexpectedly, when more training data was used, predictive performance increased for both models (Fig. 4a top). However, we also found that the performance of the hybrid-natural model was consistently higher than that of the SI model, with the difference becoming significant for ≥60% and peaking at around 90% training data (Fig. 4a bottom). Additionally, for both models the spatial filters became increasingly more Gaussian-like with more data (Fig. 4b). We also observed that the performance difference dropped for large dataset sizes — which, we expect, may be asymptotically near zero in the regime of infinite data.
Together, these results suggest that a hybrid-natural model, which has access to natural statistics, requires significantly less training data than the baseline SI model.
Hybrid models for testing temporal coding strategies
It has been suggested that early stages of visual processing, rather than encoding a past stimulus, aim at predicting future stimuli in their temporal stream of inputs (24, 49–51). Such a future prediction strategy is thought to require a smaller dynamic range to be encoded than that needed for representing past stimuli (past encoding), and thus allows for lower energy consumption (52, 53). Therefore, we next tested if the neural encoding task would profit even more from natural statistics when spatio-temporal (i.e., 3D) filters were shared between the hybrid model’s two branches. We implemented both strategies — past encoding and future prediction — in the EC branch, and compared their influence on the SI task (54).
We modified the 2D SI model to use spatio-temporal (instead of factorized spatial and temporal) convolutional filters to predict neural responses for 8-frame noise movies (3D SI model; Suppl. Fig. S2a). Likewise, we employed spatio-temporal convolutional filters for the EC branch. As before, the two branches of the resulting hybrid model were trained in parallel, but now sharing spatio-temporal filters. In the past e ncoding case, the EC branch was trained to reconstruct the 7th frame (at t – 1) of a continuous 8-frame natural movie clip based on frames at t – 7 to t (hybrid-natural-past; Suppl. Fig. S2b,c). In the future prediction case, the EC branch was trained to predict the 8th unseen frame based on the first 7 frames (t – 7 to t – 1) of the clip (hybrid-natural-future; Suppl. Fig. S2d left).
Like for the 2D models, we varied w or the number of bases and then selected the best model for each condition (3D SI, hybrid-natural-past, hybrid-natural-future, and 3D SI-PCA) based on validation performance. We next quantitatively compared the different models using the test data (Fig. 5a,b; Suppl. Fig. S3c). We found that the 3D SI-PCA model outperformed the 3D SI model, presumably because the former profited from the low-pass filtering of the PCA transformation. Importantly, both hybrid models displayed a better performance than the 3D SI-PCA model. While the hybrid-natural-past model performed slightly better than its hybrid-natural-future counterpart, this difference was not statistically significant. In summary, both the past encoding and future prediction strategy in the EC branch turned out to be equally beneficial for the neural encoding task and, as before, the benefit extended beyond low-pass filtering effects. However, no performance increase was achieved with respect to the 2D hybrid-natural model (Fig. 5b vs. Fig. 3d).
We also analyzed the shared spatio-temporal filters using the same metric as for the 2D case, which assesses the similarity between spatial filters (after performing a low-rank decomposition of 3D shared filters into spatial and temporal components; see Methods) and smooth 2D Gaussians (Fig. 5c,d). Again, we found higher R-squared values for the hybrid models and the 3D SI-PCA model compared to the baseline SI case. Note that here, the 3D SI-PCA model did not significantly differ from the two hybrid models, possibly due to a large number of bases (n = 128 vs. n = 16 in the 2D case).
Next, we asked if the fact that we did not see a significant advantage of 3D over 2D could be due to the relatively slow (5 Hz) noise stimulus, which may drive insufficiently temporal properties of the GCL cell responses. Therefore, we recorded a new dataset (n = 64 cells) in which we presented a 30-Hz dense noise stimulus and used it with the 3D hybrid models. Like for 5-Hz noise, hybrid-natural-past and hybrid-natural-future models performed similarly on the validation data, with a peak in performance at around w = 0.7 (Suppl. Fig. S4a), as well as on the test data, where they were significantly better than the 3D SI model (Suppl. Fig. S4b). Moreover, both 3D hybrid models learned shared filters with similar R-squared values, which were significantly higher than that of the 3D SI model (Suppl. Fig. S4c). But again, the 3D models performed only equally well compared to the 2D models.
In summary, the hybrid-natural models achieved a higher performance for different noise stimuli (5-Hz vs. 30-Hz) and different shared filter organizations (2D vs. 3D) than all other tested models. Therefore, it is likely that their superior predictive performance for neuronal responses and their more biologically plausible filters resulted from the EC branch having access to natural statistics.
Direction-selective neurons benefit more than others from hybrid models
The retina encodes the visual scene in a number of features that are represented by the more than 40 different types of RGC whose outputs are relayed in parallel to higher visual centers in the brain (42, 55–58). Thus, we next asked, if access to natural statistics allows our hybrid models to predict some cell types better than others (Fig. 6). Earlier, it has been shown that motion-relevant properties emerge in the efficient coding framework for both past encoding and future prediction approaches (54). Therefore, we employed our 3D hybrid models (cf. Fig. 5) and focused on direction-selective (DS) cells (42, 59).
For this analysis, we used a set of n=427 GCL neurons, whose responses were recorded not only to the 5-Hz noise stimulus (for training the models) but also to full-field chirp and moving bar stimuli. The latter two stimuli (Fig. 6a) enabled us to identify the functional type of each recorded GCL neuron (42) using a cell type classifier (see Methods; Suppl. Fig. S5).
To explore cell type-specific effects, we chose a dataset size (30% of total recording time) for which the synergy between neural SI and EC was particularly pronounced. As expected, we found that both hybrid networks (hybrid-natural-past and hybrid-natural-future) performed significantly better than the SI model, with no significant difference between the two hybrid models (cf. Fig. 5b, Suppl. Fig. S4b).
First, we evaluated if any of the broader functional groups of GCL cells profited more from natural statistics than others. For this, we sorted the cells into 6 groups based on their response polarity (ON vs. OFF) and transience, and based on whether they were RGCs or dACs (for group sizes, see Fig. 6 legend). For all 6 groups, the hybrid models showed a better predictive performance than the SI model (Fig. 6b). However, no significant differences were observed between any pair of groups (p>0.05 for all pairwise comparisons, two-sided permutation test, n=10,000 repeats; Fig. 6c) and the two hybrid models (p>0.05 for all pair-wise comparisons; Suppl. Fig. S6a).
Next, we grouped the cells into DS (p<0.05, direction tuning using a permutation test; n=90) and non-DS cells (n=300) based on their moving bar responses (Fig. 6a right). Note that n=37 neurons were excluded as they did not pass the quality check for chirp and moving-bar responses (Methods). We found that the predictive performance for DS cells was significantly higher than that of the non-DS cells for both hybrid-natural-past (Fig. 6d,e; p=0.0027) and hybrid-natural-future (Suppl. Fig. S6b,c; p=0.0042). To test whether this performance difference was merely due to different signal-to-noise ratios in DS vs. non-DS cells, we compared their response quality indices (QI; Methods). While DS cells had significantly higher QI values for moving-bar responses (QIbar) than non-DS cells, we did not find any significant difference between the two groups with respect to their noise (QInoise) or chirp responses (QIchirp; Suppl. Fig. S6e-g). These results suggest that DS cells benefit more from the EC branch of the hybrid models than non-DS cells, partially consistent with earlier findings ((54); see also Discussion).
In summary, efficient coding of natural statistics served as a beneficial normative regularization for all types of mouse GCL cells and in particular DS cells, suggesting the potential role of motion statistics in the natural environment on shaping neuronal response properties.
Discussion
In this study, we asked if access to natural scene statistics can help predicting neural responses. To address this question, we combined system identification (SI, (3)) and efficient encoding (EC, (25)) methods into a normatively regularized (hybrid) modeling framework. Specifically, we used models that efficiently represent natural scenes recorded in the mouse’ habitat to regularize models that predict retinal responses to visual stimuli. We analyzed such hybrid models with shared spatial filters, and found that natural images as input to the EC branch indeed improved the performance in predicting retinal responses and allowed the model to generate filters that resembled RFs found in the early visual system. These improvements extend beyond those gained by simple low-pass filtering or using second-order statistics of the natural scenes. Our hybrid models with shared spatio-temporal filters performed similarly well as those with shared spatial filters, independently of whether they used a past encoding or a future prediction strategy. Notably, predictions for DS cells in the mouse retina improved the most in the hybrid models with natural input. In summary, our results suggest that sourcing information about an animal’s environment — e.g., through hybrid SI-EC models — helps building more predictive and biologically-plausible models of neuronal networks. More generally, our findings lend support to the idea that knowledge of natural statistics is already encoded in sensory circuits.
Hybrid models improve data efficiency
The difference in predictive performance between the hybrid and the baseline SI model was significant and it depended on the amount of available data, indicating that our hybrid modeling approach increased data efficiency. We note that both the stimulus (dense noise) and the neural model system (retinal neurons) present much easier SI problems than, for instance, predicting more nonlinear neural responses to natural stimuli (18, 60). For those more challenging problems at downstream visual areas, where neural response functions and, hence, the neural prediction tasks, become more complex (61), the data efficiency of a hybrid approach and the improvement from natural scene statistics may be even higher.
Biological plausibility and temporal coding principles in hybrid models
The biological plausibility of most learned models was positively correlated with their predictive performance except some indeterminacy for SI-DCT models, suggesting that more biologically plausible filters increased performance. Note that we used the filters’ similarity to smooth 2D Gaussian functions as a measure of biological plausibility, following the assumption that RFs in the retina (and at early downstream stages of the visual system) often feature smooth, Gaussian-like structure (42, 47, 48). However, a deep, systematic understanding of artificial and neuronal networks and their hidden representations likely calls for other methods besides of filter inspection (discussed in (62)).
As the natural environment is not static, we also created hybrid models that acknowledge the time domain by sharing spatio-temporal filters. Surprisingly, both variants — past encoding and future prediction — behaved quite similar. However, in the stand-alone EC models (that is only the respective EC branch), the temporal components of the filters learned by the future prediction were much more diverse than those of past encoding (Suppl. Fig. S2c,d right). Interestingly, the differences between temporal filter of these stand-alone EC models decreased with the incorporation of the neural prediction task in the hybrid models.
The filter diversity in our 3D hybrid models is reminiscent of earlier findings by Chalk and colleagues (2018), who reported the emergence of filters sensitive to motion direction and motion speed in their past encoding and future prediction EC models, respectively. However, in contrast to their results, we did not see a difference between our hybrid-past and hybrid-future models with respect to motion-sensitive filters: Both of them performed better in predicting responses of DS vs. non-DS cells. Further work is needed to understand that partial (mis)match between our work and that by Chalk et al., and why specifically DS cells profited from both our 3D hybrid models.
Hybrid models of retinal signal processing
It has been suggested that natural stimuli drive more diverse neural responses, and more complex feature transformations are required to determine the respective stimulus-response functions ((18, 63), but also see (64)). Therefore, one future direction may be to record retinal activity while presenting natural movies (e.g., from (38)) and use it as input for the SI branch of the hybrid model. Finding a more pronounced performance improvement compared to the baseline SI model would support the notion that the noise stimulus we used in this study may have indeed limited the benefits from the EC branch (see above). Neural data to natural stimuli would also allow us to revisit our hybrid models with respect to the prediction of motion sensitive cells and the differences between our results and those from earlier work ((54); see above). Furthermore, such data may be useful for characterizing model generalization (domain transfer, see e.g., (60, 63)) by using responses to natural stimuli as unseen test data with a hybrid model trained with cell responses to noise stimuli.
For our current analysis, we used broad group assignments (e.g., FastON RGCs), which include several functional types of RGC (e.g., ON-step, ON-transient, ON-high-frequency etc; (42)) or dACs, but did not detect any differences in performance gain except for the DS neurons. Still, it is possible that distinct types of RGC profit more than others from the EC branch of our hybrid models. For example, the so-called W3 RGCs, for which the best stimulus found so far is a small dark moving spot (65), may not be “designed” to efficiently represent natural stimuli but rather to extract survival-relevant features (i.e., detecting aerial predators). Here, we could build models with different normative regularization or tasks (i.e., detecting predators in images of the sky) and would expect that this RGC type profits little from efficiently encoding natural statistics in the hybrid model. Studying coding strategies across RGC types could contribute an important biological perspective to the perennial debate between efficient coding (66) and feature detection (55) proponents.
Normative network regularization as a new framework for studying neural coding
In this study, we regularized the filters of a SI model with a normative EC model to predict visually-evoked responses of cells in the retina. Some forms of biology-inspired regularization has already been used in earlier work, for instance, in visual cortex, where sparse coding with lateral inhibition was employed to simulate neuronal activation (67), and higher-order statistics between neurons was incorporated for spike prediction (68). However, they did not leverage natural scene statistics for neural prediction. We expect that our hybrid modeling strategy may also work for different processing stages along the (early) visual pathway (and potentially other modalities, e.g., sound).
This said, however, one needs to keep in mind that different stages along the visual pathway have different tasks and constraints, and, thus, likely incorporate different efficient coding principles: For instance, the retinal hardware is space-limited and has to encode visual features in view of a bottleneck with limited bandwidth (optic nerve), whereas the primary visual cortex has comparably abundant resources which might serve for accurate probability estimation for behavioral tasks, such as novelty detection (discussed in (24, 69)). It is also worth to note that different visual processing stages (such as primary visual cortex vs. higher visual areas, or adaptation of visual coding to different behavioral states) may benefit from the hybrid modeling to a different degree, as efficient coding approaches learn filters that may be more relevant to stimulus-related features, but not high-level behavior goals (see discussion in (15)). Additionally, it would be interesting to compare our hybrid models with SI models regularized with other behavioral tasks such as object recognition (e.g., (11)) or predator detection (see above) for neural predictions along the ventral visual stream.
There is a long tradition of using SI models (reviewed in (3)) in predicting the responses of neurons to a great variety of stimuli (e.g., (2, 4, 18, 19, 70, 71)). Our results demonstrate how the EC hypothesis can be successfully leveraged as normative regularization for the identification of neural response properties. More generally, using EC as a flexible tool to impose regularization on modeling, the hybrid framework offers an opportunity to test different coding principles and unsupervised learning objectives with regards to experimental data for understanding neuronal processing.
Materials and Methods
Animal procedures and retinal activity recordings
Animal procedures
All animal procedures were performed in accordance with the law governing animal protection issued by the German Federal Government (Tierschutzgesetz), approved by the governmental review board (Regierungspräsidium Tübingen, Baden-Württemberg, Konrad-Adenauer-Str. 20, 72072 Tübingen, Germany). We used n=5, 5-9 weeks old female C57BL/6 mice (wild-type; JAX 000664, Jackson Laboratory, USA). Due to the exploratory nature of our study, we did not use any statistical methods to predetermine sample size, nor did we perform blinding or randomization.
Animals were housed under a standard light-dark (12h:12h) cycle. All procedures were carried out under very dim red illumination (>650 nm). Prior to the start of the experiment, animals were dark-adapted for ≥1 h, then anesthetized with isoflurane (Baxter, Germany), and killed by cervical dislocation.
The eyes were enucleated and hemisected in carboxygenated (95% O2, 5% CO2) artificial cerebrospinal fluid (ACSF) solution containing (in mM): 125 NaCl, 2.5 KCl, 2 CaCl2, 1 MgCl2, 1.25 NaH2PO4, 26 NaHCO3, 20 glucose, and 0.5 l-glutamine (pH 7.4). Next, the retina was flat-mounted onto an Anodisc (#13, 0.2 μm pore size, GE Healthcare, Germany) with the ganglion cell layer (GCL) facing up. To uniformly label the GCL cells, bulk electroporation was performed with the fluorescent Ca2+ indicator Oregon-Green BAPTA-1 (OGB-1; Invitrogen, Germany), as described earlier (43, 72), using 4-mm plate electrodes (CUY700P4E/L, Xceltis, Germany) and 9 pulses (~9.2 V, 100 ms pulse width at 1 Hz). After electroporation, the tissue was immediately moved to the microscope’s recording chamber, where it was continuously perfused with carboxygenated ACSF at ~36°C and left to recover for ~30 min before recordings started. Additionally, Sulforhodamine-101 (SR101, Invitrogen, Germany) was added to the ACSF (~0.1 μM final concentration) to visualize blood vessels and identify damaged cells.
Two-photon Ca2+ recordings and light stimulation
We recorded light stimulus-evoked Ca2+ signals in GCL cells of the explanted mouse retina using a MOM-type twophoton (2P) microscope (73, 74) from Sutter Instruments (purchased from Science Products, Germany), as described earlier (42, 43). In brief, the microscope was powered by a mode-locked Ti: Sapphire laser (MaiTai-HP DeepSee, Newport Spectra-Physics, Germany) at 927 nm. Two detection pathways allowed simultaneously recording of OGB-1 and SR101 fluorescence (HQ 510/84 and HQ 630/60, respectively; both Chroma/AHF, Germany) through a 16x water immersion objective (CFI75 LWD×16/0.8W, DIC N2, Nikon, Germany). A custom-written software (ScanM, by M. Müller and T.E.) running under IGOR Pro 6.3 for Windows (Wavemetrics, USA) was used to acquire time-lapsed (64×64 pixels) image scans at a frame rate of 7.8125 Hz. Higher resolution images were acquired using 512×512 pixel scans. Additionally, to register the scan field positions, the outline of the retina and the optic disc were traced.
The retinas were presented with color noise stimulus using a visual stimulator tuned to the spectral sensitivities of mice (75). This stimulus consisted of independent binary dense noise (28×28 pixel frames, each pixel covering (0.83°)2 of visual angle) in the UV and green stimulator channels at 5 or 30 Hz. The stimulus contained 5 different training sequences (96 s each) interspersed with 6 repeats of a 10 s test sequence (Suppl. Fig. S1a).
In total, we used three data sets for modeling: (i) responses of n=96 GCL neurons to 5-Hz noise recorded in dorsal retina (n=2 eyes); (ii) responses of n=427 GCL neurons to 5-Hz noise recorded ventrally (n=5 eyes); in this dataset, we also presented two other stimuli: a full-field chirp (700 μm in diameter) and a moving bar stimulus (300×1,000 μm bright bar moving at 8 directions at 1 mm/s). The responses to these latter stimuli were used to functionally classify the recorded GCL neurons (42). (iii) n=64 GCL neurons to 30-Hz noise recorded ventrally (n=2 eyes). Note that all cell numbers are after quality control (see below).
Data preprocessing and analysis
For each cell, we calculated a quality index (QI, with 0 ≤ QI ≤ 1) for its responses to each stimulus type as follows: where C is a t-by-r response matrix (time samples, t, by repetitions, r). The higher QI, the more reliable the response and the higher the signal-to-noise ratio. For the noise stimulus, QInoise was determined based on the test sequence responses. For the following analysis, we only used cells with QInoise > 0.25; in case chirp and moving bar responses were also recorded, neurons had to fulfill QIchirp > 0.35 or QIbar > 0.6 to be included.
In case of the noise stimulus, we preprocessed each cell’s Ca2+ signal by Z-normalizing the raw traces and matching sampling frequency of the recording (7.8125 Hz) to the stimulus frequency (5 or 30 Hz) via linear interpolation. Then, the traces were detrended using a high-pass filter (> 0.1 Hz) and their 1st order derivatives were calculated, with negative values set to zero. We used the average of a cell’s responses to the 6 test sequence repeats as ground truth. Excluding the test sequences, we had per cell a total of 480 s of data, of which we used 440 s (~91%) for training and the remaining 40 s (~9%) for validation (i.e., to pick the hyperparameters of the SI model, see below).
For chirp and moving bar responses, we first detrended the traces and then normalized them to [0, 1] (43). Using these responses, the cells were classified to different functional groups (42) using RGC type classifier (see below).
To estimate the directional tuning from the moving bar responses, we first performed singular value decomposition (SVD) on the mean response matrix, resulting in a temporal and a directional component. We then summed the directional vectors in 2D planes and used the resulting vector length as direction selectivity index. Next, by shuffling trial labels and computing the tuning curve for 1,000 times (permutation test), we got the null distribution (no directional tuning). The percentile of true vector length was used as p-value of directional tuning (42). Here, we considered cells with p < 0.05 as direction-selective (DS) and the remaining ones as non-DS.
RGC type classifier
To predict the functional type of GCL cells, we used a Random Forest Classifier (RFC; (76)), which was trained on a published mouse dataset (42). In that study, features were extracted from the responses to different visual stimuli (e.g., chirp and moving bar) and used to cluster GCL cells into 32 RGC types and 14 additional dAC types. Here, we learned a mapping f from response features (20 features from responses to chirp, ϕchirp and 8 features from responses to moving bar stimulus, ϕmb) and two additional parameters Θ = {θsoma, θDS to functional cell type labels L by training a RFC for the dataset from (42): where θsoma denotes soma size to distinguish between alpha and non-alpha RGC types and θDS denotes p-value of permutation test for direction selectivity to distinguish between DS and non-DS RGC types.
We fit the RFC on a subset of data from (42) and validated its performance on a held-out test dataset. The classifier had a prediction accuracy of ~76% on a heldout test dataset (Suppl. Fig. S5). To apply the trained classifier to our newly recorded dataset, we projected the RGC responses (normalized to [−1, 1]) into the feature space described in (42) by computing the dot product between the response and the feature matrices. We used the RFC implementation provided by the python package scikit-learn (77) to train the classifier.
2D models
Stand-alone SI model (2D)
As baseline model to predict the responses of neurons to the noise stimulus, we employed a stand-alone SI model (supervised learning), in which we used factorized spatial and temporal convolutional filters ( Fig. 2 a; (78, 79)). This SI model consisted of one spatial convolutional layer (16×2×1×9×9, output channels x input channels x depth x image width x image height), one temporal convolutional layer (16×16×8×1×1, with 8 stimulus frames preceding an event), and — after flattening the spatial dimension — one fully connected layer (FC; 96×6,400, output x input channels), followed by an exponential function. No padding was used. The loss function was defined as:
Here, the first term is the Poisson loss between predicted responses and ground truth (with i denoting the neuron index), the second term is the L2 penalty on the weights of the spatial convolutional filters with hyperparameter α1, the third term is the L2 penalty on the weights of temporal convolutional filters with hyper-parameter α2, and the last term is the L1 penalty on the FC layer with hyperparameter β.
After performing a grid search for the three hyperparameters, we picked α1 = 10, α2 = 10, β = 1/16 which yielded the best performance on the validation data. After training, we estimated the neurons’ spatio-temporal RF filters by computing gradients for each neuron, starting with a blank image sequence as input. These gradients represent the first-order approximation of the input that maximizes the neuron’s activation (6). For visualization, we extracted the spatial and temporal RFs via SVD.
As a metric of biological plausibility, we calculated the co-efficient of determination (R-squared; [0, 1]) of fitting 2D Gaussian distributions to the spatial (component of) the convolutional filters. We set the R-squared value to 0 if the sigma of the fitted Gaussian was larger than the size of the filter (i.e., 9 pixels). We calculated this fit quality for the filter of the chromatic channel with the dominant response. Because the mouse retina is divided into a more green-sensitive dorsal and a more UV-sensitive ventral retina (e.g., (43)), this meant that for dorsal neurons we only determined the R-squared for filters for the green stimulus channel, and for ventral neurons for the UV stimulus channel.
SI-PCA model (2D)
The spatial convolutional filters of the SI-PCA model were composed from PCA basis functions (W). The model was trained to learn the weights of these basis functions. The filters were produced by performing PCA transformation on natural images recorded in mouse habitats (38): where U contains the eigenvectors of the covariance matrix of the centered data in each column.
For example, when using 4 PCA bases, the shape of learnable weight matrix was 16×4 (channel number x basis number), the shape of PCA bases was 4×2×1×9×9 (basis number x chromatic channel x depth x image width x image height), and the resulted spatial filter had the shape of 16×2×1×9×9. We varied the number of used basis (hyperparameter) and selected the one which achieved the best performance on validation data (Suppl. Fig. S1b; Suppl. Fig. S3b).
SI-DCT model (2D)
For the SI-DCT model, its spatial convolutional filters were composed from DCT basis functions, which were defined as: where i and j denote pixel index of the input image (size (N, N)); u and v denote DCT coefficient index of the DCT filter. Here, we employed DCT basis functions for one-channel gray images and thus used different bases for each chromatic channel. For example, when using 4 DCT bases, the shape of learnable weight matrix was 16×4×2 (channel number x basis number x chromatic channel), the shape of basis function was 4×1×9×9 (basis number x depth x image width x image height), and the resulted spatial filter had the shape of 16×2×1×9×9. Like for SI-PCA, we varied the number of used basis and picked the one which achieved the best performance on validation data (Suppl. Fig. S1b).
Stand-alone EC model (2D)
We used a similar EC model architecture (convolutional autoencoder) and loss function as in (38). The model’s encoder contained a single convolutional layer (with weights denoted ) followed by a rectified linear unit (ReLU) function, one fully-connected (FC) layer, and another ReLU function. The decoder contained one FC layer, one ReLU function, a single deconvolutional layer (with weights denoted ), and a hyperbolic tangent (tanh) function to map back to the original data range ([−1, 1]).
As a measure of reconstruction quality, we used mean squared error (MSE; (37, 38)). Gaussian noise was added to the encoder output for redundancy reduction (37, 80, 81) and an L1 penalty (hyperparameter β) was imposed to its activation for sparse readouts (37, 80, 82). We also applied L2 regularization on the convolutional and deconvolutional layers to encourage the learning of smooth filters (41, 83, 84). We used 16 9×9 convolutional and deconvolutional filters. The activation tensor ( 16×28×28, output channel x image width x image height) following the first convolutional layer was flattened to a one-dimensional vector with 12,544 inputs before feeding into the FC layer. The loss function for the EC model was: where the first term is the MSE error between the prediction and ground truth with image index i, and the next two terms denote the L2 and L1 penalties.
Hybrid model (2D)
The hybrid (semi-supervised) model consisted of a SI and an EC branch (for details on the two models’ architectures, see above). These branches were trained simultaneously, sharing the spatial convolutional filters ). The total loss function of the hybrid model was derived from the loss functions of the two branches as follows:
Here, i and j denote neuron and image index, respectively; N1 and N2 the number of neurons and images, respectively. The weight (w, with 0 ≤ w ≤ 1) controlled the impact of each branch’s loss function on the shared spatial filters. Practically, we used w = 10−8 for LSI and w = (1 – 10−8) for LEC when w = 0 and w = 1, respectively. Note that we added w to the denominator of the last two terms to maintain the same regularization for and in a stand-alone SI model when varying w. For LEC, similar to LSI, we added (1 − w) to the denominator of the last two terms to keep the same regularization for and in a stand-alone EC model when varying w. We used different data to train the EC branch of the hybrid model: natural images, phase-scrambled natural images and noise. All hybrid models were trained for a maximum of 100 epochs (Suppl. Fig. S1c,d); training was stopped early when the validation loss started decreasing.
Tuning all hyperparameters jointly in a grid search was computationally prohibitive. Hence, for the SI branch, we varied the hyperparameters around those determined for the stand-alone configuration ( α1 = 10, α2 = 10, β1 = 1/16; see above), while for the EC branch, we varied the hyperparameters systematically around the values (α3 = 103, β2 = 1/16) used in (38). To tune w, we devised a linear search approach by normalizing the loss functions (using N1 and N2).
After training the hybrid model, we estimated the spatiotemporal RFs of all neurons using a gradient ascent algorithm (6). We visualized the spatial and temporal component of RFs using SVD (cf. Fig. 3b), and the magnitude of the RF was indicated in the spatial component.
We trained 2D models using all training data (440 s) with a learning rate of μ = 10−4. In case less data were used (i.e., to evaluate data efficiency), we kept all hyperparameters the same as for the full data case but doubled the learning rate. This was done because the stand-alone SI model and the hybrid model could not reach the minimum of validation loss within 100 epochs (when less data were used).
3D models
Stand-alone SI model (3D)
The 3D SI model consisted of one spatio-temporal convolutional layer (16×2×8×9×9, output channels x input channels x depth x image width x image height; depth varied with the frequency of noise stimuli, n=8 and n=30 for 5-Hz and 30-Hz noise, respectively), and — after flattening all dimension — one FC layer (96×6,400, output channels x input channels; output channel varied with cell numbers n=96, 64 or 427 for different data sets; see above), followed by an exponential function. No padding was used. The loss function was defined as:
This equation differs from Equation () with respect to the L2 penalty, which is here on the weights of the spatiotemporal convolutional filters with hyperparameter α for the second term. After performing a grid search for the two hyperparameters, we picked α = 100, β = 1/4 which yielded the best performance on the validation data. After training, we estimated and extracted the cells’ spatial and temporal RFs via SVD for visualization.
SI-PCA model (3D)
For the 3D SI-PCA models, we applied Equation () to the movie clips (2×8×9×9, chromatic channel x depth x image width x image height; depth varied with the frequency of noise stimuli, n=8 and n=30 for 5-Hz and 30-Hz noise, respectively). Like for 2D SI-PCA models, we varied the number of used bases and picked the number for which the model achieved the best performance on the validation data (Suppl. Fig. S3a).
Stand-alone EC model (3D)
The 3D EC models used a sequence of frames from a movie clip as input and featured 3D spatio-temporal convolutional layers (with weights denoted ) in the encoder. The decoder contained deconvolutional layers with weights . In the past-encoding case, we fed an 8-frame clip (frames at t – 7 to t) to the model and aimed at reconstructing the 7th frame (at t – 1). In the future-prediction case, the goal was to predict the 8th frame (at t) with the input being the first 7 frames (t – 7 to t ℃ 1) of the clip. The loss functions was similar to that given by Equation () except that (i) features different a shape (16×2×8×9×9, output channel x chromatic channel x filter depth x filter width x filter height), and (ii) xi denotes the 7th frame for the past encoding and the 8th frame for the future prediction model (Suppl. Fig. S2b,c,d).
Hybrid model (3D)
The 3D hybrid models consisted of a SI branch and an EC branch with shared spatio-temporal convolutional filters (; see above). Like for the 2D hybrid models, the total loss function was a weighted sum of losses for the two branches as follows:
Here, i denotes neuron index, j movie clip index, N1 neuron number, and N2 the number of movie clips. Again, instead of tuning all hyperparameters jointly via a grid search, we varied the hyperparameters around the values determined for the stand-alone SI configuration ( α1 = 100, β1 = 1/4) for the SI branch. For the EC branch, we varied the hyperparameters systematically around the values (α2 = 104, β2 = 1/16) used in the stand-alone EC models. We then tuned w linearly after normalizing the loss functions (using N1 and N2). We also visualized the spatial and temporal RF components using SVD (Fig. 5a, bottom).
Author Contributions
Conceptualization: Y.Q.; Methodology: Y.Q., D.K., K.S., D.G., L.H., T.S., and T.E.; Data acquisition & curation: K.S.; Formal analysis: Y.Q. with input from D.K., M.B., L.B., and T.E.; Investigation: Y.Q. with input from D.K., K.S., L.B., M.B., and T.E.; Writing – original Draft: Y.Q., D.K., L.B., and T.E.; Writing – review & editing: all authors; Visualization: Y.Q.; D.G. (confusion matrix); Software: Y.Q.; L.H. and D.G. (classifier); Resources: T.S. and T.E.; Supervision: M.B., L.B., and T.E.; Funding acquisition: L.B., M.B., and T.E.
Declaration of Interests
The authors declare no competing interests.
Data and Code Availability
Data and code would be available upon publication.
Acknowledgments
We thank Dylan Paiton and Katrin Franke for helpful discussions, and Merle Harrer for excellent technical assistance. This work was supported by the German Research Foundation (DFG) through the Collaborative Research Centre CRC 1233 (project number 276693517; sub projects 10 and 12) and under Germany’s Excellence Strategy EXC 2064/1 (project number 390727645); the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant (agreement No 674901); the Max Planck Society (M.FE.A.KYBE0004); and the German Ministry of Education and Research (BMBF; FKZ: 01GQ1002), and the Tübingen AI Center (FKZ: 01IS18039A). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Bibliography
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.
- 15.↵
- 16.
- 17.↵
- 18.↵
- 19.↵
- 20.
- 21.
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.
- 28.↵
- 29.↵
- 30.
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.
- 57.
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵