Abstract
Despite their immense success as a model of macaque visual cortex, deep convolutional neural networks (CNNs) have struggled to predict activity in visual cortex of the mouse, which is thought to be strongly dependent on the animal’s behavioral state. Furthermore, most computational models focus on predicting neural responses to static images presented under head fixation, which are dramatically different from the dynamic, continuous visual stimuli that arise during movement in the real world. Consequently, it is still unknown how natural visual input and different behavioral variables may integrate over time to generate responses in primary visual cortex (V1). To address this, we introduce a multimodal recurrent neural network that integrates gaze-contingent visual input with behavioral and temporal dynamics to explain V1 activity in freely moving mice. We show that the model achieves state-of-the-art predictions of V1 activity during free exploration and demonstrate the importance of each component in an extensive ablation study. Analyzing our model using maximally activating stimuli and saliency maps, we reveal new insights into cortical function, including the prevalence of mixed selectivity for behavioral variables in mouse V1. In summary, our model offers a comprehensive deep-learning framework for exploring the computational principles underlying V1 neurons in freely-moving animals engaged in natural behavior.
1 Introduction
Computational models have been crucial in providing insight into the underlying mechanisms by which neurons in the visual cortex respond to external stimuli. Deep convolutional neural networks (CNNs) have had immense success as predictive models of the primate ventral stream, in cases where the animal was passively viewing stimuli or simply maintaining fixation [1–5]. Despite their success, these CNNs are poor predictors of neural responses in mouse visual cortex [6], which is thought to be shallower and more parallel than that of primates [7, 8]. According to the best models in the literature [9–14], the mouse visual system is more broadly tuned and operates on relatively low-resolution inputs to support a variety of behaviors [15]. However, these models were limited to predicting neural responses to controlled (and potentially ethologically irrelevant) stimuli that were passively viewed by head-fixed animals.
Movement is a critical element of natural behavior. In the visual system, eye and head movements during locomotion and orienting transform the visual scene in potentially both beneficial (e.g., by providing additional visual cues) and detrimental ways (e.g., by introducing confounds due to self-movement) [16, 17]. Movement-related activity is widespread in mouse cortex [18, 19] and prevalent in primary visual cortex (V1) [20, 21]. For instance, V1 neurons of freely moving mice show robust responses to head and eye position [22, 23], which may contribute a multiplicative gain to the visual response [24] that cannot be replicated under head fixation. V1 activity may be further modulated by variables that depend on the state of the animal and its behavioral goals [19, 21, 25, 26]. However, how these behavioral variables may integrate to modulate visual responses in V1 is unknown. Furthermore, a comprehensive predictive model of V1 activity in freely moving animals is still lacking.
To address these challenges, we make the following contributions:
We introduce a multimodal recurrent neural network that integrates gaze-contingent visual input with behavioral and temporal dynamics to explain V1 activity during natural vision in freely moving mice.
We show that the model achieves state-of-the-art predictions of V1 activity during free exploration based on visual input and behavior, demonstrating the ability to accurately model neural responses in the dynamic regime of movement through the visual scene.
We uncover new insights into cortical neural coding by analyzing our model with maximally activating stimuli and saliency maps, and demonstrate that mixed selectivity of visual and behavioral variables is prevalent in mouse V1.
2 Related Work
Despite their success in predicting neural activity in the macaque visual cortex, deep CNNs trained on ImageNet have had limited success in predicting mouse visual cortical activity [6]. This is perhaps not surprising, as most ImageNet stimuli belong to static images of human-relevant semantic categories and may thus be of low ethological relevance for rodents. More importantly, these deep CNNs may not be the ideal architecture to model mouse visual cortex, which is known to be shallower and more parallel than primate visual cortex [27, 28]. In addition, mice are known to have lower visual acuity than that of primates [7, 8], and much of their visual processing may be devoted to active, movement-based behavior rather than passive analysis of the visual scene [21, 29, 30]. Although the majority of V1 neurons is believed to encode low-level visual features [31], their activity is often strongly modulated by behavioral variables related to eye and head position [22–24], locomotion [17, 20, 21], arousal [26, 32], and the recent history of the animal [25]. Furthermore, mouse V1 is highly interconnected with both cortical and subcortical brain areas, which contrasts with feedforward, hierarchical models of visual processing [21].
A common architectural approach that has proved quite successful is to split the network into different components (first introduced by [11]):
a “core” network, which typically consist of a CNN used to extract convolutional features from the visual stimulus [11, 12, 24, 33], sometimes in combination with a recurrent network [11];
a “shifter” network, which mimics gaze shifts by learning a (typically affine) transformation from head-to eye-centered coordinates, either applied to the pixel input [11, 24] or a CNN layer [12];
a “readout” network, which learns a mapping from artificial to biological neurons [11, 12, 33].
Owing to the difficulty of developing a predictive model of mouse cortex, Willeke et al. [14] recently invited submissions to the Sensorium competition held at NeurIPS ‘22. The competition introduced a benchmark dataset of V1 neural activity recorded from head-fixed mice on a treadmill viewing static images, with simultaneous measurements of running speed, pupil size, and eye position. A baseline model was provided as well, which consisted of a 4-layer CNN core in combination with a shifter and readout network [12]. Even though 26 teams submitted 194 different models, the overall improvement to the baseline performance was modest, raising the single trial correlation from .287 to .325 in the Sensorium and from .384 to .453 in the Sensorium+ competition. Architectural innovations (e.g., Transformers, Normalizing Flows, YOLO, and knowledge distillation), were unable to make an impact, as most improvements were gained from ensemble methods. A promising direction was taken by the winning model, which attempted to learn a latent representation of the “brain state” from the various behavioral variables, inspired by [19]. However, the model utilized the timestamps of the test set to estimate recent neuronal activities, which the other competitors did not have access to.
Taken together, we identified three main limitations of previous work that this study aims to address:
Head-fixed preparations. Most previous models operated on data from animals in head-fixed conditions with static stimuli, which do not mirror natural behavior and thus provide limited insight into visual processing in real-world environments. In contrast, the present work is applied to state-of-the-art neurophysiological recordings of V1 activity in freely moving mice. This represents a dramatic shift in the “parameter space” of visual input, from static images to dynamic, real-world visual input. One could imagine that this will make the modeling process more difficult, because the stimulus set is more complex, or easier, because it is more matched to the computational challenge the brain evolved for.
Limited influence of behavioral state. Previous models often limited the influence of behavioral state to eye measurements and treadmill running speed, which were either concatenated with the visual features [14, 32], utilized in the shifter network to determine the gaze-contingent retinal input [11, 14], or used to predict a multiplicative gain factor [11].
Missing temporal dynamics. Most previous modeling works ignored the temporal factors that might influence V1 activity and overlooked the dynamic nature of visual processing (but see [11]). We overcome this limitation by utilizing approximately 1-hour-long recordings of three mice freely exploring an arena, and our model is capable of handling continuous data streams of any length.
3 Methods
Head-mounted recording system
We had access to data from three adult mice who were freely exploring 48 cm long × 37 cm wide × 30 cm high arena (Fig. 1A), collected with a state-of-the-art recording system [24] that combined high-density silicon probes with miniature head-mounted cameras (Fig. 1B). One camera was aimed outwards to capture the visual scene from the mouse’s perspective (“worldcam”) at 16 ms per frame (downsampled to 60 × 80 pixels). A second camera, aimed at the eye, was used to extract eye position (θ, ϕ) and pupil radius (σ) at 30 Hz using DeepLabCut [34], which allowed for the worldcam video to be corrected for eye movements (see [24] for details). Pitch (ρ) and roll (ω) of the mouse’s head position were extracted at 30 kHz from the inertial measurement unit (IMU). Locomotion speed (s) was estimated from the top-down camera feed using DeepLabCut [34]. Electrophysiology data was acquired at 30 kHz using a 11 μm × 15 μm multi-shank linear silicon probe (128 channels) implanted in the center of the left monocular V1, then bandpass-filtered between 0.01 Hz and 7.5 kHz, and spike-sorted with Kilosort 2.5 [35]. Single units were selected using Phy2 (see [36]) and inactive units (mean firing rate < 3 Hz) were removed. This yielded 68, 32, and 49 active units for Mouse 1–3, respectively. To prepare the data for machine learning, all data streams were deinterlaced and resampled at 20.83 Hz (48 ms per frame; Fig. 1C).
Model architecture
We used a 3-layer CNN (kernel size 7, 128 × 64 × 32 channels) to encode the visual stimulus. Each convolutional layer was followed by a BatchNorm layer, a ReLU, and a Dropout layer (0.5 rate). A fully-connected layer transformed the learned visual features into a visual feature vector, ν (Fig. 2, top-right). In a purely visual version of the model, ν was fed into a fully-connected layer, followed by a softplus layer, to yield a neuronal response prediction.
To encode behavioral state, we constructed an input vector from different sets of behavioral variables:
𝒮 : all behavioral variables used in the Sensorium+ competition [14], consisting of running speed (s), pupil size (σ), and its temporal derivative ;
ℬ : all behavioral variables used in [24], consisting of eye position (θ, ϕ), head position (ρ, ω), pupil size (σ), and running speed (s);
𝒟 : the first-order derivatives of the variables in B, namely , and s.
To test for interactions between behavioral variables, these sets could also include the pairwise multiplication of their elements; e.g., ℬ× = {bibj ∀ (bi, bj) ∈ ℬ}. The input vector was then passed through a batch normalization layer and a fully connected layer (subjected to a strong L1 norm for feature selection) to produce a behavioral vector, b.
We then concatenated the vectors v, b, and their element-wise product v ⊙ b (all calculated for each individual input frame), fed them through a batch normalization layer, and input them to a 1-layer gated recurrent unit (GRU) (hidden size of 512). To incorporate temporal dynamics, we constructed different versions (GRUk) of the model that had access to k previous frames. A fully-connected layer and a softplus activation function were applied to yield the neuronal response prediction.
Training and model evaluation
To deal with the continuous and dynamic nature of the data, we split the ∼ 1 h-long recording into 10 consecutive segments. The first 70 % of each segment were then reserved for training (including an 80-20 validation split) and the remaining 30 % for testing.
Models were separately trained on the data from each mouse. Model parameters were optimized with Adam (batch size: 256, CNN learning rate: .0001, full model: .0002) to minimize the Poisson loss between predicted neuronal response and ground truth , where N denotes the number of recorded neurons for each mouse. We used early stopping on the validation set (patience: 5 epochs), which led all models to converge in less than 50 epochs. Due to the large number of hyper-parameters, the specific network and training settings were determined using a combination of grid search and manual exploration on a validation set (see Appendix A).
To evaluate model performance, we calculated the cross-correlation (cc) between a smoothed version (2 s boxcar filter) of the predicted and ground-truth response for each recorded neuron [24].
All models were implemented in PyTorch and trained on an NVIDIA RTX 3090 with 24GB memory. All code will be made available on GitHub should this paper get accepted.
Maximally activating stimuli
We used gradient ascent [33] to discover the visual stimuli that most strongly activate a particular model neuron in our network. The visual input was initialized with noise sampled in 𝒩(.5, 2). We used the Adam optimizer to repeatedly add the gradient of the target neuron’s activity with respect to its inputs. We also applied L2 regularization (weight of .02) and Laplacian regularization (weight of 0.01) [37] on the image. This procedure was repeated 6400 times.
The resulting, maximally activating visual stimuli were smoothed with a Butterworth filter (low-pass, .05 cutoff frequency ratio) to reduce the impact of high-frequency noise.
Saliency map
We computed a saliency map [38] of the behavioral vector for each neuron to discover which behavioral variables contributed most strongly to each model neuron’s activity. We iterated through the test dataset, recorded the gradient of each behavioral input with respect to each neuron’s prediction, and then averaged the gradients per neuron to obtain the saliency map.
4 Results
Mouse V1 activity is best predicted with a 3-layer CNN
To determine the purely visual contribution to V1 responses, we experimented with a large number of vision architectures (see Appendix A). In the end, a vanilla 3-layer CNN (kernel size 7, 128 × 64 × 32 channels) yielded the best cross-correlation between predicted and ground-truth responses (Table 1), outperforming the best autoencoder architecture (kernel size: 7, encoder: 64 × 128 × 256 channels, decoder: 256 × 128 × 64 channels), ResNet-18 [39] (a 20-layer CNN with the first input channel being replaced by 1), EfficientNet-B0 [40] (a 65-layer CNN with the first input channel being replaced by 1), and the Sensorium baseline [12] (a 4-layer CNN with a readout network). The greatest improvement in cross-correlation was achieved for Mouse 2, whose activity overall proved to be much harder to predict than that of Mouse 1 and 3.
Behavioral variables improve most neuronal predictions
Once we identified the 3-layer CNN as the best visual encoder, we added the different sets of behavioral variables to the network. To allow for a fair comparison with the Sensorium+ baseline [14], we first limited ourselves to , but then gradually added more behavioral variables (ℬ) [24] as well the derivatives of these variables (𝒟) and multiplicative pairs (ℬ× and {ℬ ∪ 𝒟}×).
The results are shown in Table 2. All models were able to outperform the Sensorium+ baseline, and the addition of behavioral variables and their interactions further improved model performance. Note that although the full model used a GRU to combine visual and behavioral features, the input sequence length was always 1 (i.e., GRU1). That being said, it is possible that the GRU learned long-term correlations that the Sensorium+ baseline model did not have access to. Nevertheless, the biggest performance improvements were gained through the addition of behavioral variables related to head and eye position (which are present in ℬ but not in 𝒮), their derivatives (𝒟), and multiplicative interactions between these variables ({ℬ ∪ 𝒟}×).
We also wondered whether the prediction of only some V1 neurons would benefit from the addition of these behavioral variables. To our surprise, the cross-correlation between predicted and ground-truth responses improved for almost all recorded V1 neurons (Fig. 3).
Access to longer series of data in time further improves predictive performance
After we identified the full behavioral feature set ({ℬ ∪ 𝒟}×) as the one yielding the best model performance, we extended the GRU’s temporal dependence by allowing the input to vary from one frame (48 ms) to a total of eight frames (384 ms), and assessed the model’s performance.
The results are shown in Table 3. The number of frames needed by the model to reach peak predictive performance varied for different mice (6, 5, and 3, respectively). This indicates that temporal information is important for predicting dynamic neural activity. However, the dependence on temporal information has a limit, and different neurons in V1 might possess different temporal capacities.
Well-defined visual receptive fields emerge
To assess whether the CNN+GRU1 model learned meaningful visual receptive fields, we used gradient ascent (see Methods) to find the maximally activating stimulus for each neuron. Receptive fields for the 32 best-predicted neurons are shown in Fig. 4. Interestingly, most of them had well-defined excitatory and inhibitory subregions, often resembling receptive fields of orientation-selective neurons. Most excitatory and inhibitory subregions spanned approximately 30° of visual angle (the full width of the frame, 80 pixels, roughly corresponding to 120° of visual angle), which is roughly on the same order of magnitude compared to receptive field sizes typically observed in mouse V1, varying from 10° to 30° [8, 24, 41].
Receptive fields demonstrated noticeable differences across mice. The model trained on Mouse 1 seems to have generated many robust visual receptive fields, whereas the models trained on Mice 2–3 appear weaker (same colorbar across panels). In addition, even some of the best-predicted neurons lack a pronounced or spatially structured receptive field, implying that these neurons could be primarily driven by behavioral variables.
Analysis of behavioral saliency maps reveals different types of neurons
Intrigued by the fact that some neurons lacked pronounced visual receptive fields, we aimed to analyze the influence of behavioral state on the predicted neuronal response by performing a saliency map analysis on the behavioral inputs (see Methods). Since different behavioral variables operate on different input ranges, we first standardized the saliency map activities for each behavioral variable across the model neuron population. Saliency map activities further than 1 standard deviation from the mean were then interpreted as “driving” the neuron, allowing us to categorize each neuron as being driven by one or multiple behavioral variables (Fig. 5).
We first asked which neurons in our model were driven by which behavioral variables (Fig. 5, top). Consistent with [24], we found a large fraction of model neurons driven by eye and head position, and smaller fractions driven by locomotion speed and pupil size. 20-25% of neurons were not driven by any of these behavioral variables, rendering their responses purely visual.
However, a particular neuron could be driven by multiple behavioral variables. Repeating the above analysis, we found that most model neurons showed mixed selectivity (Fig. 5, middle), with only a minority of cells responding exclusively to a single behavioral variable. Adding the interaction terms between behavioral variables (Fig. 5, bottom) did not change the fact that most model V1 neurons encoded combinations of multiple behavioral variables, often relating information about the animal’s eye position to head position and locomotor speed.
5 Discussion
In this paper, we propose a deep recurrent neural network that achieves state-of-the-art predictions of V1 activity in freely moving mice. We discovered that our model outperforms previous models under these more naturalistic conditions, which could be attributed to the better alignment of this data with the computations performed by the mouse visual system, based on its natural visual environment and behavioral characteristics. Similar to previous models, we found that a simple CNN architecture is sufficient to predict the visual response properties of cells in mouse V1.
In addition, mouse V1 is known to be strongly modulated by signals related to the movement of the animal’s eyes, head, and body [20, 21, 24], which are severely restricted in head-fixed preparations. Models trained on head-fixed preparations may thus be limited in their predictive power. In contrast, our model was able to predict V1 activity on a 1-hour continuous data stream, during which the animal freely explored a real-world arena. Our analyses demonstrate the impact of the animal’s behavioral state on V1 activity and reveal that most model V1 neurons exhibit mixed selectivity to multiple behavioral variables.
Accurate predictions of mouse V1 activity under natural conditions
Our brains did not evolve to view stationary stimuli on a computer screen. However, most research on neural coding in vision has been conducted under head-fixed conditions, which do not mirror natural behavior and thus provide limited insight into visual processing in real-world environments. Some visual functions mediated by the ventral stream, such as identifying faces and objects, resemble this condition, but the real visual environment is constantly shifting due to self-motion, leading to dynamic activities such as navigation or object reaching, typically mediated by the dorsal stream. To truly understand visual perception in natural environments, we need to capture the computational principles when the subjects are in motion.
In this research, we take the initial steps towards this by modeling a novel data type encompassing neural activity coupled with a visual scene captured from a freely moving animal’s perspective. This represents a dramatic (but, in our opinion, crucial) shift in the “parameter space” of visual input, from static images projected on a screen to dynamic, real-world visual input.
Mixed selectivity of behavioral variables
Our experiments demonstrated that the models incorporating behavioral variables and their interactions performed substantially better than the models relying exclusively on visual inputs. Moreover, our saliency map analysis showed that only around 25% of model neurons could be considered purely visual, with the majority of model neurons driven by multiple behavioral variables.
This widespread mixed selectivity is consistent with previous literature suggesting that V1 neurons may be modulated by a high-dimensional latent representation of several behavioral variables related to the animal’s movement, recent experiences, and behavioral goals [19]. It is also consistent with the idea of a basis function representation [42, 43], which allows a population of neurons to conjunctively represent multiple behaviorally relevant variables. Such representations are often employed by higher-order visual areas in primate cortex to implement sensorimotor transformations [44]. It is intriguing to find computational evidence for such a representation as early as V1 in the mouse. Future computational studies should therefore aim to study the mechanisms by which V1 neurons might construct a nonlinear combination of behavioral signals.
Limitations and future work
While our study opens a new perspective on modeling neural activity during natural conditions, there are a few limitations that need to be acknowledged. First, our data was relatively limited (around 50 neurons per animal, for 3 animals). The development of a Sensorium-style standardized dataset [14] for freely-moving mice would significantly benefit future research in this area, enabling more robust comparisons between different modeling approaches. Second, it would be beneficial to integrate other modalities that are known to be encoded in mouse V1 into the model. One such example is reward signals [45], which could provide additional information about the animal’s decision-making processes and motivations during exploration.
Acknowledgments
This work was supported by the National Institute of Neurological Disorders and Stroke of the National Institutes of Health under Award Number R01-NS121919. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Appendix
A. Vision-Only Models
A.1 Hyperparameter tuning
We performed a grid search to find the optimal CNN kernel size (3, 5, 7, 9), number of channels (32, 64, 128, 256, 512; in various combinations), and dropout rate (0, 0.25, 0.5). While other models often rely on kernel size 3 for their CNN, we found these small kernels to lead to worse performance, perhaps due to the mouse’s low-resolution vision, and that size 7 performed better. We repeated the grid search for CNNs with different number of convolutional layers. The resulting 3-layer CNN outperformed many differently sized networks, such as a 1-layer CNN with 1024 channels (i.e., a shallow but wide network), a 2-layer CNN, or a 4-layer CNN. Choice of learning rates and optimizers had no notable effect on the final performance of the networks.
A.2 Autoencoder
We hypothesized that an autoencoder could provide regularization benefits, because the reconstruction loss might encourage the model to learn visual features that are useful for decoding. Specifically, an encoder ϕ mapped the original frame ℱ to a vector 𝒱 in the latent space, which was present at the bottleneck, while the decoder ψ then mapped the vector 𝒱 from the latent space to the output.
After the hyperparameter search, we settled on size 256 for the latent space vector, and the weight of the reconstruction loss relative to the Poisson loss was fixed to 0.5. Both the encoder and the decoder were 3-layer CNNs, and their numbers of channels were symmetric. However, after testing a number of autoencoders with different configurations (Table 4), we found that a simple 3-layer CNN outperformed any and all of the tested autoencoders.
Footnotes
yuchenhou{at}ucsb.edu
cniell{at}uoregon.edu
mbeyeler{at}ucsb.edu