Decoding Images from Multi-Region, High Resolution, Electrode Recordings In the Mouse Visual System

We hypothesize that deep networks are superior to linear decoders at recovering visual stimuli from neural activity. Using high-resolution, multielectrode Neuropixels recordings, we verify this is the case for a simple feed-forward deep neural network having just 7 layers. These results suggest that these feed-forward neural networks and perhaps more complex deep architectures will give superior performance in a visual brain-machine interface.


Introduction
A visual brain-machine interface needs to decode images from cortical activity [1], particularly natural images which have distinct spatial frequency spectra [2]. While natural images cause sparse activation of neurons in visual cortex (V1) [3] [4] [5], how this sparse set of neurons represents a given image and how the image can be inferred (decoded) from sparse activity is unclear.
Linear decoding assumes that the represented image is a linear function of neuron activity. Such decoders underlie classic visual neuroscience theory [6] [7], but are results of experiments featuring small numbers of electrodes, e.g. [8]. As advancements such as optogenetics, two-photon imaging, and electrode design increase the scale of availale physiological data, linear decoders may struggle to keep pace in their performance, while deep learning continues to outperform linear decoders as the state of the art in a multitude of tasks and brain regions [9] [10] [11] [12] [13]. The caveat to increased performance is that deep networks perform best with larger training data in the form of function input/output. Until recently, direct neural data in the form of electrode recordings were limited to few (order of one hundred) measurements in a single brain region using hardware such as the Utah Array. [14] Neuropixels [15] are a recent hardware advancement that give not only an order of magnitude increase in number of electrodes, but also data from multiple brain regions simultaneously. A public dataset of neuropixel recordings in mice was recently made available online [16]. With this rapid expansion of available data, deep networks are poised to offer state of the art decoding performance from visual cortex activity.

Data Collection & Feature Extraction:
The Neuropixels Dataset [17] covers 58 experiments in which visual stimuli ranging from Gabor functions to natural scenes are presented to mice with multiple neuropixels probes inserted into visual cortex. The length of the probes also permits recordings from subcortical structures such as the Lateral Genticulate Nucleus (LGN) and Lateral Posterior (LP) nucleus. High-pass-filtered electrode recording data is statistically whitened before being passed through the Kilosort2 algorithm to identify specific neurons, giving spike times for a set of neurons/units assigned to a given anatomical region [18].
The decoder maps measured spiking rates recorded by the probes to the images presented to the subject on a 1920 × 1080 monitor. That is, we infer the image on the monitor from the spiking activity.
Input Spike Rates: We compute the spike rates for each of N neurons by measuring the number of spikes in window of length τ after a given image is presented as in figure (1).
Spike rates are then z-scored so that each channel has zero mean and unit variance across the time of recording. Using the linear model described below, we empirically determine that the optimal window length is 167ms. As shown in figure (2), this length offers the best trade-off between accuracy and latency, since the improvement on accuracy versus integration time switches from exponential to linear.
Output Images: The network's output are the presented 1920 × 1080 images presented at the beginning of the integration window of length τ = 30 ms. To reduce the dimensionality, we resample the images by averaging to size 120 × 64. The monitor spanned 120 • × 95 • of the mouse visual field, so the resampled resolution limits the spatial frequency of the presented images to 0.5 cycles/ • , the approximate limit of spatial frequency observable by mice [19]. Thus, we preserve all perceptible visual information while reducing the size of the image by a factor of 16 2 . To improve the network Here we considered recording data from one session (ID: 840012044) over exclusively natural stimuli ('natural_movie_one_more_repeats'), giving a total of 54, 000 frames presented over 80 minutes at two separate intervals.

Decoding Model Selection, Training, & Evaluation
We evaluate a linear decoding model and compare it to a deep feed-forward neural network.
Model Objective: We consider the mean square error as a loss function, common in image processing literature [20]. The mean square error (MSE) measures the norm of the difference in pixel intensities between the true image y and predicted imageŷ: Model Training: To train the deep network, we use batch gradient descent with the Adaptive Moment Estimation (Adam) algorithm [21] using a batch size of 1000, and a learning rate of = .0001. The deep network model is a feedforward sequential network with 7 hidden layers with input length N pixels , depth of 7 and width of 77 as sketeched in figure (3). These numbers were chosen by training a new network for each combination of logarithmically spaced parameter values and evaluating the performance after 1000 epochs.
For the linear decoder, the MSE objective has an analytic solution: where X ∈ R N samples ×Nneurons , gives the recorded spike rates, Y ∈ R N samples ×N pixels gives the presented images, and F ∈ R Nneurons×N pixels , gives the linear mapping between spike rates and images. Each row of F gives the pixel intensities decoded by a spike in that respective neuron.

Results
We trained each model on the MSE loss function using the complete dataset, i.e. all channels from all anatomical regions. Accuracy is measured by the average percentage of the error between the true

Model
Linear Deep Network Random Output Train Accuracy 40.0% 55.5% < 1% Test Accuracy 39.53% 53.5% < 1 a% Table 1: Model Accuracy Computed using all N neurons for linear, deep, and random (null) models. For the latter, pixel values of the predictions were randomly chosen to have same mean and unit variance as the target images from a Gaussian distribution. and predicted images: Table (1) lists the accuracy for each model and error metric. We see that the deep network outperforms the optimal linear decoder by roughly 15%. This metric improvement is also visible in the decoded images. A sample image comparing the presented image with the two model predictions is given in figure (4)

Conclusions & Future Work
We've shown that deep networks are more effective at decoding visual information from neural activity than their linear counterparts. This implies that a deep network will outperform a linear decoder when used in a visual brain-machine interface. Given the simplicity of the architecture and the expansive research in deep learning, future work could incorporate more sophisticated designs such as Convolutional Neural Networks(CNN's) , and various forms of Recursive Neural Networks(RNN's) to process the sequential nature of temporal spike recordings.