## Abstract

Neural decoding methods provide a powerful tool for quantifying the information content of neural population codes and the limits imposed by correlations in neural activity. However, standard decoding methods are prone to overfitting and scale poorly to high-dimensional settings. Here, we introduce a novel decoding method to overcome these limitations. Our approach, the Gaussian process multi-class decoder (GPMD), is well-suited to decoding a continuous low-dimensional variable from high-dimensional population activity, and provides a platform for assessing the importance of correlations in neural population codes. The GPMD is a multi-nomial logistic regression model with a Gaussian process prior over the decoding weights. The prior includes hyperparameters that govern the smoothness of each neuron’s decoding weights, allowing automatic pruning of uninformative neurons during inference. We provide a variational inference method for fitting the GPMD to data, which scales to hundreds or thousands of neurons and performs well even in datasets with more neurons than trials. We apply the GPMD to recordings from primary visual cortex in three different species: monkey, ferret, and mouse. Our decoder achieves state-of-the-art accuracy on all three datasets, and substantially outperforms independent Bayesian decoding, showing that knowledge of the correlation structure is essential for optimal decoding in all three species.

## 1 Introduction

Since Zohary, Shadlen, and Newsome’s landmark demonstration of correlated activity in a population of MT neurons (Zohary et al., 1994), computational neuroscience has been seeking to elucidate the role that correlations play in the population code (Averbeck et al., 2006; Bartolo et al., 2020; Ecker et al., 2011; Kanitscheider et al., 2015; Kohn et al., 2016; Moreno-Bote et al., 2014; Nirenberg & Latham, 2003; Schneidman et al., 2003). A common strategy for evaluating the role that correlations play in a particular population’s code is to compare the accuracy of two decoders trained on that population’s stimulus-response data: a correlation-blind decoder, and a correlation-aware decoder (Berens et al., 2012; Graf et al., 2011; Nirenberg & Latham, 2003; Stringer et al., 2021). If the correlation-aware decoder performs better, then one may conclude that downstream regions must take correlations into account to optimally read out information from the upstream population code.

This strategy is an effective way to investigate the scientific question, but existing work is plagued by a number of statistical issues, which we aim to address in this paper. First, as neural datasets have increased in dimensionality, regularization has become a prerequisite for good decoding performance, making it difficult to compare correlation-blind decoders—which are often unregularized—and correlation-aware decoders, which are almost always regularized. Second, conventional correlation-aware decoders struggle to scale computationally to modern datasets containing tens or hundreds of thousands of neurons.

To address these shortcomings, we develop a suite of three new decoders with a common regularization strategy based on Gaussian Processes (GPs). First, we introduce two correlation-blind decoders that apply Bayesian decoding to an independent encoding model: the GP Poisson Independent Decoder (GPPID), which assumes independent Poisson encoding noise; and the GP Gaussian Independent Decoder (GPGID), which assumes independent Gaussian encoding noise. Both of these decoders place a Gaussian process prior over the neural tuning curves. (For each neuron, its tuning curve is its mean response as a function of the stimulus variable.) The GPPID model can be used when the neural responses are encoded by non-negative integers (e.g., spike counts), whereas the GPGID model can be used when the neural responses are real numbers (e.g., calcium imaging). We emphasize that both of these decoders are insensitive to correlations in neural activity, because they rely on independence assumptions.

We then introduce a third decoder, which is correlation-aware, the Gaussian Process Multiclass Decoder (GPMD), which is a multinomial logistic regression model that uses a GP prior to regularize its weights. This decoder, which learns a direct linear mapping from high-dimensional neural activity patterns to the log-probability of the stimulus, is the only one of the three that can take into account neural correlations. However, the three decoders have a similar number of parameters—equal to the number of neurons times the number of stimulus categories—and rely on a common regularization method, making it straightforward to compare them.

We compared our decoders to a variety of previously proposed decoding methods: first, multinomial logistic regression regularized using an elastic-net penalty (GLM-NET, see Zou and Hastie [2005]); second, the empirical linear decoder (ELD), a decoder trained using support vector machines (Graf et al., 2011); and third, the “super-neuron” decoder (SND), a recently proposed decoder trained using least squares regression and a bank of nonlinear target functions (Stringer et al., 2021). All three of these decoders are linear, correlation-aware classifiers. For completeness, we also compared our decoders to unregularized, correlation-blind Poisson and Gaussian independent decoders (PID/GID).

We benchmarked all these decoders on three real-world datasets from primary visual cortex (V1), recorded from monkey (Graf et al., 2011), ferret, and mouse (Stringer et al., 2021). We found that our regularized correlation-blind decoders (GPPID and GPGID) could match and even exceed the performance of some of the correlation-aware decoders. However, none of these decoders did as well as our our proposed correlation-aware decoder, the GPMD, which achieved state-of-the-art performance on all datasets. These results indicate that knowledge of the correlation structure is crucial for reading out stimulus information from V1 populations in all three species. For ease of use, our decoders conform to the scikit-learn interface and are released as a Python package at https://github.com/cdgreenidge/gdec.

## 2 The neural decoding problem

In this paper, we consider the problem of a decoding a low-dimensional stimulus variable (i.e., the orientation of a sinusoidal grating) from a high-dimensional neural activity pattern (i.e., a vector of spike counts). We assume the stimulus belongs to one of *K* discrete bins or classes, formally making this a classification problem. However, the regression problem can be approximated by making *K* large, so that the grid of stimulus values becomes arbitrarily fine.

Figure 1 illustrates the problem setup for the V1 datasets we examined. The visual stimulus for each individual trial is a drifting sinusoidal grating with an orientation *θ*_{k} selected from a set of discrete orientations {*θ*_{1}, …, *θ*_{K}} that evenly divide the interval [0, 2*π*]. The stimulus variable to be decoded is thus a categorical variable *y* ∈ {1, …, *K*}.

We consider the neural population response to be a vector **x** ∈ ℝ^{D}, where *D* indicates the number of neurons in the dataset. We obtained this response vector by summing each neuron’s spikes (monkey) or two-photon calcium fluorescence (ferret and mouse) over some time window after stimulus presentation. Figure 1B shows orientation tuning curves from three example neurons from each dataset. The monkey datasets (left) contained between *D* = 68 and *D* = 147 neurons, with *K* = 72 discrete stimulus orientations (spaced every 5 degrees), and 50 trials per orientation for *T* = 3600 trials (Graf et al., 2011). The ferret dataset (middle) contained *D* = 784 neurons, with *K* = 180 discrete stimuli (spaced every 2 degrees) and 11 trials per orientation for a total of *T* = 1991 trials, with the 0^{°}/360^{°} orientation sampled twice. Finally, the mouse datasets (right) contained between *D* = 11311 and *D* = 20616 neurons. The stimuli for this experiment were sampled uniformly in [0, 2*π*], and we subsequently discretized them into *K* = 180 bins (Stringer et al., 2021). Each bin contained between 12 and 42 trials for a total of between *T* = 4282 and *T* = 4469 trials, depending on the dataset.

In each case, we collected the population response vectors **x** and the discretized stimuli *y* into a classification dataset . Full details on these datasets and their preprocessing procedures can be found in Appendix C.

The decoders we consider are all linear classifiers, meaning that they are defined by a set of linear decoding weights and an intercept term. Their common form is:
where **w**_{k} ∈ ℝ^{D} is a set of decoding weights, and *b*_{k} is an intercept term for stimulus class *k*. Note that an explicit intercept term is not strictly necessary, since it can be included in the weights if a 1-valued entry is appended to **x**. To obtain the stimulus estimate *ŷ*, we compute the dot product between the neural response vector **x** and the weights for each class, and select the class in {1, …, *K*} for which this dot product is maximal. The full set of parameters for a decoding model is thus the set of decoding weights for each class, which can be written as a *D*×*K* matrix *W* = [**w**_{1}, …, **w**_{K}], and, optionally, a *k*-dimensional intercept vector **b** = [*b*_{1}, …, *b*_{k}]^{⊤}. The only difference between the decoding methods we will consider is the procedure for training these weights from data.

We note that decoding with linear classifiers is optimal for so-called “exponential-family” probabilistic population codes with linear sufficient statistics (Beck et al., 2007; Ma et al., 2006). Although we could certainly expand our study to consider decoding with nonlinear classifiers, previous analyses of two of the V1 datasets we used showed no benefit from adding nonlinear classification (Graf et al., 2011; Stringer et al., 2021).

## 3 Review of existing decoders

Here we describe previously proposed neural decoding methods, which we will compare to the Gaussian Process based decoding methods we introduce in Section 4.

### 3.1 Correlation-blind decoders

First, we introduce two independent or “correlation-blind” decoders, the first assuming Poisson noise, and the second assuming Gaussian noise. Both decoders make use of Bayes’ rule to obtain a posterior distribution over the stimulus under an independent encoding model, an approach commonly known as “naïve Bayes.” The encoding models underlying these decoders assume that neural responses are conditionally independent given the stimulus, making them unable to take correlations into account during decoding.

#### 3.1.1 The Poisson Independent Decoder (PID)

The Poisson independent decoder relies on an independent Poisson encoding model of neural responses, which assumes that each neuron’s spike count follows an independent Poisson distribution with its mean determined by the stimulus (Abbott, 1994; Földiák, 1993). The encoding model describes *x*_{d}, the response of neuron *d*, as:
where *λ*_{dk} is the mean response of neuron *d* to stimulus *k*. Under the conditional independence assumption, the joint distribution of the population response is simply the product of the single-neuron encoding distributions:
where *D* is the total number of neurons.

Bayes’ theorem lets us derive the probability over stimuli value given an observed response vector **x**:

If there are equal numbers of trials per class in the training dataset, the prior probabilities *P*(*y* = *k*) are equal for all *k*, and cancel, leaving the prediction rule

To fit the model, one has to estimate the parameters {*λ*_{dk}} across all the stimuli for each neuron. Collected into a vector *λ*_{d} = (*λ*_{d1}, …, *λ*_{dK})^{T}, these are known as the tuning curve. The maximum likelihood estimate for *λ*_{dk} is given by the mean spike count for each neuron-stimulus combination:
where 𝒜_{k} = {(**x**_{t}, *y*_{t}) ∈ 𝒟 ∣ *y*_{t} = *k*} is the set of all elements of the dataset associated with a particular stimulus *y* = *k*, and |𝒜_{k}| is the number of elements in 𝒜_{k}.

Assuming the prior class probabilities *P*(*y* = *k*) are equal, the log of the class-conditional probability (eq. 5), also known as the log posterior over classes, can be written:
where *c* is a constant we ignore because it does not depend on the class. This shows that the PID decoder is a linear classifier (eq. 1), with weights *Ŵ* and intercepts given by

See Appendix A for a detailed derivation.

#### 3.1.2 The Gaussian Independent Decoder (GID)

The Poisson independent decoder described above can only applied to nonnegative integer data, such as spike counts. For real-valued data such as calcium fluorescence, intracellularly-recorded membrane potential, local field potential, or fMRI BOLD signals, it is common to use a Gaussian encoding model. This model describes *x*_{d}, the response of neuron *d*, as:
where *μ*_{dk} is the mean response of neuron *d* to stimulus *k*, and is the noise variance for neuron *d*. Unlike a typical Gaussian naïve Bayes decoder, we restrict the noise variance to be constant across stimulus classes, though, as usual, it can vary across neurons. With this restriction, the decoder becomes a linear classifier, like the other decoders we consider. If the noise variance were allowed to vary across stimulus classes, the decoder would be a quadratic classifier (see Appendix A.)

To fit the model, we compute maximum likelihood estimates of the encoding distribution parameters for each neuron, which are given by the class-conditional empirical means , and the empirical variances , for each *d*-th neuron:

As before, is the dataset, and 𝒜_{k} = {**x** ∈ 𝒟 ∣ *y*_{t} = *k*} is the set of all neural response vectors for a particular stimulus *y* = *k*.

Decoding stimuli under this encoding model follows from Bayes’ rule in the same manner as in the Poisson independent decoder (eq. 5), but using the Gaussian encoding distribution instead of the Poisson. After some algebra, we can see that the Gaussian independent decoder (GID) is a a linear classifier (eq. 1) with weights *W* and intercepts **b** given by

See Appendix A for a detailed derivation.

### 3.2 Correlation-aware decoders

Here we review three previously-proposed decoders that take into account the structure of neural correlations when determining a classification boundary, and are therefore “correlation-aware.” Unlike the two naïve Bayes decoders described above, which resulted from applying Bayes’ rule to an encoding model, these directly model the posterior probability over stimuli given a vector of neural activity. All three decoders are multiclass linear classifiers, but they are trained with different loss functions and regularization methods.

#### 3.2.1 Multinomial logistic regression with an elastic-net penalty (GLM-NET)

The multinomial logistic regression model is a generalization of binary logistic regression to the multiple-class setting. It assumes that the log probability of the stimulus given the response is an affine function—that is, a linear transform plus a constant—of the neural response vector (Bishop, 2006). The conditional probability of the stimulus belonging to class *k* given the neural response vector **x** can be written:
where **w**_{k} is a vector of the decoding weights for class *k, b*_{k} is the constant offset for class *k*, and the *Z* is the normalizing constant.

The model parameters consist of the weights *W* = (**w**_{1}, …, **w**_{K}) and offsets **b** = (*b*_{1}, …, *b*_{K})^{⊤}, and can be fit by maximum likelihood. The log-likelihood function given the dataset can be written:
where **y**_{t} is the one-hot vector representation of the stimulus class *y*_{t} ∈ {1, …, *K*} on trial *t*—that is, a vector of all zeros except for a one in the entry corresponding to the stimulus class—and **1** is a length-*D* vector of ones.

The maximum-likelihood estimator (MLE) tends to perform poorly in settings with limited amounts of data, and may not exist for small datasets. In fact, the MLE is not defined when the number of trials *T* is smaller than the number of identifiable parameters in the weight matrix *W*, i.e. when *T* < *D*(*K* − 1). Even in settings where the MLE does exist, it may overfit, yielding poor generalization performance.

A popular solution to this problem is to regularize the MLE with the elastic-net penalty, which combines ℓ_{1} (“lasso”) and ℓ_{2} (“ridge”) penalties to induce parameter sparsity and shrinkage (Friedman et al., 2010). The elastic-net estimator is obtained by maximizing the log-likelihood minus the regularization penalty:

Here, *γ*_{1} is a hyperparameter determining the strength of the regularization, and *γ*_{2} is a hyperparameter that controls the balance between the ℓ_{1} penalty, which encourages *W* to be sparse, and the ℓ_{2} penalty, which encourages *W* to have a small magnitude. For our decoding tasks, we found that including the ℓ_{1} penalty always diminished cross-validated performance, so we fixed *γ*_{2} = 0. We then set *γ*_{1} using a five-step logarithmic grid search from 10^{−4} to 10, evaluated with three-fold cross validation.

#### 3.2.2 The Empirical Linear Decoder (ELD)

The Empirical Linear Decoder (ELD), introduced by Graf et al. (2011), is similar to multinomial logistic regression in that it models the log probability of the stimulus class as an affine function of the neural response vector (eq. 15). However, instead of using standard likelihood-based methods to fit the model, the authors constructed an inference method based on support vector machines (SVMs).

Their key observation was that the log-likelihood ratio for adjacent stimulus classes is an affine function of the response vector, with weights given by the difference of the two classes’ decoding weight vectors. For example, for stimulus classes one and two we have:

Here, we have defined **v**_{2} to be the difference vector (**w**_{2} − **w**_{1}) and *c*_{2} to be the difference scalar *b*_{2} − *b*_{1}.

We see that discriminating class two from class one under the multinomial logistic regression model is equivalent to solving a linear binary classification task with weights **v**_{2} and offset *c*_{2}. The authors proposed estimating **v**_{2} and *c*_{2} using an SVM trained on the data from classes one and two. They then used the same approach to estimate the weights for all subsequent pairs of adjacent classes. That is, they estimated the difference weights **v**_{k+1} using an SVM trained on data from classes *k* − 1 and *k*, for *k* = 2, …, *K*.

To recover the weights of the multinomial logistic regression model from the SVM weights, Graf et al. (2011) used the recursions:
and
for *k* = 2, …, *K*. Here the weights for class one can be set to zero without loss of generality. The constant *α*_{k}, which scales the contribution of the SVM weights **v**_{k} and *c*_{k}, is necessary because SVMs only recover **v**_{k} and *c*_{k} up to a multiplicative constant. We were unable to determine how the authors set these scaling constants, so we fit them by maximizing the log likelihood of the data under the multinomial logistic regression model (eq. 16).

#### 3.2.3 The Super Neuron Decoder (SND)

The Super Neuron Decoder (SND), introduced by Stringer et al. (2021), is a third approach for training a linear classifier on multi-class data. It optimizes a set of decoding weights using penalized least-squares regression and a set of nonlinear super-neuron response functions. The super-neuron response functions encode the tuning curves of a population of narrowly-selective downstream “super-neurons”, containing one super-neuron for each stimulus orientation. Each super-neuron responds maximally to a single orientation, making the population response on each trial a narrow bump of activity centered on the correct stimulus.

Formally, the SND seeks a matrix of weights *W* that maps the response vector **x** to a target vector **h** ∈ ℝ^{K} on each trial. The target vector **h** contains the responses of the super-neuron population. The super-neurons have tuning curves parameterized by the von Mises probability density function, which is appropriate since the stimulus variable is periodic.

The *i*-th super-neuron has a preferred orientation of , so its tuning curve is given by:

The target vector for the *k*-th stimulus class is therefore
where is the stimulus angle associated with stimulus class *k* ∈{1, …, *K*}.

Stringer et al. (2021) trained the model weights *W* by linearly regressing the observed neural responses onto the target vectors. To penalize large weight values, they included an ℓ_{2} (“ridge”) regularization:

The term is the squared error between the correct target vector and the output of the linear decoding weights *W* ^{⊤}**x**_{t} on trial *t*. The term is the squared ℓ_{2} penalty on the decoding weights with regularization strength *γ*, which the authors fixed at *γ* = 1.0. Intuitively, this training procedure seeks weights *W* that make the linearly transformed population response *W* ^{⊤}**x** match the super-neuron population response **h** as closely as possible in a least-squares sense.

The decoding rule chooses the class label corresponding to the maximum of the linearly weighted responses. (This is the same decoding rule as in the other decoders we have considered):

Here (*W* ^{⊤}**x**)_{k} is the *k*-th element of the transformed response vector *W* ^{⊤}**x**. In other words, the predicted stimuli value is the preferred orientation of the maximally responding super-neuron.

## 4 Proposed methods: GP-regularized decoders

In this section, we first introduce two correlation-blind decoders regularized with Gaussian process (GP) priors: the GP-regularized Gaussian Independent Decoder (GPGID) and the GP-regularized Poisson Independent Decoder (GPPID). Like the GID and PID, these decoders use independent Gaussian and Poisson encoding models, but they also add GP priors to induce smoothness in the neural tuning curve estimates. Next, we introduce a correlation-aware decoder, the Gaussian Process Multiclass Decoder (GPMD), which adds a GP prior to multinomial logistic regression for the same purpose.

### 4.1 The GP-regularized Poisson Independent Decoder (GPPID)

When doing inference in the PID decoder (section 3.1.1), it is necessary to estimate each , the mean spike count for the *d*-th neuron and the *k*-th stimulus. The mean spike counts for every stimulus form a tuning curve for each neuron *λ*_{d} = (*λ*_{d1}, …, *λ*_{dK})^{T}. The maximum likelihood estimator for each entry in the tuning curve is simply the empirical mean of the spike counts for the *d*-th neuron under the *k*-th stimulus. However, the empirical mean estimates are noisy, especially when the number of trials for each stimulus is small, which can limit the PID decoder’s performance.

In principle, we could compensate for the noise by recording more trials for each stimulus, but this is expensive, particularly if the stimulus grid has a fine resolution. Instead, we propose to reduce error in the tuning curve estimates by exploiting our prior knowledge that tuning curves tend to be smooth with respect to orientation. We incorporate this knowledge into the model by placing an independent Gaussian process prior over the log tuning curve of each neuron (Park et al., 2014; Rad & Paninski, 2010).

The resulting GP-regularized PID model is given by
where *x*_{d} is the spike response of neuron *d*, and *λ*_{d}[*k*] is the *k*th element of the tuning curve *λ*_{d}. Here, the log tuning curve has a Gaussian Process prior with zero mean, and a covariance function *κ*(⋅, ⋅) with hyperparameters *θ*_{d}.

We choose *κ* to be the radial basis (RBF) or “Gaussian” covariance function:
where *d*(*j, k*) denotes the distance between stimulus classes *j* and *k*, and the hyperparameters *θ*_{d} = {*ρ*_{d}, ℓ_{d}} are the marginal variance *ρ*_{d} and length scale ℓ_{d} of the log tuning curve for neuron *d*. Because our stimuli lie on a circle with circumference *K*, the number of classes, we choose *d* to be the circular distance function

We will write the Gaussian process kernel matrix using the overloaded notation *κ*_{θ}(**u, v**), where **u** and **v** are arbitrary vectors in ℝ^{n} and ℝ^{m}, respectively. This notation denotes an *n* × *m* kernel matrix, with elements given by *κ*_{θ}(**u, v**)_{ij} = *κ*_{θ}(*u*_{i}, *v*_{j}). An important instance of the kernel matrix is the kernel matrix of the tuning curve. Because of the kernel function’s lengthscale parameter, the distance between each stimuli value can be rescaled arbitrarily, so we define the tuning curve kernel matrix to be *κ*_{θ}(**r, r**), where **r** = (1, …, *K*)^{⊤}.

Note that the neuron-specific hyperparameters *θ*_{d} permit tuning curves to differ in amplitude and smoothness, so different neurons can be regularized differently. For neurons with non-existent tuning or exceptionally noisy responses, the inference procedure will set the amplitude *ρ*_{d} to zero or the length scale ℓ_{d} to infinity, making the tuning curve flat (see fig. 5). Such neurons are effectively pruned from the dataset, since flat tuning curves make no contribution to decoding. This effect is known as automatic relevance determination (MacKay, 1992; Neal, 1996), and it eliminates the need to manually filter out noisy or untuned neurons. Automatic preprocessing in this manner is critical when working with large datasets.

To fit the GPPID model to spike count data, we employ a two-step procedure known as empirical Bayes (Bishop, 2006). For each neuron, we first compute a point estimate of the tuning curve hyperparameters by maximizing the model evidence, and then find the *maximum-a-posteriori* (MAP) estimate of the tuning curve using the previously estimated hyperparameters. The model evidence for neuron *d* is the marginal probability of **X**_{∗d}, the *d*-th column of the spike count matrix **X** = (**x**_{1}, …, **x**_{T})^{⊤}, given the hyperparameters:

This integral is intractable, so we approximate it using a Laplace’s method (Bishop, 2006). We define to be the sum of log-prior and log-likelihood, and *H* to be the Hessian matrix of *h* evaluated at its maximizer, the MAP estimate . The integral’s approximate value is then given by:

We compute point estimates of the model hyperparameters by optimizing this approximation with respect to *θ*_{d} using the Nelder-Mead algorithm (Nelder & Mead, 1965).

Once we have estimated the hyperparameters for neuron *d*, we estimate the neuron’s tuning curve by computing the MAP estimate of *λ*_{d} under the model given by equation 25:
where Poiss(*x*_{t} ∣ *λ*_{d}[*y*_{t}]) is the probability of spike count *x*_{t} given the firing rate *λ*_{d}[*y*_{t}], under a Poisson distribution (eq. 2).

To accelerate the optimization procedure, we use a Fourier-domain representation of the covariance function based on the Karhunen-Loéve expansion. (See Appendix B for details.) Since the procedure described above can be performed independently for each neuron, fitting the GPPID model is fully parallelizeable across neurons.

### 4.2 The GP-regularized Gaussian Independent Decoder (GPGID)

To perform inference in the GID (section 3.1.2), it is necessary to estimate the class-conditional mean activity *μ*_{kd} and the variance for each neuron. For each *d*-th neuron, the mean activities form a tuning curve, the vector *μ*_{d} = (*μ*_{1d}, …, *μ*_{Kd})^{T}. We estimate the tuning curve using the same approach as the PID (section 4.1), but with a Gaussian likelihood instead of a Poisson likelihood. The model is as follows:

Note that each neuron has three hyperparameters: the GP prior’s marginal variance *ρ*_{d} and length scale ℓ_{d}, and the likelihood’s observation noise variance . As in the GID’s model (section 3.1.2), we assume that the observation noise variance is constant over classes. This restriction ensures that the classification boundary is linear (see Appendix A).

To fit the model, we use empirical Bayes, as we did for inference in the GPPID. The first step is to compute a point estimate of each neuron’s hyperparameters by maximizing the model evidence. Before proceeding, we need to introduce some notation: the vector *μ*_{d}[**y**] is the vector whose *t*-th element is given by . Intuitively, this vector contains the values from the *d*-th neuron’s tuning curve corresponding to each stimulus in the dataset. With this notation, the evidence maximization can be written as:

Since the objective function can be expressed analytically, we perform the maximization using a trust-region Newton method.

Then, for each neuron, we compute the MAP estimate of the tuning curve:

The solution to this problem can be expressed analytically (Rasmussen & Williams, 2006):

However, for scalability, we use an equivalent procedure leveraging the same spectral weight representation as in the GPPID. (See Appendix B for details.)

### 4.3 The Gaussian Process Multiclass Decoder (GPMD)

In this section, we introduce the the Gaussian Process Multiclass Decoder (GPMD), which is multinomial logistic regression with a GP prior placed over the weights for each neuron (see fig. 2). As in section 3.2.1, the multinomial logistic regression model can be written:

where **w**_{k} is a vector containing the decoding weights for class *k, b*_{k} is the offset for class *k*, and *Z* is the normalizing constant.

We regularize the weight matrix by placing an independent zero-mean Gaussian Process prior on each of its rows:

Here *W*_{d∗} is the *d*-th row of *W*, which contains the decoding weights associated with neuron *d* across all stimuli, and is the RBF covariance function defined in equation 27.

In the GPPID and GPGID, we used a GP prior to formalize our prior knowledge that neuron’s tuning curves tend to be smooth. In the GPMD, we cannot apply that prior knowledge directly, since the GPMD’s weights have no direct interpretation in terms of tuning curves. Nonetheless, we can motivate our application of GP prior with the following observation: since orientation is a continuous variable, the decoding weights ought to vary smoothly as a function of orientation.

Like previous decoders, the GPMD has neuron-specific hyperparameters, which allow different neurons to have decoding weights with different amplitudes and different amounts of smoothness. This flexibility has two benefits: first, it allows each neuron’s weights to adapt to the neuron’s response properties, and second, it automatically discards untuned or noisy neurons as described in section 4.1, eliminating the need for manual dataset preprocessing.

To fit the GPMD, we use variational inference to simultaneously learn both a posterior estimate for the weights {*W*, **b**} and point estimates for the prior hyperparameters. Specifically, given an approximate posterior family *q*—which we choose to be mean-field Gaussian—indexed by parameters *ϕ* and prior hyperparameters *θ*, we maximize the evidence lower bound (ELBO, see Blei et al. [2017] and Hoffman et al. [2013]) jointly with respect to *ϕ* and *θ*:

To calculate the likelihood term, we draw *M* = 3 samples from the variational posterior , and use these to compute a Monte Carlo approximation of the expectation:
where *Z* is the normalizing constant defined in Eq. 40. In principle, we could also calculate this approximation using a subset of the data, or “minibatch” (Hoffman et al., 2013), but our datasets are small enough that this is not necessary, and doing so increases the approximation variance.

The KL-divergence term expands to a sum of KL divergences for each row of the weight matrix, since each row of the matrix is independent of the others. Assuming *q*_{1}, …, *q*_{K} are the mean-field variational distributions for each row of the weight matrix, and *p*_{1}, …, *p*_{K} are the associated GP priors, the KL-divergence term reduces to

Each of the summands is easy to calculate analytically, since *q*_{k} and *p*_{k} are both multivariate normal distributions. The approximate variational posterior *q*_{k} is a diagonal normal distribution, and the GP prior for each column *p*_{k} is the multivariate normal distribution , where, as in the GPPID model, **r** = (0, 1, …, *K*)^{⊤} defines the grid of class labels.

To make predictions, we approximate MAP inference by using the mode of the posterior approximation *q* as a point estimate of the weights *W* and **b** in the multinomial logistic regression model (eq. 40). Since we are only interested in the effects of the prior, and not in using the posterior to assess prediction uncertainty, this approach suffices.

#### 4.3.1 Scaling GPMD inference

When maximizing the ELBO, there are two scaling concerns: the number of examples (*T*) and the number of neurons (*D*). It is well-known that the ELBO can be scaled to huge numbers of examples by estimating the likelihood term using a minibatch approximation (Hoffman et al., 2013). However, even when using a minibatch approximation, the KL-divergence term must be calculated at each gradient step, and from it costs ∼ *DK*^{3} to evaluate (eq. 44). For large values of *D*, which are expected in high-dimensional neural datasets, the KL-divergence term evaluation makes stochastic gradient descent far too slow.

We solve this problem by representing *W* ^{⊤} using a basis Ψ ∈ ℂ^{K×M}, i.e. *W* ^{⊤} = Ψ*U*. Then, we place an independent normal prior on each entry of *U*. This allows the KL-divergence term to be evaluated with ∼ *DM* complexity, since it becomes a KL divergence between two diagonal normal distributions. The only difficulty is choosing Ψ such that *W* turns out to have the desired Gaussian process distribution when *U* is a standard normal.

It can be shown that the appropriate choice of Ψ is the unitary Fourier basis (see Appendix B). With this basis, the entry *U*_{id}, the element in the *d*-th column of *U* corresponding to Fourier frequency *i*, must satisfy two conditions. The first condition is conjugacy, , which ensures that *W* is real. The second condition is on the distribution, which must be a zero-mean complex normal with variance

This spectral formulation assumes the stimuli lie on [0, 2*π*], but it can be trivially extended to any domain.

## 5 Results

### 5.1 Evaluation and performance

We benchmarked each decoder by calculating its mean absolute test error on the monkey (Graf et al., 2011), ferret (this paper), and mouse (Stringer et al., 2021) datasets, using five-fold cross-validation repeated ten times (fig. 3). We examined five monkey datasets, one ferret dataset, and three mouse datasets. Figure 3A reports the average scores for each animal; separate scores are reported in supplementary figure 9. Note that the GID decoder has two variants: the standard formulation, which has a quadratic decision boundary, and the formulation described in section 3.1.2, which has a linear decision boundary.

The rank ordering of the models remained largely consistent across datasets. In general, the correlation-blind decoders (the PID and GID) performed worse than the correlation-aware decoders, which is consistent with previous decoding studies (Graf et al., 2011; Stringer et al., 2021). Their regularized variants (the GPPID and GPGID) performed better, but still did not match the performance of the best correlation-aware decoders.

The GPMD set or matched state-of-the-art performance on all datasets. An important advantage of its Bayesian approach is that hyperparameters are learned automatically, which allowed the GPMD to adapt to the conditions present in different datasets. By contrast, models that set hyperparameters manually exhibited occasional poor performance—for example, the SND’s scores on the monkey datasets.

These results could have been be obscured by the choice of error metric. For example, repeating the same benchmark using “proportion correct” instead of mean absolute error improved the performance of the ELD substantially (see supplementary figure 8), qualitatively replicating the results of Graf et al. (2011). To ensure that our results were not artifacts of the error metric, we used the empirical error cumulative distribution function to characterize each decoder’s errors in more detail (fig. 3B). Good decoders should classify higher fractions of the dataset at lower errors, producing curves that lie up and to the left. We found that the GPMD outperformed or matched all the other decoders on all the datasets (see supplementary figure 10).

Our results show that both regularization and exploiting correlations improved decoding performance substantially. The regularized correlation-blind decoders, the GPPID and GPGID, outperformed their unregularized analogues, the GID and PID. The GLMNET decoder, which is correlation aware, outperformed the correlation-blind GPPID and GPGID. Finally, the SND and GPMD, which are both regularized and correlation-aware, outperformed all other decoders.

The relative impact of these strategies depended on dataset dimensionality. For small datasets, such as the monkey dataset with ∼ 150 neurons, both regularization and exploiting correlations had a substantial effect. For example, adding regularization to the GID (using the GPGID) decreased its mean absolute error by 16.3 degrees, and exploiting correlations (using the GPMD), decreased error by another 9.4 degrees. For high-dimensional datasets where it was easy to overfit, such as the mouse dataset, which was recorded from ∼ 20, 000 neurons, regularization became the most important strategy. In fact, on the mouse dataset, the regularized correlation-blind GPGID did just as well as some of the correlation-aware decoders.

To characterize the GPMD’s performance and training times with respect to dataset size and dimensionality, we performed ablation studies on both the number of training examples and the number of neural features (fig. 4).

The GPMD performed well at all training set sizes (fig. 4A) implying that its inductive biases were well-calibrated—that is, strong enough to permit good performance with few training examples, but flexible enough to allow continued learning with many training examples. We believe the good calibration is due to the flexibility of the GP prior, which learns the structure present in the neural dataset. Models with stronger inductive biases, such as the GPGID, which assumes independence, or the SND, which has many hard-coded parameters, had difficulty learning from increasing numbers of training examples.

The GPMD also performed well with any number of neural features (fig. 4C). Linear decoders with no or poor regularization, such as the GID and ELD, did not exhibit this property; in fact, their performance became worse as the number of neural features increased from the “classical” to the “interpolating” regime, producing a phenomenon known as the “double descent” error curve (Belkin et al., 2019). Properly regularized models such as the GPGID and GPMD did not display this phenomenon and gave accurate performance estimates for all numbers of neural features.

Thanks to the GPMD’s approximate inference, GPU acceleration, and spectral weight representation, it trained quickly, producing fast cross-validated error estimates that exhibited favorable scaling with respect to both observations and neurons (figures 4B and 4D). For the largest dataset with 20,000 neurons, it took 131 +/-0.82 seconds to train (roughly 20 minutes of wall-clock time) for a ten-fold cross-validation estimate. By comparison, a performance-tuned GLMNET model took 618 +/-6.40 seconds to train (roughly 1 hour and 45 minutes of wall-clock time) for the same estimate. Given the training time trends shown in the training-set size ablation (fig. 4BB) and neural feature ablation (fig. 4D) studies, we expect the GPMD to handle even larger datasets without difficulty.

Scaling to large datasets was further enhanced by the GPMD’s automatic dataset preprocessing. Decoding studies, such as Graf et al. (2011), often select only strongly tuned neurons for decoding, since noisy neurons make it easier for models to over-fit. Manual selection rules have two disadvantages: first, they may ignore neurons that look noisy but actually carry information, and second, they can require prohibitive amounts of time if human input is needed (e.g., for choosing initialization points for nonlinear curve fitting).

The GPMD’s Bayesian formulation automatically discarded noise neurons by setting their prior amplitudes to zero (fig. 5), a phenomenon known as automatic relevance determination (MacKay, 1992; Neal, 1996). Examples of tuning curves from automatically discarded and automatically retained neurons are shown in figure 5. Some of the automatically retained neurons displayed the bimodal “Gaussian bump” structure commonly sought by manual selection rules. Others displayed more complicated tuning patterns that would likely be ignored by a manual selection rule.

Our implementation of the empirical linear decoder (ELD, see Graf et al. [2011]) replicated the original paper’s results only qualitatively, not quantitatively. Our implementation of the ELD did outperform the Poisson Independent Decoder (PID) when using the “proportion correct” error criterion, as in the original paper (see supplementary figure 8). However, it did not achieve the performance reported in the original paper. Because our implementation of the PID, a very simple decoder, also did not match the performance of the PID in Graf et al. (2011), we believe the discrepancy was caused by data preprocessing. We were not able to replicate the data preprocessing steps described in Graf et al. (2011) precisely, since the original code has been lost.

### 5.2 Scientific implications

Previous studies, such as Graf et al. (2011) and Stringer et al. (2021), have used correlation-blind and correlation-aware decoders to investigate the effects of correlations on decoding performance. In all cases, they have found that correlation-aware decoders outperform correlation-blind decoders. However, the performance difference could be due to the lack of regularization in the correlation-blind decoders. In the sample-poor data regimes typically studied, the ideal weights of correlation-blind decoders are often corrupted by substantial amounts of noise.

Our results show that the performance difference between correlation-blind and correlation-aware decoders is fundamental, and not just a result of regularization. The correlation-aware decoders consistently perform better than even the regularized correlation-blind decoders (fig. 3), though regularization does narrow the gap significantly.

To characterize the effects of correlations on decoding performance, we visualized the decoding separatrices given by a correlation-blind model, the GPPID, and a correlation-aware model, the GPMD, on the third monkey dataset (fig. 6A). To reduce the 147 neural dimensions to two dimensions for visualization, we first selected a model to visualize, and two classes *i* and *j*. Then, we formed a two-dimensional basis by orthogonalizing the decoding weight vectors *W*_{i∗} and *W*_{j∗}. Using this basis, we were able to plot the data for classes *i* and *j*; the source model’s separatrix, which lay in the basis span; and approximate separatrices from other models, which had to be projected onto the basis. We performed this procedure both the GPMD and GPPID models, since each two-dimensional basis could exactly represent only the separatrix from its source weight matrix.

We first wished to determine whether the data deviated significantly from the independent Poisson model assumed by the GPPID. To do this, we generated an uncorrelated dataset using a Poisson distribution that matched the data’s empirical mean (referred to in the figure as the “Poisson surrogate” dataset) and plotted both it and the real dataset in the same basis. Compared to the real data, the Poisson surrogate data exhibit much less variance and tilt relative to the basis vectors, showing that the correlations in the real data can significantly affect decoding. However, in the GPMD basis, the differences between the real and surrogate datasets are much less pronounced, implying that the GPMD’s weight matrix incorporates a linear transform that “decorrelates” the data somewhat.

Next, we plotted the class separatrices along with the data. The separatrix given by the GPPID successfully separated the Poisson surrogate data, but failed to separate the real dataset because of correlation-induced distortions. However, as expected, the GPMD separatrix successfully took the data’s correlations into account.

To visualize how the entire set of 72 classes related to each other, we plotted each class’s correlation ellipse on the basis given by the zero- and five-degree basis vectors from each model (fig. 6B). The GPPID’s basis did a poor job of separating the classes, but the GPMD’s basis separated them fairly well. In the GPMD’s basis, the ellipses from classes 180 degrees apart appear in nearly identical locations, confirming that that the GPMD identified grating angles more precisely than grating drift direction, a phenomenon previously observed in our performance bench-marks (fig. 3C).

Finally, we used each decoder to form a spatial map of each neuron’s preferred decoding angle, calculated for the neuron *i* as the argmax of the weight matrix column *W*_{∗}*i*. We found that the GPMD decoder clustered the ferret neurons into direction-selective columns, but not the mouse neurons (fig. 7). This result is consistent with previous studies on direction selectivity in mouse and ferret visual cortex, which have discovered column structure in ferret visual cortex, but not in mouse (Rochefort et al., 2011).

## 6 Conclusion

Linear decoders are a natural way to characterize the information in neural populations. While all linear decoders share the same basic classification rule, they make differing assumptions about the neural population that affect both accuracy and parameter inference. Decoders with restrictive assumptions, like correlation-blind decoders based on independent generative models, generally have the worst performance. More accurate linear decoders (e.g., GLMNET) model dependencies across neurons, and the most accurate (e.g., the SND and GPMD) take into account—at least implicitly—dependencies across both neurons and stimuli.

In this paper, we present a suite of new decoders which share a common regularization strategy. The correlation-aware decoder in the suite, the GPMD, explicitly models correlations across neurons and stimuli. We find that it matches or out-performs all other decoders on three real world datasets, from monkey, ferret and mouse. Furthermore, it scales to the very largest datasets using a combination of approximate Bayesian inference, spectral methods, and GPU acceleration.

We investigated the effect of neural correlations on decoding by comparing the performance of the regularized correlation-aware and correlation-blind decoders. We found that the performance gap between correlation-aware and correlation-blind decoders is fundamental—that is, it is not an artifact of the sophisticated regularization schemes commonly used by correlation-aware decoders. This confirmed the results of previous studies. Even with sophisticated regularization, the correlation-blind decoders still performed worse than the correlation-aware decoders. Thus, we may conclude that exploiting neural correlations can significantly improve decoding performance.

Visualizations of the decoding separatrices produced by each decoder indicate that the real datasets differ significantly from the assumptions made by correlation-blind decoders. The correlation-aware decoders discovered low-dimensional subspaces that “decorrelated” the data, making the transformed data match independence assumptions more closely.

Finally, we note that our decoder discovered the cortical column structure (or lack therof) in ferret and mouse visual cortex.

## A Linearity of the PID and GID decoders

Both the GID and the PID decoders, under appropriate assumptions, have linear decision boundaries. To derive both decision boundaries at the same time, let us consider the more general case of a naïve Bayes decoder as described in section 3.1.1, but with an exponential family likelihood (Bishop, 2006). That is, the likelihood of the *d*th element of the feature vector **x**_{d} can be written as
where *η* is the natural parameter and the sufficient statistic *u* : ℝ → ℝ is a function of **x**_{d}.

With this likelihood in mind, we can begin solving for the decision boundary. Our class prediction *ŷ* for a given example **x** is

If we assume that the prior probabilities are constant, i.e. *P*(*y* = *k*) = *P*(*y* = *k*′) for every *k, k*′ ∈ {1, …, *K*}, then this simplifies to

Introducing a log under the argmax, dropping terms that don’t depend on *k*, and substituting , we simplify further:

Writing out the exponential family form, we have

This will simplify to the form of a linear decoder as long as the sufficient statistic of the exponential distribution *u*(**x**_{d}) of the form *u*(**x**_{d}) = *α***x**_{d} where *α* is a scalar. In that case, the entries of the weight matrix are given by the natural parameters
and the entries of the intercept vector **b** are given by

Using these definitions, we can write which is the form of a linear decoder.

In the case of the Poisson Independent Decoder, the sufficient statistic is the identity function, the natural parameter of the Poisson distribution is given by *η*_{kd} = log *λ*_{kd} and . Thus, for the PID,

The case of the Gaussian Independent Decoder is slightly more complicated, since the Gaussian sufficient statistic only takes the proper form if the variance *σ*^{2} can be incorporated into *h*(**x**_{d}), a term we dropped from the argmax. For this dropping to be valid, we must constrain *σ*_{kd} = *σ*_{k′d} for all *k, k*′ ∈ {1, …, *K*}. If this is true, then the sufficient statistic is *u*(**x**_{d}) = **x**_{d}/*σ*_{d}, the natural parameter is given by *η*_{kd} = *μ*_{kd}/*σ*_{d}, and . Thus, for the GID,

## B Spectral GP regression

### B.1 With Gaussian noise

In this section we demonstrate how to solve a 1-D GP regression problem in the spectral domain. Consider a regression dataset with *y* ∈ ℝ and, without loss of generality, *x* ∈ [−*π, π*]. Note that we do not require the *x* values to lie on a grid. We can concatenate the training examples and labels into vectors as follows: **x** = (*x*_{1}, …, *x*_{t})^{⊤} and **y** = (*y*_{1}, …, *y*_{t})^{⊤}.

We assume that the values of **y** are noisy observations of a zero-mean Gaussian process *z* where **z**_{i} = *z*(**x**_{i}), i.e. **y**_{i} = **z**_{i} + *ϵ* where *ϵ* ∼ 𝒩(0, *σ*^{2}). Our probability model can be written:

Here *K*_{θ}(**x, x**) is the kernel matrix between **x** and **x**. If *κ*_{θ}(⋅) is the stationary kernel function with hyperparameters *θ*, then *K*_{θ}(**x, x**)_{ij} = *κ*_{θ}(**x**_{i} − **x**_{j}). To infer the GP hyperparameters using type-II maximum likelihood, we wish to maximize the log evidence given by

However, each evaluation of this expression has cost ∼ *n*^{3}, which is intractable for large *n*. Our goal is to decorrelate **z** so that the prior covariance becomes diagonal, dropping the cost to ∼ *n*.

To achieve this we will represent the GP *z*(*x*) using a Fourier series:
where Ω = 2*π*/*N* and the *F*_{k} are the Fourier series coefficients. Our goal is to find the distribution of *F*_{k} such that *z* is a real zero-mean Gaussian with Cov[*z*(*x*), *z*(*x*′)] = *κ*_{θ}(*x, x*′).

To ensure that *z*(*x*) is real, we require . This requirement can be verified by expanding equation 66 in terms of sines and cosines. To ensure that *z*(*x*) is Gaussian, we require that the *F*_{k} are independent Gaussian random variables, which implies that they are jointly Gaussian. Since *z*(*x*) is a linear combination of the *F*_{k}s, this implies it is also Gaussian. To ensure that *z*(*x*) is zero mean, we require that each *F*_{k} is zero mean. Because expectation is a linear operator, this ensures that *z*(*x*) is also zero mean.

The trickiest task is finding the variance of the *F*_{k}s that induces the proper GP distribution on *z*. We can construct an equation to solve for it as follows: given a lag *τ* = *s* − *t* between two *x* values, we have the covariance

Because the Fourier coefficients are independent, we have for all *k* ≠ *k*′.

This is just a Fourier series. Thus, we can use the Fourier coefficient formula to invert the equation and solve for :

At this point, we are done. However, many implementations use the Fourier transform of *κ* rather than the Fourier coefficient expression given above. The equivalence can be derived by extending the bounds of integration to [−∞, ∞]. This is a reasonable approximation as long as *κ*(*τ*) is close to zero outside [−*π, π*]—which is true for the RBF and Matern kernels as long as the lengthscale is short. Extending the bounds of integration, we have

For notational simplicity, let **w** be the vector of frequency-domain coefficients and **s** be the vector of associated covariances. Then the log evidence we wish to maximize is
where and . If Ψ is unitary, which is true if **x** lies on a grid, then *A* will be diagonal, which simplifies the gradient and Hessian calculations somewhat.

### B.2 With Poisson noise

Consider the same regression problem as in §B.1, but with Poisson observation noise. We wish to fit the hyperparameters by maximizing the log evidence

Since Poiss(**y**_{t}:*θ*_{t}) the integral is not analytically tractable.

Define *h*(**z**) = *p*(**y** ∣ **z**, *θ*)*p*(**z** ∣ *θ*) and its argmax as **z**^{∗}. Using the Laplace Approximation (Azevedo-Filho & Shachter, 1994), we can approximate the integral in eq. 76 as

Since evaluating this quantity requires finding **z**^{∗} via an optimization procedure, it is difficult to maximize it using a derivative-based optimization algorithm, an issue pointed out by Rasmussen and Williams (2006). We use a derivative-free technique, the Nelder-Mead algorithm (Nelder & Mead, 1965).

## C Dataset and preprocessing details

For each of the five monkey datasets provided by Graf et al. (2011), we chose the feature ({**x**}) exactly as in Graf et al. (2011). The stimuli grating angles were selected from a five-degree grid, so to get *y*_{t} we simply mapped the angles {0, 5, 10, …, 360} to the integers {0, 1, 2, …, 72}.

Unlike Graf et al., we did not drop noisy neurons from the dataset, since we found it made little to no difference in decoding accuracy (see Figure TODO).

For the three mouse datasets, we chose the feature ({**x**}) vectors exactly as in Stringer et al. (2021). For the class values (*y*) values, we binned the stimulus angles using 2-degree bins and used the bin index as the class label.

For the ferret dataset, all procedures were performed according to NIH guidelines and approved by the Institutional Animal Care and Use Committee at Max Planck Florida Institute for Neuroscience. Surgical procedures and acute preparations were performed as described in Scholl et al. (2017). To preform calcium imaging of cellular populations, AAV1.Syn.GCaMP6s (UPenn) was injected at multiple depths (total volume 500 nL). Visual stimuli were generated using Psychopy (Peirce, 2007). The monitor was placed 25 cm from the animal, centered in the receptive field locations for the cells of interested. Square-wave drifting gratings (0.10 cycles per degree spatial frequency, 4Hz temporal frequency) were presented at 2 degree increments across the full range of directions (1 second duration, 1 second ISI, 11 trials). Two photon imaging was performed on a Bergamo II microscope (Thorlabs) running Scanimage (Pologruto et al., 2003) (Vidrio Technologies) with 940nm dispersion-compensated excitation provided by an Insight DS+ (Spectraphysics). Power after the objective was 40 mW. Images were collected at 30 Hz using bidirectional scanning with 512×512 pixel resolution. The full field of view was 1 × 1 mm. Raw images were corrected for in-plane motion via a non-rigid motion correction algorithm (Pnevmatikakis & Giovannucci, 2017). Regions of interest were drawn in ImageJ. Mean pixel values for ROIs were computed over the imaging time series and imported into MATLAB (Hiner et al., 2017). Δ*F*/*F*_{o} was computed by computing *F*_{o} with time-averaged median or percentile filter. Δ*F*/*F*_{o} traces were synchronized to stimulus triggers sent from Psychopy and collected by Spike2. Response amplitudes for each stimulus on each trial was calculated as the sum of the Fourier mean and modulation (*F*_{0} + *F*_{1}). These values for each neuron were used to generate the feature ({**x**}) vectors. Class values (*y*) were the stimulus angles presented (at 2-degree increments), using the bin index as the class label

## D Supplementary figures

## Acknowledgements

This work was supported by grants from the Simons Collaboration on the Global Brain (SCGB AWD543027), the NIH BRAIN initiative (NS104899 and R01EB026946), and a U19 NIH-NINDS BRAIN Initiative Award (5U19NS104648). Jacob L. Yates is supported by the NIH (K99EY032179). Benjamin Scholl is supported by the NIH (K99EY031137) and thanks the Max Planck Society and Max Planck Florida Institute for their generous support. We thank A. B. A. Graf, Al Kohn, M. Jazayeri, and J. A. Movshon for providing the primate datasets; and C. Stringer, M. Michaelos, and M. Pachitariu for providing the publicly available mouse datasets.