Abstract
Learning from a limited number of experiences requires suitable inductive biases. While inductive biases are central components of intelligence, how they are reflected in and shaped by population codes are not well-understood. To address this question, we consider biologically-plausible reading out of arbitrary stimulus-response maps from arbitrary population codes, and develop an analytical theory that predicts the generalization error of the readout as a function of the number of examples. Our theory illustrates in a mathematically precise way how the structure of population codes allow sample-efficient learning of certain stimulus-response maps over others, and how a match between the code and the task is crucial for sample-efficient learning. We observe that many different codes can support the same inductive biases and by analyzing recordings from the mouse primary visual cortex, we demonstrate that biological codes are metabolically more efficient than other codes with identical biases. We apply our theory to experimental recordings of mouse primary visual cortex neural responses, elucidating a bias towards sample-efficient learning of low frequency orientation discrimination tasks. We demonstrate emergence of this bias in a simple model of primary visual cortex, and further show how invariances in the code to stimulus variations affect learning performance. We extend our methods to time-dependent neural codes. Finally, we discuss implications of our theory in the context of recent developments in neuroscience and artificial intelligence. Overall, our study suggests sample-efficient learning as a general normative coding principle.
Introduction
The ability to learn fast is crucial for survival in a complex and an everchanging world, and the brain is remarkably efficient in this. Often, only a few experiences are sufficient to learn a task, whether acquiring a new word [1] or recognizing a new face [2]. Despite the importance and ubiquity of sample efficient learning, our understanding of the brain’s information encoding strategies that support this faculty remains poor [3, 4, 5].
In particular, when learning and generalizing from past experiences, and especially from few experiences, the brain relies on implicit assumptions it carries about the world, or its inductive biases [6, 5]. Reliance on inductive bias is not a choice: inferring a general rule from finite observations is an ill-posed problem which requires prior assumptions since many hypotheses can explain the same observed experiences [7]. Consider learning a rule that maps photoreceptor responses to a prediction of whether an observed object is a threat or is neutral. Given a limited number of visual experiences of objects and their threat status, many threat-detection rules are consistent with these experiences. By choosing one of these threat-detection rules, the nervous system reveals an inductive bias. Without the right biases that suit the task at hand, successful generalization is impossible [6, 5]. Therefore, in order to understand why we learn certain tasks accurately and rapidly over others, we must understand the brain’s inductive biases [3, 4, 5].
We study sample efficient learning in a general neural circuit model which comprises of a population of sensory neurons and a readout neuron learning a stimulus-response map with a biologically-plausible learning rule (Fig 1A). In this circuit, inductive bias arises from the nature of the neural code for sensory stimuli. While different population codes can encode the same stimulus variables and allow learning of the same output with perfect performance given infinitely many samples, learning performance can depend dramatically on the code when restricted to a small number of samples, where the reliance on and the effect of inductive bias are strong (Fig 1B,C,D). Given the same sensory examples and their associated response values, the readout neuron may make drastically different predictions depending on the inductive bias set by the nature of the code, leading to successful or failing generalizations (Fig 1C,D). We say that a code and a learning rule, together, have a good inductive bias for a task if the task can be learned from a small number of examples.
Learning tasks through linear readouts exploit representations of the population code to approximate a target response. A The readout weights from the population to a downstream neuron, shown in blue, are updated to fit target values y, using the local, biologically plausible delta rule. B Examples of tuning curves for two different population codes: Smooth tuning curves (Code 1) and rapidly varying tuning curves (Code 2). C (Left) A target function with low frequency content is approximated through the learning rule shown in A using these two codes. The readout from Code 1 (turquoise) fits the target function (black) almost perfectly with only P = 12 training examples, while readout from Code 2 (purple) does not accurately approximate the target function. (Right) However, when the number of training examples is sufficiently large (P = 120), the target function is estimated perfectly by both codes, indicating that both codes are equally expressive. D The same experiment is performed on a task with higher frequency content. (Left) Code 1 fails to perform well with P = 12 samples indicating mismatch between inductive bias and the task can prevent sample efficient learning while Code 2 accurately fits the target. (Right) Again, provided enough data P = 120, both models can accurately estimate the target function. Details of these simulations are given in Methods.
In order to understand how population codes shape inductive bias and allow fast learning of certain tasks over others with a biologically plausible learning rule, we develop an analytical theory of the readout neuron’s learning performance as a function of the number of sampled examples, or sample size. We find that the readout’s performance is completely determined by the code’s kernel, a function which takes in pairs of population response vectors and outputs a representational similarity defined by the inner product of these vectors. We demonstrate that the spectral properties of the kernel introduce an inductive bias toward explaining sampled examples with simple stimulus-response maps and determine compatibility of the population code with learning task, and hence the sample-efficiency of learning. We observe that many codes could support the same kernel function, however, by analyzing data from mouse primary visual cortex (V1) [8, 9, 10, 11], we find that the biological code is metabolically more efficient than others. Further, mouse V1 responses support sample-efficient learning of low frequency orientation discrimination tasks over high frequency ones. We demonstrate this bias in a simple model of V1 and show how response nonlinearity, sparsity, and relative proportion of simple and complex cells influence the code’s bias and performance on learning tasks, including ones that involve invariances. Finally, we extend our theory to temporal population codes, using codes generated by recurrent neural networks learning a delayed response task as an example. Overall, our results suggest sample-efficient learning as a novel functional role for population codes.
Results
We consider a population of N neurons whose responses, {r1(θ), r2(θ),…, rN (θ)}, vary with the input stimuli, which is parameterized by a vector variable θ, such as the orientation and the phase of a grating (Figure 1A). These responses define the population code. A readout neuron learns its weights w to approximate a stimulus-response map, or a target function y(θ), such as one that classifies stimuli as apetitive (y = 1) or aversive (y = −1), or a more smooth one that attaches intermediate values of valence. Our theory is general in its assumptions about the structure of the population code and the stimulus-response map considered (Methods), and can apply to many scenarios.
The readout neuron learns from P stimulus-response examples with the goal of generalizing to previously unseen ones. Example stimuli θµ, (µ = 1,…, P) are sampled from a probability distribution describing stimulus statistics p(θ). This distribution can be natural or artificially created, for example, for a laboratory experiment (Supplementary Information, SI). From the set of learning examples, , the readout weights are learned with the local, biologically-plausible delta-rule,
, where η is a learning rate (Methods, Figure 1A). This learning process converges to a unique set of weights w*(𝒟) (Methods). Generalization error with these weights is given by
which quantifies the expected error of the trained readout over the entire stimulus distribution p(θ). This quantity will depend on the population code r(θ), the target function y(θ) and the set of training examples 𝒟. Our theoretical analysis of this model provides insights into how populations of neurons encode information and allow sample-efficient learning.
Kernel structure of population codes controls learning
First, we note that the generalization performance of the learned readout on a given task depends entirely on the inner product kernel, defined by
which quantifies the similarity of population responses to two different stimuli θ and θ′. This is because the learning procedure converges to a unique solution w*(𝒟) for the training set D [12, 13] and the readout neuron’s learned output has the form
where the coefficient vector α = K+y, where + denotes Moore-Penrose inverse (Methods), and the matrix K has entries Kµν = K(θµ, θν). Our main observation is that in these expressions the population code only appears through the kernel K. Therefore, the kernel controls the learned response pattern.
Biological codes are metabolically more efficient than other codes with identical kernels
The fact that learning performance depends only on the kernel introduces a large degeneracy in the set of codes which achieve identical desired performance on learning tasks. This is because the kernel is invariant with respect to left-rotations of the population code. A population code r(θ) can be rotated to generate a new code with identical kernel:
where Q is an orthogonal matrix. Codes r(θ) and
will have identical readout performance on all possible learning tasks. We illustrate this degeneracy in Figure 2 using a publicly available dataset which consists of activity recorded from ∼ 20,000 neurons from the primary visual cortex of a mouse while shown static gratings [8, 9]. An original code r(θ) is rotated to generate
(Figure 2A) which have the same kernels (Figure 2B) and the same performance on a learning task (Figure 2C).
The inner product kernel controls the generalization performance of readouts. A Tuning curves r(θ) for three example recorded Mouse V1 neurons to varying static grating stimuli oriented at angle θ [8, 9] (Left) are compared with a randomly rotated version (Middle) of the same population code. (Right) These two codes, original (Ori.) and rotated (Rot.) can be visualized as parametric trajectories in neural space. B The inner product kernel matrix has elements K(θ1, θ2). The original V1 code and its rotated counterpart have identical kernels. C In a learning task involving uniformly sampled angles, readouts from the two codes perform identically, resulting in identical approximations of the target function (shown on the left as blue and red curves) and consequently identical generalization performance as a function of training set size P (shown on right with blue and red points). The theory curve will be described in the main text.
Although, the performance of linear readouts may be invariant to such rotations, metabolic efficiency may favor certain codes over others [14, 15, 16, 17, 18], reducing degeneracy in the space of codes with identical kernels. To formalize this idea, we define δ to be the vector of spontaneous firing rates of a population of neurons, and sµ = r(θµ)+δ be the spiking rate vector in response to a stimulus θµ. The modulation with respect to the spontaneous activity, r(θµ), gives the population code and defines the kernel,. To avoid confusion with r(θµ), we will refer to sµ as total spiking activity. We propose that population codes prefer smaller spiking activity subject to a fixed kernel. In other words, because the kernel is invariant to any change of the spontaneous firing rates and left rotations of r(θ), the orientation and shift of the population code r(θ) should be chosen such that the resulting total spike count
is small.
We tested whether biological codes exhibit lower total spiking activity than others exhibiting the same kernel on mouse V1 recordings, using deconvolved calcium activity as a proxy for spiking events [8, 9, 19] (Methods; Figure 3). To compare the experimental total spiking activity to other codes with identical kernels, we computed random rotations of the neural responses around spontaneous activity, , and added the
that minimizes total spiking activity and maintains its nonnegativity (Methods). In other words, we compare the true code to the most metabolically efficient realizations of its random rotations. This procedure may result in an increased or decreased total spike count in the code, and is illustrated in a synthetic dataset in Figure 3A. We conducted this procedure on subsets of various sizes of mouse V1 neuron populations, as our proposal should hold for any subset of neurons (Methods), and found that the true V1 code is much more metabolically efficient than randomly rotated versions of the code (Figure 3B and C). This finding holds for both responses to static gratings and to natural images as we show in Figure 3B and C respectively.
The biological code is more metabolically efficient than random codes with same inductive biases. A We illustrate our procedure in a synthetic example. A non-negative population code (left) can be randomly rotated about its spontaneous firing rate (middle), illustrated as a purple dot, and optimally shifted to a new non-negative population code (right). If the kernel is measured about the spontaneous firing rate, these transformations leave the inductive bias of the code invariant but can change the total spiking activity of the neural responses. We refer to such an operation as random rotation + optimal shift (RROS). We also perform gradient descent over rotations and shifts, generating an optimized code (opt). B Performing RROS on N neuron subsamples of experimental Mouse V1 recordings [8, 9], shows that the true code has much lower average cost compared to random rotations of the code. The set of possible RROS transformations (Methods) generates a distribution over average cost, which has higher mean than the true code. We also optimize metabolic cost over the space of RROS transformations, which resulted in the red dashed lines. We plot the distance (in units of standard deviations) between the cost of the true and optimal codes and the cost of randomly rotated codes for different neuron subsample sizes N. C The same experiment performed on Mouse V1 responses to ImageNet images from 10 relevant classes [11, 10].
To further explore metabolic efficiency, we posed an optimization problem which identifies the most efficient code with the same kernel as the biological V1 code. This problem searches over rotation matrices Q and finds the Q matrix and off-set vector δ which gives the lowest cost (Methods)(Figure 3). Though the local optimum identified with the algorithm is lower in cost than the biological code, both the optimal and biological codes are significantly displaced from the distribution of random codes with same kernel. Our findings do not change when data is preprocessed with an alternative strategy, an upper bound on neural responses is imposed on rotated codes, or subsets of stimuli are considered (SI and Figure SI.1). Overall, the large disparity in total spiking activity between the true and randomly generated codes with identical kernels suggests that metabolic constraints may favor the biological code over others that realize the same kernel.
Code-task alignment governs generalization
We next examine how the population code affects generalization performance of the readout. We calculated analytical expressions of the average generalization error in a task defined by the target response y(θ) after observing P stimuli using methods from statistical physics (Methods). Because the relevant quantity in learning performance is the kernel, we leveraged results from our previous work studying generalization in kernel regression [20, 21], and calculated the generalization error averaged over all possible realizations of the training dataset of composed of P stimuli, Eg = ⟨Eg(𝒟)⟩ 𝒟. As P increases, the variance in Eg due to the composition of the dataset falls, and our expressions become descriptive of also the typical case. Our final analytical result is given in Equation (29) in Methods. We provide details of our calculations in Methods and SI, and focus on their implications here.
One of our main observations is that given a population code r(θ), the singular value de-composition of the code gives the appropriate basis to analyze the inductive biases of the read-outs (Figure 4A). The tuning curves for individual neurons ri(θ) form an N −by-M matrix R, where M, possibly infinite, is the number of all possible stimuli. The left-singular vectors (or principal axes) and singular values of this matrix have been used in neuroscience for describing lower dimensional structure in the neural activity and estimating its dimensionality, see e.g. [22, 23, 24, 25, 26, 27, 11, 8, 28, 29, 30]. We found that the function approximation properties of the code are controlled by the singular values, or rather their squares {λk} which give variances along principal axes, indexed in decreasing order, and the corresponding right singular vectors {ψk(θ)}, which are also the kernel eigenfunctions (Methods and SI). This follows from the fact that learned response (Eq. (3)) is only a function of the kernel K, and the eigenvalues λk and orthonormal eigenfunctions ψk(θ) collectively define the code’s inner-product kernel K(θ, θ′) through an eigendecomposition [31] (Methods and SI).
The singular value decomposition of the population code reveals the structure and inductive bias of the code. A Singular value decomposition of the response matrix R gives left singular vectors uk (principal axes), kernel eigenvalues λk, and kernel eigenfunctions ψk(θ). The ordering of eigenvalues provides an ordering of which modes ψk can be learned by the code from few training examples. The eigenfunctions were offset by 0.5 for visibility. B (Left) Two different learning tasks y(θ), a low frequency (blue) and high frequency (red) function, are shown. (Middle) The cumulative power distribution rises more rapidly for the low frequency task than the high frequency, indicating better alignment with top kernel eigenfunctions and consequently more sample-efficient learning as shown in the learning curves (right). Dashed lines show theoretical generalization error while dots and solid vertical lines are experimental average and standard deviation over 30 repeats. C The feature space representations of the low (left) and high (middle and right) frequeny tasks. Each point represents the embedding of a stimulus response vector along the k-th principal axis . The binary target value {±1} is indicated with the color of the point. The easy (left), low frequency task is well separated along the top two dimensions, while the hard, high frequency task is not linearly separable in two (middle) or even with four feature dimensions (right). D On an image discrimination task (recognizing birds vs mice, left), V1 has an entangled representation which does not allow good perform9ance of linear readouts. This is evidenced by the projection of the responses along the top principal axes (middle) and the slowly rising C(k) curve (right).
Our analysis shows the existence of a bias in the readout towards learning certain target responses faster than others. The kernel eigenfunctions form a complete basis for square integrable functions, allowin g the expansion of the target response y(θ) = Σkvkψk(θ) and the learned readout response in this basis. We found that the readout’s generalization is better if the target function y(θ) is aligned with the top eigenfunctions ψk, equivalent to
decaying rapidly with k (Methods). We formalize this notion by the following metric. Mathematically, generalization error ⟨Eg⟩ can be decompo sed into normalized estimation errors Ek for the coefficients of these eigenfunctions ψk,
, where
. We found thatthe ordering of the eigenvalues λk controls the rates at which these mode errors Ek decrease as P increases (Methods):
Hence, larger eigenvalues mean lower generalization error for those normalized mode errors Ek, indicating a spectral bias of the readout.
Based on this observation, we propose code-task alignment as a principle for good generalization. To quantify code-task alignment, we use a cumulative power distribution C(k) which measures the total power in of the target function in the top k eigenmodes, normalized by the total power [21]:
Stimulus-response maps that have high alignment with the population code’s kernel will have quickly rising cumulative power distributions C(k), since a large proportion of power is placed in the top modes. Target responses with high C(k) can be learned with fewer training samples than target responses with low C(k) since the mode errors Ek are ordered for all P (Methods).
This theory can be used to probe the learning biases of neural populations. Using publicly available calcium imaging recordings from mouse primary visual cortex (V1), we analyzed population responses to static grating stimuli oriented at an angle θ [8, 9]. We found that the kernel eigenfunctions have sinusoidal shape with differing frequency. The ordering of the eigenvalues and eigenfunctions in Figure 4A indicates a frequency bias: lower frequency functions of θ are easier to estimate at small sample sizes.
We tested this idea by constructing two different orientation discrimination tasks shown in Figures 4B,C, where we assign static grating orientations to positive or negative valence with different frequency square wave functions of θ. We trained the readout using a subset of the experimentally measured neural responses, and measured the readout’s generalization performance.
We found that the cumulative power distribution for the low frequency task has a more rapidly rising C(k) (Figure 4B). Using our theory of generalization, we predicted learning curves for these two tasks, which express the generalization error as a function of the number of sampled stimuli P. The error for the low frequency task is lower at all sample sizes than the hard task. The theoretical predictions and numerical experiments show perfect agreement (Figure 4B). More intuition can be gained by visualizing by projection of the neural response along the top principal axes (Figure 4C). For the low frequency task, the two target values are well separated along the top two axes. However, the high frequency task is not well separated along even the top four axes (Figure 4C).
Using the same ideas, we can use our theory to get insight into tasks which the V1 population code is ill-suited to learn. For the task of identifying mice and birds [11, 10] the linear rise in cumulative power indicates that there is roughly equal power along all kernel eigenfunctions, indicative of a representation poorly aligned to this task. (Figure 4D)
Low frequency bias and code-task alignment in a simple model of V1
Next, we study a simple model of V1 to elucidate factors that lead to the low frequency bias. We model responses of V1 neurons as photoreceptor inputs passed through Gabor filters and a subsequent nonlinearity, g(z), modeling a population of orientation selective simple cells (Figure 5A) (Methods and SI). In this model, the kernel for static gratings with orientation θ ∈ [0, π] is of the form K(θ, θ′) = κ(|θ − θ′|), and, as a consequence, the eigenfunctions of the kernel in this setting are Fourier modes (Methods). The eigenvalues, and hence the strength of the spectral bias, are determined by the nonlinearity.
A model of V1 as a bank of Gabor filters recapitulates experimental inductive bias. A Gabor filtered inputs are mapped through nonlinearity. A grating stimulus (left) with orientation θ and phase ϕ is mapped through a circuit of simple and complex cells (middle). Some examples of randomly sampled Gabor filters (right) generate preferred orientation tuning of neurons in the population. B A threshold-powerlaw nonlinearity gq,a(z) = max{0, z − a}q is fit to the mouse V1 kernel (black dots). Kernels and spectra for alternative choices of q, a are shown (color code defined in left panel). C We plot eigenfunctions ψk (modes) for mixtures of sN simple cells and (1 − s)N complex cells. A pure complex cell population has all eigenfunctions independent of phase ϕ. A pure simple cell population s = 1 or mixture codes 0 < s < 1 depend on both orientation phase in a nontrivial way. D Three tasks are visualized, where color indicates the binary target value ±1. The left task only depends on orientation stimulus variable θ, the middle only depends on phase ϕ, the hybrid task (right) depends on both. E (top) Generalization error and cumulative power distributions for the three tasks as a function of the simple-complex cell mixture parameter s. In Figure SI.2 we provide more comparisons of our theory and numerical experiments.
Motivated by findings in the primary visual cortex [32, 33, 34, 35], we studied the spectral bias induced by rectified power-law nonlinearities of the form g(z) = max{0, z − a}q. We fit q and a to the Mouse V1 kernel and compared to other parameter sets in Figure 5B. Computation of the kernel and its eigenvalues (Methods) indicates a low frequency bias: the eigenvalues for low frequency modes are higher than those for high frequency modes, indicating a strong inductive bias to learn functions of low frequency in the orientation. Decreasing sparsity (lower a) leads to a faster decrease in the spectrum (but similar asymptotic scaling at the tail, see Methods) and a stronger bias towards lower frequency functions (Figure 5B; more comparisons in Figure SI.2). The effect of the power of nonlinearity q is more nuanced: increasing power may increase spectra at lower frequencies, but may also lead to a faster decay at the tail (Figure 5B; more comparisons in Figure SI.2). In general, an exponent q implies a power-law asymptotic spectral decay λk ∼ k−2q−2 as k → ∞ (Methods). The behavior at low frequencies may have significant impact for learning with few samples. We discuss this in more detail in the next section. Overall, our findings show that the spectral bias of a population code can be determined in non-trivial ways by its biophysical parameters, including neural thresholds and nonlinearities.
To further illustrate the importance of code-task alignment, we next study how invariances in the code to stimulus variations may affect the learning performance. We introduce complex cells in addition to simple cells in our model with proportion s ∈ [0, 1] of simple cells (Methods; Figure 5A), and allow phase, ϕ, variations in static gratings. We use the energy model [36, 37] to capture the phase invariant complex cell responses (Methods). We reason that in tasks that do not depend on phase information, complex cells should improve sample efficiency.
In this model, the kernel for the V1 population is a convex combination of the kernels for the simple and complex cell populations
where Ks is the kernel for a pure simple cell population that depends on both orientation and phase, and Kc is the kernel of a pure complex cell population that is invariant to phase (Methods, Eqs. (39) and (54)). Figure 5C shows top kernel eigenfunctions for various values of s elucidating inductive bias of the readout.
Figures 5D and 5E show generalization performance on tasks with varying levels of dependence on phase and orientation. On pure orientation discrimination tasks, increasing the proportion of complex cells by decreasing s improves generalization. Increasing the sensitivity to the nuisance phase variable, ϕ, only degrades performance. The cumulative power distribution is also maximized at s = 0. However, on a task which only depends on the phase, a pure complex cell population cannot generalize, since variation in the target function due to changes in phase cannot be explained in the codes’ responses. In this setting, a pure simple cell population attains optimal performance. The cumulative power distribution is maximized at s = 1. Lastly, in a nontrivial hybrid task which requires utilization of both variables θ, ϕ, an optimal mixture s exists for each sample budget P which minimizes the generalization error. The cumulative power distribution is maximized at different s values depending on k, the component of the target function. This is consistent with an optimal heterogenous mix, because components of the target are learned successively with increasing sample size. In reality, V1 must code for a variety of possible tasks and we can expect a nontrivial optimal simple cell fraction s. We conclude that the degree of invariance required for the set of natural tasks, and the number of samples determine the optimal simple cell, complex cell mix.
Small and large sample size behaviors of generalization
Our results imply that generalization with low sample sizes crucially depend on the top eigenvalues and eigenfunctions of the code’s kernel. This is to be contrasted with a recent proposal about the effect of asymptotic decay rate of the kernel eigenvalues on generalization. Stringer et al. [11] argued that the input-output differentiability of the code may be necessary for better generalization, which is in turn governed by the asymptotic rate of spectral decay. Here, we provide an example to illustrate that asymptotic conditions on the kernel spectrum are insufficient to provide generalization guarantees when the sample size is small.
Our first example demonstrates how a code allowing good generalization for large sample sizes can be disadvantageous for small sizes. In Figure 6A, we plot three different populations of neurons with smooth (infinitely differentiable) tuning curves that tile a periodic stimulus variable, such as the direction of a moving grating. The tuning width, σ, of the tuning curves strongly influences the structure of these codes: narrower widths have more high frequency content as we illustrate in a random 3D projection of the population code for θ ∈ [0, 2π] (Figure 6A). Visualization of the corresponding (von Mises) kernels and their spectra are provided in Figure 6B. The width of the tuning curves control bandwidths of the kernel spectra Figure 6B, with narrower curves having an later decay in the spectrum and higher high frequency eigenvalues. These codes can have dramatically different generalization performance, which we illustrate with a simple “bump” target response (Figure 6C). In this example, for illustration purposes, we let the network learn with a delta-rule with a weight decay, leading to a regularized kernel regression solution (Methods). For a sample size of P = 10, we observe that codes with too wide or too narrow tuning curves (and kernels) do not perform well, and there is a well-performing code with an optimal tuning curve width σ, which is compatible with the width of the target bump, σT. We found that optimal σ is different for each P (Figure 6C). In the large-P regime, the ordering of the performance of the three codes are reversed (Figure 6C). In this regime generalization error scales in a power law (Methods) and the narrow code, which performed worst for P ∼ 10, performs the best. This example demonstrates that asymptotic conditions on the tail of the spectra are insufficient to understand generalization in the small sample size limit. The bulk of the kernel’s spectrum needs to match the spectral structure of the task to generalize efficiently in the low-sample size regime. However, for large sample sizes, the tail of the eigenvalue spectrum becomes important. We repeat the same exercise and draw the same conclusions for Laplace kernels (SI and Figure SI.3) showing that these results are not an artifact of the infinite differentiability of von Mises kernels.
The top eigensystem of a code determines its low-P generalization error. A A periodic variable is coded by a population of neurons with tuning curves of different widths (top). Narrow, wide and optimal refers to the example in C. These codes are all smooth (infinitely differentiable) but have very different feature space representations of the stimulus variable θ, as random projections reveal (below). B (left) The population codes in the above figure induce von Mises kernels with different bandwidths σ. (right) Eigenvalues of the three kernels. C (left) As an example learning task, we consider estimating a “bump” target function. The optimal kernel (red, chosen as optimal bandwidth for P = 10) achieves a better generalization error than either the wide (green) or narrow (blue) kernels. (middle) A contour plot shows generalization error for varying bandwidth σ and sample size P. (right) The large P generalization error scales in a power law. Solid lines are theory, dots are simulations averaged over 15 repeats, dashed lines are asymptotic power law scalings described in main text. Same color code as B and C-left.
Time-Dependent Neural Codes
Our framework can directly be extended to learning of arbitrary time-varying functions of time-varying inputs from an arbitrary spatiotemporal population code (Methods). In this setting, the population code r({θ(t)}, t) is a function of an input stimulus sequence θ(t) and possibly its entire history, and time t. A downstream linear readout f ({θ}, t) = w · r({θ}, t) learns a target sequence y({θ}, t) from a total of 𝒫 examples that can come at any time during any sequence. Learning is again achieved through the delta-rule and the learned function can be expressed as a linear combination of the kernel evaluated at the 𝒫 examples. The kernel in this case is a more complicated object that computes inner products of neural population vectors at different times t, t′ for different input sequences [38, 39, 40]. Our theory carries over from the static case with appropriate modifications (Methods). Kernels whose top eigenfunctions have high alignment with the target time-varying response y({θ}, t) will achieve the best average case generalization performance.
As a concrete example, we focus on readout from a temporal population code generated by a recurrent neural network in a task motivated by a delayed reach task [41] (Figure 7A,B). We consider a randomly connected recurrent network of neurons whose current dynamics obeys
The performance of time-dependent codes when learning dynamical systems can be understood through spectral bias. A We study the performance of time dependent codes on a delayed response task which requires memory retrieval. A cue (black dot) is presented at an angle γµ. After a delay time d, the cursor position (blue triangle) must be moved to the remembered cue position and then subsequently moved back to the origin after a short time. B The readout weights (blue) of a time dependent code can be learned through a modified delta rule. C Input is presented to the network as a time series which terminates at t = 1. The sequences are generated by drawing an angle γµ ∼ Uniform[0, 2π] and using two step functions as input time-series that code for the cosine and the sine of the angle (Methods). We show an example of the one of the variables in a input sequence. D The target functions for the memory retrieval task are step functions delayed by a time d. E The kernel Kµ,µ′,t,t′ compares the code for two sequences at two distinct time points. We show the time dependent kernel for identical sequences (left) and the stimulus dependent kernel for equal time points (middle left) as well as for non-equal stimuli (middle right) and non-equal time (right). F The kernel can be diagonalized, and the eigenvalues λk determine the spectral bias of the reservoir computer (left). We see that higher gain g networks have higher dimensional representations. The “eigensystems” ψk(θµ, t) are functions of time and cue angle. We plot only µ = 0 components of top systems k = 1, 2, 3, 4 (right). G The readout is trained to approximate a target function yµ(t), which requires memory of the presented cue angle. (left) The theoretical (solid) and experimental (vertical errorbar, 100 trials) generalization error Eg are plotted for the three delays d against training sample size 𝒫. (right) The ordering of Eg matches the ordering of the C(k) curves as expected.
where the rates are related to input currents through a tanh nonlinearity r(t) = tanh(z(t)). The recurrent weights are drawn from a normal distribution
and the input encoding weights from
(Methods). The gain parameter g was set to 1.5 to generate rich dynamics [42]. In this task, the network is presented for a short time an input cue sequence coding an angular variable which is drawn randomly from a distribution (Figure 7C). The recurrent neural network must remember this angle and reproduce an output sequence which is a simple step function whose height depends on the angle which begins after a time delay from the cessation of input stimulus and lasts for a short time (Figure 7D).
The kernel induced by the spatiotemporal code is shown in Figure 7E. The high dimensional nature of the activity in the recurrent network introduces complex and rich spatiotemporal similarity structure. Figure 7F shows the kernel’s eigensystem, which consists of stimulus dependent time-series ψk({θ}; t) for each eigenvalue λk. An interesting link can be made with this eigensystem and linear low-dimensional manifold dynamics observed in several cortical areas [22, 23, 25, 43, 27, 44, 30, 26, 45, 24]. The kernel eigenfunctions also define the latent variables obtained through a singular value decomposition of the neural activity [25].
With enough samples, the readout neuron can learn to output the desired angle with high fidelity (Figure 7G). Unsurprisingly, tasks involving long time delays are more difficult and exhibit lower cumulative power curves. Consequently, the generalization error for small delay tasks drops much more quickly with increasing 𝒫.
Discussion
Elucidating inductive biases of the brain is fundamentally important for understanding natural intelligence, however, how to do this using neural data is unknown. In this work, we attempted to fill this gap by examining how the structure of neural population codes shape inductive biases for learning.
We showed that under the biologically-plausible delta rule, the generalization performance is entirely dependent on the code’s inner product kernel, and proposed the kernel as a determinant of inductive bias. In its finite dimensional form, the kernel is an example of a representational similarity matrix and is a commonly used tool to study neural representations [46, 47, 48, 49, 50, 51]. Our work elucidates a concrete link between this experimentally measurable mathematical object, and sample-efficient learning.
We derived an analytical expression for the generalization error as a function of sample-size under very general conditions, for an arbitrary stimulus distribution, arbitrary population code and an arbitrary target stimulus-response map. We used our findings in both theoretical and experimental analysis of primary visual cortex, and temporal codes in a delayed reach task. This generality of our theory is a particular strength.
Our analysis elucidated two principles that define the inductive bias. The first one is spectral bias: kernel eigenfunctions with large eigenvalues can be estimated using a smaller number of samples. The second principle is the code-task alignment: Target functions with most of their power in top kernel eigenfunctions can be estimated efficiently and are compatible with a code. The cumulative power distribution, C(k) [21], provides a measure of this alignment. These findings define a notion of “simplicity” bias in learning from examples, and provides a solution to the question of what stimulus-response maps are easier to learn.
A recent proposal considered the possibility that the brain acts as an overparameterized interpolator [52]. Suitable inductive biases are crucial to escape overfitting and generalize well in such a regime [53]. Our theory could explain these inductive biases since, when the kernel is full-rank, the delta rule converges to an interpolator of the learning examples. Modern deep learning architectures also operate in an overparameterized regime, but generalize well [54, 53], and an inductive bias towards simple functions has been proposed as an explanation [20, 21, 55, 56].
Our work suggests sample efficiency as a general coding principle for neural populations, relating neural representations to the kinds of problems they are well suited to solve. These codes may be shaped through evolution or themselves be learned through experience [57]. Prior related work demonstrated the dependence of sample-efficient learning of a two-angle estimation task on the width of the individual neural tuning curves [58] and additive function approximation properties of sparsely connected random networks [59].
A sample efficiency approach to population coding differs from the classical efficient coding theories [16, 14, 15, 60, 61, 62, 17, 63], which postulate that populations of neurons optimize information content of their code subject to metabolic constraints or noise. While these theories emphasize different aspect of the code’s information content (such as reduced redundancy, predictive power, or sparsity), they do not address sample efficiency demands on learning. Further, recent studies demonstrated hallmarks of redundancy and correlation in population responses [45, 24, 64, 30, 65, 66, 11], violating a generic prediction of efficient coding theories that responses of different neurons should be uncorrelated across input stimuli in high signal-to-noise regimes to reduce redundancy in the code and maximize information content [14, 15, 60, 61, 67, 68]. In our theory, the structured correlations of neural responses correspond to the decay in the spectrum of the kernel, and play a key role in biasing learned readouts towards simple functions.
In recent related studies, the asymptotic decay rate of the kernel’s eigenspectrum was argued to be important for generalization [11] and robustness [69]. Decay rate in the mouse visual cortex was found to be consistent with a high dimensional (power law) but smooth (differentiable) code, and smoothness was argued to be an enabler of generalization [11]. We show that sample-efficient learning requires more than smoothness conditions in the form of asymptotic decay rates on the kernel’s spectrum. The interplay between the stimulus distribution, target response and the code gives rise to sample efficient learning. Because of spectral bias, the top eigenvalues govern the small sample size behavior. The tail of the spectrum becomes important for large sample sizes.
Though the kernel is degenerate with respect to rotations of the code in the neural activity space, we demonstrated that the true V1 code has much lower metabolic cost than random codes with the same kernel, sugggesting that evolution and learning may be selecting neural codes with low average spike rates which preserve sample-efficiency demands for downstream learning tasks. We predict that metabolic efficiency may be a determinant in the orientation and placement of the ubiquitously observed low-dimensional coding manifolds [44, 66] in neural activity space in other parts of the brain. The demand of metabolic efficiency is consistent with prior sparse coding theories [70, 17, 18, 71], however, our theory emphasizes sample-efficient learning as a normative objective for the code.
Our work focused on the effect of signal correlations to coding and inductive bias [72, 73]. Future analysis could study how signal and noise correlations interact to shape inductive bias and determine generalization.
Methods
Generating example codes (Figure 1)
The two codes in Figure 1 were constructed to produce two different kernels for θ ∈ S1:
An infinite number of codes could generate either of these kernels. After diagonalizing the kernel into its eigenfunctions on a grid of 120 points,
, we used a random rotation matrix Q ∈ O(120) to generate a valid code
This construction guarantees that
and
. We plot the tuning curves for the first three neurons. The target function in the first experiment is y = cos(θ) − 0.6 cos(4θ), while the second experiment used y = cos(6θ) − cos(8θ).
Learning task and convergence of the delta-rule
Gradient descent training of readout weights w on a finite sample of size P converges to the kernel regression solution [74, 75, 76]. Let be the dataset with samples xµ and target values yµ. We introduce a shorthand rµ = r(θµ) for convenience. The empirical loss we aim to minimize is a sum of the squared losses of each data point in the training set
Performing gradient descent updates generates the following weight update
which is merely the delta rule that we discussed in the main text [77, 78]. The dynamics for this rule can be analyzed efficiently through the singular value decomposition of the P-sample response matrix R = [r1, r2,…, rP] ∈ℝN ×p. The singular value decomposition of
allows us to simplify the dynamics and identify the unique fixed point of the delta-rule. The singular value decomposition of this random sub-sample matrix R is different from the population singular value decomposition which is the solution to an integral eigenvalue problem (discussed in the next section). To clarify this, we use the “hat”
to denote the singular components of the empirical matrix R. We can expand w and y in the basis defined by
and
respectively so that
and
. In this basis, the delta rule dynamics decouple
If we initialize the weights at the origin w = 0, then we can solve these dynamics in closed form
which has the limit
where we used the fact of convergence of a geometric series
provided that |z| < 1. The equivalent condition for convergence in this case is that
which implies
for all k. These dynamics converge to a unique fixed point w∗
where the sum runs over the modes k with nonzero eigenvalues
. This solution is the minimum norm solution to the linear system RRTw = Ry which can be written as w∗ = RK+y where K+ is the Moore-Penrose pseudo-inverse of the kernel gram matrix K = RTR ∈ℝP ×P which is explicitly given by
Using these weights w∗, we can calculate the learned function at a test point, we find
where kµ(θ) = K(θ, θµ). This solution is known as the kernel regression solution for dataset 𝒟 and kernel K(θ, θ′) = r(θ) · r(θ′)/N [79]. The fact that the optimal solution can always be written as a linear combination of
is known as the representer theorem [80, 79].
Weight Decay and Ridge Regression
We can introduce a regularization term in our learning problem which penalizes the size of the readout weights. This leads to a modified learning objective of the form
Inclusion of this regularization alters the learning rule through weight decay
which multiplies the existing weight value by a factor of 1 − ηλ before adding the data dependent update. This learning problem and gradient descent dynamics have a closed form solution
The generalization benefits of explicit regularization through weight decay is known to be related to the noise statistics in the learning problem [21]. We simulate weight decay only in Figure 6C, where we use λ = 0.01 ∑k λk to improve stability of the solution at large P.
Theory of Generalization
Recent work has established analytic results that predict the average case generalization error for kernel regression
where Eg(𝒟) = ⟨(f (θ, 𝒟) − y(θ))2⟩θ is the generalization error for a certain sample 𝒟 of size P and f (θ, 𝒟) is the kernel regression solution for 𝒟, given in (18) [20, 21]. The typical or average case error Eg is obtained by averaging over all possible datasets of size P. This average case generalization error is determined solely by the decomposition of the target function y(x) along the eigenbasis of the kernel and the eigenspectrum of the kernel. This continuous diagonalization again takes the form [79]
Our theory is also applicable to discrete stimuli if p(θ) is a Dirac measure (Methods). Since the eigenfunctions form a complete set of square integrable functions [79], we expand both the target function y(θ) and the learned function f (θ) in this basis
Due to the orthonormality of the kernel eigenfunctions {ψk}, the generalization error for any set of coefficients w is
We now introduce training error, or empirical loss, which depends on the disorder in the dataset
It is straightforward to verify that the optimal w∗ which minimizes H(w, 𝒟) is the kernel regression solution for kernel with eigenvalues {λk} when λ → 0. Nonzero λ is equivalent to the weight decay discussed in the previous section. The optimal weights w can be identified through the first order condition ∇H(w, 𝒟) = 0 which gives
where Ψk,µ = ψk(xµ) are the eigenfunctions evaluated on the training data and Λk,l = δk,l, λk is a a diagonal matrix containing the kernel eigenvalues. The generalization error for this optimal solution is
We note that the dependence on the randomly sampled dataset 𝒟 only appears through the matrix G(𝒟). Thus to compute the typical generalization error we need to average over this matrix ⟨G(𝒟)⟩𝒟. There are multiple strategies to perform such an average and we will study one here based on a partial differential equation which was introduced in [81, 82] and studied further in [20].
We describe in detail how such an average can be performed in the SI. After this computation, we find that the generalization error can be written as
where
, giving the desired result. Taking λ → 0 gives the generalization error of the minimum norm interpolant, which desribes the generalization error of the solution in (18).
This result was recently reproduced using the replica method from statistical mechanics [20, 21].
Spectral bias
Through implicit differentiation it is straightforward to verify that the ordering of the mode errors matches the ordering of the eigenvalues [21]. Let λk > λl, then we have
Since λl < λk, the first bracket must be negative and the second bracket must be positive. Further, it is straightforward to compute that
. Therefore
for all P. Since
at P = 0 we therefore have that log(Ek/El) < 0 for all P and consequently Ek < El. Modes with larger eigenvalues λk have lower normalized mode errors Ek.
Asymptotic power law scaling of learning curves
Exponential Spectral Decays
First, we will study the setting relevant to the von-Mises kernel where λk ∼ βk and where α, β < 1. This exponential behavior accounts for differences in bandwidth between kernels which modulates the base β of the exponential scaling of λk with k.
We will approximate the sum over all mode errors with an integral
If we include a regularization parameter λ, then κ ∼ λ as P → ∞. With this fact, we can therefore approximate the integral at large P by splitting it up into all k < k* = ln(P/λ)/ ln(1/β) and k > k*.
for P-independent constants A and An. Thus, we obtain a power law scaling of the learning curve Eg which is dominated at large P by
. For the von-Mises kernel we can approximate the spectra with λk ∼ σ−2k and
giving rise to a generalization scaling scaling
.
Power Law Spectral Decays
The same arguments can be applied for power law kernels λk ∼ k−b and power law targets , which is of interest due to its connection to nonlinear rectified neural populations. In this setting, the generalization error is
We see that there are two possible power law scalings for Eg with the exponents (a − 1)/b and 2. At large P this formula will be dominated by the term with minimum exponent so Eg ∼ P − min(−1,2b)/b.
V1 Model
A Simple Feedforward Model of V1
We consider a simplified but instructive model of the V1 population code as a linear-nonlinear map from photoreceptor responses through Gabor filters and then nonlinearity [36, 83, 84]. Let x ∈ ℝ2represent the two-dimensional retinotopic position of photoreceptors. The firing rates of the photoreceptor at position x to a static grating stimulus oriented at angle θ is
We model each V1 neuron’s receptive field as a Gabor filter of the receptor responses h(x, θ). The i-th V1 neuron has preferred wavevector ki, generating the following set of weights between photoreceptors and the i-th V1 neuron
The V1 population code is obtained by filtering the photoreceptor responses. By approximating the resulting sum over all retinal photoreceptors with an integral, we find the response of neuron i to grating stimulus with wavenumber k is
The response of neuron i is computed through nonlinear rectification of this input current ri(θ) = g(w(θi) · h(θ)). For a linear neuron g(z) = z, the kernel has the following form
where
and the kernel is normalized to have maximum value of 1. Note that this normalization of the kernel is completely legitimate since it merely rescales each eigenvalue by a constant and does not change the learning curves.
Since the kernel only depends on the difference between angles θ − θ′, it is said to posess translation invariance. Such translation invariant kernels admit a Mercer decomposition in terms of Fourier modes K(θ) = Σn λn cos(nθ) since the Fourier modes diagonalize shift invariant integral operators on 𝕊1. For the linear neuron, the kernel eigenvalues scale like , indicating infinite differentiability of the tuning curves. Since λn decays rapidly with n, we find that this Gabor code has an inductive bias that favors low frequency functions of orientation θ.
Nonlinear Simple Cells
Introducing nonlinear functions g(z) that map input currents z into the V1 population into firing rates, we can obtain a non-linear kernel Kg(θ) which has the following definition
In this setting, it is convenient to restrict ki, k, k′ ∈ 𝕊1 and assume that the preferred wavevectors ki are uniformly distributed over the circle. In this case, it suffices to identify a decomposition of the composed function g(wi · h(θ)) in the basis of Chebyshev polynomials Tn(z) which satisfy Tn(cos(θ)) = cos(nθ)
which can be computed efficiently with an appropriate quadrature scheme. Once the coefficients an are determined, we can compute the kernel by first letting θi to be the angle between k and ki and letting θ be the angle between k and k′
Thus the kernel eigenvalues are
.
Asymptotic scaling of spectra
Activation functions that encourage sparsity have slower eigen-value decays. If the nonlinear f-I activation function has the form gq,t(z) = max{0, z − a}q, then the spectrum decays like λn ∼ n−2q−2. A simple argument justifies this scaling: if the function is only q − 1 times differentiable then
since
must diverge. Therefore
. Note that this scaling is independent of the threshold.
Phase Variation, Complex Cells and Invariance
We can consider a slightly more complicated model where Gabors and stimuli have phase shifts
The simple cells are generated by nonlinearity
The input currents into the simple V1 cells can be computed exactly
When |k| = |ki| = 1, the simple cell tuning curves ri = g(wi · h) only depend on cos(θ − θi) and ϕ, allowing a Fourier decomposition
The simple cell kernel Ks, therefore decomposes into Fourier modes over θ
where
. It therefore suffices to solve the infinite sequence of integral eigenvalue problems over φ
With this choice it is straightforward to verify that the kernel eigenfunctions are vn,k(θ, φ) = einθvn,k(φ) with corresponding eigenvalue λn,k. Since bn is not translation invariant in φ − φ′, the eigenfunctions vn,k are not necessarily Fourier modes. These eigenvalue problems for bn must be solved numerically when using arbitrary nonlinearity ψ. The top eigenfunctions of the simple cell kernel depend heavily on the phase of the two grating stimuli φ. Thus, a pure orientation discrimination task which is independent of phase requires a large number of samples to learn with the simple cell population.
Complex Cells Populations are Phase Invariant
V1 also contains complex cells which possess invariance to the phase φ of the stimulus. Again using Gabor filters
we model the complex cell responses with a quadratic nonlinearity and sum over two squared filters which are phase shifted by π/2
which we see is independent of the phase φ of the grating stimulus. Integrating over the set of possible Gabor filters (ki, φi) again gives the following kernel for the complex cells
Remarkably, this kernel is independent of the phase ϕ of the grating stimulus. Thus, complex cell populations possess good inductive bias for vision tasks where the target function only depends on the orientation of the stimulus rather than it’s phase. In reality, V1 is a mixture of simple and complex cells. Let s ∈ [0, 1] represent the relative proportion of neurons which are simple cells and (1 − s) the relative proportion of complex cells. The kernel for the mixed V1 population is given by a simple convex combination of the simple and complex cell kernels
where n denotes neuron type (simple vs complex, tuning etc) and PV 1(n), ps(n), pc(n) are probability distributions over the V1 neuron identities, the simple cell identities and the complex cell identities respectively. Increasing s increases the phase dependence of the code by giving greater weight to the simple cell population. Decreasing s gives weight to the complex cell population, encouraging phase invariance of readouts.
Time-Dependent Neural Codes
In this setting, the population code r({θ(t)}, t) is a function of an input stimulus sequence θ(t) and time t. In general the neural code r at time t can depend on the entire history of the stimulus input θ(t′) for t′ ≤ t, as is the case for recurrent neural networks. We denote dependence of a function f on θ(t) in this causal manner with the notation f ({θ}, t). In a learning task, a set of readout weights w are chosen so that a downstream linear readout f ({θ}, t) = w · r({θ}, t) approximates a target sequence y({θ}, t) which maps input stimulus sequences to output scalar sequences. The quantity of interest is the generalization Eg, which in this case is an average over both input sequences and time, Eg = ⟨ (y({θ}, t) − f ({θ}, t))2⟩θ(t),t. The average is computed over a distribution of input stimulus sequences p(θ(t)). To train the readout, w, the network is given a sample of P stimulus sequences θµ(t), µ = 1,…, P. For the µ-th training input sequence, the target system y is evaluated at a set of discrete time points giving a collection of target values
and a total dataset of size
. The average case generalization computes a further average of the generalization error Eg over randomly sampled datasets of size 𝒫.
Learning is again achieved through iterated weight updates with delta-rule form, but now have contributions from both sequence index and time . As before, optimization of the readout weights is equivalent to kernel regression with a kernel that computes inner products of neural population vectors at different times t, t′ for different input sequences
. This kernel depends on details of the time varying population code including its recurrent intrinsic dynamics as well as its encoding of the time-varying input stimuli. The optimization problem and delta rule described above converge to the kernel regression solution for kernel gram matrix
[38, 39, 40]. The learned function has the form
, where α = K+y for kernel gram matrix K ∈ ℝ 𝒫 × 𝒫 which is computed for the entire set of training sequences, and the vector y ∈ ℝ 𝒫 is the vector containing the desired target outputs for each sequence. Assuming a probability distribution over sequences θ(t), the kernel can be diagonalized with orthonormal eigenfunctions ψk({θ}, t). Our theory carries over from the static case: kernels whose top eigenfunctions have high alignment with the target dynamical system y({θ}, t) will achieve the best average case generalization performance.
RNN Experiment
For the simulations in Figure 7 we integrated a rate based recurrent network model with N = 6000 neurons, time constant τ = 0.05 and gain g = 1.5. Each of the P = 80 randomly chosen angles γµ generates a trajectory over T = 100 equally spaced points in t ∈ [0, 3]. The two dimensional input sequence is simply θ(t) = H(t)H(1 − t) [cos(γµ), sin(γµ)]T ∈ ℝ2. Target function for a delay d is y(θµ, t) = H(1.5 + d − t)H(t − d − 1)[cos(γµ), sin(γµ)]T which is nonzero for times t ∈ [1 + d, 1.5 + d]. In each simulation, the activity in the network is initialized to u(0) = 0. The kernel gram matrix K ∈ ℝPT ×PT is computed by taking inner products of the time varying code at for different inputs γµ and at different times. Learning curves represent the generalization error obtained by randomly sampling 𝒫 time points from the PT total time points generated in the simulation process and training readout weights w to convergence with gradient descent.
Data Analysis
Data source and processing
Mouse V1 neuron responses to orientation gratings were obtained from a publicly available dataset [8, 9]. Two-photon calcium microscopy fluorescence traces were deconvolved into spike trains and spikes were counted for each stimulus, as described in [8]. The presented grating angles were distributed uniformly over [0, 2π] radians. Data pre-processing, which included z-scoring against the mean and standard deviation of null stimulus responses, utilized the provided code for this experiment, which also publicly available at https://github.com/MouseLand/stringer-et-al-2019. This preprocessing technique was used in all Figures in the paper. To reduce corruption of the estimated kernel from neural noise (trial-to-trial variability), we first trial average responses, binning the grating stimuli oriented at different angles θ into a collection of 100 bins over the interval from [0, 2π] and averaging over all of the available responses from each bin. Since grating angles were sampled uniformly, there is a roughly even distribution of about 45 responses in each bin. After trial averaging, SVD was performed on the response matrix R, generating the eigenspectrum and kernel eigenfunctions as illustrated in Figure 4. Figures 2, 3, 4, all used this data anytime responses to grating stimuli were mentioned.
In Figures 3C and 4D, the responses of mouse V1 neurons to ImageNet images were obtained from a different publicly available dataset [10, 11]. Again, spike counts were obtained from de-convolved and z-scored calcium fluorescence traces. Each of the images presented belongs to one of 15 relevant Imagenet categories, including the mice and bird categories displayed in 4D. The preprocessing code and image category information were obtained from the publicly available code base at https://github.com/MouseLand/stringer-pachitariu-et-al-2018b.
Generating alternative codes
In Figure 3, the randomly rotated codes are generated by sampling a matrix Q from the Haar measure on the set of N -by-N orthogonal matrices, and chosing a δ by solving the following optimization problem:
which minimizes the total spike count subject to the kernel and nonnegativity of firing rates. The solution to this problem is given by
.
Comparing Sparsity of Population Codes
To explore the metabolic cost among the set of codes with the same inductive biases, we estimate the distribution of average spike counts of codes with the same inner product kernel as the biological code. These codes are generated in the form sµ = Qrµ +δ where δ solves the optimization problem
To quantify the distribution of such codes, we randomly sample Q from the Haar-measure on O(N) and compute the optimal δ as described above. This generates the aqua colored distribution in Figure 3 B and C.
We also attempt to characterize the most efficient code with the same inner product kernel
Since this optimization problem is non-convex in Q, there is no theoretical guarantee that minima are unique. Nonetheless, we attempt to optimize the code by starting Q at the identity matrix and conduct gradient descent in the tangent space so(N). Such updates take the form
where exp(·) is the matrix exponential. To make the loss function differentiable, we incorporate the non-negativity constraint with a soft-minimum:
where Z = ν exp(−βaν) is a normalizing constant and Q = [q1,…qN]. In the β → ∞ limit, this cost function converges to the exact optimization problem with non-negativity constraint. Finite β, however, allows learning with gradient descent. Gradients are computed with automatic differentiation in JAX [85]. This optimization routine is run until convergence and the optimal value is plotted as dashed red lines labeled “optimal” in Figure 3.
We show that our result is robust to different pre-processing techniques and to imposing bounds on neural firing rates in Figure SI.1. To demonstrate that our result is not an artifact of z-scoring the deconvolved signals against the spontaneous baseline activity level, we also conduct the random rotation experiment on the raw deconvolved signals. In addition, we show that imposing realistic constraints on the upper bound of the each neuron’s responses does not change our findings. We used a subset of N = 100 neurons and computed random rotations. However, we only accepted a code as valid if it’s maximum value was less than some upper bound ub. Subsets of N = 100 neurons in the biological code achieve maxima in the range between 3.2 and 4.7. We performed this experiment for ub ∈ {3, 4, 5} so that the artificial codes would have maxima that lie in the same range as the biological code.
Fitting a Gabor model to mouse V1 kernel
Under the assumption of translation symmetry in the kernel K(θ, θ′), we averaged the elements of the over rows of the empirical mouse V1 kernel [9]
where angular addition is taken mod π. This generates the black dots in Figure 5 B. We aimed to fit a threshold-power law nonlinearity of the form gq,a(z) = max{0, z − a}q to the kernel. Based on the Gabor model discussed above, we parameterized tuning curves as
where θi is the preferred angle of the i-th neuron’s tuning curve. Rather than attempting to perform a fit of
of this form to the responses of each of the ∼ 20-k neurons, we instead simply attempt to fit to the population kernel by optimizing over (s, a, q). However, we noticed that two of these variables s, a are constrained by the sparsity level of the code. If each neuron, on average, fires for only a fraction f of the uniformly sampled angles θ, then the following relationship holds between s and a
Calculation of the coding level f for the recorded responses allowed us to infer a from s during optimization. This reduced the free parameter set to (s, q). We then solve the following optimization problem
where integration over θi is performed numerically. Using the Scipy Trust-Region constrained optimization routine, we found (q, s, a) = (1.7, 5.0, 0.2) which we use as the fit parameters in Figure 5.
Supplementary Information
Singular Value Decomposition of Continuous Population Responses
SVD of population responses is usually evaluated with respect to a discrete and finite set of stimuli. In the main paper, we implicitly assumed that a generalization of SVD to a continuum of stimuli. In this section we provide an explicit construction of this generalized SVD using techniques from functional analysis. Our construction is an example of the quasimatrix SVD defined in [86] and justifies our use of SVD in Figure 4.
For our construction, we note that Mercer’s theorem guarantees the existence of an eigendecomposition of any inner product kernel K(θ, θ′) in terms of a complete orthonormal set of functions [79]. In particular, there exist a non-negative (but possibly zero) summable eigenvalues
and a corresponding set of orthonormal eigenfunctions such that
For a stimulus distribution p(θ), the set of functions
are orthonormal and form a complete basis for square integrable functions L2 which means
Next, we use this basis to construct the SVD. Each of the tuning curves ri can be expressed in this basis with the top N of the functions in the set
where we introduced a matrix A ∈ ℝN ×N of expansion coefficients. Note that rank(A) ≤ N. We compute the singular value decomposition of the finite matrix A
We note that the signal correlation matrix for this population code can be computed in closed form
due to the orthonormality of {ψk}. Thus the principal axes uk of the neural correlations are the left singular vectors of A.
We may similarly express the inner product kernel in terms of the eigenfunctions
The kernel eigenvalue problem demands [79]
The vk vectors must be identical to ±ek, the Cartesian unit vectors, if the eigenvalues are non-degenerate. From this exercise, we find that the SVD for A has the form
. With this choice, the population code admits a singular value decomposition
This singular value decomposition demonstrates the connection between neural manifold structure (principal axes uk) and function approximation (kernel eigenfunctions ψk). This singular value decomposition can be verified by computing the inner product kernel and the correlation matrix, utilizing the orthonormality of {uk} and {ψk}.
This exercise has important consequences for the space of learnable functions, which is at most rank(A) dimensional since linear readouts lie in span.
Discrete Stimulus Spaces: Finding Eigenfunctions with Matrix Eigendecomposition
In our discussion so far, our notation suggested that θ take a continuum of values. Here we want to point that our theory still applies if θ take a discrete set of values. In this case, we can think of a Dirac measure , where i indexes all the
values θ can take. With this choice
Demanding this equality for
generates a matrix eigenvalue problem
where Bij = δijpi. The eigenfunctions over the stimuli are identified as the columns of Ψ while the eigenvalues are the diagonal elements of Λk = λkδk.
Experimental considerations
In an experimental setting, a finite number of stimuli are presented and the SVD is calculated over this finite set regardless of the support of p(θ). This raises the question of the interpretation of this SVD and its relation to the inductive bias theory we presented. Here we provide two interpretations.
In the first interpretation, we think of the empirical SVD as providing an estimate of the SVD over the full distribution p(θ). To formalize this notion, we can introduce a Monte-Carlo estimate of the integral eigenvalue problem
For this interpretation to work, the experimenter must sample the stimuli from p(θ), which could be the natural stimulus distribution. Measuring responses to a larger number of stimuli gives a more accurate approximation of the integral above, which will provide a better estimate of generalization performance on the true distribution p(θ).
In the second interpretation, we construct an empirical measure on experimental stimulus values
, and consider learning and generalization over this distribution.
This allows the application of our theory to an experimental setting where is designed by an experimenter. For example, the experimenter could procure a complicated set of
videos, to which an associated function y(θ) must be learned. After showing these videos to the animal and measuring neural responses, the experimenter could compute, with our theory, generalization error for a uniform distribution over this full set of
videos. Our theory would predict generalization over this distribution after providing supervisory feedback for only a strict subset of
videos. Under this interpretation, the relationship between the integral eigenvalue problem and matrix eigenvalue problem is exact rather than approximate
Demanding either of (SI.11) or (SI.12) equalities for θ′ = θν, ν = 1,…, P generates a matrix eigenvalue problem
The eigenfunctions restricted to {θµ} are identified as the columns of Ψ while the eigenvalues are the diagonal elements of Λk = λkδkl. For the case where N and P are finite, the spectrum obtained through eigendecomposition of the kernel K is the same as would be obtained through the finite N signal correlation matrix Σs, since they are inner and outer products of trial averaged population response matrices R.
Generalization in Kernel Regression
Recent work has established analytic results that predict the average case generalization error for kernel regression
where
is the generalization error for a certain sample 𝒟 of size P and f (θ, D) is the kernel regression solution for 𝒟 [20, 21]. The typical or average case error Eg is obtained by averaging over all possible datasets of size P. This average case generalization error is determined solely by the decomposition of the target function y(x) along the eigenbasis of the kernel and the eigenspectrum of the kernel. This diagonalization takes the form
Since the eigenfunctions form a complete set of square integrable functions, we expand both the target function y(θ) and the learned function f (θ) in this basis
Due to the orthonormality of the kernel eigenfunctions {ψk}, the generalization error for any set of coefficients w is
We now introduce training error, or empirical loss, which depends on the disorder in the dataset
It is straightforward to verify that the optimal w* which minimizes H(w, 𝒟) is the kernel regression solution for kernel with eigenvalues {λk} when λ → 0. The optimal weights w can be identified through the first order condition ∇H(w, D) = 0 which gives
where Ψk,µ = ψk(xµ) are the eigenfunctions evaluated on the training data and Λk, = δk,l λk is a a diagonal matrix containing the kernel eigenvalues. The generalization error for this optimal solution is
We note that the dependence on the randomly sampled dataset 𝒟 only appears through the matrix G(𝒟). Thus to compute the typical generalization error we need to average over this matrix ⟨G(𝒟)⟩ 𝒟. There are multiple strategies to perform such an average and we will study one here based on a partial differential equation which was introduced in [81, 82] and studied further in [20, 21]. In this setting, we denote the average matrix G(P) = ⟨G(𝒟)⟩| 𝒟|=P for a dataset of size P. We first will derive a recursion relationship using the Sherman Morrison formula for a rank-1 update to an inverse matrix. We imagine adding a new sampled feature vector ϕ to a dataset ψ with size P. The average matrix G(P + 1) at P + 1 samples can be related to G(P) through the Sherman Morrison rule
where in the last step we approximated the average of the ratio with the ratio of averages. This operation, is of course, unjustified theoretically, but has been shown to produce accurate learning curves [20, 82]. Since the chosen basis of kernel eigenfunctions are orthonormal, the average over the new sample is trivial ⟨ψψT⟩ = I. We thus arrive at the following recursion relationship for G
By introducing an additional source J so that
, we can relate G(𝒟, J)’s first and second moments through differentiation
Thus the recursion relation simplifies to
where we approximated the finite difference in P as a derivative, treating P as a continuous variable. Taking the trace of both sides and defining κ(P, J) = λ + TrG(P, J) we arrive at the following quasilinear PDE
with the initial condition κ(0, J) = λ + Tr(Λ−1 + J I)−1. Using the method of characteristics, we arrive at the solution
. Using this solution to κ, we can identify the solution to G
The generalization error, therefore can be written as
where
, giving the desired result. Note that κ depends on J implicitly, which is the source of the
factor. This result was recently reproduced using techniques from statistical mechanics [20, 21].
Translation Invariant Kernels
For the special case where the data distribution is uniform over volume V and the kernel is translation invariant K(θ, θ′) = κ(θ − θ′), the kernel can be diagonalized in the basis of plane waves
The eigenvalues are the Fourier components of the Kernel
while the eigenfunctions are plane waves ψk(θ) = eik·θ. The set of admissible momenta Sk = {k0, ±k1, ±k2,…} are determined by the boundary conditions. The diagonalized representation of the kernel is there-fore
For example, if the space is the torus 𝕋n = S1 × S1 ×… × S1, then the space of admissable momenta are the points on the integer lattice Sk = ℤn = {k ∈ ℝn|ki ∈ ℤ ∀i = 1,…, n}. Reality and symmetry of the kernel demand that Im(λk) = 0 and λ−k = λk ≥ 0. Most of the models in this paper consider θ ∼ Unif (S1), where the kernel has the following Fourier/Mercer decomposition
where we invoked the simple trigonometric identity cos(a − b) = cos(a) cos(b) + sin(a) sin(b). By recognizing that
form a complete orthonormal set of functions with respect to Unif (S1), we have identified this as the collection of kernel eigenfunctions.
Visualization of Feedforward Gabor V1 Model and Induced Kernels
Examples of the induced kernels for the Gabor-bank V1 model are provided in Figure SI.2. We show how choice of rectifying nonlinearity g(z) and sparsifying threshold a influence the kernel and their spectra. Learning curves for simple orientation tasks are provided.
Laplace Kernel Generalization
We repeat the same exercise in Figure 6 with Laplace kernels to show that our results is not an artifact of the infinite differentiability of the Von Mises kernel. Each of these Laplace kernels has the same asymptotic power law spectrum λk ∼ o(k−2), exhibiting a discontinuous first derivative (Figure SI.3 A). Despite having the same spectral scaling at large k, these kernels can give dramatically different performance in learning tasks, again indicating the influence of the top eigenvalues on generalization at small P (Figure SI.3). Again, the trend for which kernels perform best at low P can be reversed at large P. In this case, all generalization errors scale with Eg ∼ P −2 (Figure SI.3B). More generally, our theory shows that if the task power spectrum and kernel eigenspectrum are both falling as power laws with exponents a and b respectively, then the generalization error asymptotically falls with a power law, Eg ∼ P − min(a−1,2b)/b (Methods) [20]. This decay is fastest when for which Eg ∼ P −2. Therefore, the tail of the kernel’s eigenvalue spectrum determines the large sample size behavior of the generalization error for power law kernels. Small sample size limit is still governed by the bulk of the spectrum.
Supplementary Figures
Our metabolic efficiency finding is robust to different pre-processing techniques and upper bounds on neural firing. A We show the same result as in Figure 3 except we use raw (non z-scored) estimate of responses for each stimulus. B Our result is robust to imposition of firing rate upper bounds ub on each neuron. This result uses the z-scored responses to be consistent with the rest of the paper. The biological code achieves a maximum z-score values in the range [3.2, 4.7], which motivated the range of our tested upper bound values {3, 4, 5}. C Our finding is robust to the number of sampled stimuli P as we show in an experiment where rotations in N = 500 dimensional subspace.
Nonlinear Rectification and proportion of simple and complex cells influences the inductive bias of the population code. A The choice of nonlinearity has influence on the kernel and its spectrum. If the nonlinearity is g(z) = max(0, zq), then λk ∼ k−2n−2. B The sparsity can be increased by shifting the nonlinearity g(z) → g(z _ a). Sparser codes have higher dimensionality. Note that a = 0 is a special case where the neurons behave in the linear regime for all inputs θ since the currents w · h are positive. Thus, for a = 0, the spectrum decays like a Bessel Function λk = Ik(β). C-D Easy and hard orientation discrimination tasks with varying nonlinear polynomial order q. At low sample sizes, large q performs better, whereas at large P, the step function nonlinearity q = 0 achieves the best performance.
A, B Kernel regression experiments are performed with Laplace kernels of varying bandwidth on a non-differentiable target function. The top eigenvalues are modified by changing the bandwidth, but the asymptotic power law scaling is preserved. Generalization at low P is shown in the contour plot while the large P scaling is provided in the generalization. In A-right and B-right, color code is the same as Figure 6C.
Acknowledgements
We thank C. Stringer, M. Pachitariu, M. Michaelos, N. Steinmetz, M. Carandini, and K. D. Harris for publicly sharing their datasets. We thank B. Ö lveczky, C. Stringer, M. Pachitariu, K. Blum and J. Zavatone-Veth for comments on the manuscript.
References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].↵
- [65].↵
- [66].↵
- [67].↵
- [68].↵
- [69].↵
- [70].↵
- [71].↵
- [72].↵
- [73].↵
- [74].↵
- [75].↵
- [76].↵
- [77].↵
- [78].↵
- [79].↵
- [80].↵
- [81].↵
- [82].↵
- [83].↵
- [84].↵
- [85].↵
- [86].↵