Abstract
We apply efficient coding principles to derive the optimal population of neurons to encode rewards from a distribution. Similar to this optimal population, dopaminergic reward prediction error neurons have a broad distribution of optimistically placed thresholds, neurons with higher thresholds have higher gain and the curvature of their responses depends on the threshold. Thus, these neurons may broadcast an efficient reward signal, not necessarily a reward prediction error.
Main
Reward processing is central for a wide range of cognitive tasks including decision making, planning and learning. One important reward representation in the brain is maintained by reward prediction error neurons (RPENs) [1].
These dopaminergic neurons in the midbrain react to received rewards relative to an expectation based on past experience. The existence of these neurons is the most prominent evidence in favor of reinforcement learning (RL) in the brain [2]. Also, RPENs have been implied in a broad range of tasks that require value-based cognition[3]. Thus, RPENs’ encoding of reward is a cornerstone of our understanding of reward signals in the brain.
One guiding principle to explain neural codes is the hypothesis that neuronal codes might be optimized for efficient coding, i.e. to convey as much information as possible with a given budget for the number of neurons and the number of spikes [4, 5]. This efficient coding hypothesis is quite successful at predicting the codes employed by sensory neurons [6–8], but this framework is not limited to sensory processing and can be applied to reward encoding, too. Notably, efficient coding has been applied successfully to rewards to explain preference-based decisions [9–11], which corroborates the idea that rewards are represented using an efficient code somewhere in the brain.
Here, we propose that RPENs implement an efficient code for reward value. To test our hypothesis, we first derive the most efficient population of sigmoidally tuned neurons to encode rewards sampled from an arbitrary given distribution. We then compare this optimized population to a population of RPENs measured by Eshel et al. [13, 14]. We find three aspects of the optimized population confirmed: First, RPEN thresholds cover the range of the distribution with an emphasis on higher thresholds. Second, RPENs with higher thresholds have higher gains. And third, the asymmetry of RPENs’ responses around their thresholds depends on their threshold switching from concave to convex with growing threshold.
We derive the optimal population of neurons to encode rewards analytically by extending an existing framework by Ganguli and Simoncelli [15]. Assuming that RPENs have a sigmoid tuning curve for reward [16, 17], we start with a broad class of populations, among which we search for the most efficient one. Here, we construct the class based on a base population of neurons on the unit interval. We then allow any smooth, monotone function to map from rewards to the unit interval and define the responses of the neurons in terms of the responses of the base population at the mapped location. Additionally, we allow an arbitrary scaling of the neurons’ response gain and of the density of neurons depending on their placement. In mathematical terms, we use a prototypical sigmoidal response function h0 on the unit interval to define the response of a neuron with midpoint at µ as: where is the mapping from reward space to the unit interval with derivative f ≥ 0, which gives the local conversion rate between reward space and the unit interval. g is the gain of neurons placed at µ. Finally, d is the density of neurons, i.e. how likely it is that a neuron is placed at a given reward.
We then optimize the three functions to find the most efficient populations that can be generated this way to encode rewards from a distribution with density p(R) and cumulative density P (R) (see Online Methods). This yields a family of equally good populations, which we parameterized by α ∈ [0, 1]: This solution extends the earlier result [15], which assumed a uniform density of neurons on the unit interval, effectively setting d(R) = f (R) in our formulation. Their solution is a special case of our family with α = 1 and thus d(R) = p(R) and .
We compare the predictions of the efficient coding model to data from the ‘variable-reward task’ of Eshel et al. [13] (Fig. 1A). In this task, mice were given one of seven reward magnitudes in each trial with certain probabilities, while RPENs responses were recorded. We optimize the efficient code for a continuous approximation of the discrete probability distribution, in the form of a log-normal distribution matching the mean and variance of the discrete distribution (µ=1.29 and σ2=0.71). To compare our predictions to the data, we set free parameters of the derived population to match the population observed in the neural recordings. We used n = 39 neurons with a summed expected firing rate of rmax = 245.41, and set the spontaneous firing rate r*, the parameter of our family of solutions α, and k a parameter controlling the overall slope to optimize the match to the empirically observed midpoints and thresholds, resulting in r* = 7.67, α = 0.77, and k = 5.07 (see Online Methods for details).
The midpoints of neurons in the optimized population cover the range of the reward distribution, with some emphasis on higher quantiles (Fig. 1C). This pattern represents a trade-off between two pulls on the efficient code: Neurons generally need to be distributed relatively uniformly across the encountered reward distribution, because investing more spikes to encode a range of rewards more accurately yields less additional information transfer the more accurate the reward value is already encoded. However, neurons with higher thresholds are activated by fewer reward values and are thus cheaper. This pattern matches the measured population (Fig. 1C).
In the efficient population of sigmoidal neurons, the gain is higher for neurons with higher reward thresholds. This effect does not happen for unimodal tuning functions [15, 18]. The increase in gain is optimal, because neurons with high thresholds respond to fewer rewards and can thus afford a higher gain with the same expected number of spikes. In the data, we observe the expected increase of gain, albeit slightly weaker than expected (Fig. 2C).
The higher gain for neurons with higher thresholds leads to concave tuning curves for neurons with low thresholds and convex tuning curves for neurons with high thresholds (Fig. 2A). This effect is due to the definition of the threshold as the point where the spontaneous firing rate is surpassed. As long as the gain increases faster than the spontaneous firing rate, the threshold will move down relative to the sigmoid shape moving from the upper concave part of the sigmoid down into the convex part.
To quantify the degree of asymmetry of neural responses, we followed previous work and approximated the neural tuning curves with two linear response functions, one above and one below the spontaneous firing rate r*. This allows for an asymmetric response of the RPENs, with the slope above threshold β+ being different from the slope below threshold β− [19, 20]. The ratio of slopes can then be used as an index of asymmetry, which is close to one for convex tuning curves and close to 0 for convex tuning curves.
The relationship between asymmetry on threshold was similar in the optimal population and in the measured population (Fig. 2D). This relationship was captured even better than by distributional reinforcement learning [14] (dashed line in Fig. 2D), a recent mechanistic explanation that linked the asymmetry around threshold to the asymmetry of the update equations necessary to converge to different expectiles of the reward distribution.
We thus conclude that RPENs indeed implement an efficient code for reward. This proposal provides a normative reason for the tuning of RPEN responses to rewards, in particular for the following observations: First, neurons encode reward relative to a broad range of quantiles of the reward distribution with some bias towards higher thresholds. Second, neurons with higher reward thresholds have higher gain than neurons with lower thresholds. Third, neurons with high (low) thresholds have convex (concave) tuning curves.
Our qualitative observations are all robust to changing the measure of information used for optimizing the population, the assumed distribution of neural responses or the sigmoid shape. Different objective functions or different response distributions lead to different relationships between the density of the reward distribution and the optimal density and gain of neurons[15]. However, the relationship to neural density is always monotonically increasing, as it is more efficient to focus neural resources on probable rewards. Analogously, all objectives lead to higher gains at higher thresholds, as gain increases are “cheaper” for neurons with higher thresholds. Furthermore, different shapes for the response functions do not change the density and gain predictions, which were derived independent of the shape, and only lead to slightly different predictions for the asymmetry of the responses around the threshold. For any shape that is convex in its lower response range and concave in its upper response range, the qualitative argument for the dependence between threshold and asymmetry will hold.
Typically, RPEN responses are described in terms of reward prediction error instead of rewards as we do here. However, efficient coding always assumes that the reward distribution is known by both sender and receiver, such that encoding the expected reward does not communicate additional information. As a consequence, an efficient code does not encode the expected value and the optimal populations for reward and for reward prediction error are exactly equal. From this one code, a receiver can then decode both value and reward prediction error.
To simplify our analysis, we restricted ourselves to a task with a single reward supplied in each independent episode without any action required by the animal. In this simple case, the reward received and the change in value of the current state (potentially with temporal discounting) are confounded. Thus, all our analyses are agnostic to this distinction. Earlier literature suggests that RPENs take future rewards into account with some temporal discounting in multi-step settings[21] and delayed reward tasks[22]. These observations speak in favor of a temporally discounted value interpretation rather than a pure reward interpretation. Distinguishing these concepts and explaining the ramping and phasic activities of RPENs[e.g. 23], will require future research.
There are three alternative explanations of the data by Eshel et al. [13] we are aware of: The distributional RL expectile explanation [14], the Laplace code [24], and normalized reinforcement learning [25]. These reinforcement learningbased explanations primarily give a mechanistic explanation how RPENs learn their code. Distributional RL[14] proposes that RPENs update their thresholds with the same asymmetry shown by their response profile, which leads to convergence to different expectiles of the distribution. This induces the relationship between threshold and asymmetry. The Laplace code[24] proposes that TD-learning neurons with different parameters can be used to encode temporal aspects and the whole distribution of rewards by applying an analogue of the Laplace transform to the neural responses. In the paper on the Laplace code, the authors also already note that the dependence between asymmetry and threshold can be created by cutting a sigmoid response curve at different response levels, although they moved the sigmoid up and down instead of the gain change we predict and observe. Finally, normalized reinforcement learning [25] proposes that reward encoding neurons perform divisive normalization with different half saturation constants. This yields sigmoid (Naka-Rushton non-linearity [26]) neurons with different thresholds and the asymmetry around those thresholds is explained again by cutting the sigmoids at different heights, similar to the explanation for the Laplace code. All these earlier explanations are not normative but mechanistic, do not note or explain the increase in gain and one would not expect the bias towards higher thresholds from these theories.
Efficient coding as a normative account is compatible with any mechanistic theory how the code is implemented and learned. Classically, the efficient coding literature has ignored learning and studied the converged steady state for distributions that are learned over a lifetime. Some recent research proposed mechanisms how neurons may learn efficient codes[e.g. 27–30]. However, the mechanistic explanations for the asymmetry threshold relationship we discussed above[24, 25] could most likely be modified to converge to an efficient code for rewards if neural gain was also adjusted and the right distribution of neuron parameters were chosen. Making these learning rules concrete and comparing them to the efficient code learning proposals is an interesting direction for future research.
Online Methods
Analysis of neural data
We used the data from the variable-reward task by [13] as preprocessed by [14]. For details of data acquisition such as data inclusion and identifying neurons, please refer to [13]. The recording techniques are described in detail in [17]. We only give a brief summary of important information we took from their articles:
In this task, five mice were given one of seven reward magnitudes (0.1, 0.3, 1.2, 2.5, 5, 10 and 20 µl) randomly chosen in each trial with probabilities 0.066, 0.091, 0.15, 0.15, 0.31, 0.15, and 0.077, respectively. Electrophysiological data were recorded from n = 40 ventral tegmental area dopaminergic RPENs. From each of the mice 3, 6, 9, 16, and 6 neurons were selected respectively. One neuron was excluded because it had never shown a single response larger than its baseline for any reward magnitude.
Following previous analyses of this dataset [14], we used average spike rates from 200 ms to 600 ms as a response. Data were originally binned into 1 ms bins. The resulting peri-stimulus time histograms were then smoothed by convolving with the function (1 − e−t/ms) e−t/(20ms).
Logistic fits
To extract the parameters predicted by the efficient coding framework from the data, we fitted logistic sigmoid functions to the data of each neuron. This allows us to extract a gain a1, a slope a2, and a midpoint a3 for each neuron. We used least-squares minimization to estimate parameters of the sigmoid functions: where Rj is the measured firing rate of a neuron responding to the reward magnitude xj at j-th trial of T trials.
The one parameter that influences the efficient code is the overall expected firing rate rmax. For this parameter, we computed a 90% confidence interval of [209.60, 264.21] by bootstrap resampling neurons and measured trials for them 5, 000 times. As all firing rates in the efficient code scale proportionally to the expected firing rate, this confidence interval directly translates into a confidence interval for the gain at any threshold (Fig. 2C).
Fitting of asymmetric slopes
To enable comparisons to the data by [14], we repeated their analysis on their data and apply the same analysis technique to the simulated neurons in our population optimized for efficient coding.
First, we subtracted a baseline from the smoothed PSTHs for each neuron and each trial. The baseline was the mean firing rate over the 1000 ms before stimulus onset. We then estimated asymmetric slopes, η+ and η− by separately fitting linear function to the responses above and below threshold for each neuron. Following Dabney et al. [14], we used estimated utility space as x-axis for this step, but the results are not sensitive to the choice of utility function.
In math terms, the slope above threshold η+ and the slope below threshold η− for responses yj to rewards Rj are thus defined as: where r* is the spontaneous firing rate and the sum over j runs over the T trials.
To achieve more efficient fitting than Dabney et al. [14], we derive analytic solutions for η+ and η− for a given θ, which are: Thresholds – called reversal points in [14] – were then chosen to minimize the overall sum of squares, by sampling 20,000 thresholds from an uniform distribution between the range of rewards and choosing the best fitting one. Dabney et al. used the same random sampling procedure for η+ and η−.
For fitting our simulated neurons, we set the threshold and yj = hi(vj) and fit the ηs based on the results for 106 trials sampled from the approximated log-normal reward distribution.
Derivation of the efficient code
To derive the efficient code for reward, we extend the framework for analytic solutions proposed by Ganguli and Simoncelli [15]. In this framework, a uniform standard population is defined on the unit interval [0, 1] 1. By using an adaptable mapping between the unit interval and the stimulus space and functions that transform the population, we can define a broad class of populations among which we can find the most efficient ones analytically. We extend this framework by allowing the distribution of neurons to be non-uniform, which decouples the local slope and density of neurons that are coupled in the original framework.
Definitions
We start by defining a “standard” population of RPENs. This standard population is a collection of sigmoidal tuning curves sω : [0, 1] → [0, 1] indexed by their ’center’ ω ∈ [0, 1]. Individual neurons are assumed to be monotonically increasing functions whose derivatives are unimodal functions with their peak near ω. Additionally, we assume that the overall Fisher information If (x) of this population about the position x ∈ [0, 1] and the overall increase in firing rate are independent of x i.e.: where If (x, ω) is the Fisher information provided by the tuning function with center ω and is the derivative of the tuning function.
Finally, we assume that the total amount of Fisher information provided by each neuron is approximately constant: One simple way to construct a population that fullfills these criteria is to start with a prototypical sigmoid function s0 centered on 0 and defining the other tuning curves as sω(x) = s0(x − ω). This construction often leads to strong edge effects at 0 and 1, because a substantial part of the sigmoid response curves near the borders will fall outside the [0, 1] interval. We ameliorate these effects by using a more sophisticated population based on cumulative beta distributions below. Nonetheless, this simple construction illustrates the intuition behind a uniform population of sigmoidal neurons well and was actually used by Ganguli and Simoncelli in their implementation.
To map between the unit interval and stimulus space we use a strictly monotonic function from the stimulus space into the unit interval, parameterized by its derivative f (s) > 0, which we will optimize below. As F is strictly monotonic, it is also invertible, such that this also defines a function F −1 mapping the unit interval into stimulus space. Additionally, we allow a gain function g > 0 of stimulus space, which scales the tuning curve of neurons placed at different positions in stimulus space.
With these definitions, we can now define the tuning function hµ of a neuron placed at stimulus level µ: i.e. we map both the positions of the neurons and the stimuli into the unit interval, evaluate the sigmoids there and scale the result with the gain.
Up to this point our definitions are equivalent to those of Ganguli and Simoncelli [15], who now assume that the neurons are equally spaced in the interval. This additional assumption makes it impossible to place more neurons in some stimulus range without changing their shape. This removes one of the possible adaptations of a neural populations mentioned by other authors like Wei and Stocker [31] for example. To allow different distributions of the neurons, we instead assume that the µi are drawn from a distribution over reward space with density d, which we will optimize below. The equal placement assumed by Ganguli and Simoncelli is essentially equivalent to the assumption d(R) = f (R), which yields a uniform distribution over the placements in the unit interval ω.
Optimization Objective
We further assume that the firing rates of the RPENs are subject to independent Poisson noise, which is the simplest noise assumption. We can then calculate the Fischer information If (R; µ) provided by a neuron centered on µ: We then optimize the expected value (over both R and µ) of the logarithm of the Fisher information I0, based on N neurons: This objective was originally proposed as a lower bound on the mutual information between the neural responses and the stimulus [15, 32, 33], which becomes tight for small Gaussian shaped errors on the encoded value. However, some later publications cast doubt on this interpretation: In particular
[34] interpret this equation as an upper bound on mutual information instead. Additionally, small firing rates and/or steep changes in firing rate can lead to this approximation being highly inaccurate and this realm is actually reached for cases in which neurons are measured only for short times [35]. Nonetheless, I remains the best Fisher information based approximation to mutual information and we follow Ganguli and Simoncelli [15] in optimizing it. They also tried a few other functionals of Fisher information and got qualitatively similar results. Mutual information itself is unfortunately untractable for these derivations.
Assuming further that there is a large population of RPENs, we can now look at this distribution in the limit of infinitely many neurons. According to the strong law of large numbers, the logarithm of the mean in (10) almost surely converges to the logarithm of the expected Fisher information, because the Fisher information provided by each single neuron has finite variance: Because this is a constant with respect to µ, we can drop the outer expected value, yielding: To make this expression amenable to optimization, we simplify it as follows: Here, we eliminated the inner integral through two steps, which are only approximately correct: First, we observe for sigmoidal functions, that is a unimodal function that is centered around R, which effec-tively acts as a smoothing kernel that is convolved with the rest of the integral. If d(µ), f (µ) and g(µ) change little over the range of this smoothing kernel, we can replace them by their values at R and pull them out of the integral. Second, the integral over µ we are left with is (approximately) shift-invariant, i.e. the same for all R and invariant to the monotonic transformation F, if the original population is (approximately) uniform as defined in equations (6) & (7).
The final optimization problem then is to optimize this approximation of I with respect to the three functions d, g, and f : Additionally, we enforce that the expected number of spikes 𝔼[r] under the prior distribution is smaller or equal to a bound rmax. The expected number of spikes is: where P is the cumulative density function of the reward distribution. In (20), we approximated the sigmoid with a step function at µ. This approximation is close if p(R) is near constant over the range where the sigmoid hµ changes from 0 to 1. This is the case if the sigmoids are reasonably steep.
There are two further constraints, one to make d a valid probability density and one to limit the range of F to the unit interval:
Solving the Optimization
Now we are ready to solve this optimization problem using Lagrange multipliers. To do so, we first compute the functional derivatives of our objective function: Similarly, we compute the functional derivatives of the constraints aiming for a Lagrange multiplier based solution. First, for the expected spike rate: Second, for the integral of the density D: And finally, for the upper end of the interval reached by F, Fmax: With these derivations in place, we can finally solve the optimization using Lagrange multipliers λR, λD, λF, which result in the following three equations: From (29), we immediately get: From (30), we obtain: By inserting this result into (31), we see that it can always be guaranteed with λD = 0.
We conclude that under a constraint on the expected number of spikes, the neural population that optimizes the Fischer information bound on mutual information has the following properties: The inverse slope of the sigmoids is proportional to the probability density. The product of the probability density for neurons with a given midpoint and the gain for neurons with this midpoint is equal to the ratio of the reward probability density and the probability that rewards larger than the midpoint appear.
The formulation by Ganguli & Simoncelli [15] with equally spaced neurons in the abstract space essentially enforces the additional constraint that d(R) ∝ f (R) and thus arrives at the special case of d(R) ∝ p(R) and g(R) ∝ (1 − P (R))−1.
Choosing a Solution
In our solution, d and g entirely compensate for each other. Due to the fact that their product needs to be proportional to a product of the two factors p(R) and , a natural choice to parameterize a family of solutions is to use a parameter α ∈ [0, 1] that trades off the distribution of the two factors to density and gain: Effectively, α trades off the gain increase at higher thresholds against placing more neurons at higher thresholds. We illustrate this for several different reward distributions and values of α in Fig. 3.
We fitted α using only the midpoints of the measured neurons. To do so, we extracted the midpoints of the measured from logistic function fits as described below. We then estimated α by maximizing the probability of the measured neurons’ midpoints under the density d. The best fit was produces by setting α = 0.77.
The requirements we put forth in (5)-(7) are approximately fulfilled by many populations of sigmoid shaped tuning curves. Thus, the exact shape and in particular the slope of the sigmoid are not fixed by these efficient coding considerations. For all optimal populations in this paper we use the cumulative density functions of beta distributions, because these sigmoid functions are defined on the unit interval [0, 1] and become exactly 0 and 1 at the ends of the interval, which yields good boundary behavior. These sigmoid functions are not exactly shifted versions of each other as we assumed in the derivations, but functions that are shifted versions of each other lead to relatively strong boundary effects that produce worse deviations from the theory. To place a neuron at a specific position p ∈ [0, 1], we set the parameters such that p is the mean of the beta distribution (a = kp, b = k(1 − p)), which leaves us with one more parameter n > 0, which controls the slope of the sigmoid functions. As the efficient coding scheme does not determine what the overall slope should be, we set k = 5.07 to fit the threshold distribution of measured neurons as described below. We do not expect other shapes to behave fundamentally differently.
We used a fixed spontaneous firing rate r* for all RPENs. We fitted this r* to minimize the discrepancy between the measured thresholds and the ones predicted by the efficient code, using a truncated Newton algorithm [36]. The objective is minimizing the mean squared error between thresholds of the efficient code and thresholds of measured neurons θi with respect to r* and an aforementioned parameter k for the slope of the sigmoid function h(·): As a result, we found r* = 7.67 and k = 5.07 for the efficient code that we fitted to the neurons measured by [13] used by [14].
Checking the solution
We performed several checks to confirm that the derived populations are indeed efficient given the reward distribution. To do so, we compared the amount of information communicated to some other populations with the same number of neurons and the same expected number of spikes (Fig. 4). The optimized populations retain more information than other populations. The population with the optimized α = 0.77 is equally efficient as the solution of Ganguli and Simoncelli [15], which corresponds to α = 1.
Acknowledgments
We thank Hsin-Hung Li for valuable discussions.