Abstract
Learning aims at causing better performance and the typical gradient descent learning is an approximate causal estimator. However real neurons spike, making their gradients undefined. Interestingly, a popular technique in economics, regression discontinuity design, estimates causal effects using such discontinuities. Here we show how the spiking threshold can reveal the influence of a neuron’s activity on the performance, indicating a deep link between simple learning rules and economics-style causal inference.
Learning is typically conceptualized as changing a neuron’s properties to cause better performance or improve the reward R. This is a problem of causality: to learn, a neuron needs to estimate its causal influence on reward, βi. The typical solution linearizes the problem and leads to popular gradient descent-based (GD) approaches of the form
. However gradient descent is just one possible approximation to the estimation of causal influences. Focusing on the underlying causality problem promises new ways of understanding learning.
Gradient descent is problematic as a model for biological learning for two reasons. First, real neurons spike, as opposed to units in artificial neural networks (ANNs), and their rates are usually quite slow so that the discreteness of their output matters (e.g. Shafi et al 2007 [30]). Further, given physiological constraints on information transmission, it remains unclear how mechanistically neurons might implement gradient descent. Given these challenges we may ask if the brain uses different approaches for causal inference.
The most obvious approach to causality is intervention: if some spiking is random then the correlation of performance with those random perturbations reveal causality. Perturbation-based methods have been demonstrated in various settings [3, 7, 8, 20, 24]. Other approaches rely on intrinsically generated noise to infer causality [35, 29], however these methods fail when the noise utilized for causal inference is correlated among neurons. Yet noise correlations are a common phenomenon in neuroscience (e.g. Zylberberg et al 2016 [36]), which limits these methods’ applicability. One can think of injecting noise as equivalent to randomized controlled trials in medicine [22] and A/B tests in computer science [16]. However requiring the injection of extra noise decreases performance, prompting us to ask if it is necessary.
Econometricians have deeply thought about causality [1]. One of the most popular techniques is regression discontinuity design (RDD) [15]. In RDD a binary treatment of interest, G, is based on thresholding an input variable, called a forcing or running variable. We are interested in the treatment’s effect on an output variable I. An example from education might be if a student fails an exam then they are enrolled in mandatory additional classes (Fig. 1a) [21]. In that case how could we estimate the causal effect of the additional classes on future academic performance?
a, (left) In education, the effect of mandatory classes given to students who fail an exam can be used to infer the effect of the classes by focusing on students at the threshold. The discontinuity at the threshold is then a meaningful estimate of the local average treatment effect (LATE), or causal effect (right) In neuroscience, the effect of a spike on a reward function can be determined by considering cases when the neuron is driven to be just above or just below threshold. b, The maximum drive versus the reward shows a discontinuity at the spiking threshold, which represents the causal effect. c, This is judged by looking at the neural drive to the neuron over a short time period. Marginal sub- and super-threshold cases can be distinguished by considering the maximum drive throughout this period. d, Schematic showing how RDD operates in network of neurons. Each neuron contributes to output, and observes a resulting reward signal. Learning takes place at end of windows of length T. Only neurons whose input drive brought it close to, or just above, threshold (gray bar in voltage traces) update their estimate of β. e, Model notation.
A naive estimate is just to compare the students who take the extra classes to those who do not, which we will term the observed dependence (OD):
However there will be differences between the two groups, e.g. stronger students will tend to not have been enrolled in the extra classes in the first place. Effects based on student skills and the extra classes will be superimposed, confounding the estimate.
A more meaningful estimate comes from focusing on marginal cases. If we compare the students that are right below the threshold and those that are right above the threshold then they will effectively have the same exam performance. And, since exam performance is noisy, the statistical difference between marginally sub- and super- threshold students will be negligible. Therefore the difference in outcome between these two populations of students will be attributable only to the additional classes attended, providing a measure of causal effect (Fig. 1a). If χ is the threshold exam score, then RDD computes
This estimates the causal effect of treatments without requiring the injection of noise. RDD uses local regression near the threshold to obtain statistical power while avoiding confounding.
Neurons that are not subject to external noise injection have to solve exactly the same causal inference problem (Fig. 1a). Through neuromodulator signals a neuron may receive feedback on a reward signal [27, 5]. The comparison in reward between time periods when a neuron almost reaches its firing threshold to moments when it just reaches its threshold are analogous to the above scenario; thus a neuron can use the RDD approach to estimate its own causal effect (Fig. 1b, c). Rather than using randomized perturbations from an additional noise source, a neuron can take advantage of the interaction of its threshold with presynaptic drive.
To implement RDD a neuron can estimate a piece-wise linear model of the reward function at time periods when its inputs place it close to threshold:
Here Hi is neuron i’s spiking indicator function, γi, αli and αri are the slopes that correct biases that would otherwise occur from having a finite bandwidth, Zi is the maximum neural drive to the neuron over a short time period, and βi represents the causal effect of neuron i’s spiking. The neural drive we will use here is the leaky, integrated input to the neuron, that obeys the same dynamics as the membrane potential except without a reset mechanism. By tracking the maximum drive attained over a short time period, marginally super-threshold inputs can be distinguished from well-above-threshold inputs, as required to apply RDD.
How could a neuron use RDD to estimate causal effects? We analyze a simple two neuron network, obeying leaky integrate-and-fire (LIF) dynamics. The neurons receive an input signal x with added noise, correlated with coefficient c. Each neuron weighs the noisy input by wi. The correlation in input noise induces a correlation in the output spike trains of the two neurons [31], thereby introducing confounding. The neural output determines a non-convex reward signal R. This setting allows us to test if neurons can conceivably implement RDD.
The difficulty in estimating a causal effect is that other neurons’ activity confounds activation with reward. A simplified RDD estimator that considers only average difference in reward above and below threshold within a window p, rather than a linear model, reveals this confounding (Fig. 2a). The locally linear RDD model, on the other hand, is more robust to this confounding (Fig. 2b). Thus the linear correction that is the basis of many RDD implementations [15] allows neurons to readily estimate their causal effect.
a, Estimates of causal effect (black line) using a constant RDD model (difference in mean reward when neuron is within a window p of threshold) reveals confounding for high p values and highly correlated activity. p = 1 represents the observed dependence, revealing the extent of confounding (dashed lines). b, The linear RDD model is unbiased over larger window sizes and more highly correlated activity (high c). c, Relative error in estimates of causal effect over a range of weights (1 ≤ wi ≤ 20) show lower error with higher coefficient of variability (CV; top panel), and lower error with lower firing rate (bottom panel). d, Over this range of weights, RDD estimates are less biased than just the naive observed dependence. e,f, Approximation to the reward gradient overlaid on the expected reward landscape. The white vector field corresponds to the true gradient field, the black field correspond to the RDD (e) and OD (f) estimates. The observed dependence is biased by correlations between neuron 1 and 2 – changes in reward caused by neuron 1 are also attributed to neuron 2.
To investigate the robustness of the RDD estimator, we systematically vary the weights, wi, of the network. RDD works better when activity is fluctuation-driven and at a lower firing rate (Fig. 2c). RDD is less biased than the observed dependence (Fig. 2d). Thus RDD is most applicable in irregular but synchronous activity regimes [4]. The causal effect can be used to estimate
(Fig. 2e,f), and thus the RDD estimator may be used for learning weights that maximize the expected reward (see Methods).
To demonstrate how a neuron can learn β through RDD, we derive an online learning rule from the linear model. The rule takes the form:
where ui are the parameters of the linear model required to estimate βi, η is a learning rate, and ai are drive-dependent terms (see Methods). This plasticity rule, where both a reward signal and activation can switch the sign of plasticity, is compatible with the interaction of modulatory influences of neuromodulators and neuronal firing [28, 2].
When applied to the toy network, the online learning rule (Fig. 3a) estimates β over the course of seconds (Fig. 3b). When the estimated β is then used to maximize expected reward in an unconfounded network (uncorrelated − c = 0.01), RDD-based learning exhibits higher variance than learning using the observed dependence. RDD-based learning exhibits trajectories that are initially meander while the estimate of β settles down (Fig. 3c). When a confounded network (correlated – c = 0.5) is used RDD exhibits similar performance, while learning based on the observed dependence sometimes fails to converge due to the bias in gradient estimate. In this case RDD also converges faster than learning based on observed dependence (Fig. 3d,e).
a, Sign of RDD learning rule updates are based on whether neuron is driven marginally below or above threshold. b, Applying rule to estimate β for two sample neurons shows convergence within 10s (red curves). Error bars represent standard error of the mean. c, Convergence of observed dependence (left) and RDD (right) learning rule to unconfounded network (c = 0.01). Observed dependence converges more directly to bottom of valley, while RDD trajectories have higher variance. d,e, Convergence of observed dependence (d) and RDD (e) learning rule to confounded network (c = 0.5). Right panels: error as a function of time for individual traces (blue curves) and mean (black curve). With confounding learning based on observed dependence converges slowly or not at all, whereas RDD succeeds.
This paper is a first step to introduce the RDD to neuronal learning. It serves to illustrate the difference in behavior of RDD and observed-dependence learning in the presence of confounding, but is by no means optimized for performance. Further, in many ways it can and should be extended: our model does not solve temporal credit assignment; it does not deal with large, interesting, systems; and it does not specify where presynaptic variance comes from. Nonetheless, RDD is one of the few known ways of statistically dealing with confounders, and an example of a larger class of methods called pseudo-experiments [23]. Demonstrations that segregated neuronal models [11, 17] and synthetic gradient methods [6] can solve deep learning problems at scale inspire future work.
Within reinforcement learning, there exist two popular approaches for estimating causality, each based on utilizing different kinds of intrinsic noise. In perturbation-based methods, a separate noise process is purposefully injected into the system and a mechanism for the system to understand responses as being either ‘natural’ or ‘perturbation-caused’ is used [3, 7, 8, 20]. In REINFORCE-type schemes [34], the noise instead comes from the biophysical properties of neurons, e.g. their Poisson spiking [35, 29]. In RDD approaches, on the other hand, it is sufficient that something, in fact anything that is presynaptic, produces variability. As such, RDD approaches do not require the noise source to be directly measured.
Further, in previous work, spiking is typically seen as a disadvantage and systems aim to remove spiking discontinuities through smoothing responses [14, 13, 19]. The RDD rule, on the other hand, exploits the spiking discontinuity. Moreover, finite difference approaches like the method derived here also have the benefit that they can operate in environments with non-differentiable or discontinuous reward functions. In many real-world cases, gradient descent would be useless: even if the brain could implement it, the outside world does not supply us with gradients (unlike its simulators [33]). Spiking may, in this sense, allow a natural way of understanding a neuron’s causal influence in a complex world.
The most important aspect of RDD is the explicit focus on causality. A causal model is one that can describe the effects of an agent’s actions on an environment. Thus learning through the reinforcement of an agent’s actions relies, even if implicitly, on a causal understanding of the environment [9, 18]. Here, by explicitly casting learning as a problem of causal inference we have developed a novel learning rule for spiking neural networks. Causality is what really matters in life and, as such, we believe that focusing on causality is essential when thinking about the brain.
Methods
The causal effect
A causal model is a Bayesian network along with a mechanism to determine how the network will respond to intervention. This means a causal model is a directed acyclic graph (DAG)
over a set of random variables
and a probability distribution P that factorizes over
[25]. An intervention on a single variable is denoted do(Xi = y). Intervening on a variable removes the edges to that variable from its parents, PaXi, and forces the variable to take on a specific value:
. Given the ability to intervene, the local average treatment effect (LATE), or just causal effect, between an outcome variable Xj and a binary variable Xi can be defined as:
We will make use of the following result. If Sij ⊂ χ is a set of variables that satisfy the back-door criteria with respect to Xi → Xj, then it satisfies the following: (i) Sij blocks all paths from Xi to Xj that go into Si, and (ii) no variable in Sij is a descendant of Xi. In this case the interventional expectation can be inferred from
Given this framework, here we will define the causal effect of a neuron as the average causal effect of a neuron Hi spiking or not spiking on a reward signal, R:
where Hi and R are evaluated over a short time window of length T.
Neuron, noise and reward model
We consider the activity of a network of n neurons whose activity is described by their spike times
Here n = 2. Synaptic dynamics
are given by
for synaptic time scale
. An instantaneous reward is given by
. In order to have a more smooth reward signal, R is a function of s rather than h. The reward function used here has the form of a Rosenbrock function:
The neurons obey leaky integrate-and-fire (LIF) dynamics
where integrate and fire means simply:
Noisy input ηi is comprised of a common DC current, x, and noise term, ξ(t), plus an individual noise term, ξi(t):
The noise processes are independent white noise:
. This parameterization is chosen so that the inputs η1,2 have correlation coefficient c. Simulations are performed with a step size of Δt = 1ms. Here the reset potential was set to vr = 0. Borrowing notation from Xie and Seung 2004 [35], the firing rate of a noisy integrate and fire neuron is
where
and
is the input noise standard deviation.
We define the input drive to the neuron as the leaky integrated input without a reset mechanism. That is, over each simulated window of length T:
RDD operates when a neuron receives inputs that place it close to its spiking threshold – either nearly spiking or barely spiking – over a given time window. In order to identify these time periods, the method uses the maximum input drive to the neuron:
The input drive is used here instead of membrane potential directly because it can distinguish between marginally super-threshold inputs and easily super-threshold inputs, whereas this information is lost in the voltage dynamics once a reset occurs. Here a time period of T = 50ms was used. Reward is administered at the end of this period: R = R(sT).
Policy gradient methods in neural networks
The dynamics given by (3) generate an ergodic Markov process with a stationary distribution denoted ρ. We consider the problem of finding network parameters that maximize the expected reward with respect to ρ. In reinforcement learning, performing optimization directly on the expected reward leads to policy gradient methods [32]. These typically rely on either finite difference approximations or a likelihood-ratio decomposition. Both approaches ultimately can be seen as performing stochastic gradient descent, updating parameters by approximating the expected reward gradient:
for neural network parameters w. Here capital letters are used to denote the random variables drawn from the stationary distribution, corresponding to their dynamic lower-case equivalent above.
Manipulating the expectation using a likelihood-ratio trick leads to REINFORCE-based methods [34]. In neural networks, likelihood-ratio based algorithms are known to be higher variance than methods that more directly approximate back-propagation (e.g. Rezende et al 2014 [26]). This motivates considering methods that more directly estimate the gradient terms [26, 12], breaking down (5) as we would with the deterministic expression. Here we focus on cases dominated by the mean reward gradient, meaning we assume the following:
where μ is the mean activity vector of the neurons.
Fig. 2e suggests that the assumption (6) is reasonable for the case presented here. Of course in general this assumption does not hold, however the method presented here can likely be extended to broader cases. For instance, if we were to assume that the stationary distribution ρ can be approximated as Gaussian then we have:
for μ and ∑, the mean and covariance of the Gaussian random vector S [26]. In this case quasi-Newton methods may be able to make use of the methods derived here. Alternatively, in some episodic learning cases, the gradient ∇w may be computed by unrolling the network so that the parameters separate from the stochastic variables; this is sometimes known as the re-parameterization trick [10, 26, 12].
Thus we derive methods to estimate
, and use it with (6) and (4) for stochastic gradient descent-based learning. We deal with spiking networks, meaning Ѕ is discontinuous at spike times. Therefore it makes sense to consider finite difference approximations to this gradient term.
Causal effect in neural networks
How can a neuron estimate
? We show that the reward gradient term can be related to the causal effect of a neuron on the reward signal. To show this we replace
with a type of finite difference operator:
Here Ѕj≠i ⊂ χ is a set of nodes that satisfy the back-door criterion with respect to Hi → R. When R is a deterministic, differentiable function of Ѕ and Δs → 0 this recovers the reward gradient
and we recover gradient descent-based learning. However this formulation has the advantage that it is defined when R is not differentiable, it does not require R is a deterministic function of Ѕ, and does not require that Δs be small.
To consider the effect of a single spike, note that unit i spiking will cause a jump in Ѕi (according to (2)) compared to not spiking. If we let Δs equal this jump then it can be shown that
is related to the causal effect:
A derivation is presented in the supplementary material (Section A).
Using regression discontinuity design
For comparison we define the observed dependence
as:
This of course provides an estimate of βi only when Hi is independent of other neurons in the network. In general the causal effect is confounded through correlation with other units.
As described in the main text, to remove confounding, RDD considers only the marginal super- and sub-threshold periods of time. This works because the discontinuity in the neuron’s response induces a detectable difference in outcome for only a negligible difference between sampled populations (sub- and super-threshold periods). The RDD method estimates [15]:
for maximum input drive obtained over a short time window, Zi, and spiking threshold, θ; thus, Zi < θ means neuron i does not spike and Zi ≥ θ means it does.
To estimate
, a neuron can estimate a piece-wise linear model of the reward function:
locally, when Zi is within a small window p of threshold. Here γi, αli and αri are nuisance parameters, and βi is the causal effect of interest. This means we can estimate
from
A neuron can learn an estimate of
through a least squares minimization on the model parameters γi, αl, αr. That is, if we let ui = [γi, αl, αr]T and
, then the neuron solves:
Performing stochastic gradient descent on this minimization problem gives the learning rule:
for all time periods at which zi,t is within p of threshold θ.
Implementation
python code used to run simulations and generates figures is available at: https://github.com/benlansdell/rdd.
A The relation between causal effect and the finite difference operator
Here we present a more detailed derivation of (8), which relates the causal effect to a finite difference approximation of the reward gradient. First, assuming the conditional independence of R from Hi given Ѕi and Ѕj≠i:
Now if we assume that on average Hi spiking induces a change of Δs in Ѕi within the same time period, compared with not spiking, then:
This is reasonable because the linearity of the synaptic dynamics, (2), means that the difference in Ѕi between spiking and non-spiking windows is simply exp
, for spike time tsi. We approximate this term with its mean:
under the assumption that spike times occur uniformly throughout the length T window. These assumptions are supported numerically (Suppl. Fig. 1).
Writing out the inner two expectations of (9) gives:
from (10)
after making the substitution si → si + Δs in the first term. Writing this back in terms of expectations gives the result:
a, Simulated spike trains are used to generate Si|Hi = 0 and Si|Hi = 1. QQ-plot shows that Ѕi following a spike is distributed as a translation of Ѕi in windows with no spike, as assumed in (10). b, This off set, Δs, is independent of firing rate and is unaffected by correlated spike trains. c, Over a range of values (0.01 < T < 0.1, 0.01 < τs < 0.1) the derived estimate of Δs (11) is compared to simulated Δs. Proximity to the diagonal line (black curve) shows these match. d, Δs as a function of window size T and synaptic time constant τs. Larger time windows and longer time constants lower the change in Ѕi due to a single spike.
Footnotes
↵+ lansdell{at}seas.upenn.edu