## Abstract

Learning aims at causing better performance and the typical gradient descent learning is an approximate causal estimator. However real neurons spike, making their gradients undefined. Interestingly, a popular technique in economics, regression discontinuity design, estimates causal effects using such discontinuities. Here we show how the spiking threshold can reveal the influence of a neuron’s activity on the performance, indicating a deep link between simple learning rules and economics-style causal inference.

Learning is typically conceptualized as changing a neuron’s properties to cause better performance or improve the reward *R*. This is a problem of causality: to learn, a neuron needs to estimate its causal influence on reward, *β _{i}*. The typical solution linearizes the problem and leads to popular gradient descent-based (GD) approaches of the form
. However gradient descent is just one possible approximation to the estimation of causal influences. Focusing on the underlying causality problem promises new ways of understanding learning.

Gradient descent is problematic as a model for biological learning for two reasons. First, real neurons spike, as opposed to units in artificial neural networks (ANNs), and their rates are usually quite slow so that the discreteness of their output matters (e.g. Shafi et al 2007 [30]). Further, given physiological constraints on information transmission, it remains unclear how mechanistically neurons might implement gradient descent. Given these challenges we may ask if the brain uses different approaches for causal inference.

The most obvious approach to causality is intervention: if some spiking is random then the correlation of performance with those random perturbations reveal causality. Perturbation-based methods have been demonstrated in various settings [3, 7, 8, 20, 24]. Other approaches rely on intrinsically generated noise to infer causality [35, 29], however these methods fail when the noise utilized for causal inference is correlated among neurons. Yet noise correlations are a common phenomenon in neuroscience (e.g. Zylberberg et al 2016 [36]), which limits these methods’ applicability. One can think of injecting noise as equivalent to randomized controlled trials in medicine [22] and A/B tests in computer science [16]. However requiring the injection of extra noise decreases performance, prompting us to ask if it is necessary.

Econometricians have deeply thought about causality [1]. One of the most popular techniques is regression discontinuity design (RDD) [15]. In RDD a binary treatment of interest, *G*, is based on thresholding an input variable, called a forcing or running variable. We are interested in the treatment’s effect on an output variable *I*. An example from education might be if a student fails an exam then they are enrolled in mandatory additional classes (Fig. 1a) [21]. In that case how could we estimate the causal effect of the additional classes on future academic performance?

A naive estimate is just to compare the students who take the extra classes to those who do not, which we will term the observed dependence (OD): However there will be differences between the two groups, e.g. stronger students will tend to not have been enrolled in the extra classes in the first place. Effects based on student skills and the extra classes will be superimposed, confounding the estimate.

A more meaningful estimate comes from focusing on marginal cases. If we compare the students that are right below the threshold and those that are right above the threshold then they will effectively have the same exam performance. And, since exam performance is noisy, the statistical difference between marginally sub- and super- threshold students will be negligible. Therefore the difference in outcome between these two populations of students will be attributable *only* to the additional classes attended, providing a measure of causal effect (Fig. 1a). If *χ* is the threshold exam score, then RDD computes
This estimates the causal effect of treatments without requiring the injection of noise. RDD uses local regression near the threshold to obtain statistical power while avoiding confounding.

Neurons that are not subject to external noise injection have to solve exactly the same causal inference problem (Fig. 1a). Through neuromodulator signals a neuron may receive feedback on a reward signal [27, 5]. The comparison in reward between time periods when a neuron almost reaches its firing threshold to moments when it just reaches its threshold are analogous to the above scenario; thus a neuron can use the RDD approach to estimate its own causal effect (Fig. 1b, c). Rather than using randomized perturbations from an additional noise source, a neuron can take advantage of the interaction of its threshold with presynaptic drive.

To implement RDD a neuron can estimate a piece-wise linear model of the reward function at time periods when its inputs place it close to threshold:
Here *H _{i}* is neuron

*i*’s spiking indicator function,

*γ*,

_{i}*α*and

_{li}*α*are the slopes that correct biases that would otherwise occur from having a finite bandwidth,

_{ri}*Z*is the maximum neural drive to the neuron over a short time period, and

_{i}*β*represents the causal effect of neuron

_{i}*i*’s spiking. The neural drive we will use here is the leaky, integrated input to the neuron, that obeys the same dynamics as the membrane potential except without a reset mechanism. By tracking the maximum drive attained over a short time period, marginally super-threshold inputs can be distinguished from well-above-threshold inputs, as required to apply RDD.

How could a neuron use RDD to estimate causal effects? We analyze a simple two neuron network, obeying leaky integrate-and-fire (LIF) dynamics. The neurons receive an input signal *x* with added noise, correlated with coefficient *c*. Each neuron weighs the noisy input by *w _{i}*. The correlation in input noise induces a correlation in the output spike trains of the two neurons [31], thereby introducing confounding. The neural output determines a non-convex reward signal

*R*. This setting allows us to test if neurons can conceivably implement RDD.

The difficulty in estimating a causal effect is that other neurons’ activity confounds activation with reward. A simplified RDD estimator that considers only average difference in reward above and below threshold within a window *p*, rather than a linear model, reveals this confounding (Fig. 2a). The locally linear RDD model, on the other hand, is more robust to this confounding (Fig. 2b). Thus the linear correction that is the basis of many RDD implementations [15] allows neurons to readily estimate their causal effect.

To investigate the robustness of the RDD estimator, we systematically vary the weights, *w _{i}*, of the network. RDD works better when activity is fluctuation-driven and at a lower firing rate (Fig. 2c). RDD is less biased than the observed dependence (Fig. 2d). Thus RDD is most applicable in irregular but synchronous activity regimes [4]. The causal effect can be used to estimate
(Fig. 2e,f), and thus the RDD estimator may be used for learning weights that maximize the expected reward (see Methods).

To demonstrate how a neuron can learn *β* through RDD, we derive an online learning rule from the linear model. The rule takes the form:
where **u*** _{i}* are the parameters of the linear model required to estimate

*β*,

_{i}*η*is a learning rate, and

**a**

*are drive-dependent terms (see Methods). This plasticity rule, where both a reward signal and activation can switch the sign of plasticity, is compatible with the interaction of modulatory influences of neuromodulators and neuronal firing [28, 2].*

_{i}When applied to the toy network, the online learning rule (Fig. 3a) estimates *β* over the course of seconds (Fig. 3b). When the estimated *β* is then used to maximize expected reward in an unconfounded network (uncorrelated − *c* = 0.01), RDD-based learning exhibits higher variance than learning using the observed dependence. RDD-based learning exhibits trajectories that are initially meander while the estimate of *β* settles down (Fig. 3c). When a confounded network (correlated – *c* = 0.5) is used RDD exhibits similar performance, while learning based on the observed dependence sometimes fails to converge due to the bias in gradient estimate. In this case RDD also converges faster than learning based on observed dependence (Fig. 3d,e).

This paper is a first step to introduce the RDD to neuronal learning. It serves to illustrate the difference in behavior of RDD and observed-dependence learning in the presence of confounding, but is by no means optimized for performance. Further, in many ways it can and should be extended: our model does not solve temporal credit assignment; it does not deal with large, interesting, systems; and it does not specify where presynaptic variance comes from. Nonetheless, RDD is one of the few known ways of statistically dealing with confounders, and an example of a larger class of methods called pseudo-experiments [23]. Demonstrations that segregated neuronal models [11, 17] and synthetic gradient methods [6] can solve deep learning problems at scale inspire future work.

Within reinforcement learning, there exist two popular approaches for estimating causality, each based on utilizing different kinds of intrinsic noise. In perturbation-based methods, a separate noise process is purposefully injected into the system and a mechanism for the system to understand responses as being either ‘natural’ or ‘perturbation-caused’ is used [3, 7, 8, 20]. In REINFORCE-type schemes [34], the noise instead comes from the biophysical properties of neurons, e.g. their Poisson spiking [35, 29]. In RDD approaches, on the other hand, it is sufficient that something, in fact anything that is presynaptic, produces variability. As such, RDD approaches do not require the noise source to be directly measured.

Further, in previous work, spiking is typically seen as a disadvantage and systems aim to remove spiking discontinuities through smoothing responses [14, 13, 19]. The RDD rule, on the other hand, exploits the spiking discontinuity. Moreover, finite difference approaches like the method derived here also have the benefit that they can operate in environments with non-differentiable or discontinuous reward functions. In many real-world cases, gradient descent would be useless: even if the brain could implement it, the outside world does not supply us with gradients (unlike its simulators [33]). Spiking may, in this sense, allow a natural way of understanding a neuron’s causal influence in a complex world.

The most important aspect of RDD is the explicit focus on causality. A causal model is one that can describe the effects of an agent’s actions on an environment. Thus learning through the reinforcement of an agent’s actions relies, even if implicitly, on a causal understanding of the environment [9, 18]. Here, by explicitly casting learning as a problem of causal inference we have developed a novel learning rule for spiking neural networks. Causality is what really matters in life and, as such, we believe that focusing on causality is essential when thinking about the brain.

## Methods

### The causal effect

A causal model is a Bayesian network along with a mechanism to determine how the network will respond to intervention. This means a causal model is a directed acyclic graph (DAG)
over a set of random variables
and a probability distribution *P* that factorizes over
[25]. An intervention on a single variable is denoted do(*X _{i}* =

*y*). Intervening on a variable removes the edges to that variable from its parents, Pa

*, and forces the variable to take on a specific value: . Given the ability to intervene, the local average treatment effect (LATE), or just causal effect, between an outcome variable*

_{X}_{i}*X*and a binary variable

_{j}*X*can be defined as:

_{i}We will make use of the following result. If S* _{ij}* ⊂ χ is a set of variables that satisfy the

*back-door criteria*with respect to

*X*→

_{i}*X*, then it satisfies the following: (i) S

_{j}*blocks all paths from*

_{ij}*X*to

_{i}*X*that go into

_{j}*S*, and (ii) no variable in S

_{i}*is a descendant of*

_{ij}*X*. In this case the interventional expectation can be inferred from Given this framework, here we will define the causal effect of a neuron as the average causal effect of a neuron

_{i}*H*spiking or not spiking on a reward signal,

_{i}*R*: where

*H*and

_{i}*R*are evaluated over a short time window of length

*T*.

### Neuron, noise and reward model

We consider the activity of a network of *n* neurons whose activity is described by their spike times
Here *n* = 2. Synaptic dynamics
are given by
for synaptic time scale
. An instantaneous reward is given by
. In order to have a more smooth reward signal, *R* is a function of s rather than **h**. The reward function used here has the form of a Rosenbrock function:

The neurons obey leaky integrate-and-fire (LIF) dynamics
where integrate and fire means simply:
Noisy input *η _{i}* is comprised of a common DC current,

*x*, and noise term,

*ξ*(

*t*), plus an individual noise term,

*ξ*(

_{i}*t*): The noise processes are independent white noise: . This parameterization is chosen so that the inputs

*η*

_{1,2}have correlation coefficient

*c*. Simulations are performed with a step size of Δ

*t*= 1ms. Here the reset potential was set to

*v*= 0. Borrowing notation from Xie and Seung 2004 [35], the firing rate of a noisy integrate and fire neuron is where and is the input noise standard deviation.

_{r}We define the input drive to the neuron as the leaky integrated input without a reset mechanism. That is, over each simulated window of length *T*:
RDD operates when a neuron receives inputs that place it close to its spiking threshold – either nearly spiking or barely spiking – over a given time window. In order to identify these time periods, the method uses the maximum input drive to the neuron:
The input drive is used here instead of membrane potential directly because it can distinguish between marginally super-threshold inputs and easily super-threshold inputs, whereas this information is lost in the voltage dynamics once a reset occurs. Here a time period of *T* = 50ms was used. Reward is administered at the end of this period: *R* = *R*(*s _{T}*).

### Policy gradient methods in neural networks

The dynamics given by (3) generate an ergodic Markov process with a stationary distribution denoted *ρ*. We consider the problem of finding network parameters that maximize the expected reward with respect to *ρ*. In reinforcement learning, performing optimization directly on the expected reward leads to policy gradient methods [32]. These typically rely on either finite difference approximations or a likelihood-ratio decomposition. Both approaches ultimately can be seen as performing stochastic gradient descent, updating parameters by approximating the expected reward gradient:
for neural network parameters *w*. Here capital letters are used to denote the random variables drawn from the stationary distribution, corresponding to their dynamic lower-case equivalent above.

Manipulating the expectation using a likelihood-ratio trick leads to REINFORCE-based methods [34]. In neural networks, likelihood-ratio based algorithms are known to be higher variance than methods that more directly approximate back-propagation (e.g. Rezende et al 2014 [26]). This motivates considering methods that more directly estimate the gradient terms [26, 12], breaking down (5) as we would with the deterministic expression. Here we focus on cases dominated by the mean reward gradient, meaning we assume the following:
where *μ* is the mean activity vector of the neurons.

Fig. 2e suggests that the assumption (6) is reasonable for the case presented here. Of course in general this assumption does not hold, however the method presented here can likely be extended to broader cases. For instance, if we were to assume that the stationary distribution *ρ* can be approximated as Gaussian then we have:
for *μ* and ∑, the mean and covariance of the Gaussian random vector S [26]. In this case quasi-Newton methods may be able to make use of the methods derived here. Alternatively, in some episodic learning cases, the gradient ∇* _{w}* may be computed by unrolling the network so that the parameters separate from the stochastic variables; this is sometimes known as the re-parameterization trick [10, 26, 12].

Thus we derive methods to estimate
, and use it with (6) and (4) for stochastic gradient descent-based learning. We deal with spiking networks, meaning **Ѕ** is discontinuous at spike times. Therefore it makes sense to consider finite difference approximations to this gradient term.

### Causal effect in neural networks

How can a neuron estimate
? We show that the reward gradient term can be related to the causal effect of a neuron on the reward signal. To show this we replace
with a type of finite difference operator:
Here **Ѕ*** _{j≠i}* ⊂ χ is a set of nodes that satisfy the back-door criterion with respect to

*H*→

_{i}*R*. When

*R*is a deterministic, differentiable function of

**Ѕ**and Δ

*→ 0 this recovers the reward gradient and we recover gradient descent-based learning. However this formulation has the advantage that it is defined when*

_{s}*R*is not differentiable, it does not require

*R*is a deterministic function of

**Ѕ**, and does not require that Δ

*be small.*

_{s}To consider the effect of a single spike, note that unit *i* spiking will cause a jump in *Ѕ _{i}* (according to (2)) compared to not spiking. If we let Δ

*equal this jump then it can be shown that is related to the causal effect:*

_{s}A derivation is presented in the supplementary material (Section A).

### Using regression discontinuity design

For comparison we define the observed dependence as:

This of course provides an estimate of *β _{i}* only when

*H*is independent of other neurons in the network. In general the causal effect is confounded through correlation with other units.

_{i}As described in the main text, to remove confounding, RDD considers only the marginal super- and sub-threshold periods of time. This works because the discontinuity in the neuron’s response induces a detectable difference in outcome for only a negligible difference between sampled populations (sub- and super-threshold periods). The RDD method estimates [15]:
for maximum input drive obtained over a short time window, *Z _{i}*, and spiking threshold,

*θ*; thus,

*Z*<

_{i}*θ*means neuron

*i*does not spike and

*Z*≥

_{i}*θ*means it does.

To estimate
, a neuron can estimate a piece-wise linear model of the reward function:
locally, when *Z _{i}* is within a small window

*p*of threshold. Here

*γ*,

_{i}*α*and

_{li}*α*are nuisance parameters, and

_{ri}*β*is the causal effect of interest. This means we can estimate from

_{i}A neuron can learn an estimate of
through a least squares minimization on the model parameters *γ _{i}*,

*α*,

_{l}*α*. That is, if we let

_{r}**u**

_{i}= [

*γ*,

_{i}*α*,

_{l}*α*]

_{r}*and , then the neuron solves:*

^{T}Performing stochastic gradient descent on this minimization problem gives the learning rule:
for all time periods at which *z _{i,t}* is within

*p*of threshold

*θ*.

## Implementation

`python` code used to run simulations and generates figures is available at: https://github.com/benlansdell/rdd.

### A The relation between causal effect and the finite difference operator

Here we present a more detailed derivation of (8), which relates the causal effect to a finite difference approximation of the reward gradient. First, assuming the conditional independence of *R* from *H _{i}* given

*Ѕ*and

_{i}*Ѕ*: Now if we assume that on average

_{j≠i}*H*spiking induces a change of Δ

_{i}*in*

_{s}*Ѕ*within the same time period, compared with not spiking, then: This is reasonable because the linearity of the synaptic dynamics, (2), means that the difference in

_{i}*Ѕ*between spiking and non-spiking windows is simply exp , for spike time

_{i}*t*. We approximate this term with its mean: under the assumption that spike times occur uniformly throughout the length

_{si}*T*window. These assumptions are supported numerically (Suppl. Fig. 1).

Writing out the inner two expectations of (9) gives:
from (10)
after making the substitution *s _{i}* →

*s*+

_{i}*Δ*in the first term. Writing this back in terms of expectations gives the result:

_{s}## Footnotes

↵+ lansdell{at}seas.upenn.edu