## Abstract

Recurrently connected networks of spiking neurons underlie the astounding information processing capabilities of the brain. But in spite of extensive research, it has remained open how they can learn through synaptic plasticity to carry out complex network computations. We argue that two pieces of this puzzle were provided by experimental data from neuroscience. A new mathematical insight tells us how these pieces need to be combined to enable biologically plausible online network learning through gradient descent, in particular deep reinforcement learning. This new learning method – called *e-prop* – approaches the performance of *BPTT* (backpropagation through time), the best known method for training recurrent neural networks in machine learning. In addition, it suggests a method for powerful on-chip learning in novel energy-efficient spike-based hardware for AI.

## Introduction

Networks of neurons in the brain differ in at least two essential aspects from deep neural networks in machine learning: They are recurrently connected, forming a giant number of loops, and they communicate via asynchronously emitted stereotypical electrical pulses, called spikes, rather than bits or numbers that are produced in a synchronized manner by each layer of a feedforward deep network. Models that capture primary information processing capabilities of spiking neurons in the brain are well known, and we consider the arguably most prominent one: leaky integrate-and-fire (LIF) neurons, where spikes that arrive from other neurons through synaptic connections are multiplied with the corresponding synaptic weight, and are linearly integrated by a leaky membrane potential. The neuron fires – i.e., emits a spike – when the membrane potential reaches a firing threshold.

But it is an open problem how recurrent networks of spiking neurons (RSNNs) can learn, i.e., how their synaptic weights can be modified by local rules for synaptic plasticity so that the computational performance of the network improves. In deep learning this problem is solved for feedforward networks through gradient descent for a loss function *E* that measures imperfections of current network performance [1]. Gradients of *E* are propagated backwards through all layers of the feedforward network to each synapse through a process called backpropagation. Recurrently connected networks can compute more efficiently because each neuron can participate several times in a network computation, and they are able to solve tasks that require integration of information over time or a non-trivial timing of network outputs according to task demands. But since each synaptic weight can affect the network computation at several time points during a recurrent network computation, its impact on the loss function (see Fig. 1a) is more indirect, and learning through gradient descent becomes substantially more difficult. This learning problem is aggravated if there are slowly changing hidden variables in the neuron model, such as an adapting firing threshold of a spiking neuron, or the content of the memory cell of an LSTM (Long Short Term Memory) unit in artificial neural networks. In machine learning one had solved this problem by unrolling the recurrent network into a virtual feedforward network [1], see Fig. 1b, and applying the backpropagation algorithm to that (Fig. 1c). This learning method for recurrent neural networks is called backpropagation through time (*BPTT*) since it requires propagation of gradients backwards in time with regard to the network computation.

With a careful choice of the pseudo-derivative for handling the discontinuous dynamics of spiking neurons one can apply *BPTT* also to RSNNs, and RSNNs were able to learn in this way for the first time to solve really demanding computational tasks (see [2], [3] for preceding results). But the dilemma is that *BPTT* requires storing the intermediate states of all neurons during a network computation, and merging these in a subsequent offline process with gradients that are computed backwards in time (see Fig. 1c, Movie S1 and Movie S2). This makes it very unlikely that *BPTT* is used by the brain [4].

We present a solution to this dilemma in the form of a biologically plausible method for online network learning through gradient descent: *e-prop* (Fig. 1d, see Movie S3). *E-prop* is motivated by two streams of experimental data from neuroscience:

Neurons in the brain maintain traces of preceding activity on the molecular level, for example in the form of calcium ions or activated CaMKII enzymes [5]. In particular, they maintain a fading memory of events where the presynaptic neuron fired before the postsynaptic neuron, which is known to induce synaptic plasticity if followed by a top-down learning signal [6, 7, 8]. Such traces are often referred to as eligibility traces.

In the brain there exists an abundance of top-down signals such as dopamine, acetylcholine, and the event-related negativity (ERN), that inform local populations of neurons about behavioral results. Furthermore both dopamine signals [9, 10] and ERN-related neural firing [11] have been found to be specific for different target pop-ulations of neurons, rather than being global. We refer in our learning model to such top-down signals as learning signals.

A re-analysis of the mathematical basis of gradient descent in recurrent neural networks tells us how local eligibility traces and top-down learning signals should be optimally combined to enable network learning through gradient descent – without requiring backprogation of signals through time. The resulting new learning method, *e-prop*, turns out to approximate the performance of *BPTT*, thereby providing a solution to the learning dilemma for RSNNs. Essential is that *e-prop* also applies to RSNNs with more complex neuron models, such as LSNNs [2] that contain neurons with an adaptive firing threshold as further hidden variable. It had been shown in [2] that LSNNs trained with *BPTT* achieve a performance similar to LSTM networks, the state of the art in machine learning. We show here that this performance level can also be achieved with *e-prop*, i.e., a method that is online and local, apart from top-down learning signals. This elucidates how the brain could learn to recognize speech (Fig. 2), solve temporal credit assignment problems (Fig. 3), and learn new behaviors – without any help from a teacher – just from rewards (Fig. 4, 5).

The previously described learning dilemma for RSNNs also affects the development and use of neuromorphic computing hardware, which aims at a drastic reduction in the energy consumption of AI implementations. A substantial fraction of this neuromorphic hardware, such as SpiNNaker [12] or Intel’s Loihi chip [13], implements RSNNs and aims at on-chip training of these RSNNs. Although it does not matter here whether the learning algorithm is biologically plausible, the excessive storage and offline processing demands of *BPTT* make this option unappealing for neuromorphic hardware. Hence there also exists a learning dilemma for RSNNs in neuromorphic hardware, which can be solved with *e-prop*.

## Results

### Mathematical basis for *e-prop*

Spikes are modeled as binary variables that assume value 1 if neuron *j* fires at time *t*, otherwise value 0. It is common in models to let *t* vary over small discrete time steps, e.g. of 1 ms length. The goal of network learning is to find synaptic weights *W* that minimize a given loss function *E*. *E* may depend on all or a subset of the spikes in the network. *E* measures in the case of regression or classification learning the deviation of the actual output of each output neuron *k* at time *t* from its given target value (Fig. 1a). In reinforcement learning (RL), the goal is to optimize the behavior of an agent in order to maximize obtained rewards. In this case, *E* measures deficiencies of the current agent policy to collect rewards.

The gradient for the weight *W _{ji}* of the synapse from neuron

*i*to neuron

*j*tells us how this weight should be changed in order to reduce

*E*. The key innovation is that this gradient can be represented as a sum over the time steps

*t*of the RSNN computation, where the second factor is just a local gradient that does not depend on

*E*:

This local gradient can be represented as a sum of products of partial derivatives concerning the hidden state of neuron *j* up to time *t* (equation (13)), which can be updated during the forward computation of the RNN by a simple recursion (equation (14)). This term is not an approximation. Rather, it collects the maximal amount of information about the network gradient that can be computed locally in a forward manner. Therefore it is the key-factor of *e-prop*. Since it reduces for simple neuron models – whose internal state is fully captured by its membrane potential – to a variation of terms that are commonly referred to as eligibility traces for synaptic plasticity [8], we also refer to
as eligibility trace. But most biological neurons have additional hidden variables that change on a slower time scale, such as for example the firing threshold of a neuron with firing threshold adaptation. Furthermore it was shown in [2] that these slower processes in neurons are essential for attaining with spiking neurons similarly powerful computing capabilities as LSTM networks. Hence the form that this eligibility trace takes for adapting neurons (see equation (25)) is essential for understanding *e-prop*, and it is the main driver behind the resulting qualitative jump in computing capabilities of RSNNs which are attainable through biologically plausible learning. Equations (1) and (2) yield the representation
of the loss gradient, where we refer to as the learning signal for neuron *j*. This equation defines a clear program for approximating the network loss gradient through local rules for synaptic plasticity: Change each weight *W _{ji}* at step

*t*proportionally to or accumulate these “tags” in a hidden variable that is translated occasionally into an actual weight change. Hence

*e-prop*is an online learning method in a strict sense (see Fig. 1d and Movie S3). In particular, there is no need to unroll the network as for

*BPTT*.

Since the ideal value of the learning signal also captures influences which the current spike output of neuron *j* may have on *E* via future spikes of other neurons, its precise value is in general not available at time *t*. We replace it by an approximation, such as , which ignores these indirect influences. This approximation takes only currently arising losses at the output neurons *k* of the RSNN into account, and routes them with neuron-specific weights *B _{jk}* to the network neurons

*j*(see Fig. 2a):

Although this approximate learning signal only captures errors that arise at the current time step *t*, it is combined in equation (3) with an eligibility trace that may reach far back into the past of neuron *j* (see Fig. 3b), thereby alleviating the need to solve the temporal credit assignment problem by propagating signals backwards in time (like in *BPTT*).

There are several strategies for choosing the weights *B _{jk}* for this online learning signal. In

*symmetric e-prop*we set it equal to the corresponding weight of the synaptic connection from neuron

*j*to output neuron

*k*, as demanded by . Note that this learning signal would actually implement exactly in the absence of recurrent connections in the network. Biologically more plausible are two variants of

*e-prop*that avoid weight sharing: If all network neurons

*j*are connected to output neurons

*k*, we let

*B*evolve in

_{jk}*adaptive e-prop*through a simple local plasticity rule that mirrors the plasticity rule applied to . In

*random e-prop*the values of the weights

*B*are randomly chosen and remain fixed, similar to broadcast alignment for feedforward networks [14, 15].

_{jk}Resulting synaptic plasticity rules (see Methods) look similar to previously proposed plasticity rules [8] for the special case of LIF neurons without slowly changing hidden variables. In particular they involve postsynaptic depolarization as one of the factors, similarly as the data-based Clopath-rule in [16], see section S6.4 in the supplement for an analysis.

### Comparing the performance of *e-prop* and *BPTT* for learning speech recognition

The speech recognition task TIMIT [17] is one of the most commonly used benchmarks for temporal processing capabilities of different types of recurrent neural networks and different learning approaches [18]. It comes in two versions. Both use, as input, acoustic speech signals from sentences that are spoken by 630 speakers from 8 dialect regions of the USA (see the top of Fig. 2b for a sample segment). In the simpler version, used for example in [18], the goal is to recognize which of 61 phonemes is spoken in each 10 ms time frame (“frame-wise classification”). In the more sophisticated version from [19], which achieved an essential step toward human-level performance in speech-to-text transcription, the goal is to recognize the sequence of phonemes in the entire spoken sentence independently of their timing (“sequence transcription”). RSNNs consisting of LIF neurons do not even reach with *BPTT* good performance for TIMIT [2]. Hence we are considering here LSNNs, where a random subset of the neurons is a variation of the LIF model with firing rate adaptation (ALIF neurons), see Methods. The name LSNN is motivated by the fact that this special case of the RSNN model can achieve through training with *BPTT* similar performance as an LSTM network [2].

*E-prop* approximates the performance of *BPTT* on LSNNs for both versions of TIMIT very well, as shown in Fig. 2c. Furthermore LSNNs could solve the frame-wise classification task without any neuron firing more frequently than 12 Hz (spike count taken over 32 spoken sentences), demonstrating that they operate in an energy efficient spike-coding – rather than a rate-coding – regime. For the more difficult version of TIMIT we trained as in [19] a complex LSNN consisting of a feedforward sequence of three recurrent networks. Our results show that *e-prop* can also handle learning for such more complex network structures very well. In Fig. S3 we show for comparison also the performance of *e-prop* and *BPTT* for LSTM networks on the same tasks. These data show that for both versions of TIMIT the performance of *e-prop* for LSNNs comes rather close to that of *BPTT* for LSTM networks. In addition, they show that *e-prop* provides also for LSTM networks a functionally powerful online learning method.

The FORCE method of [20] is a frequently considered learning method for RSNNs. However it was not argued to be biologically realistic, since the plasticity rule for each synaptic weight requires knowledge of the current values of all other synaptic weights in the RSNN. It was applied in [20] to supervised learning of several pattern generation tasks. We show in Figs. S1, S5, and Movie S4 that RSNNs can learn such tasks also with a biologically plausible learning method: *e-prop*.

*E-prop* performance for learning a task where temporal credit assignment is difficult

A hallmark of cognitive computations in the brain is the capability to go beyond a purely reactive mode: to integrate diverse sensory cues over time, and to wait until the right moment arrives for an action. A large number of experiments in neuroscience analyze neural coding after learning such tasks (see e.g. [21, 9]). But it had remained unknown how one can model the learning of such cognitive computations in RSNNs of the brain. In order to test whether *e-prop* can solve this problem, we considered the same task that was studied in the experiments of [21] and [9]. There a rodent moved along a linear track in a virtual environment, where it encountered several visual cues on the left and right, see Fig. 3a and Movie S1. Later, when it arrived at a T-junction, it had to decide whether to turn left or right. It was rewarded when it turned to that side from which it had previously received the majority of visual cues. This task is not easy to learn since the subject needs to find out that it does not matter on which side the last cue was, or in which order the cues were presented. Instead, the subject has to learn to count cues separately for each side and to compare the two resulting numbers. Furthermore the cues need to be processed properly long before a reward is given. We show in Fig. S4 that LSNNs can learn this task via *e-prop* in exactly the same way just from rewards. But since the way how *e-prop* solves the underlying temporal credit assignment problem is easier to explain for the supervised learning version of this task, we discuss here the case where a teacher tells the subject at the end of each trial what would have been the right decision. This still yields a challenging scenario for any online learning method since non-zero learning signals arise only during the last 150 ms of a trial (Fig. 3b). Hence all synaptic plasticity has to take place during these last 150 ms, long after the input cues have been processed. Nevertheless, *e-prop* is able to solve this learning problem, see Fig. 3c and Movie S3. It just needs a bit more time to reach the same performance level as offline learning via *BPTT* (see Movie S2). Whereas this task can not even be solved by *BPTT* with a regular RSNN that has no adapting neurons (red curve in Fig. 3c), all 3 previously discussed variations of *e-prop* can solve it if the RSNN contains adapting neurons. We explain in section S2.4 how this task can also be solved by sparsely connected LSNNs: by integrating stochastic rewiring [22] into *e-prop*.

But how can the neurons in the LSNN learn to record and count the input cues if all the learning signals are identically 0 until the last 150 ms of a 2000 ms long trial (5th row of Fig. 3b)? The solution is indicated in the bottom row of Fig. 3b: The slow component of the eligibility traces *e _{ji}* of adapting neurons

*j*decays with the typical long time constant of firing rate adaptation (see equation (24) and Movie S3), that typically lies in the range of seconds. Since these traces stretch from the beginning of the trial into its last phase, they enable learning of differential responses to “left” and “right” input cues that arrived over 1000 ms ago, see Fig. 3d. Fig. 3d also provides insight into the functional role of the broadcast weights of

*random e-prop*: The difference of these weights determines for each neuron

*j*whether it learns to respond in the first phase of a trial more to cues from the left or right. This observation suggests that neuron-specific learning signals for RSNNs have the advantage that they can create a diversity of feature detectors for task-relevant network inputs. Hence a suitable weighted sum of these feature detectors is later able to cancel remaining errors at the network output, similarly as in the case of feedforward networks [14].

### Reward-based e-prop

Deep RL has significantly advanced the state of the art in machine learning and AI through clever applications of *BPTT* to RL [23]. We found that one of the arguably most powerful RL methods within the range of deep RL approaches that are not directly biologically implausible, policy gradient in combination with actor-critic, can be implemented with *e-prop*. This yields the biologically plausible and hardware friendly deep RL algorithm *reward-based e-prop*. The LSNN learns here both an approximation to the value function (the “critic”) and a stochastic policy (the “actor”). Neuron-specific learning signals are combined in *reward-based e-prop* with a global signal that transmits reward prediction errors (Fig. 4b). In contrast to the supervised case, where the learning signals depend on the deviation from an external target signal, the learning signals communicate here how a stochastically chosen action deviates from the action mean that is currently proposed by the network.

The resulting online synaptic plasticity rule (5) for deep RL is similar to equation (3), except that a fading memory filter is applied here to the term where *γ* is the given discount factor for future rewards and denotes a low-pass filtered copy of the eligibility trace (see Methods). This term is multiplied in the synaptic plasticity rule with the reward prediction error *δ ^{t}* =

*r*+

^{t}*γV*

^{t+1}−

*V*, where

^{t}*r*is the reward received at time

^{t}*t*. This yields an instantaneous weight change of the form:

Previous 3-factor learning rules for RL were usually of the form [24, 8]. Hence they estimated gradients of the policy just by correlating the output of network neurons with the reward prediction error. The learning power of this approach is known to be quite limited due to high noise in the resulting gradient estimates. In contrast, in the plasticity rule (5) for *reward-based e-prop* the eligibility traces are first combined with a neuron specific feedback , before they are multiplied with the reward prediction error *δ ^{t}*. We show in Methods analytically that this yields estimates of policy- and value gradients similarly as in deep RL with

*BPTT*. Furthermore, in contrast to previously proposed 3-factor learning rules, this rule (5) is also applicable to LSNNs.

We tested *reward-based e-prop* on a classical benchmark task [23] for learning intelligent behavior from rewards: Winning Atari video games provided by the Arcade Learning Environment [25]. To win such game, the agent needs to learn to extract salient information from the pixels of the game screen, and to infer the value of specific actions, even if rewards are obtained in a distant future. In fact, learning to win Atari games is a serious challenge for reinforcement learning even in machine learning [23]. Besides artificial neural networks and *BPTT*, previous solutions also required experience replay (with a perfect memory of many frames and action sequences that occurred much earlier) or an asynchronous training of numerous parallel agents sharing synaptic weight updates. We show here that also an LSNN can learn via *e-prop* to win Atari games, through online learning of a single agent. This becomes possible with a single agent and without episode replay if the agent uses a schedule of increasing episode lengths –with a learning rate that is inversely related to that length. Using this scheme, an agent can experience diverse and uncorrelated short episodes in the first phase of learning, producing useful skills. Subsequently, the agent can fine-tune its policy using longer episodes.

First, we considered the well-known Atari game Pong (Fig. 4a). Here, the agent has to learn to hit a ball in a clever way using up and down movements of his paddle. A reward is obtained if the opponent cannot catch the ball. We trained an agent using *reward-based e-prop* for this task, and show a sample trial in Fig. 4c and Movie S5. In contrast to common deep RL solutions, the agent learns here in a strict online manner, receiving at any time just the current frame of the game screen. In Panel d of Fig. 4 we demonstrate that also this biologically realistic learning approach leads to a competitive score.

If one does not insist on an online setting where the agent receives just the current frame of the video screen but the last 4 frames, winning strategies for about half of the Atari games can already be learnt by feedforward neural networks (see table S3 of [23]). However it was shown in [23] that LSTM networks achieve with the A3C method a higher score than feedforward networks (with batch processing of 4 successive frames) for the game Pong. For other Atari games, such as Fishing Derby (Fig. 5a), it was even shown in [23] that deep RL applied to LSTM networks achieves a substantially higher score than any deep RL method for feedforward networks which was considered there. Hence, in order to test the power of online *reward-based e-prop* also for those Atari games that require enhanced temporal processing, we tested it on the Fishing Derby game. In this game, the agent has to catch as many fish as possible while avoiding that the shark touches the fish with any part of its body, and that the opponent catches the fish first. We show in Fig. 5c that online *reward-based e-prop* applied to an LSNN does in fact reach the same performance as reference offline algorithms applied to LSTM networks. We show a random trial after learning in Fig. 5d, where we can identify two different learnt behaviors: 1) evading the shark, 2) collecting fish. The agent has learnt to switch between these two behaviors as required by the situation.

In general, we conjecture that variants of *reward-based e-prop* will be able to solve most deep RL tasks that can be solved by online actor-critic methods in machine learning.

## Discussion

We propose that in order to understand the computational function and neural coding of neural networks in the brain, one needs to understand the organization of the plasticity mechanisms that install and maintain these. So far *BPTT* was the only candidate for that, since no other learning method provided sufficiently powerful computational function to RSNN models. But since *BPTT* is not viewed to be biologically realistic [4], it does not help us to solve this problem. *E-prop* offers a solution to this dilemma, since it does not require biologically unrealistic mechanisms, but still enables RSNNs to learn difficult computational tasks, in fact almost as well as *BPTT*. Furthermore it enables RSNNs to solve these tasks in an energy efficient sparse firing regime, rather than resorting to rate coding.

*E-prop* relies on two types of signals that are abundandly available in the brain, but whose precise role for learning have not yet been understood: eligibility traces and learning signals. Since *e-prop* is based on a transparent mathematical principle (see equation (3)), it provides a normative model for both types of signals, as well as for synaptic plasticity rules. Interestingly, the resulting learning model suggests that a characteristic aspect of many biological neurons – the presence of slowly changing hidden variables – provides a possible solution to the problem how a RSNN can learn without error signals that propagate backwards in time: Slowly changing hidden variables of neurons cause eligibility traces that propagate forward over longer time spans, and are therefore able to coincide with later arising instantaneous error signals (see Fig. 3b).

The theory of *e-prop* makes a concrete experimentally testable prediction: that the time constant of the eligibility trace for a synapse is correlated with the time constant for the history-dependence of the firing activity of the postsynaptic neuron. It also suggests that the experimentally found diverse time constants of the firing activity of populations of neurons in different brain areas [26] are correlated with their capability to handle corresponding ranges of delays in temporal credit assignment for learning.

Finally, *e-prop* theory provides a hypothesis for the functional role of the experimentally found diversity of dopamine signals to different populations of neurons [9]. Whereas previous theories of reward-based learning required that the same learning signal is sent to all neurons, the basic equation (1) for *e-prop* postulates that ideal top-down learning signals to a population of neurons depend on its impact on the network performance (loss function), and should therefore be target specific (see Fig. 2c and section S6.2). In fact, the learning-to-learn result for *e-prop* in [27] suggests that prior knowledge about the possible range of learning tasks for a brain area could optimize top-down learning signals even further on an evolutionary time scale, thereby enabling for example learning from few or even a single example.

Several methods for approximating stochastic gradient descent in feedforward networks of spiking neurons have been proposed, see e.g. [28, 29, 30, 31, 32]. These employ – like *e-prop* – a pseudo-gradient to overcome the non-differentiability of a spiking neuron, as proposed previously in [33, 34]. [28, 30, 31] arrive at a synaptic plasticity rule for feedforward networks that consists – like *e-prop* – of the product of a learning signal and a derivative (eligibility trace) that describes the dependence of a spike of a neuron *j* on the weight of an afferent synapse *W _{ji}*. But in a recurrent network the spike output of

*j*depends on

*W*also indirectly, via loops in the network that allow that a spike of neuron

_{ji}*j*contributes to the firing of other neurons, which in turn affect firing of the presynaptic neuron

*i*. Hence the corresponding eligibility trace can no longer be locally computed if one transfers these methods for feedforward networks to recurrently connected networks. Therefore [28] suggests the need to investigate extensions of their approach to RSNNs. Previous work on the design of online gradient descent learning algorithms for RNNs was based on real-time recurrent learning (RTRL) [35]. RTRL has rarely been used since its computational complexity per time-step is , if

*n*is the number of neurons. But interesting approximations to RTRL have subsequently been proposed: some stochastic approximations [36] which are or only applicable for small networks [37], and also recently two deterministic approximations [38, 39]. The latter were in fact written at the same time as the first publication of

*e-prop*[27]. A structural difference of [38] is that it requires that learning signals are transmitted between the neurons in the RNN, with separately learnt weights. [39] derived for rate based neurons a learning rule similar to

*random e-prop*. But this work did not address other forms of learning than supervised regression, such as RL, nor learning in networks of spiking neurons, or in more powerful types of RNNs such as LSTM networks or LSNNs.

*E-prop* also has complexity , in fact if *S* is the number of synaptic connections. This bound is optimal since this is the complexity of just simulating the RNN. The key point of *e-prop* is that the general form (13) of its eligibility trace collects all contributions to the loss gradient that can be locally computed in a feedforward manner. This general form enables applications to spiking neurons with slowly varying hidden variables, such as neurons with firing rate adaptation, which are essential ingredients of RSNNs to reach the computational power of LSTM networks [2]. It also enables the combination of these rigorously derived eligibility traces with – semantically identical but algorithmically very different – eligibility traces from RL for *reward-based e-prop* (equation (5)), thereby bringing the power of deep RL to RSNNs. As a result, we were able to show in Fig. 2 - 5 that RSNNs can learn with the biologically plausible rules for synaptic plasticity that arise from the *e-prop* theory to solve tasks such as speech recognition, integrating evidence over time and waiting for the right moment to act, and winning Atari games. These are tasks that are fundamental for modern learning-based AI, but have so far not been solved with RSNNs. Hence *e-prop* provides a new perspective of the major open question how intelligent behavior can be learnt and controlled by neural networks of the brain.

Apart from obvious consequences of *e-prop* for research in neuroscience and cognitive science, *e-prop* also provides an interesting new tool for approaches in machine learning where *BPTT* is replaced by approximations in order to improve computational efficiency. We have already shown in Fig. S3 that *e-prop* provides a powerful online learning method for LSTM networks. Furthermore, the combination of eligibility traces from *e-prop* with synthetic gradients from [40] even improves performance of LSTM networks for difficult machine learning problems such as the copy-repeat task and the Penn Treebank word prediction task [27].

Finally, *e-prop* suggests a promising new approach for realizing powerful on-chip learning of RSNNs on neuromorphic chips. Whereas *BPTT* is not within the reach of current neuromorphic hardware, an implementation of *e-prop* appears to offer no serious hurdle. Our results show that an implementation of *e-prop* will provide a qualitative jump in on-chip learning capabilities of neuromorphic hardware.

## Methods

Table of Contents:

Network models

Conventions

Mathematical basis for

*e-prop*Eligibility traces

Derivation of eligibility traces for concrete neuron models

Synaptic plasticity rules resulting from

*e-prop**Reward-based e-prop*: application of*e-prop*to deep RL

### Network models

To exhibit the generality of the *e-prop* approach, we define the dynamics of recurrent neural networks using a general formalism that is applicable to many recurrent neural network models, not only to RSNNs and LSNNs. Also non-spiking models such as LSTM networks fit under this formalism (see Section S4.3 in the Supplement). The network dynamics is summarized by the computational graph in Fig. 6. It uses the function *M* to define the update of the hidden state of a neuron *j*: , where **W**_{j} gathers the weights of synapses arriving at neuron *j*, and *f* to define the update of the observable state of a neuron *j*: (*f* simplifies to for LIF and ALIF neurons).

#### LIF neurons

Each LIF neuron has a one dimensional internal state – or hidden variable that consists only of the membrane potential . The observable state is binary, indicating a spike or no spike at time *t*. The dynamics of the LIF model is defined by the equations:

is the synaptic weight from network (input) neuron *i* to neuron *j*. The decay factor *α* in (6) is given by *e ^{−δt/τm}*, where

*δt*is the discrete time step size (1 ms in our simulations) and

*τ*(typically 20 ms) is the membrane time constant.

_{m}*H*denotes the Heaviside step function.

Due to the term in equation (6), the neurons membrane potential is reduced by a constant value after an output spike, which relates our model to the spike response model [41]. To introduce a simple model of neuronal refractoriness, we further assume that is fixed to 0 after each spike of neuron *j* for a short refractory period of 2 to 5 ms depending on the simulation.

#### LSNNs

LSNNs are recurrently connected networks that consist of LIF neurons and of adaptive LIF (ALIF) neurons. ALIF neurons *j* have a second hidden variable , which denotes the variable component of its firing threshold. As a result, their internal state is a 2 dimensional vector . Their threshold potential increases with every output spike and decreases exponentially back to the baseline threshold *v*_{th}. This can be described by
with a threshold adaptation according to
where the decay factor *ρ* is given by *e ^{−δt/τa}*, and

*τ*is the adaptation time constant that is typically chosen to be in the range of the time span of the length of the working memory that is a relevant for a given task. This is a very simple model for a neuron with spike frequency adaptation [2]. We refer to [41, 42, 43] for experimental data and other neuron models. We refer to a recurrent network of spiking neurons (RSNN) as LSNN if some of its neurons are adaptive.

_{a}In relation to the more general formalism represented in the computational graph in Fig. 6, equations (6) and (10) define , and equations (7) and (9) define .

#### Gradient descent for RSNNs

Gradient descent is problematic for spiking neurons because of the step function *H* in equation (7). We overcome this issue as in [44, 2]: The non-existing derivative is replaced in simulations by a simple nonlinear function of the membrane potential that is called the pseudo-derivative. Outside of the refractory period, we choose a pseudo-derivative of the form where *γ*_{pd} = 0.3. During the refractory period the pseudo derivative is set to 0.

#### Network output and loss functions

We assume that network outputs are real-valued and produced by leaky output neurons (readouts) *k*, which are not recurrently connected:
where *κ* ∈ [0, 1] defines the leak and denotes the output bias. The leak factor *κ* is given for spiking neurons by *e ^{−δt/τout}*, where

*τ*is the membrane time constant. Note that for non-spiking neural networks (such as for LSTM networks), temporal smoothing of the network observable state is not necessary. In this case, one can use

_{out}*κ*= 0.

The loss function *E*(**z**^{1},…, **z**^{T}) quantifies the network performance. We assume that it depends only on the observable states **z**^{1},…, **z**^{T} of the network neurons. For instance, for a regression problem we define *E* as the mean square error between the network outputs and target values . For classification or RL tasks the loss function *E* has to be re-defined accordingly.

### Conventions

#### Notation for derivatives

We distinguish the total derivative , which takes into account how *E* depends on **z**^{t} also indirectly through influence of **z**^{t} on the other variables **z**^{t+1}, …, **z**^{t}, and the partial derivative which quantifies only the direct dependence of *E* on **z**^{t}.

Analogously for the hidden state the partial derivative denotes the partial derivative of *M* with respect to . It only quantifies the direct influence of on and it does not take into account how indirectly influences via the observable states **z**^{t−}. To improve readability we also use the following abbreviations: and .

#### Notation for temporal filters

For ease of notation we use the operator to denote the low-pass filter such that, for any time series *x _{t}*:
and . In the specific case of the time series and , we simplify notation further and write and for and .

### Mathematical basis for *e-prop*

We provide here the proof of the fundamental equation (1) for *e-prop*
with the new eligibility trace

For spiking neurons *j* we replace the first factor of by the pseudo-derivative, see [2, 3, 44]. The second factor , which we call eligibility vector, obviously satisfies the recursive equation
where · denotes the dot product. This provides the rule for the online computation of , and hence of .

We start from a classical factorization of the loss gradients in recurrent neural networks that arises for instance in equation (12) of [45] to describe *BPTT*. This classical factorization can be justified by unrolling an RNN into a large feedforward network where each layer (*l*) represents one time step. In a feedforward network the loss gradients with respect to the weights of layer *l* are given by . But as the weights are shared across the layers when representing a recurrent network, the summation of these gradients over the layers *l* of the unrolled RNN yields this classical factorization of the loss gradients:

Note that the first factor in these products also needs to take into account how the internal state **h**_{j} of neuron *j* evolves during subsequent time steps, and whether it influences firing of *j* at later time steps. This is especially relevant for ALIF neurons and other biologically realistic neuron models with slowly changing internal states. Note that this first factor of (15) is replaced in the *e-prop* equation (13) by the derivative of *E* with regard to the observable variable . There the evolution of the internal state of neuron *j* is pushed into the second factor, the eligibility trace *e _{ji}*, which collects in

*e-prop*all online computable factors of the loss gradient that just involve neurons

*j*and

*i*.

Now we show that one can re-factorize the expression (15) and prove that the loss gradients can also be computed using the new factorization (13) that underlies *e-prop*. In the steps of the subsequent proof until equation (19), we decompose the term into a series of learning signals and local factors for *t* ≥ *t*′. Those local factors will later be used to transform the partial derivative from equation (15) into the eligibility vector that integrates the whole history of the synapse up to time *t*, not just a single time step. To do so, we express recursively as a function of the same derivative at the next time step by applying the chain rule at the node for *t* = *t′* of the computational graph shown in Fig. 6c:
where we defined the learning signal as . The resulting recursive expansion ends at the last time step *T* of the computation of the RNN, i.e., . If one repeatedly substitutes the recursive formula (17) into the classical factorization (15) of the loss gradients, one gets:

The following equation is the main equation for understanding the transformation from *BPTT* into *e-prop*. The key idea is to collect all terms which are multiplied with the learning signal at a given time *t*. These are only terms that concern events in the computation of neuron *j* up to time *t*, and they do not depend on other future losses or variable values. To this end, we write the term in parentheses in equation (19) into a second sum indexed by *t* and exchange the summation indices to pull out the learning signal . This expresses the loss gradient of *E* as a sum of learning signals multiplied by some factor indexed by *ji*, which we define as the eligibility trace . The main factor of it is the eligibility vector , which has the same dimension as the hidden state :

This completes the proof of equations (1), (3), (13).

### Derivation of eligibility traces for concrete neuron models

The eligibility traces for LSTMs are derived in the supplementary materials. Below we provide the derivation of eligibility traces for spiking neurons.

#### Eligibility traces for LIF neurons

We compute the eligibility trace of a synapse of a LIF neuron without adaptive threshold (equation (6)). Here the hidden state of a neuron consists just of the membrane potential and we have and (for a derivation of the eligibility traces taking the reset into account we refer to section S1.2). Using these derivatives and equation (14), one obtains that the eligibility vector is the low-pass filtered presynaptic spike-train, and following equation (13), the eligibility trace is:

For all neurons *j* the derivations in the next sections also hold for synaptic connections from input neurons *i*, but one needs to replace the network spikes by the input spikes (the time index switches from *t* − 1 to *t* because the hidden state is defined as a function of the input at time *t* but the preceding recurrent activity). For simplicity we have focused on the case where transmission delays between neurons in the RSNN are just 1 ms. If one uses more realistic length of delays *d*, this − *d* appears in equations (23)–(25) instead of −1 as the most relevant time point for presynaptic firing (see Section S1.3). This moves resulting synaptic plasticity rules closer to experimentally observed forms of STDP.

#### Eligibility traces for ALIF neurons

The hidden state of an ALIF neuron is a two dimensional vector . Hence a two dimensional eligibility vector is associated with the synapse from neuron *i* to neuron *j*, and the matrix matrix. The derivatives and capture the dynamics of the adaptive threshold. Hence to derive the computation of eligibility traces we substitute the spike *z _{j}* in equation (10) by its definition given in equation (9). With this convention one finds that the diagonal of the matrix is formed by the terms and . Above and below the diagonal, one finds respectively . Seeing that , one can finally compute the eligibility traces using equation (13).

The component of the eligibility vector associated with the membrane potential remains the same as in the LIF case and only depends on the presynaptic neuron: .

For the component associated with the adaptive threshold we find the following recursive update: and this results in an eligibility trace of the form:

Recall that the constant arises from the adaptation time constant *τ _{a}*, which typically lies in the range of hundreds of milliseconds to a few seconds in our experiments, yielding values of

*ρ*between 0.995 and 0.9995. The constant

*β*is typically of the order of 0.07 in our experiments.

To provide a more interpretable form of eligibility trace that fits into the standard form of local terms considered in 3-factor learning rules [8], one may drop the term in equation (24). This approximation of equation (24) becomes an exponential trace of the post-pre pairings accumulated within a time window as large as the adaptation adaptation time constant:

The eligibility traces are computed with equation (24) in most experiments, but the performances obtained with *symmetric e-prop* and this simplification were indistinguishable in the task where temporal credit assignment is difficult of Fig. 3.

### Synaptic plasticity rules resulting from *e-prop*

An exact computation of the ideal learning signal in equation (1) requires to back-propagate gradients through time (see Fig. 6c). For online *e-prop* we replace it with the partial derivative , which can be computed online. Implementing the weight updates with gradient descent and learning rate *η*, all the following plasticity rules are derived from

Note that in the absence of the superscript *t*, Δ*W _{ij}*, denotes the cumulated weight change over one trial or batch of consecutive trials but not the instantaneous weight update. This can be implemented online by accumulating weight updates in a hidden synaptic variable. Note also that the weight updates derived in the following for the recurrent weights also apply to the inputs weights . For the output weights and biases the derivation does not require the theory of

*e-prop*, and the weight updates can be found in the Section S3.1.

#### Case of regression tasks

In the case of a regression problem with targets and outputs defined in equation (11), we define the loss function . This results in a partial derivative of the form . This seemingly provides an obstacle for online learning, because the partial derivative is a weighted sum over future errors. But this problem can be resolved since one can interchange the two summation indices in the expression for the weight updates (see section S3.1). In this way the sum over future events transforms into a low-pass filtering of the eligibility traces , and the resulting weight update can be written as

#### Case of classification tasks

We assume that *K* target categories are provided in the form of a one-hot encoded vector **π**^{*,t} with *K* dimensions. We define the probability for class *k* predicted by the network as , and the loss function for classification tasks as the cross-entropy error . The plasticity rule resulting from *e-prop* reads (see derivation in Section S3.1):

*Reward-based e-prop*: application of *e-prop* to deep RL

For reinforcement learning, the network interacts with an external environment. At any time *t* the environment can provide a positive or negative reward *r ^{t}*. Based on the observations

*that are perceived, the network has to commit to actions*

**x**^{t}*a*

^{t0}, ⋯,

*a*, ⋯ at certain decision times

^{tn}*t*

_{0}, ⋯,

*t*, ⋯. Each action

_{n}*a*is sampled from a probability distribution

^{t}*π*(· |

*y*^{t}) which is also referred to as the policy of the RL agent. The policy is defined as function of the network outputs

*y*^{t}, and is chosen here to be a categorical distribution of K discrete action choices. We assume that the agent chooses action k with probability .

The goal of reinforcement learning is to maximize the expected sum of discounted rewards. That is, we want to maximize the expected return at time , where the return at time *t* is defined as with a discount factor *γ* ≤ 1. The expectation is taken over the agent actions *a ^{t}*, the rewards

*r*and the observations from the environment

^{t}

*x*^{t}. We approach this optimization problem by using the actor-critic variant of the policy gradient algorithm, which applies gradient ascent to maximize . The basis of the estimated gradient relies on an estimation of the policy gradient, as shown in section 13.3 in [46]. There, the resulting weight update is given in equation (13.8), where

*G*refers to the return

_{t}*R*. Hence, the gradient is proportional to , which is easier to compute because the expectation can be estimated by an average over one or many trials. Following this strategy, we define the per-trial loss function

^{t}*E*as a function of the sequence of actions

_{π}*a*

^{t0}, ⋯,

*a*, ⋯ and rewards

^{tn}*r*

^{0}, ⋯,

*r*sampled during this trial:

^{T}And thus:

Intuitively, given a trial with high rewards, policy gradient changes the network output ** y** to increase the probability of the actions

*a*that occurred during this trial. In practice, the gradient is known to have high variance and the efficiency of the learning algorithm can be improved using the actor-critic variant of the policy gradient algorithm. It involves the policy

^{tn}*π*(the actor) and an additional output neuron

*V*which predicts the value function (the critic). The actor and the critic are learnt simultaneously by defining the loss function as where measures the performance of the stochastic policy

^{t}*π*, and measures the accuracy of the value estimate

*V*.

^{t}Since *V ^{t}* is independent of the action

*a*one can show that .

^{t}We can use that to define an estimator of the loss gradient with reduced variance:
similarly as in equation (13.11) of section 13.4 in [46]. A difference in notation is that *b*(*S _{t}*) refers to our value estimation

*V*. In addition, equation (34) already includes the gradient that is responsible for learning the value prediction. Until now this derivation follows the classical definition of the actor-critic variant of policy gradient, and the gradient can be computed with

^{t}*BPTT*. To derive

*reward-based e-prop*we follow instead the generic online approximation of

*e-prop*as in equation (27) and approximate by a sum of terms of the form with

We choose this estimator of the loss derivative because it is unbiased and has a low variance, more details are given in section S5.1. We derive below the resulting synaptic plasticity rule as needed to solve the task of Fig. 4, 5. For the case of a single action as used in Fig. S4 we refer to section S5.1.

When there is a delay between the action and the reward or, even harder, when a sequence of many actions lead together to a delayed reward, the loss function *E* cannot be computed online because the evaluation of *R ^{tn}* requires knowledge of future rewards. To overcome this, we introduce temporal difference errors

*δ*=

^{t}*r*+

^{t}*γV*

^{t+1}−

*V*(see Fig. 4), and use the equivalence between the forward and backward view in reinforcement learning [46]. Using the one-hot encoded action at time

^{t}*t*, which assumes the value 1 if and only if

*a*=

^{t}*k*(else it has value 0), we arrive at the following synaptic plasticity rules for a general actor-critic algorithm with

*e-prop*(see Section S5.1): where we define the term to have value zero when no action is taken at time

*t*. is here the weight from the output neuron for the value function to neuron

*j*, and the weights denote the weights from the outputs for the policy.

A combination of reward prediction error and neuron-specific learning signal was previously used in a plasticity rule for feedforward networks inspired by neuroscience [47, 48]. Here it arises from the approximation of *BPTT* by *e-prop* in RSNNs solving RL problems. Note that the filtering requires an additional eligibility trace per synapse. This arises from the temporal difference learning in RL [46]. It depends on the learning signal and does not have the same function as the eligibility trace .

## Authors contributions

GB, FS, AS and WM conceived the work, GB, FS, AS, EH and DS carried out experiments and all authors contributed to the writing of the paper.

## Acknowledgments

This research/project was supported by the Human Brain Project (Grand Agreement number 785907) and the SYNCH project (Grand Agreement number 824162) of the European Union. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Quadro P6000 GPU used for this research. Computations were carried out on the Human Brain Project PCP Pilot Systems at the Juelich Supercomputing Centre, which received co-funding from the European Union (Grand Agreement number 604102) and on the Vienna Scientific Cluster (VSC).

We thank Thomas Bohnstingl, Wulfram Gerstner, Christopher Harvey, Martin Vinck, Jason MacLean, Adam Santoro, Christopher Summerfield, and Yuqing Zhu for helpful comments on an earlier version of the manuscript.