## Abstract

Synaptic plasticity is believed to be a key physiological mechanism for learning. It is well-established that it depends on pre and postsynaptic activity. However, models that rely solely on pre and postsynaptic activity for synaptic changes have, to date, not been able to account for learning complex tasks that demand hierarchical networks. Here, we show that if synaptic plasticity is regulated by high-frequency bursts of spikes, then neurons higher in the hierarchy can coordinate the plasticity of lower-level connections. Using simulations and mathematical analyses, we demonstrate that, when paired with short-term synaptic dynamics, regenerative activity in the apical dendrites, and synaptic plasticity in feedback pathways, a burst-dependent learning rule can solve challenging tasks that require deep network architectures. Our results demonstrate that well-known properties of dendrites, synapses, and synaptic plasticity are sufficient to enable sophisticated learning in hierarchical circuits.

## Introduction

The current canonical model of synaptic plasticity in the cortex is based on the co-occurrence of activity on the two sides of the synapse, pre and postsynaptic [44, 79]. The occurrence of either long-term depression (LTD) or long-term potentiation (LTP) is controlled by specific features of pre and postsynaptic activity [5, 21, 28, 34, 49, 58, 61, 88, 89] and a more global state of neuromodulation [3, 31, 32, 36, 48, 57, 65, 79, 86, 90]. However, local learning rules by themselves do not provide a guarantee that behavioral metrics will improve. With neuromodulation driven by an external reward/punishment mechanism, this guarantee is achievable. But, such learning is very slow in tasks that require large or deep networks because a global signal provides very limited information to neurons deep in the hierarchy [9, 59, 94]. Thus, an outstanding question is (Fig. 1): how can neurons higher in the hierarchy signal to other neurons — sometimes multiple synapses lower — whether to engage in LTP or LTD in order to improve behavior [79]? This question is sometimes referred to as the “credit assignment problem”, essentially, how can we assign credit for any errors or successes to neurons that are multiple synapses away from the output [77]?

As established in non-hierarchical systems, such as the electrosensory lateral line lobe of the electric fish [8, 14, 66] or the cerebellum [15], feedback connections on dendrites are well-poised to orchestrate learning [78]. But for credit assignment in hierarchical networks, these connections should obey four constraints: 1) Feedback must *steer* the sign and magnitude of plasticity. 2) Feedback signals from higher-order areas should be *multiplexed* with feedforward signals from lower-order areas so that credit information can percolate down the hierarchy with minimal disruption to sensory information. 3) There should be some degree of *alignment* between feedback connections and feedforward connections. 4) Integration of credit-carrying feedback signals should be close to *linear* and avoid saturation (i.e., feedback signals should be linear with respect to any credit information). Experimental and theoretical work have addressed steering [16, 34], multiplexing [1, 45, 51, 69], alignment [2, 17, 18, 59] or linearity [67] in isolation, often by learning in an offline fashion [2, 4, 40, 47, 59, 83, 84, 101], without learning rules based on spikes [4, 53, 80, 82–84] or without learning to solve tasks that necessitate hierarchical processing. Thus, it remains unclear whether a single set of cellular and subcellular mechanisms address all four requirements for orchestrating learning in cortical hierarchies efficiently.

Here, we address the credit assignment problem with a spike-based learning rule that models how high-frequency bursts determine the sign of synaptic plasticity [33, 49, 58, 70]. Extending the work of Kording and Konig (2001) [53], we show that combining properties of apical dendrites [55] with this burst-dependent learning rule allows feedback to steer plasticity. We further show that feedback information can be multiplexed across multiple levels of a hierarchy when feedforward and feedback connections have distinct short-term plasticity (STP) [38, 87]. Using spiking simulations, we demonstrate that these mechanisms can be used to coordinate learning across a hierarchical circuit. A coarse-grained equivalent of these dynamical properties will, on average, lead to learning that approximates loss-function gradients that are used in machine learning for training deep neural networks. We further show that this biological approximation to loss-function gradients is improved by a burst-dependent learning rule performing the alignment of feedback weights with feedforward weights, as well as recurrent connections that linearize credit signals. Finally, we show that networks trained with these mechanisms can learn to classify complex image patterns with high accuracy. Altogether, our work highlights that well-known properties of dendritic excitability, synaptic transmission, and synaptic plasticity are sufficient to solve the credit assignment problem in hierarchical networks.

## Results

### A burst-dependent rule enables top-down steering of plasticity

Experimental work has demonstrated that the sign of plasticity can be determined by patterns of pre and postsynaptic activity. The most common formulation of this is spike-timing-dependent plasticity (STDP), wherein the timing of pre and postsynaptic spikes is what determines whether LTP or LTD occurs [10, 35, 61]. However, there is also evidence suggesting that in many circuits, particularly mature ones [64], the principal determinant of plasticity is the level of postsynaptic depolarization, with large depolarization leading to LTP and small depolarization leading to LTD [5, 21, 28, 89]. Importantly, one of the easiest ways to induce large magnitude depolarization in dendrites is via backpropagation of high-frequency bursts of action potentials [50]. Thus, evidence suggests that the sign of synaptic plasticity is largely dependent on postsynaptic bursting [33, 49, 58, 89].

To explore how high-frequency bursting could control learning in biological neural networks, we formulated a burst-dependent plasticity rule as an abstraction of the experimental data. We consider a burst to be any occurrence of at least two spikes with a short (e.g. under 20 ms) interspike interval [7]. Following Ref. [69], we further define an event as either an isolated single spike or a burst. Thus, for a given neuron’s output, there is an event train (similar to a spike train, except that events can be either bursts or single spikes) and a burst train, which comprises a subset of the events (see Methods). The learning rule states that the change over time of a synaptic weight between postsynaptic neuron *i* and presynaptic neuron *j*, *d _{wij}/dt*, results from a combination of an eligibility trace of presynaptic activity,

*Ẽ*, and the potentiating (or depressing) effect of the burst train

_{j}*B*(or event train

_{i}*E*) of the postsynaptic cell (Fig. 2a):

_{i}The variable controls the relative strength of burst-triggered potentiation and event-triggered depression. To ensure a finite growth of synaptic weights, we set this to a moving average of the proportion of events that are bursts in postsynaptic neuron *i*, with a slow (~ 1 – 10 s) time scale (see Methods).

The plasticity rule stipulates that when a presynaptic input is paired with a postsynaptic burst LTP is induced, and otherwise, LTD results (Fig. 2a) [33, 49, 58, 64, 70]. Using this rule, we simulated a series of synaptic plasticity experiments from the experimental and computational literature. First, we examined a frequency-dependent STDP protocol [89]. We found that when the spike pairing frequency is low, LTD is produced, and when the pairing frequency is high, LTP is produced (Fig. 2b). This matches previous reports on frequency-dependent STDP and shows that a burst-dependent synaptic plasticity rule can explain this data. Then, we explored the behavior of our rule when the pre and postsynaptic neuron fire independently according to Poisson statistics [74] (Fig. 2c). Experimental results have established that in such a situation the postsynaptic firing rate should determine the sign of plasticity [89]. As in similar learning rules [74], we found that a burst-dependent plasticity rule produces exactly this behavior (Fig. 2c). Notably, the Bienenstock-Cooper Munro (BCM) model [11] has posited that the switching point between LTD and LTP depends on a nonlinear moving average of past activity. In the present case, the adaptive threshold is , a nonlinear function of the past postsynaptic activity. These results demonstrate that a burst-dependent plasticity rule is capable of uniting a series of known experimental and theoretical results.

The burst-dependent rule suggests that feedback-mediated steering of plasticity could be achieved if there were a mechanism for top-down control of the likelihood of a postsynaptic burst. To illustrate this, in Fig. 2d we simulated another protocol wherein events were generated with Poisson statistics, and each event could become a burst with probability *p* (x axis in Fig. 2d). Manipulating this burst probability against the initial burst probability estimate controlled the occurrence of LTP and LTD, while changing the pre and postsynaptic event rates simply modified the rate of change of the weight (but not the transition point between LTP and LTD). This shows that one way for neurons to control the sign of plasticity to ensure effective learning may be to regulate the probability of high-frequency bursts.

Interestingly, evidence suggests that in cortical pyramidal neurons of sensory cortices the probability of generating high-frequency bursts is controlled by inputs to the distal apical dendrites and their activation of voltage-gated calcium channels (VGCCs) [26, 30, 55, 56, 98]. Anatomical and functional data has shown that these inputs often come from higher-order cortical or thalamic regions [19, 29].

We wondered whether combining a burst-dependent plasticity rule with regenerative activity in apical dendrites could permit top-down signals to act as a “teaching signal”, instructing the sign of plasticity in a neuron. To explore this, we ran simulations of pyramidal neuron models with simplified VGCC kinetics in the apical dendrites (see Methods). We found that by manipulating the distal inputs to the apical dendrites we could control the number of events and bursts in the neurons independently (Figs. 2e,g). Importantly, the inputs to the apical dendrites in the postsynaptic neurons were what regulated the number of bursts, and this also controlled changes in the synaptic weights, through the burst-dependent learning rule. When the relative proportion of bursts increased, the synaptic weights potentiated on average, and when the relative proportion of burst decreased, the synaptic weights depressed (Fig. 2f). Modifying the proportion of bursts in the presynaptic neurons had little effect on the weights (see the rightmost gray shaded area in Fig. 2e-g). The sign of plasticity was independent of the number of events, though the magnitude was not. Therefore, while the number of events contributed to the determination of the magnitude of changes, the top-down inputs to the apical dendrites controlled the sign of plasticity. In this way, the top-down inputs acted as a “teaching signal” that determined whether LTP or LTD would occur. These results show that a burst-dependent learning rule paired with the control of bursting provided by apical dendrites enables a form of top-down steering of synaptic plasticity in an online, local, and spike-based manner.

### Dendrite-dependent bursting combined with short-term plasticity supports multiplexing of feedforward and feedback signals

The question that naturally arises from our finding that top-down inputs can steer synaptic plasticity via a burst-dependent rule is whether feedback can steer plasticity without affecting the communication of bottom-up signals. Using numerical simulations, we previously have demonstrated that in an ensemble of pyramidal neurons the inputs to the perisomatic and distal apical dendritic regions can be distinctly encoded using the event rate computed across the ensemble of cells and the percentage of events in the ensemble that are bursts (the “burst probability”), respectively [69]. When communicated by synapses with either short-term facilitation (STF) or short-term depression (STD), this form of “ensemble multiplexing” may allow top-down and bottom-up signals to be simultaneously transmitted through a hierarchy of pyramidal neurons.

To explore this possibility, we conducted simulations of two reciprocally connected ensembles of pyramidal neurons along with interneurons providing feedforward inhibition. One ensemble received currents in the perisomatic region and projected to the perisomatic region of the other ensemble (Fig. 3a, green ensemble). The other ensemble (Fig. 3a, light blue) received currents in the distal apical compartment and projected to the distal apical compartment of the first ensemble. As such, we considered the first ensemble to be “lower” (receiving and communicating bottom-up signals), and the other to be “higher” (receiving and communicating top-down signals) in the hierarchy. Furthermore, we made one key assumption in these simulations. We assumed that the synapses in the perisomatic regions were short-term depressing, whereas those in distal apical dendrites were short-term facilitating. Additionally, we assumed that the inhibitory interneurons targeting the perisomatic region possessed STD synapses, and the inhibitory interneurons targeting the distal apical dendrites possessed STF synapses. These properties are congruent with what is known about parvalbumin-positive and somatostatin-positive interneurons [63, 76, 93], which target the perisomatic and apical dendritic regions, respectively.

In these simulations, we observed that currents injected into the lower ensemble’s perisomatic compartments were reflected in the event rate of those neurons (Fig. 3c3), whereas the currents injected into the distal apical dendrites of the higher ensemble were reflected in the burst probability of those neurons (Fig. 3b2). Importantly, though, we also observed that these signals were simultaneously propagated up and down. Specifically, the input to the lower ensemble’s perisomatic compartments was also encoded by the higher ensemble’s event rate (Fig. 3b3). As well, the burst rate of the higher ensemble was encoded by the lower ensemble’s burst probability (Fig. 3c2). In this way, the lower ensemble had access to information about the signal transmitted to the higher ensemble’s distal apical dendrites, as well as the higher ensemble’s event rate.

These simulations demonstrate that if bottom-up connections to perisomatic regions and perisomatic inhibition rely on STD synapses, while top-down connections to apical dendrites and distal dendritic inhibition utilize STF synapses, then ensembles of pyramidal neurons are capable of simultaneously processing both a top-down signal and a bottom-up signal. We conclude that a top-down signal to apical dendrites can 1) control the sign of plasticity locally (steering; Fig. 2) and 2) be communicated to lower ensembles without affecting the flow of bottom-up information (multiplexing; Fig. 3).

### Combining a burst-dependent plasticity rule with short-term plasticity and apical dendrites can solve the credit assignment problem

To test whether STP, dendrite-dependent bursting and a burst-dependent learning rule can act simultaneously in a hierarchy, we built a simulation of ensembles of pyramidal neurons arranged in three layers, with two ensembles of cells at the input, one ensemble of cells at the output, and two ensembles of cells in the middle (the “hidden” layer; Fig. 4a). The distal dendrites of the top ensemble received “teaching” signals indicating desired or undesired outputs. No other teaching signal was provided to the network. As such, the hidden layer ensembles were informed of the suitability of the output only via the signals they received from the output ensemble’s bursts. Currents injected into the somatic compartments of the input layer populations controlled their activity levels in accordance with the learning task to be discussed below. Compared to Figs. 2–3, for this simulation we made a few modifications to synaptic transmission and pyramidal neuron dynamics to streamline the burst-event multiplexing and decoding (see Methods). The most important addition, however, is that we modified the learning rule in Eq. 1 by multiplying the right-hand side by an additional global term, *M*(*t*), that gates plasticity. This term abstracts a number of possible sources of control of plasticity, like local inhibition [67], burst sizes [99] or transient neuromodulation [73, 86, 100]. Importantly, *M*(*t*) in our model gates plasticity without changing its sign, contrary to some models on the role of neuromodulation in plasticity [36]. Its role was to make sure that plasticity elicited by the abrupt onset and offset of each training example does not overcome the plasticity elicited by the teaching signal, i.e. it was used to ensure a supervised training regime. We accomplished this by setting *M* = 0 when no teaching signal was present at the output layer and *M* =1 under supervision. In this way, we ensured that the teaching signal was the primary driver of plasticity.

We trained our 3-layer network on the exclusive or (XOR) task, wherein the network must respond with a high output if only one input pool is active, and low output if neither or both input pools are active (Fig. 4). We chose XOR as a canonical example of a task that requires a nonlinear hierarchy with appropriate credit assignment for successful learning. Before learning, the network was initialized such that the output pool treated any input combination as roughly equivalent (Fig. 4c, dashed line). To compute XOR, the output pool would have to learn to reduce its response to simultaneously active inputs and increase its response to a single active input.

We set up the network configuration (Fig. 4a) to address a twofold question:

(1) Would an error signal applied to the top-layer neurons’ dendrites be propagated downward adequately? (2) Would the burst-dependent learning rule combine top-down signals with bottom-up information to make the hidden-layer neurons better feature detectors for solving XOR?

Importantly, if the answer to these two questions were true, we would expect that the two hidden ensembles would learn different features if they receive different feedback from the output. To test this, we provided hidden pool 1 with positive feedback from the output, and hidden pool 2 with negative feedback (Fig. 4a, light blue symbols). With this configuration, adequate error propagation to the two hidden pools would make their responses diverge with learning, and the output pool would learn to take advantage of this change. Indeed, the XOR task was solved after training (Fig. 4c, solid line).

To understand how this solution was aided by appropriate credit assignment, we examined the information about the top-down teaching signals in each layer. According to the learning rule, plasticity can be steered by controlling the instantaneous propensity to burst with respect to a moving average of the burst probability (see term in Eq. 1 and Fig. 2e-f). In the output pool, the error signal applied to the apical dendrites induced a temporary decrease in the burst probability when the input pools were both active or both inactive, and a temporary increase when only one input pool was active (Fig. 4d). These changes in the output burst probability modified the output burst rate, which was propagated to the hidden pools. As mentioned above, the hidden pools received top-down signals with different signs (Fig. 4e1-2, orange lines), and indeed their respective burst probabilities were altered in opposite directions (Fig. 4e1-2, red lines). Due to these distinct top-down signals and the adaptive threshold , the hidden pools’ response diverged during learning (Fig. 4f1-2). For instance, hidden pool 1 reduced its responses to both inputs being active, while hidden pool 2 increased its response. These changes were due to the top-down control of the plasticity of synapses between the input and hidden pools. We verified that solving this task depends on the plasticity of connections from input to hidden units, but only weakly on the size of the ensembles (Fig. S3). These results demonstrate that the propagation of errors using bursts and the burst-dependent learning rule can combine to achieve hierarchical credit assignment in ensembles of pyramidal neurons.

### Burst-dependent plasticity promotes linearity and alignment of feedback

Having demonstrated that a burst-dependent learning rule in pyramidal neurons enables online, local, spike-based solutions to the credit assignment problem, we were interested in understanding the potential relationship between this algorithm and the gradient-descent-based algorithms used for credit assignment in machine learning. To do this, we wanted to derive the average behavior of the burst-dependent learning rule at the coarse-grained, ensemble-level, and determine whether it provided an estimate of a loss-function gradient. More precisely, in the spirit of mean-field theory and linear-nonlinear rate models [27, 75, 85], we developed a model where each unit represents an ensemble of pyramidal neurons, with event rates, burst probabilities, and burst rates as described above (Fig. S8). Specifically, for an ensemble of pyramidal neurons, we defined *e*(*t*) and *b*(*t*) as ensemble averages of the event and burst trains, respectively. Correspondingly, *p*(*t*) = *b*(*t*)/*e*(*t*) refers to the ensemble-level burst probability. We then defined the connection weight between an ensemble of presynaptic neurons and an ensemble of postsynaptic neurons, W_{post,pre}, as the effective impact of the presynaptic ensemble on the postsynaptic ensemble, taking into consideration potential polysynaptic interactions. Note that this means that the ensemble-level weight, W_{post,pre}, can be either positive or negative, as it reflects the cumulative impact of both excitatory and inhibitory synapses (see Supplemental Materials).

Our goal was then to derive the ensemble-level weight updates from the burst-dependent plasticity rule (Eq. 1). We assumed that any given pair of neurons were only weakly correlated on average, a reasonable assumption if the synaptic weights in the circuit are small [25]. Moreover, decorrelation between neurons is observed when animals are attending to a task [25], which suggests that this is a reasonable assumption for active processing states. We further assumed that the neuron-specific moving average burst probability is independent of the instantaneous occurrence of events. Using these assumptions, it can be shown (see Supplemental Materials) that the effective weight averaged across both pre and postsynaptic ensembles obeys:
where the learning rate *η* is different from that appearing in Eq. 1, and is a ratio of moving averages for the postsynaptic burst rate and event rate. This learning rule can be shown to implement an approximation of gradient descent for hierarchical circuits, like the backpropagation-of-error algorithm [81]. Specifically, if we assume that the burst probabilities remain in a linear regime (linearity), that the feedback synapses are symmetric to the feedforward synapses (alignment), and that error signals are received in the dendrites of the top-level ensembles, then is equivalent to the error signal sent backwards in backpropagation (see Supplemental Materials). For the sake of computational efficiency, when simulating this ensemble-level learning, we utilized simplifications to the temporal dynamics (i.e. we implemented a discrete-time version of the rule), though the fundamental computations being implemented were identical to the continuous-time equation above (see Methods and Supplemental Materials).

The assumptions of feedback linearity and alignment can be supported by the presence of additional learning mechanisms. First, we examined learning mechanisms to keep the burst probabilities in a linear regime. Inspired by a previous study that used distal apical inhibition to control error signals [82], and consistent with our previous results, we used recurrent excitatory and inhibitory inputs to control the apical compartments’ potential (Fig. 5a). These dendrite-targeting inputs propagated bursts from neural ensembles at the same processing stage in the hierarchy. We found that a simple homeostatic learning rule (see Methods) could learn to keep burst probabilities in a linear regime, thus improving gradient estimates (Fig. 5b). Second, we explored potential mechanisms for learning weight symmetry. Symmetry between feedforward and feedback weights is an implicit assumption of many learning algorithms that approximate loss-function gradients. However, it has been shown that it is possible to learn weight symmetry [17]. In one classic algorithm [52], weight symmetry is obtained if feedforward and feedback weights are updated with the same error signals, plus some weight decay [2]. In our model, this form of feedback weight update was implemented locally because the error signal used to update the feedforward weights in discrete time is the deviation of the burst rates from the moving average baseline, and this, we propose, is also determining the updates to the feedback weights (see Methods). When we implemented this form of learning on the ensemble-level feedback weights we observed rapid weight alignment (Fig. 5c) and convergence to a loss-function gradient (Fig. 5d). Altogether, these results demonstrate that the burst-dependent learning rule, averaged across ensembles of pyramidal neurons, and paired with biologically plausible learning rules for recurrent inputs and feedback connections, can provide a good estimate of loss-function gradients in hierarchical networks.

### Ensemble-level burst-dependent plasticity in deep networks can support good performance on standard machine learning benchmarks

We wanted to determine whether the ensemble-level learning rule could perform well on difficult tasks from machine learning that previous biologically plausible learning algorithms have been unable to solve. Specifically, we built a deep neural network comprised of pyramidal ensemble units that formed a series of convolutional layers followed by fully-connected layers (Fig. 6a). We then trained these networks on two challenging image categorization datasets that previous biologically plausible algorithms have struggled with: CIFAR-10 and ImageNet [6].

The training throughout the network used our burst-dependent plasticity rule and recurrent inputs for linearization. For the CIFAR-10 dataset, we observed a classification test error rate of 20.1 % after 400 epochs (where an epoch is a pass through all training images), similar to the error rate achieved with full gradient descent in a standard artificial neural network (Fig. 6b). Training the feedback weights was critical for enabling this level of performance on CIFAR-10, as fixed feedback weights led to much worse performance, in-line with previous results [6]. Furthermore, rich unit-specific feedback signals were critical. A network trained using a global reward signal, while theoretically guaranteed to follow gradient descent on average [94, 95], was unable to achieve good performance on CIFAR-10 in a reasonable amount of time (Fig. 6b, node perturbation). For the ImageNet dataset we observed a classification error rate of 56.1 % on the top 5 predicted image classes with our algorithm, which is much better than the error rate achieved when keeping the feedback weights fixed, and much closer to that of full gradient descent (Fig. 6c). We also trained a network on the MNIST dataset, and achieved a similar performance of 1.1% error on the test set with all three algorithms (Fig. S7). Therefore, these results demonstrate that the ensemble-level burst-dependent learning rule, coupled with additional mechanisms to promote multiplexing, linearity and alignment, can solve difficult tasks.

## Discussion

In this paper, we explored the possibility that high-frequency bursts in pyramidal neurons can provide an instructive signal for synaptic plasticity that can coordinate learning across hierarchical circuits. We have shown that a combination of a burst-dependent plasticity rule with STP and regenerative dendritic activity turns feedback connections into a teacher (Fig. 2), which by multiplexing (Fig. 3) can coordinate plasticity across multiple synaptic jumps (Fig. 4). We then showed that, with some additional burst-dependent learning at recurrent and feedback synapses, these mechanisms provide an approximation of a loss-function gradient for supervised learning (Fig. 5) and perform well on challenging image classification tasks (Fig. 6). Together, these results show that a local, spike-based learning rule that utilizes high-frequency bursts as an instructive signal can enable sophisticated credit assignment in hierarchical circuits.

It is important to note that there are a number of limitations to our model. First, our ensemble-level models utilized many “ensemble units” that incorporated the activity of many pyramidal neurons, which could potentially require disproportionately-sized networks. However, the functional impact of using many neurons in an ensemble is to provide a means for averaging the burst probabilities. Theoretically, this averaging could be done over time, rather than over neurons. If so, there is no reason that the algorithm could not work with single-neuron ensembles, though it would require a much longer time to achieve good estimates of the gradients. To some extent, this is simply the typical issue faced by any model of rate-based coding: if rates are used to communicate information then spatial or temporal averaging is required for high-fidelity communication. Furthermore, we suspect that allowing population coding could reduce the number of neurons required for a reliable representation [13].

Next, by focusing on learning, we ignored other ongoing cognitive processes. For instance, the close link between attention and credit assignment implies that the same mechanisms may serve both attention and learning purposes [60, 80]. Although some experimental data points to a role of bursting in attention [91, 96], further work is required to establish if burst coding can give rise to attention-like capabilities in neural networks.

The presence of the gating term, *M*(*t*), may be seen as an additional limitation in the model, since it is left in an abstract form and not directly motivated by biology. This term was introduced in order to ensure that learning was driven by the teaching signal and not by changes in the stimuli. Of course, if the goal is not supervised learning, but unsupervised learning, then this term may be unnecessary. Indeed, one may view this as a prediction, i.e. that learning to match a target should involve an additional gating mechanism that is not required for unsupervised learning. We believe that this may be a fruitful avenue for future research.

Another set of limitations derive from how we moved from detailed cellular-level simulations to abstract neural network models that were capable of solving complex tasks. For example, in moving to the abstract models, we gradually made a number of simplifying assumptions, including clear separation between bursts and single spikes, simplified STP, simplified bursting mechanisms, and ensemble-level units that represented spiking activity across multiple neurons with a single value. We highlight these limitations because it is important to keep them in mind when considering the potential for the cellular-level plasticity rule to implement sophisticated credit assignment. Ideally, we would have the computational resources to fully simulate many thousands of ensembles of pyramidal neurons and interneurons with complex synaptic dynamics and bursting in order to see if the cellular-level burst-dependent rule could also solve complicated tasks. However, these questions will have to be resolved by large-scale projects that can simulate millions of biophysically realistic neurons with complicated internal dynamics [42, 62]. Finally, our model did not include any sophisticated neuromodulatory systems. Yet, we know neuromodulation can regulate synaptic plasticity [86]. Future work could investigate how burst-dependent plasticity and neuromodulation could interact to improve learning further.

Our model makes a number of falsifiable experimental predictions that could be examined experimentally. First, the model predicts that there should be a polarization of STP along the sensory hierarchy, with bottom-up synaptic projections being largely STD and top-down synaptic projections being largely STF. There are reports of such differences in thalamocortical projections [38, 87], which suggests that an important missing component of our model is the inclusion of thalamic circuitry. There are also reports of polarization of STP along the basal dendrosomatic axis [39], and our model would predict that this polarization should extend to apical dendrites. Second, because our model proposes that burst firing carries information about errors, there should be a relationship between burst firing and progress in learning. Specifically, our model predicts that the *variance* in burst probabilities across a population should be correlated with the errors made during learning (Fig. S6). Experimental evidence in other systems supports this view [26, 45]. Finally, our model predicts that inhibition in the distal apical dendrites serves, in part, to homeostatically regulate burst probabilities to promote learning. Thus, a fairly simple prediction from the model is that manipulations of distal dendrite targeting interneurons, such as somatostatin positive interneurons, should lead to unusual levels of bursting in cortical circuits and disrupt learning. Some recent experimental evidence supports this prediction [20, 26].

Linking low-level and high-level computational models of learning is one of the major challenges in computational neuroscience. Our focus on supervised learning of static inputs was motivated by recent progress in this area. However, machine learning researchers have also been making rapid progress in unsupervised learning on temporal sequences in recent years [24, 41]. We suspect that many of the same mechanisms we explored here, e.g. burst-dependent plasticity, could also be adapted for unsupervised learning of temporal sequences in hierarchical circuits. Ultimately, by showing that a top-down orchestration of learning is a natural result of a small set of experimentally observed phenomena, our work opens the door to future approaches that utilize the unique physiology of cortical microcircuits to implement powerful learning algorithms using time-varying signals.

## Methods

### Spiking model

Spiking simulations were performed using the Auryn simulator [102], except for the pairing protocols of Fig. 2b-d, which used Python. The code is available at https://github.com/apayeur/spikingburstprop.

### Event and burst detection

An *event* was said to occur either at the time of an isolated spike or at the time of the first spike in a burst. A *burst* was defined as any occurrence of at least two spikes with an interspike interval (ISI) less than the threshold b_{th} = 16 ms [7, 69]. Any additional spike with ISI < b_{th} belonged to the same burst. A neuron i kept track of its time-averaged burst probability by using exponential moving averages of its event train *E _{i}* and burst train

*B*: where

_{i}*τ*

_{avg}is a slow time constant (~ 1-10 s). Also,

*E*(

_{i}*t*) = Σ

_{event}

*δ*(

*t*−

*t*) and

_{i,event}*B*(

_{i}*t*) = Σ

_{burst}

*δ*(

*t*−

*t*), where

_{i,burst}*t*and

_{i,event}*t*indicate the timing of an event and of the second spike in a burst, respectively.

_{i,burst}### Plasticity rule

Weights were updated upon the detection of a postsynaptic event or burst according to
where is a presynaptic trace with time constant *τ*_{pre}. Here, *τ*_{pre} is typically much smaller than *τ*_{avg}, with *τ*_{pre} ~ 10 ms, but it could possibly be made larger to accommodate plasticity rules with slower dynamics [12]. The prefactor M gates plasticity during training: in the XOR task (Fig. 4), *M* = 1 when the teaching signal is present and 0 otherwise. In Fig. 2, *M* = 1 throughout.

Homeostatic terms help to restrict the activity of neurons to an appropriate range. The homeostatic functions *H _{i}* and

*G*were defined as where

_{i}*e*

_{min}; (resp.

*e*

_{max}) is a minimum (resp. maximum) event rate, and Θ(·) denotes the Heaviside step function. When the neuron-specific running average of the event rate,

*Ē*(

_{i}*t*), lies within these limits,

*H*=

_{i}*G*= 0, and we recover the learning rule of Eq. 1. In most simulations, network parameters were chosen in such a way that the homeostatic plasticity had little to no effect. Typically, we used

_{i}*e*

_{min}= 2 Hz and

*e*

_{max}= 10 Hz.

### Pairing protocols

For all pairing protocols of Fig. 2b-d, we had *τ*_{pre} = 50 ms, *τ*_{avg} = 15 s, *η* = 0.1, and we set the homeostatic terms to zero.

*Periodic protocol*. Five consecutive pairings were separated by a quiescent period of 10 s, 15 times. We used pairings with Δ*t*= 0. For each pairing frequency the starting value for the estimated burst probability was and*Ē*(_{i}*t*= 0)= 5 Hz.*Poisson protocol*. Both the pre and postsynaptic neurons fired spikes at a Poisson rate*r*with no refractory period. For each*r*, the induction lasted 60 s and we averaged over 10 independent realizations. We used*Ē*(_{i}*t*= 0).*Burst-Poisson protocol*. Both the pre and postsynaptic neurons produced events at a Poisson rate*r*, including a refractory period . For each event, a burst was generated with probability p and an intraburst ISI was sampled from Unif, with . For the simulations in Fig. 2d, we used and*t*_{max}= 10 ms. We set and the event rate of the pre and postsynaptic neurons were set to*r*= 5 Hz and*r*=10 Hz, with corresponding values for the initial postsynaptic event rate estimates. For each*r*, the induction lasted 100 s and we averaged over 20 independent realizations.

### Neuron models

*Pyramidal neurons*The somatic compartment obeyed where*V*is the somatic membrane potential,_{s}*w*is an adaptation variable,_{s}*I*is the total current applied to the soma (includes noise and other synaptic inputs) and_{s}*S*(*t*) is the spike train of the neuron. The function*f*(*V*) in the equation for_{d}*V*takes into account the coupling with the dendritic compartment, with_{s}*f*(*V*) = 1/{1 + exp[−(_{d}*V*−_{d}*E*)/_{d}*D*]} and parameters_{d}*E*= −38 mV and_{d}*D*= 6 mV. A spike occurred whenever V_{d}_{s}crossed a moving threshold from below. The latter jumped up by 2 mV right after a spike and relaxed towards −50 mV with a time constant of 27 ms. Other somatic parameters were:*τ*= 16 ms,_{s}*C*= 370 pF,_{s}*E*= −70 mV,_{L}*τ*= 100 ms,_{Ws}*b*= 200 pA, and*g*= 1300 pA. The reset voltage after a spike was_{s}*V*= −70 mV._{r}The dendritic compartment obeyed

The function

*f*(*V*) is the same as above and is responsible for the regenerative dendritic activity. The term_{d}*c*represents the backpropagating action potential, with the kernel_{d}(K * S)(t)*K*modeled as a box filter of amplitude one and duration 2 ms, delayed by 0.5 ms with respect to the somatic spike. Other dendritic parameters were:*τ*= 7 ms,_{d}*C*= 170 pF,_{d}*E*= −70 mV,_{L}*τ*= 30 ms,_{Wd}*a*=13 nS, and*g*= 1200 pA._{d}This model and its parameters are described in more detail and compared with experimental data in Ref. [69].

*Dendrite-targeting inhibition*. We modeled somatostatin-positive interneurons [46, 71, 92] using the adaptive exponential integrate-and-fire (AdEx) model [68]: where*I*is the total current applied to the neuron. A spike occurred whenever*V*crossed*V*= 24 mV and was followed by a refractory period_{cut}*τ*. Parameter values were C = 100 pF,_{ref}*g*= 5 nS,_{L}*E*= −70 mV,_{L}*V*= −62 mV, Δ_{T}_{T}= 4 mV,*τ*= 500 ms, a = 0.5 nS, b =10 pA,_{w}*V*= −65 mV and_{r}*τ*= 2 ms. In Fig. 3, these model neurons (grey squares in Fig. 3a) were receiving top-down excitation from higher-level pyramidal cells._{ref}*Perisomatic inhibition*We modeled parvalbumin-positive neurons [72] using the AdEx model with parameters chosen to reproduce qualitatively their typical fast-spiking phenotype. Parameter values were C = 100 pF,*g*= 10 nS,_{L}*E*= −70 mV,_{L}*V*= −48 mV, Δ_{T}_{T}= 2 mV,*V*= −55 mV,_{r}*τ*= 1 ms and_{ref}*a*=*b*= 0. In Fig. 3, these model neurons (grey discs in Fig. 3a) were receiving bottom-up excitation from the lower-level pyramidal cells.

### Connectivity

In general, connections between distinct neural ensembles were sparse (~ 5% connection probability). Pyramidal neurons within an ensemble had no recurrent connections between their somatic compartments. Within a pyramidal ensemble, burst-probability linearization was enacted by sparse STF inhibitory synapses onto the dendritic compartments (Fig. S1). These STF connections were not illustrated in Fig. 3a for clarity.

### Synapses

All synapses were conductance-based. The excitatory (resp. inhibitory) reversal potential was E = 0 mV (resp. E = −80 mV) and the exponential decay time constant was 5 ms (resp. 10 ms). There were no NMDA components to excitatory synapses. For a given connection between two ensembles, existing synapses had their strengths all initialized to the same value.

### Noise

Each neuron (for single-compartment neurons) and each compartment (for two-compartment neurons) received its own (private) noise in the form of a high-frequency excitatory Poisson input combined to an inhibitory Poisson input. The only exception was the noise applied to the neural populations in Fig. 2e-g, where we used sparse connections from a pool of excitatory and inhibitory Poisson neurons. Noise served to decorrelate neurons within a population and to imitate *in vivo* conditions.

### Short-term plasticity

STP was modeled following the extended Markram-Tsodyks model [63]. Using the notation of Ref. [22], the parameters for STF were *D* = 100 ms, *F* = 100 ms, *U* = 0.02 and *f* = 0.1. For STD, the parameters were *D* = 20 ms, *F* = 1 s, *U* = 0.9 and *f* = 0.1. These sets of parameters were chosen following [69] to help decode bursts (using STF) and events (using STD).

### Spiking XOR gate

A XOR gate maps binary inputs (0, 0) and (1, 1) onto 0 and inputs (1, 0) and (0, 1) onto 1. In the context of our spiking network, input 0 corresponded to a low event rate (~ 2 Hz) and input 1 to a higher event rate (~ 10 Hz). These were obtained by applying a hyperpolarizing (resp. depolarizing) current for 0 (resp. 1) to the corresponding input-layer population. Importantly, compared to the spiking simulations described above, our implementation of the spiking XOR gate used three simplifications to reduce the dimension of the parameter search space. First, events and bursts were propagated directly instead of relying on STP (see Fig. S2). Second, disynaptic inhibition was replaced by direct inhibition coming from the pyramidal cells. Third, we used a simplified pyramidal neuron model. Below, we describe this model, as well as the initialization of the network, the error generation and the learning protocol for the XOR gate.

*Simplified pyramidal neuron model*. The effect of dendritic regenerative activity on the somatic compartment (controlled by*g*in Eqs. 9–10) was replaced by a conditional burst probability: whenever a somatic event occurred, a burst was produced with probability_{s}*f*(*V*). This function is the same as that appearing in Eqs. 9–10, but with_{d}*E*= −57 mV. This model permitted a cleaner burst-detection process and burst-ensemble multiplexing._{d}*Initialization of the network*. The feedforward synaptic strengths were initialized so that the event rates of all pyramidal ensembles in the network belonged to [*e*_{min},*e*_{max}] for all inputs. Excitatory synaptic strengths from the input layer to the hidden layer were all equal, and likewise for the inhibitory synapses. For the hidden-to-output feedforward connections, the ratio of the excitatory synaptic strengths was 1.4:1.05 in favor of hidden 1. This ratio for inhibition was 5:0.3 in favor of hidden 2. The feedback strengths from the output population to the hidden populations—the only existing ones—were initialized so that one of the hidden population (hidden pool 2 in Fig. 4) would receive inhibition on average, and the other excitation. A constant depolarizing current was applied to the hidden pool 2’s dendritic compartments to compensate for this stronger inhibition.*Error generation*. At the output layer, we specified a maximum and a minimum event rate,*e*_{max}and*e*_{min}(the same as in the learning rule of Eq. 6). The following linearly transformed*Ē*was then used in conjunction with a cross entropy loss function to compute the error for each neuron of the output population. As a result, a dendritic current, , was injected into every neuron so that its burst probability would increase or decrease according to the running average of its event rate and the desired output: where c ~ 1 nA o Hz. For instance, if the desired output was 0 and E_{i}_{i}was large, then the injected current was strongly hyperpolarizing. The injected current was set to zero when E_{i}was to within 1 Hz of the desired value.*Learning protocol*. A simulation proceeded as follows. With the plasticity off, there was first a relaxation interval of duration 3*τ*_{avg}, with no input applied to the network (for Fig. 4,*τ*_{avg}= 5 s). Then, the four different input pairs were applied consecutively to give the “before learning” response in Fig. 4d. Afterward, the four input/output pairs were applied consecutively, for 20 s each (typically in the same order); namely one epoch. For each input/output pair, first, the input alone was applied to the input populations with the plasticity off. We let the network reach its steady state for that input for the first 90% of the duration of an example. During this prediction interval, the moving average of the burst probability would converge towards the actual burst probability of the population for that given input. During the last 10% of the example duration, the plasticity was activated for all feedforward excitatory synapses and the teacher was applied. For computational efficiency, the error was computed once, at the very end of the prediction interval. The total number of epochs required to reach decent performance depended on the initialization of the network and the learning rate; for Fig. 4, we used 500 epochs. At the end of learning, the plasticity was switched off for good and the “after learning” response was computed.

### Deep network model for categorical learning

We now describe the deep network model that was used to learn the classification tasks reported in Figs. 5–6. The code is available at https://github.com/jordan-g/Burstprop. The model can be seen as a limiting case of a time-dependent rate model, which itself can be heuristically derived from the spiking network model under simplifying assumptions (see Supplemental Materials).

For the fully-connected layers in the network, we defined the “somatic potentials” of units in layer *l* as:
where **W**_{l} is the weight connecting layer *l − l* to layer *l*. Note that in this formulation we include a bias term as a column of **W**_{l}. The event rate of layer l was given by
where *f _{l}* is the activation function for layer l. In models trained on MNIST and CIFAR-10, the activation function was a sigmoid. In the model trained on ImageNet, a ReLU activation was used for hidden layers and a softmax activation was used at the output layer.

During the feedforward pass, the burst probability at the output layer (l = L) was set to a constant, (in these experiments, this was set to 0.2). Our previous research [69] has shown that the dendritic transfer function is a sigmoidal function of its input (see also Fig. S4). Therefore, the hidden-layer burst probabilities, **p**_{l}, for *l* < *L*, were computed using a sigmoidal function of a local “dendritic potential” **u**_{l}
as
where *α* and *β* are constants controlling the dendritic transfer function. In our experiments, we set *β* =1 and *α* = 0. Figure S4 illustrates various mechanisms affecting these parameters. The dendritic potentials were given by
where ⊙ is the elementwise product. The vector-valued function depends on the chosen activation function; of course, some caution is required when ReLU and softmax activations are used (see Supplemental Materials). The burst rate is given by

Finally, **Y**_{l} is the feedback weight matrix. For the feedback alignment algorithm, **Y**_{l} is a random matrix and is fixed throughout learning [59]. In the standard backpropagation algorithm, the feedforward and feedback weight matrices are symmetric so that , where ^{T} denotes the transpose. Below, we also describe how to learn the feedback weights to make them symmetric with the feedforward weights using the Kolen-Pollack algorithm [2].

With the teacher present, the output-layer burst probabilities were set to a squashed version of , where is the loss function (a mean squared error loss for Figs. 5–6). The squashing function was to make sure that *p _{L,i}* ∈ [0, 1]. The Supplemental Materials provide a few examples of squashing functions. The burst probabilities of the hidden layers were then computed as above. Finally, the weights were updated according to
where

**p**

_{l}and denote the burst probabilities with and without teacher, respectively,

*η*is the learning rate hyperparameter for units in layer

_{l}*l*, and

*λ*is a weight decay hyperparameter. Note that, for this model, ē

_{l}lags e

_{l}by a single computational step (see Supplemental Materials). Therefore, when the teacher appears, ē

_{l}= e

_{l}and we can write

This means that, in this model, the error is directly represented by the deviation of the burst rate with respect to a reference.

In the case of convolutional layers, the event rates of ensembles in layer *l* were given by
where * represents convolution. Similarly, the dendritic potentials in layer l were given by **u**_{l} = **Y**_{l} * **b**_{l+1} while burst probabilities were calculated as in the fully-connected layers. Finally, the weights of convolutional layers were updated as
where *ψ* combines the burst deviations and event rates to compute an approximation of the gradient with respect to the convolutional weights **W**_{l}.

### Learning the recurrent weights

In certain experiments, we introduced recurrent inputs into the hidden layers that served to keep burst probabilities in the linear regime of the sigmoid function. At layer *l*, we set the reference dendritic potentials to
where **Z**_{l} is the recurrent weight matrix and the burst rates used here, in bold sans-serif, are calculated as the burst rate *without* any recurrent inputs and *without* the teaching signal:

Otherwise, the dendritic potentials and burst rates must be solved self-consistently, slowing down computations. Recurrent weights are then updated in order to minimize **ū**_{l}:
where *η _{r}* is the learning rate. Note that, with these recurrent inputs, the updates of matrix

**W**

_{l}are the same as before, but now with

### Learning the feedback weights

Kolen and Pollack [52] found that if the feedforward and feedback weights are updated such that
where **A** is any matrix with the same shape as **W**_{l} and **Y**_{l}, then **Y**_{l} and **W**_{l} will converge. This means that if the feedback weights are updated in the same direction as the feedforward weights and weight decay is applied to both sets of weights, they will eventually become symmetric. Thus, we implemented the following learning rule for the feedback weights between layer l + 1 and layer l:
where λ is a weight decay hyperparameter. In convolutional layers, we used the following weight update:

### Training the model with CIFAR-10 and ImageNet

The network architectures described in Table S2 of the Supplemental Materials were trained on standard image classification datasets, CIFAR-10 [54] and ImageNet [23]. The CIFAR-10 dataset consists of 60,000 32 x 32 px training images belonging to 10 classes, while the ImageNet dataset consists of 1.2 million images (resized to 224 x 224 px) split among 1000 classes.

Each unit in these networks represents an ensemble of pyramidal neurons and has an event rate, burst probability, and burst rate. For each training example, the input image is presented and a forward pass is done, where event rates **e**_{l} throughout the network are computed sequentially, followed by a feedback pass where burst probabilities and burst rates are computed. Then, the teaching signal is shown at the output layer, and new burst probabilities **p**_{l} and burst rates **b**_{l} are computed backward through the network. Weights are then updated using our weight update rules. Networks were trained using stochastic gradient descent (SGD) with mini-batches, momentum and weight decay. ReLU layers were initialized from a normal distribution using Kaiming initialization [43], whereas Xavier initialization was used in sigmoid layers [37]. Hyperparameter optimization was done on all networks using validation data (see Supplemental Materials for details).

### Training the model using node perturbation

Node perturbation is a technique that approximates gradient descent by randomly perturbing the activations of units in the network, and updating weights according to the change in the loss function [94, 95]. In the model trained using node perturbation, at each step, first the input is propagated through the network as usual, after which the global loss, L, is recorded. Then, the same input is propagated again through the network but the activations of units in a single layer l are randomly perturbed:
where the elements of *ξ _{l}* are chosen from a normal distribution with mean 0 and standard deviation σ. The new loss,

*L*

_{NP}, is recorded. The weights in layer

*l*are then updated using the following weight update rule:

The layer to be perturbed, l, is changed with each mini-batch by iterating bottom-up through all of the layers in the network.

## Author contributions

All authors contributed to the burst-dependent learning rule. AP performed the spiking simulations. JG designed the recurrent plasticity rule and performed the numerical experiments on CIFAR-10 and ImageNet. BAR and RN wrote the manuscript, with contributions from JG and AP. BAR and RN co-supervised the project.

## Acknowledgments

We thank Adam Santoro and Leonard Maler for comments on this manuscript. We also thank Markus Hilscher and Maximiliano José Nigro for sharing data about SOM neurons. This work was supported by two NSERC Discovery Grants, 06872 (RN) and 04947 (BAR), a CIHR Project Grant (RN383647 - 418955), and a Fellowship from the CIFAR Learning in Machines and Brains Program (BAR).

## References

- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].↵
- [65].↵
- [66].↵
- [67].↵
- [68].↵
- [69].↵
- [70].↵
- [71].↵
- [72].↵
- [73].↵
- [74].↵
- [75].↵
- [76].↵
- [77].↵
- [78].↵
- [79].↵
- [80].↵
- [81].↵
- [82].↵
- [83].↵
- [84].↵
- [85].↵
- [86].↵
- [87].↵
- [88].↵
- [89].↵
- [90].↵
- [91].↵
- [92].↵
- [93].↵
- [94].↵
- [95].↵
- [96].↵
- [97].
- [98].↵
- [99].↵
- [100].↵
- [101].↵
- [102].↵