## Abstract

The brain is using a learning algorithm which is yet to be discovered. Here we demonstrate that the ability of a neuron to predict its expected future activity may be an important missing component to understand learning in the brain. We show that comparing predicted activity with the actual activity can provide an error signal for modifying synaptic weights. Importantly, this learning rule can be derived from minimizing neuron metabolic cost. This reveals an unexpected connection, that learning in neural networks could result from simply maximizing energy balance by each neuron. We validated this predictive learning rule in neural network simulations and in data recorded from awake animals. We found that neurons in the sensory cortex can indeed predict their activity ~10-20ms in the future. Moreover, in response to stimuli, cortical neurons changed their firing rate to minimize surprise: i.e. the difference between actual and expected activity, as predicted by our model. Our results also suggest that spontaneous brain activity provides “training data” for neurons to learn to predict cortical dynamics. Thus, this work demonstrates that the ability of a neuron to predict its future inputs could be an important missing element to understand computation in the brain.

## Introduction

Artificial neural networks have shown remarkable performance in a multitude of difficult tasks, such as cancer detection, speech recognition, or self-driving cars ^{1–3}. However, the expertise of such neural networks is restricted only to particular domains. This suggests that the human brain, which can solve a wide variety of tasks, e.g. driving a car while discussing development of new cancer tests, may be using more powerful computational algorithms than those used in artificial neural networks. Currently, the best performing neural networks are trained using the backpropagation algorithm ^{4}. However, networks that use backpropagation have multiple properties which are difficult to reconcile with biological networks. For example: (1) they require separate phases for sending a bottom-up ‘sensory’ signal and for receiving a top-down error signal, whereas in the brain top-down feedback is combined with incoming sensory information; (2) backpropagation requires symmetric weights, meaning that a synaptic connection from neuron A to B has to have exactly the same strength as connection from B to A; (3) neuron models are biologically unrealistic as they do not include, for example: spiking activity, internal neuron dynamics, or Dale’s law (either excitatory or inhibitory neurons), among many other simplifications. This question of how to bridge the gaps between biological and artificial learning algorithms is a subject of rapidly growing research at the intersection of neuroscience and computer science ^{5–7}.

### Synaptic learning rule to minimize prediction error

One of the most promising ideas of how backpropagation-like algorithms could be implemented in the brain is based on using temporal difference in neuronal activity to approximate top-down error signals ^{8–14}. A typical example of such algorithms is Contrastive Hebbian Learning ^{15–17}, which was proved to be equivalent to backpropagation under certain assumptions ^{18}. Contrastive Hebbian Learning requires networks to have recurrent connections between hidden and output layers, which allows activity to propagate in both directions (Fig. 1A). The learning consists of two separate phases. First, in the ‘free phase’, a sample stimulus is continuously presented to the input layer and the activity propagates through the network until the dynamics converge to an equilibrium (activity of each neuron achieves steady-state level). In the second ‘clamped phase’, in addition to presenting stimulus to the input, the output neurons are also held clamped at values representing stimulus category (e.g.: 0 or 1), and the network is again allowed to converge to an equilibrium. For each neuron, the difference between activity in clamped and free phase is used to modify synaptic weights (*w*) according to the equation
where *i* and *j* are indices of pre- and post-synaptic neurons respectively, and *α* is a small number representing learning rate. Intuitively, this can be seen as adjusting weights to push neuron activity in free phase, closer to the desired activity represented by clamped phase. The obvious problem with the biological plausibility of this algorithm is that it requires the neuron to experience exactly the same stimulus twice in two separate phases, and that neuron needs to ‘remember’ its activity from the previous phase.

Here, we solve this problem by combining both activity phases into one, which is inspired by sensory processing in the cortex. For example, when presented with a new picture, in visual areas there is initially bottom-up driven activity containing mostly visual attributes of the stimulus (e.g. contours), which is then followed by top-down modulation containing more abstract information, e.g. this object is novel (Suppl. Fig. 1). Accordingly, our algorithm first runs only the initial part of the free phase, which represents bottom-up stimulus driven activity, and then, after a few steps the network output is clamped, corresponding to top-down modulation (Fig. 1B).

The main novel insight here is that the initial bottom-up activity is enough to allow neurons to estimate the expected top-down information, and the mismatch between estimated and actual activity can then be used as a teaching signal. To implement it in our model, neuron activity at initial time steps of free phase is used to predict its expected steady-state activity. This is then compared with actual activity at the end of the clamped phase, and the difference is used to update weights (Fig 1B, Methods). Thus, to modify synaptic weights, we replaced in Eq. (1) activity in free phase with predicted activity :

However, the problem is that this equation implies that a neuron needs to also know predicted activity of all its presynaptic neurons , which could be problematic. To solve this problem, we found that could be replaced by the actual presynaptic activity in clamped phase , which leads to the following simplified synaptic plasticity rule

Thus, to modify synaptic weights, a neuron only compares its actual activity with its predicted activity , and applies this difference in proportion to each input contribution .

## Results

### Synaptic learning rule derivation by maximizing neuron energy balance

Importantly, Eq. 3 is not an ad hoc algorithm to solve a computational problem, but this form of learning rule naturally arises as a consequence of minimizing a metabolic cost by a neuron. Most of the energy consumed by a neuron is for electrical activity, with synaptic potentials accounting for ~50% and action potential for ~20% of ATP used ^{19}. Using a simplified linear model of neuronal activity, this energy consumption can be expressed as –*b*_{1}(∑_{i} *w _{ij}X_{i}*)

^{β1}, where

*x*represents the activity of pre-synaptic neuron

_{i}*i*,

*w*represents synaptic weights,

*b*

_{1}is a constant to match energy units, and

*β*

_{1}describes a non-linear relation between neuron activity and energy usage, which is estimated to be between 1.7 - 4.8

^{20}. The remaining ~30% of neuron energy is consumed on housekeeping functions, which could be represented by a constant -

*ε*. On the other hand, the increase in neuronal population activity also increases local blood flow leading to more glucose and oxygen entering a neuron (see review on neurovascular coupling:

^{21}). This activity dependent energy supply can be expressed as: +

*b*

_{2}(∑

_{k}

*x*)

_{k}^{β2}, where

*x*represents spiking activity of neuron

_{k}*k*from a local population of

*K*neurons (

*k*∈ {1,…,

*j*,…

*K*}), with activity of neuron

*j*:

*x*= ∑

_{i}_{i}

*w*;

_{ij}x_{i}*b*

_{2}is a constant and

*β*

_{2}reflects the exponential relation between activity and blood volume increase, which is estimated to be in range

*β*

_{2}: 1.7-2.7

^{20}. Putting all those terms together, the energy balance of a neuron

*j*could be expressed as

Using gradient ascent method, we can calculate the changes in synaptic weights Δ*w* that will maximize energy *E _{j}*. For that, we need to calculate derivative of

*E*with respect to

_{j}*w*

_{ij}If we denote population activity as: , and considering that ∑_{i} *w _{ij}x_{i}* =

*x*, then we obtain

_{j}If we also denote that *α*_{1} = *β*_{1}*b*_{1} and , then we can take *α*_{1} in front of brackets; and considering that *β _{1}* and

*β*> 1.7 (see

_{2}^{20}), we may assume that

*β*

_{2}≈ 2 and

*β*≈ 2, which simplifies Eq. 6 to

_{2}Note that Eq. 7 has the same form as the learning rule from (Eq. 3): . Even if *β*_{1} and *β*_{2} are not exactly 2, Eq. 3 still can provide a good linear approximation of the gradients prescribed by Eq. 6. Note also that in Eq. 3, represents top-down modulation, thus Eq. 3 could be interpreted as changing weights to reduce the mismatch between activity of other neurons and a neuron’s own expected activity. Similarly, Eq. 7 could be interpreted, such that weights should be changed to reduce the discrepancy between neuronal population response and a neuron’s own activity.

Moreover, if we assume that neuron adapts to maximize energy balance in the future, then Eq. 7, changes to (Eq. S7): , where represents population recurrent activity, which can be thought of as type of top-down modulation, similar as . Also note that neuron *j* activity: *x _{j}*, becomes future predicted activity: (see Suppl. Materials for details of derivation). Thus, this shows that the best strategy for a neuron to maximize future energy resources requires predicting its future activity. Altogether this reveals an unexpected connection, that learning in neural networks could result from simply maximizing energy balance by each neuron.

### Learning rule validation in neural network simulations

To test if our new learning rule can be used to solve standard machine learning tasks, we created the following simulation. The neural network had 784 input units, 1000 hidden units, and 10 output units, and it was trained on a hand-written digit recognition task (MNIST^{22}; Suppl. Fig. 2; Methods). Our network achieved 1.9% error rate, which is at human level accuracy on this task (human error rate ~1.5–2.5%; ^{23}). This accuracy is also similar to neural networks with comparable architecture trained with the backpropagation algorithm ^{22}. This demonstrates that the network with our learning rule can solve challenging non-linear classification tasks.

To verify that neurons could correctly predict future free phase activity, we took a closer look at sample neurons. Figure 2A illustrates activity of all 10 output neurons in response to an image of a sample digit after the first epoch of training. During steps 1-12 only the input signal was presented and the network was running in free phase. At step 13, the output neurons were clamped, with activity of 9 neurons set to 0 and the activity of one neuron representing correct image class set to 1. For comparison, this figure also shows activity of the same neurons without clamped outputs (free phase). It illustrates that after about 50 steps in free phase, the network achieves steady-state, with predicted activity closely matching. When the network is fully trained, it still takes about 50 steps for network dynamics in free phase to converge to steady-state (Fig. 2B). Note, that although all units initially increase activity at the beginning of the free phase, they later converge close to 0, except the one unit representing the correct category. Again, predictions made from the first 12 steps during free phase closely matched actual steady-state activity. The hidden units also converged to steady-state after about 50 steps. Figure 2C illustrates the response of one representative hidden neuron to 5 sample stimuli. Because hidden units experience clamped signal only indirectly, through synapses from output neurons, their steady-state activity is not bound to converge only to 0 or 1, as in the case of output neurons. Actual and predicted steady-state activity for hidden neurons is presented in Figure 2D. The average correlation coefficient between predicted and actual free phase activity was R=1+0.0001SD (averaged across 1000 hidden neurons in response to 200 randomly selected test images). Note that for all predictions we used a cross-validation approach, where we trained a predictive model for each neuron on a subset of the data and applied it to new examples, which were then used for updating weights (Methods). Thus, neurons were able to successfully generalize their predictions to new unseen stimuli. The network error rate for the training and test dataset is shown in Fig. 2E. This demonstrates that our algorithm worked well, and each neuron accurately predicted its future activity.

We also tested our learning rule in multiple other network architectures, which were designed to reflect additional aspects of biological neuronal networks. First, we introduced a constraint that 80% of the hidden neurons were excitatory, and the remaining 20% had only inhibitory outputs. This follows observations that biological neurons are releasing either excitatory or inhibitory neurotransmitters (Dale’s law ^{24}), and that about 80% of cortical neurons are excitatory. The network with this architecture achieved an error rate of 2.66% (Suppl. Fig. 3A), which again is comparable with human error rate on this task ^{23}. We also tested our algorithm in a network with two hidden layers, in a network without symmetric weights, and in a network with all-to-all connectivity within the hidden layer. We found that all of those modified networks similarly converged towards the solution (Suppl. Fig. 3B & 4). We also implemented our learning rule in a network with spiking neurons, which again achieved a similar error rate of 2.46% (Suppl. Fig. 5). Altogether, this shows that our predictive learning rule performs well in a variety of biologically motivated network architectures.

### Learning rule validation in awake animals

To test if real neurons could also predict their future activity, we analyzed neuronal recordings from the auditory cortex in awake rats (Methods). As stimuli we presented 6 tones, each 1s long and interspersed by 1s of silence, repeated continuously for over 20 minutes (Suppl. Materials). For each of the 6 tones we calculated average onset and offset response, giving us 12 different activity profiles for each neuron (Fig. 3A). For each stimulus the activity in the time window 15-25ms was used to predict average future activity within the 30-40ms window. We used 12-fold cross-validation, where responses from 11 stimuli were used to train the least-square model, which was then applied to predict neuron activity for the 1 remaining stimulus. This procedure was repeated 12 times for each neuron. The average correlation coefficient between actual and predicted activity was R = 0.36+0.05 SEM (averaged across 55 cells from 4 animals, Fig. 3B). Distribution of correlations coefficients for individual neurons were significantly different from 0 (t-test p<0.0001; insert in Fig. 3B). This shows that neurons have predictable dynamics, and from an initial neuronal response its future activity can be estimated.

Repeated presentation of stimuli over tens of minutes also induced long-term changes in neuronal firing rates ^{25}, similar as in perceptual learning. Importantly, based on our model it was possible to infer which individual neurons will increase or decrease their firing rate. To explain it, first let’s look at neural network simulations in Fig. 3C. It shows that for a neuron the average change in activity from one learning epoch to the next, depends on the difference between clamped (actual) activity and predicted (expected) activity, in the previous learning epoch (Fig. 3C; correlation coefficient R = 0.35, p<0.0001; Suppl. Materials). Similarly, for cortical neurons, we found that change in firing rate from 1st to 2nd half of experiment was positively correlated with differences between evoked and predicted activity during the 1st half of experiment (R = 0.58, p<0.0001; Fig. 3D, Suppl. Materials). This could be understood in terms of Eq. 3, where if actual activity is higher than predicted, then synaptic weights are increased, thus leading to higher activity of that neuron in the next epoch. Therefore, similar behavior of artificial and cortical neurons, where firing rate changes to minimize ‘surprise’: difference between actual and predicted activity, provides a strong evidence in support of the learning rule presented here.

### Deriving predictive model parameters from spontaneous activity

Next, we tested if spontaneous brain activity could also be used to predict neuronal dynamics during stimulus presentation. The spontaneous activity, like during sleep, is defined as an activity not directly caused by any external stimuli. However, there are many similarities between spontaneous and stimulus evoked activity ^{26–29}. For example, spontaneous activity is composed of ~50-300 ms long population bursts called packets, which resemble stimulus evoked patterns ^{30}. This is illustrated in Figure 4A, where spontaneous activity packets in the auditory cortex are visible before sound presentation^{31,32}. In our experiments, each 1s long tone presentation was interspersed with 1s of silence, and the activity during 200-1000 ms after each tone was considered as spontaneous (animals were in soundproof chamber; Suppl. Materials). The individual spontaneous packets were extracted to estimate neuronal dynamics (Methods). Then the spontaneous packets were divided in 10 groups based on similarity in PCA space (Suppl. Materials), and for each neuron we calculated its average activity in each group (Fig. 4B). Similarly as in previous analyses in Fig 3A, the initial activity in time window 5-25ms was used to derive the least-square model to predict future spontaneous activity in 30-40ms time window (Suppl. Materials). This least-square model was then applied to predict future evoked responses from initial evoked activity for all 12 stimuli. Figure 4C shows actual vs predict evoked activity for all neurons and stimuli (correlation coefficient R = 0.2+0.05 SEM, averaged over 40 cells from 4 animals; the insert shows distribution of correlations coefficients of individual neurons; p=0.0008, t-test). Spontaneous brain activity is estimated to account for over 90% of brain energy consumption ^{33}, however the function of this activity still is a mystery. The above results offer a new insight: because neuronal dynamics during spontaneous activity is similar to evoked, thus spontaneous dynamics can be used by neurons as a training set to predict responses to new stimuli.

## Discussion

Here we present computational and biological evidence that the basic principle underlying single neuron learning may rely on minimizing future surprise: a difference between actual and predicted activity. Thus, a single neuron is not only performing summation of its inputs, but it also predicts the expected future, which we propose is a crucial component of the learning mechanism. Note that a single neuron has similar complexity as single cell organisms, which were shown to have ‘intelligent’ adaptive behaviors, including predicting consequences of its action in order to navigate toward food and away from danger ^{34–36}. This suggests that typical neuronal models used in machine learning may be too simplistic to account for the essential computational properties of biological neurons. Our work suggests new computational element within neurons, which could be crucial to reproduce brain learning mechanism.

There are multiple lines of evidence that the brain operates as a predictive system ^{37–42}. However, it remains controversial how exactly predictive coding could be implemented in the brain ^{8}. Most of proposed mechanisms involve specially designed neuronal circuits to allow for comparing expected and actual activity ^{43–46}. Although those models are to some degree biologically motivated, they require precise network configuration, which could be difficult to achieve considering that the brain connectivity is highly variable within and between brain areas. Here we propose that this problem can be solved by implementing predictions within a single neuron. Biological neurons have a variety of cellular mechanisms which operate on time scales of 1~100ms suitable for implementing predictions ^{47–51}. For instance, it was proposed that depolarization of the soma by basal dendrites in pyramidal cells could serve as an expected signal for comparison with top-down modulation arriving from apical dendrite^{52,53}. The other interesting biological aspect of our model is that it belongs to the category of energy-based models, for which it was shown that synaptic update rules are consistent with spike-timing-dependent plasticity ^{54}. All of this demonstrates compatibility of our model with neurophysiology.

Our work also suggests that packets could be basic units of information processing in the brain. It is well established that sensory stimuli evoke coordinated bursts (packets) of neuronal activity lasting from tens to hundreds ms. We call such population bursts packets, because they have stereotypical structure with neurons active at the beginning conveying bottom-up sensory information (e.g. this is a face) and later in the packet neurons represent additional higher order information (e.g. this is a happy face of that particular friend)^{55}. Also the later part of the packet can encode if there is discrepancy with expectation (e.g. this is a novel stimulus^{56,57}; Suppl. Fig. 1). This is likely because only the later part of the packet can receive top-down modulation after information about that stimulus is exchanged between other brain areas^{58,59}. Thus, our work suggests that the initial part of the packet can be used to infer what the rest of the brain may ‘think’ about this stimulus, and the difference from this expectation can be used as a learning mechanism to modify synaptic connections. This could be the reason why, for example, we cannot process visual information faster than ~24 frames/s, as only after evaluating if a given image is consistent with expectation, only then the next image can be processed by the next packet. Our learning rule thus implies, that sensory information is processed in discrete units and each packet represents an elementary unit of perception.

## Methods

### Neural Network

The code to reproduce our network with all the implementation details is available at https://github.com/ykubo82/bioCHL. Briefly, the base network has the architecture: 784-1000-10 with sigmoidal units, and with symmetric connections (see Suppl. Fig. 3-5 for more biologically plausible network architectures which we also tested). To accelerate training, we used AdaGrad ^{60}, and we applied a learning rate of 0.03 to the hidden layer and 0.02 for the output layer. In the standard implementation of Contrastive Hebbian Learning, the learning rate α in Eq. 1 is also multiplied by a small number (~0.1) for all top-down connections (e.g. from output to hidden layer) ^{61}. This different treatment of feed-forward and feedback connections could be biologically questionable as cortical circuits are highly recurrent. Therefore, to make our learning rule more biologically plausible we discarded this feedback gain factor as we wanted our network to learn by itself what should be the contribution of each input, as described in Eq. 3.

### Future activity prediction

For all the predictions we used a cross-validation approach. Specifically, in each training cycle we ran free phase on 490 examples, which were used to derive least-squares model for each neuron to predict its future activity at time step 120 , from its initial activity at steps 1-12 . This can be expressed as (Eq. 4): , where terms in brackets correspond to time steps, and *λ* and *b* correspond to coefficients and offset terms found by least-squares method. Next, 10 new examples were taken, for which free phase was ran only for 12 steps, then the above derived least-squares model was applied to predict free phase steady-state activity for each of 10 examples. From step 13 the network output was clamped. The weights were updated based on the difference between predicted and clamped activity calculated only from those 10 new examples. This process was repeated 120 times in each training epoch. Moreover, the MNIST dataset has 60,000 examples which we used for the above described training, and 10,000 additional examples which were only used for testing. For all plots in Figure 2 we only used test examples which network never saw during training. This demonstrates that each neuron can accurately predict its future activity even for novel stimuli which were never presented before.

### Surgery, recording and neuronal data

The experimental procedures for the awake, head-fixed experiment have been previously described^{31,32} and were approved by the Rutgers University Animal Care and Use Committee, and conformed to NIH Guidelines on the Care and Use of Laboratory Animals. Briefly, a headpost was implanted on the skull of four Sprague-Dawley male rats (300-500g) under ketamine-xylazine anesthesia, and a craniotomy was performed above the auditory cortex and covered with wax and dental acrylic. After recovery the animal was trained for 6-8 days to remain motionless in the restraining apparatus. On the day of the surgery, the animal was briefly anesthetized with isoflurane, the dura was resected, and after a recovery period, recording began. For recording we used silicon microelectrodes (Neuronexus technologies, Ann Arbor MI) consisting of 8 or 4 shanks spaced by 200μm, with a tetrode recording configuration on each shank. Electrodes were inserted in layer V in the primary auditory cortex. Units were isolated by a semiautomatic algorithm (klustakwik.sourceforge.net) followed by manual clustering (klusters.sourceforge.net)^{62}. Only neurons with average stimulus evoked firing rates higher than 3 SD above pre-stimulus baseline were used in analysis, resulting in 9, 12, 12, and 22 neurons from each rat. For predicting evoked activity from spontaneous, we also required that neurons must have mean firing rate during spontaneous packets above said threshold which reduced the number of neurons to 40. The spontaneous packet onsets were identified from the spiking activity of all recorded cells as the time of the first spike marking a transition from a period of global silence (30 ms with at most one spike from any cell) to a period of activity (60 ms with at least 15 spikes from any cells), as described before in^{31,63}. The data presented here are available from the corresponding author upon reasonable request.

## Author contributions

AL conceived the project, analyzed data, performed computer simulations and wrote the manuscript; YK performed computer simulations and contributed to writing manuscript.

## Competing interests

The authors declare no competing interests.

## Suppl. Materials

### Maximizing future energy balance

Intuitively, it makes sense that planning, i.e. making predictions, can improve success of an organisms in accessing more energy resources. In this section we show that this holds true even for a single neuron, where maximizing future energy balance is best achieved by predicting its future activity. For that, first we will write equation for energy balance for a neuron at time *t*+*n*, where *t* represents current time and *n* is a small time increment. Using the same logic and notation as in Eq. 4, energy balance of a neuron *j* at time *t*+*n*, can be expressed as a function of: housekeeping processes (– *ε*), cost of electrical activity (which for simplified linear neurons can be written as sum of its synaptic inputs: *x*_{j,t+n} = ∑_{i} w_{ij}*x*_{i,t+n}), and energy supply from local blood vessels controlled by combined local activity of neurons (∑_{k} *x*_{k,t+n}):

In the main text we show in simulations (Fig. 2) and in experimental data (Fig. 3), that for small *n*, activity of neuron *j* at time *t*+*n* could be approximated by a linear function of its activity at earlier time step *t*: *x*_{j,t+n} = *λ _{j}x_{j,t}*, where

*λ*is a regression coefficient. Thus Eq. S1 can be rewritten as:

Using gradient assent method, we can calculate change in weights to maximize future energy balance:

Note that in Eq. S3: , thus this term corresponds to predicted future activity: . We will also denote a population activity as: , which simplifies Eq. S3 to:
after factoring out *x*_{i,t}*λ*_{j} and switching order of terms, we get:

Considering that *β _{1}* and

*β*> 1.7 (Devor et al. 2003), we may assume that

_{2}*β*≈ 2 and

_{1}*β*≈ 2, which allows to simplify and reorganize terms as follows: after denoting constant terms as

_{2}*α*

_{3}= 2

*λ*(

_{j}*b*

_{1}–

*b*

_{2}), and as , we obtain:

This shows that maximizing future energy balance requires neuron to predict its future activity: . Also note that in the brain, networks are highly recurrent, thus population activity can be seen as providing similar role as top-down modulation: . Altogether, this suggests that type of a learning rule as in Eq. 3: , may be a necessity to allow a neuron for its highly energy intensive operation.

B) Average event-related potential recorded with EEG electrode located above the frontal cortex in healthy human adults in response to 1000 Hz standard tones and to 1032 Hz deviant tones (20% probability; adopted from Sams et al. 1985). This is a typical example from a rich body of scientific literature on the mismatch negativity phenomenon (a brain response to violations of expected stimulus rule). The similarity of neuronal responses at early phase and divergence from expected response for unexpected stimuli only in the later phase provides another biological example as justification for combining free and clamped phase in our model.

#### Note on symmetric and recurrent connections

Typical artificial neural networks trained with the backpropagation algorithm, require that a synaptic weight from neuron *i* to *j* (*w _{ij}*) is exactly the same as a synaptic weight from neuron

*j*to

*i*(

*w*=

_{ij}*w*is called symmetric weight). However, it was recently shown that networks without symmetric connections can still converge on a good solution (Lillicrap et al. 2016). Similar results were also found by Detorakis et al. (2019) who used asymmetric weights with Contrastive Hebbian Learning. Thus, our results are consistent with that previous work.

_{ji}More interesting is our result that a fully recurrent network with lateral connections also converged similarly. This type of connectivity more closely reflects biological conditions as cortical neurons form extremely recurrent networks with many lateral connections. The main interest here is that the backpropagation algorithm is not well suited for networks with highly recurrent connectivity. Thus, the algorithm presented here with predictive neurons may provide a promising solution to training recurrent networks, but more future work is needed to address this question.

#### Spiking network model

The code for this network with all implementation details is provided at: https://github.com/ykubo82/bioCHL/tree/master/bioCHLspk. Briefly, our network was based on work by O’Connor et al. (2019) who developed spiking networks using the Equilibrium Propagation algorithm which is an expansion of Contrastive Hebbian Learning (Scellier et al., 2017). The steady state is calculated using the Forward Euler method with each time step , where *A* is the activation state, *p*’(*A _{j}*) is the derivative of the activation function

*p*,

*b*is the bias, and

*i*and

*j*are neurons indices. ε can be seen as the learning rate of the activations. These states are clipped to range [0, 1]. Each neuron communicates only using binary signals: 0 or 1, to model spiking activity. For generating spikes, an encoder converts the neuron activation state to 1 or 0, which is sent as an output signal (panel A). The binary signals received by a neuron are converted into the activation state by a decoder. This could be seen as converting discrete spikes into a continuous post-synaptic membrane potential in actual neurons. We modified this spiking network by implementing our learning rule and by adding a least-squares model for predicting steady-state activation based on initial activation in steps 1-17. We also used AdaGrad (Duchi et al., 2011) to search for hyperparameters, which resulted in setting the learning rate to 0.01. The network architecture was the same as in our main model with 784-1000-10 neurons.

#### Acoustic stimuli

In animal experiments, as stimuli we used 1 s long pure tones (2, 3.3, 5.4, 8.9, 15 and 24 kHz at 60dB), interleaved with 1s periods of silence, as described in (Luczak et al. 2009 & 2013). Activity occurring >200ms after stimulus offset and before the next stimulus onset was regarded as spontaneous. The stimuli were continuously presented for 20-108 min, depending on how long the animal appeared to be comfortable during the experiment. All stimuli were tapered at the beginning and end with a 5ms cosine window. Experiments took place in a single-walled sound isolation chamber (IAC, Bronx, NY) with sounds presented free field (RP2/ES1, Tucker-Davis, Alachua, FL). To compensate for the transfer function of the acoustic chamber, tone amplitudes were calibrated prior to the experiment using a condenser microphone placed next to the animal’s head (7017, ACO Pacific, Belmont CA) and a MA3 microphone amplifier (Tucker-Davis).

#### Changes in neuronal activity across periods

For each animal we divided the duration of recording into two halves. For each neuron we calculated the mean stimulus evoked firing rate in 30-40 ms time window, averaged over all stimulus presentations in each half of the experiment. Similarly, for each neuron we calculated the average predicted activity for the 30-40 ms time window, averaged across all stimuli in the 1st half of the experiment. We then calculated the difference between activities in the 2nd half minus the 1st half, and we correlated it with the difference between activity in the 1st half minus predicted activity in the 1st half (Fig. 3D). As described in the Methods section, only neurons with average stimulus evoked firing rates higher than 3 SD above pre-stimulus baseline were used in our analyses. Dividing the data in different proportions e.g. 60%-40% gave the same conclusions.

For comparison, analogous analyses were done on artificial neurons (Fig. 3C). For each neuron in the hidden layer the steady-state clamped activity was averaged over all 1200 stimulus presentations in a single learning epoch. Similarly, for each neuron in the hidden layer, the predicted activity was averaged across all stimuli presented in an epoch. We then calculated the difference between clamped activities in 2 epoch: , where <> denotes average, *i* is index of neuron, and *M* and *N* are indexes of epochs with *M* > *N*. This was then correlated with difference between average clamped and predicted activity in epoch . We found that the correlation between Δ*C* and Δ*P* was strongest in the earliest epochs where the learning curve was most changing (Fig. 2E). For this simulation we used neural network as described in Methods with 784-1000-10 neurons, and we used the learning rule described in Eq. 3. However, in order to reproduce the repeated presentation of the same stimuli as in animal experiments, we used the same 1200 stimulus in each learning epoch. Presenting different stimuli at each epoch as in our original simulation, gave qualitatively similar results.

#### Predicting evoked dynamics from spontaneous activity

For predicting evoked dynamics from spontaneous, first we estimated the range of spontaneous activity patterns. For this we divided spontaneous packets into 10 groups based on principal component analysis (PCA). Specifically, each packet was represented as a NxT matrix, where N is number of neurons and T=30 is number of 3ms long time bins. This matrix was then converted into a single vector V with length of NxT. This was repeated for each packet, thus we obtained a VxP matrix where P is the number of spontaneous packets (P for each animal was: 293, 855, 1215 and 2472). We then ran PCA on the VxP matrix and based on values of the 1st PC we divided packets into 10 groups. Note that the distribution of packets in PCA space was continuous, thus from this analysis it should not be interpreted that there are distinct types of packets. Dividing packets into 5-20 groups and including the 2nd and 3rd PC gave consistent results with those presented in the main text.

For predicting evoked dynamics from spontaneous, we faced the problem that defining the onset of spontaneous packets is less accurate than the onset of evoked responses. Thus, to ensure a similar precision of alignment, during each stimulus presentation we detected the onset of evoked responses using the same criteria as for spontaneous packets. Then, the evoked responses were realigned according to that of the detected onset. This allowed for estimating the spontaneous and evoked activity with the same precision. Using the original onset times instead of the realigned gave qualitatively similar results.

## Acknowledgments

This work was supported by Compute Canada, NSERC and CIHR grants to AL. We thank Karim Ali, Lukas Grasse, Mikhail Klassen and Reza Torabi for help, and we thank Peter Bartho for sharing data.