## Abstract

For both humans and machines, the essence of learning is to pinpoint which components in its information processing pipeline are responsible for an error in its output — a challenge that is known as *credit assignment*^{1}. How the brain solves credit assignment is a key question in neuroscience, and also of significant importance for artificial intelligence. Many recent studies^{1–12} presuppose that it is solved by backpropagation^{13–16}, which is also the foundation of modern machine learning^{17–22}. However, it has been questioned whether it is possible for the brain to implement backpropagation^{23, 24}, and learning in the brain may actually be more efficient and effective than backpropagation^{25}. Here, we set out a fundamentally different principle on credit assignment, called *prospective configuration*. In prospective configuration, the network first infers the pattern of neural activity that should result from learning, and then the synaptic weights are modified to consolidate the change in neural activity. We demonstrate that this distinct mechanism, in contrast to backpropagation, (1) underlies learning in a well-established family of models of cortical circuits, (2) enables learning that is more efficient and effective in many contexts faced by biological organisms, and (3) reproduces surprising patterns of neural activity and behaviour observed in diverse human and animal learning experiments. Our findings establish a new foundation for learning beyond backpropagation, for both understanding biological learning and building artificial intelligence.

The credit assignment problem^{1} lies at the very heart of learning. *Backpropagation*^{13–16}, as a simple yet effective credit assignment theory, has powered notable advances in artificial intelligence since its inception^{17–22}. It has also gained a predominant place in understanding learning in the brain^{1, 2, 4–9, 11, 12, 26}. Due to this success, much recent work has focused on understanding how biological neural networks could learn in a way similar to backpropagation^{27–36}: although many proposed models do not implement backpropagation exactly, they nevertheless try to approximate backpropagation, and much emphasis is placed on how close this approximation is^{3, 27–33, 37, 38}. However, learning in the brain is superior to backpropagation in many critical aspects — for example, compared to the brain, backpropagation requires many more exposures to a stimulus to learn^{25} and suffers from catastrophic interference of newly and previously stored information^{39, 40}. This raises the question of whether using backpropagation to understand learning in the brain should be the main focus of the field.

Here, we propose that the brain instead solves credit assignment with a fundamentally different principle, which we call *prospective configuration*. In prospective configuration, before synaptic weights are modified, neural activity changes across the network so that output neurons better predict the target output; only then are the synaptic weights (weights, for short) modified to consolidate this change in neural activity. By contrast, in backpropagation the order is reversed — weight modification takes the lead and the change in neural activity is the result that follows.

We identify prospective configuration as a principle that is implicitly followed by a well-established family of neural models with solid biological groundings, namely, energy-based networks. They include Hopfield networks^{41} and predictive coding networks^{42}, which have been successfully used to describe information processing in the cortex^{43–49}. To support the theory of prospective configuration, we show that it can both yield efficient learning, which humans and animals are capable of, and reproduce data from experiments on human and animal learning. Thus, on one hand, we demonstrate that prospective configuration performs more efficient and effective learning than backpropagation in various situations faced by biological systems, such as learning with deep structures, online learning, learning with a limited amount of training examples, learning in changing environments, continual learning with multiple tasks, and reinforcement learning. On the other hand, we demonstrate that patterns of neural activity and behaviour in diverse human and animal learning experiments, including sensorimotor learning, fear conditioning and reinforcement learning, can be naturally explained by prospective configuration, but not by backpropagation.

Guided by the belief that backpropagation is the foundation of biological learning, previous work showed that energy-based networks can closely approximate backpropagation. However, to achieve it, the networks were set up in an unnatural way, such that the neural activity was prevented from substantially changing before weight modification, by constraining the supervision signal to be infinitely small or last an infinitely short time^{29, 30, 37, 50, 51}. In contrast, we reveal that the energy-based networks without these unrealistic constrains follow the distinct principle of prospective configuration rather than backpropagation, and are superior in both learning efficiency and accounting for data on biological learning.

Below, we first introduce prospective configuration with an intuitive example, show how it originates from energy-based networks, describe its advantages and quantify them in a rich set of biological-relevant learning tasks. Finally, we show that it naturally explains patterns of neural activity and behaviour in diverse learning experiments.

## Results

### Prospective configuration: an intuitive example

To optimally plan behaviour, it is critical for the brain to predict future stimuli — for example, to predict sensations in some modalities on the basis of other modalities^{52}. If the observed outcome differs from the prediction, the weights in the whole network need to be updated so that prediction in the “output” neurons are corrected. Backpropagation computes how the weights should be modified to minimize the error on the output, and this weight update results in the change of neural activity when the network next makes the prediction. In contrast, we propose that the activity of neurons is first adjusted to a new configuration, so that the output neurons better predict the observed outcome (target pattern); the weights are then modified to reinforce this configuration of neural activity. We call this configuration of neural activity “prospective”, since it is the neural activity that the network *should produce* to correctly predict the observed outcome. In agreement with the proposed mechanism of prospective configuration, it has indeed been widely observed in biological neurons that presenting outcome of a prediction triggers changes in neural activity — for example, in tasks requiring animals to predict a fruit juice delivery, the reward triggers rapid changes in activity not only in the gustatory cortex, but also in multiple cortical regions^{53, 54}.

To highlight the difference between backpropagation and prospective configuration, consider a simple example in Fig. 1a. Imagine a bear seeing a river. In the bear’s mind, the sight generates predictions of hearing water and smelling salmon. On that day, the bear indeed smelled the salmon but did not hear the water, perhaps due to an ear injury, and thus the bear needs to change its expectation related to the sound. Backpropagation (Fig. 1b) would proceed by backpropagating the negative error, so as to reduce the weights on the path between the visual and auditory neurons. However, this modification would also reduce the expectation of smelling the salmon the next time the river is visited, even though the smell of salmon was present and correctly predicted. These undesired and unrealistic side effects of learning with backpropagation are closely related with the phenomenon of catastrophic interference, where learning a new association destroys previously learned memories^{39, 40}. This example shows that, with backpropagation, even learning one new aspect of an association may interfere with the memory of other aspects of the same association.

In contrast, prospective configuration assumes that learning starts with the neurons being configured to a new state — which corresponds to a pattern enabling the network to correctly predict the observed outcome. The weights are then modified to consolidate this state. This behaviour can “foresee” side effects of potential weight modifications and compensate for them dynamically — Fig. 1c: to correct the negative error on the incorrect output, the hidden neurons settle to their prospective state of lower activity, and as a result, a positive error is revealed and allocated to the correct output. Consequently, prospective configuration increases the weights connecting to the correct output, while backpropagation does not (cf. middle plots of Fig. 1b and c). Hence, prospective configuration is able to correct the side effects of learning an association effectively, efficiently, and with little interference.

### Origin of prospective configuration: energy-based networks

To shows how prospective configuration naturally arises in energy-based networks, we introduce a physical machine analog, that provides an intuitive understanding of energy-based networks, and how they produce the mechanism of prospective configuration.

Energy-based networks have been widely and successfully used in describing biological neural systems^{41, 42, 56–58}. In these models, a neural circuit is described by a dynamical system driven by reducing an abstract “energy”, e.g., reflecting errors made by the neurons; see Methods. Neural activity and synaptic weights change to reduce this energy, hence they can be considered as “movable parts” of the dynamical system. We show below that energy-based networks are mathematically equivalent to a physical machine (we call it *energy machine*), where the energy function has an intuitive interpretation and its dynamics are straightforward — the energy machine simply adjusts its movable parts to reduce energy.

As shown in Fig. 2a–b, the energy machine includes nodes sliding on vertical posts, connected with each other via rods and springs. Translating from energy-based networks to the energy machine, the neural activity maps to the vertical position of a solid node, a connection maps to a rod (blue arrow) pointing from one node to another (where the weight determines how the end position of the rod relates to the initial position), and the energy function maps to the elastic potential energy of springs with nodes attached on their both ends (the natural length of the springs is zero). Different energy functions and networks structures result in different energy-based networks, corresponding to energy machines with different configurations and combinations of nodes, rods, and springs. In Fig. 2, we present the energy machine of predictive coding networks^{30, 43, 55}, because they are most accessible and established to be closely related to backpropagation^{30, 37}.

The dynamics of energy-based networks, which are driven by minimizing the energy function, maps to the relaxation of the energy machine, which is driven by reducing the total elastic potential energy on the springs. A prediction with energy-based networks involves clamping the input neurons to the provided stimulus and updating the activity of the other neurons, which corresponds to fixing one side of the energy machine and letting the energy machine relax by moving nodes (Fig. 2a). Learning with energy-based networks involves clamping the input and output neurons to the corresponding stimulus, first letting the activity of the remaining neurons converge and then updating weights, which corresponds to fixing both sides of the energy machine and letting the energy machine relax first by moving nodes and then by tuning rods (Fig. 2b).

The energy machine reveals the essence of energy-based networks: the relaxation before weight modification lets the network settle to a new configuration of neural activity, corresponding to those that would have occurred after the error was corrected by the modification of weights, i.e., prospective activity (thus, we call this mechanism prospective configuration). For example, the second layer “neuron” in Fig. 2b increases its activity, and this increase in activity would also be caused by the subsequent weight modification (of the connection between the first and the second neurons). In simple terms, the relaxation in energy-based networks infers the prospective neural activity after learning, towards which the weights are then modified. This distinguishes it from backpropagation, where the weights modification takes the lead, and the change in neural activity is the result that follows.

The bottom part of Fig. 2c shows the connectivity of a predictive coding network^{30, 43, 55}, which has a dynamics mathematically equivalent to the energy machine shown above it. Predictive coding networks include neurons (blue) corresponding to nodes on the posts, and separate neurons encoding prediction errors (red) corresponding to springs, (see Methods for details).

Using the energy machine, Fig. 2d simulates the learning problem from Fig. 1. Here, we can see that prospective configuration indeed foresees the result of learning and its side effects, through relaxation. Hence, it learns to avoid interference within one iteration, which would otherwise take multiple iterations for backpropagation.

### Advantages of prospective configuration: reduced interference and faster learning

Here we quantify interference in the above scenario and demonstrate how the reduced interference translates into an advantage in performance. In all simulations in the main text prospective configuration is implemented in predictive coding networks (see Methods). Fig. 3a compares the activity of output neurons in the example in Fig. 1, between backpropation and prospective configuration. Initially both output neurons are active (top right corner), and the output should change towards a target in which one of the neurons is inactive (red vector). Learning with prospective configuration results in changes on the output (purple solid vector) that are aligned better with the target than those for backpropagation (purple dotted vector). Although the output from backpropagation can reach the target after multiple iterations, the output for the “correct neuron” diverges from the target during learning and then comes back - it is particularly undesired effect in biological learning, where networks can be “tested” at any point during the learning process, because it may lead to incorrect decisions affecting chances for survival. By contrast, prospective configuration substantially reduces this effect.

The interference can be quantified by the angle between the direction of target and learning, and we define “target alignment” as the cosine of this angle (Fig. 3b), hence high interference corresponds to low target alignment (Fig. 3c). The difference in target alignment demonstrated in Fig. 3a is also present for deeper and larger (randomly generated) networks, as shown in Fig. 3d. When a network has no hidden layers, the target alignment is equal to 1 (proved in section 2.2 of Supplementary Information). The target alignment drops for backpropagation as the network gets deep, because changes in weights in one layer may interfere with changes in other layers (as explained in Fig. 1); while prospective configuration maintains a much higher value along the way. This metric directly translates to the efficiency of learning: Fig. 3e shows that the test error during training in a visual classification task with a deep neural network of 15 layers decreases faster for prospective configuration than backpropagation. Fig. 3f repeats the experiment on networks of other depths, and shows the mean of the test error during training (reflecting how fast the test error drops), as a function of network depth. The mean error is higher for low depths, as these networks are unable to learn the task, and for greater depths, because it takes longer to train deeper networks. Importantly, the gap between backpropagation and prospective configuration widens for deeper networks, paralleling the difference in target alignment. Efficient training with deeper networks is important for biological neural systems, known to be deep, e.g., primate visual cortex^{59}.

In the Supplementary Information we develop a formal theory of prospective configuration and provide further illustrations and analyses of its advantages. Extended Data Fig. 1 formally defines prospective configuration and demonstrates that it is indeed commonly observed in different energy-based networks. Extended Data Fig. 2 shows that prospective configuration shares a close relationship with another influential model of credit assignment called target propagation^{60}, providing another way of understanding the advantages of prospective configuration. Extended Data Figs. 3 and 4 empirically verify and generalize the advantages expected from the theory: they show that prospective configuration yields more accurate error allocation and less erratic weight modification, respectively.

### Advantages of prospective configuration: effective learning in biologically relevant scenarios

Inspired by these advantages, we show empirically that prospective configuration indeed handles various learning problems that biological systems would face better than backpropagation. Since the field of machine learning has developed effective benchmarks for testing learning performance, we use variants of classic machine learning problems that share key features with the learning in natural environments. Such problems include online learning where the weights must be updated after each experience (rather than a batch of training examples)^{61}, learning with limited amount of training examples, learning in changing environments^{62}, continual learning with multiple tasks^{63, 64}, and reinforcement learning^{21}. In all the aforementioned learning problems, prospective configuration demonstrates a notable superiority over backpropagation.

Firstly, based on the example in Fig. 1, we expect prospective configuration to require fewer episodes for learning than backpropagation. Before presenting the comparison, we describe how backpropagation is used to train artificial neural networks. Typically, the weights are only modified after a batch of training examples, based on the average of updates derived from individual examples (Fig. 4a). In fact, back-propagation relies heavily on averaging over multiple experiences to reach human-level performance^{67–69} as it needs to stabilise training^{70}. By contrast, biological systems must update the weights after each experience, and we compare the learning performance in such a setting. The sampling efficiency can be quantified by mean of test error during training, which is shown in Fig. 4b as a function of batch size (number of experiences that the updates are averaged over). The efficiency strongly depends on batch size for backpropagation, because it requires batch-training to average out erratic weight updates, while this dependence is weaker for prospective configuration, where the weight changes are intrinsically less erratic and the batch-averaging is less required (see Extended Data Fig. 4). Importantly, prospective configuration learns faster with smaller batch sizes, as in biological settings. Additionally, the final performance can be quantified by the minimum of the test error, which is shown in Fig. 4c, for both the FashionMNIST and the CIFAR-10^{66} natural image datasets, when trained with batch size equals to one. Here, prospective configuration also demonstrates a notable advantage over backpropagation.

Secondly, biological learning is also characterized by a limited data availability. Fig. 4d show that prospective configuration outperforms backpropagation when the model is trained with fewer examples.

Thirdly, biological systems often need to rapidly adapt to changing environments. A common way to simulate this is “concept drifting”^{62}, where a part of the mapping between the output neurons to the semantic meaning is shuffled every period of time (Fig. 4e left). Fig. 4e right shows the test error during training with concept drifting. The advantage of prospective configuration demonstrated here is related to it being able to optimally detect which weights to modify (see Extended Data Fig. 3). Fig. 4f summarizes the results and repeat the experiment on CIFAR-10. Here, we can see that prospective configuration learns better with concept drifting, indicating a better adaptation to changing environments.

Furthermore, biological organisms need to sequentially learn multiple tasks, while artificial neural networks show catastrophic forgetting: when trained on a new task, performance on previously learnt tasks is largely destroyed^{39, 71–73}. Fig. 4g shows the performance when trained on two tasks alternately (task 1 is classifying five randomly selected classes in FashionMNIST dataset, and task 2 is classifying the remaining five classes). It shows that prospective configuration outperforms backpropagation in both terms of avoiding forgetting previous tasks and re-learning current tasks. Fig. 4h summarizes the results and repeat the experiment on CIFAR-10.

Another key challenge for biological systems is to decide which actions to take. Reinforcement learning theories (e.g., Q-learning) propose that it is solved by learning the expected reward resulting from different actions in different situations^{74}. Such prediction of rewards can be made by neural networks^{21}, which can be trained with prospective configuration or backpropagation. The sum of rewards per episode during training on three classic reinforcement learning tasks is reported in Fig. 4i, where prospective configuration demonstrates a notable advance over backpropagation. This large advantage may arise because reinforcement learning is particularly sensitive to erratic changes in network’s weights (as the target output depends on reward predicted by the network itself for a new state - see Methods).

Based on the superior learning performance of prospective configuration, we may expect that this learning mechanism has been favored by evolution, thus in the next sections we investigate if it can account for neural activity and behaviour during learning better than backpropagation.

### Evidence for prospective configuration: inferring of latent state during learning

Prospective configuration is related to theories proposing that before learning, the brain first infers a latent state of environment from feedback^{75–77}. Here, we propose that this inference can be achieved in neural circuits through prospective configuration, where following feedback, neurons in “hidden layers” converge to a prospective pattern of activity that encodes this latent state. We demonstrate that data from various previous studies, which involved the inference of a latent state, can be explained by prospective configuration. These data were previously explained by complex and abstract mechanisms, such as Bayesian models^{75, 76}, while here we mechanistically show with prospective configuration how such inference can be performed by minimal networks encoding only the essential elements of the tasks.

The dynamical inference of latent state from feedback has been recently proposed to take place during sensorimotor learning^{76}. In this experiment, participants received different motor perturbations in different contexts, and learned to compensate for these perturbations. Behavioural data suggest that after receiving the feedback, the participants were first employing it to infer the context, and then adapted the force for the inferred context. We demonstrate that prospective configuration is able to reproduce these behavioural data, while backpropagation cannot.

Specifically, in the task (Fig. 5a), participants were asked to move a stick from a starting point to a target point, while experiencing perturbations. The participants experienced a sequence of blocks of trials (Fig. 5c-e) including training, washout, and testing. During the training session, different directions of perturbations, positive (+) or negative (-), were applied in different contexts, blue (B) or red (R) backgrounds, respectively. We denote these trials as B+ and R-. During the washout session, no perturbation was provided. In the testing session, the participants experienced one of the four possible test trials: B+, R+, B-, and R-. To evaluate learning on the test trials, motor adaptation (i.e., the difference between the final and target stick positions) was measured before and after the test trial, on two trials with blue background. The change of the adaptation between these two trials is a reflection of learning about blue context that occurred at the test trial. If participants just associated feedback with the colour of background (B), then the change of adaptation would only occur with test trials B+ and B-. However, experimental data (Fig. 5f, right) show that there was substantial adaptation change also with R+ trials (which was even bigger than with B-trials).

To model learning in this task, we consider a neural network (Fig. 5b) where input nodes encode the colour of background, and outputs encode movement compensations in the two directions. Importantly, this network also includes hidden neurons encoding belief of being in the contexts associated with the two backgrounds ([B] and [R]). Trained with the exact procedure of the experiment^{76} from randomly initialized weights, prospective configuration with this minimal network can reproduce the behavioural data, while backpropagation cannot (cf., Fig. 5f left and middle).

Prospective configuration can produce change in adaptation with R+ test trial, because after + feedback, it is able to also activate context [B] that was associated with this feedback during training, and then learn compensation for this latent state. To shed light on how this inference takes place in the model, the bottom parts of Fig. 5c-d show evolution of the weights of the network over sessions (thickness represents the strength of connections). Fig. 5e bottom, shows the difference between the two learning rules at the exposure to R+: although B is not perceived, prospective configuration infers a moderate excitation of the belief of blue context [B], because the positive connection from [B] to + was built during the training session. The activation of [B] enables the learning of weights from [B] to + and −; while backpropagation does not modify any weights originating from [B].

Studies of animal conditioning have also observed that feedback in learning tasks involving multiple stimuli may trigger learning about non-presented stimuli^{78–82}. For example, in one study^{78} rats were trained to associate fear (electric shock) with noise and light; and then, in one group, fear related to light was eliminates in an extinction session (Fig. 6a). Remarkably, the data suggested that eliminating the fear to light increased the fear to noise (Fig. 6b). Such learning is not predicted by the standard Rescorla-Wagner model^{83}. We consider a neural network (Fig. 6c) that includes two input neurons encoding the two stimuli, two hidden neurons, and one output neuron encoding the fear. Trained with the exact procedure of animal experiment^{78} from randomly initialized weights, prospective configuration with this simple network can reproduce the data, while backpropagation cannot (cf., Fig. 6b blue and orange). In the network employing prospective configuration, the feedback changes the activity of a hidden neuron previously associated with this feedback and with non-presented stimulus (noise), and hence enables modification of connections of this neuron (a learning mechanism analogous to that in sensorimotor learning Fig. 5, see Extended Data Fig. 5 for details).

### Evidence for prospective configuration: discovering task structure during learning

Prospective configuration is also able to discover the underlying task structure in reinforcement learning. Particularly, we consider a task where reward probabilities of different options were not independent^{75}. In this study humans were choosing between two options, whose reward probabilities were constrained such that one option had higher reward probability than the other (Fig. 7a). Occasionally the reward probabilities were swapped, so if one probability was increased, the other was decreased by the same amount. Remarkably, the recorded fMRI data suggested that participants learned that the values of two options were negatively correlated, and on each trial updated the value estimates of both options in opposite ways. This conclusion was drawn from the analysis of the signal from medial prefrontal cortex which encoded the expected value of reward. Fig. 7c, right compares this signal after making a choice on two consecutive trials: a trial on which reward was not received (“Punish trial”) and the next trial. If the participant selected the same option on both trials (“Stay”), the signal decreased, indicating the reward expected by the participant was reduced. Remarkably, if the participant selected the other option on the next trial (“Switch”), the signal increased, suggesting that negative feedback for one option increased the value estimate for the other. Such learning is not predicted by standard reinforcement learning models^{75}.

This task can be conceptualized as having a latent state encoding which option is superior, and this latent state determines the reward probabilities for both options. Consequently, we consider a neural network reflecting this structure (Fig. 7b) that includes an input neuron encoding being in this task (equal to 1 in simulations), a hidden neuron encoding the latent state, and two output neurons encoding the reward probabilities for the two options. Trained with the exact procedure of the experiment^{75} from randomly initialized weights, prospective configuration with this minimal network can reproduce the data, while backpropagation cannot (cf., Fig. 7c left and middle).

To shed light on the difference between the models, we simulate an “idealized” version of the task in Fig. 7d-e: the network shown in the inset starts from ({*W*_{0} = 1, *W*_{1} = 1, *W*_{2} = – 1}) and is trained for 64 trials in total. The rewards and punishments are delivered deterministically, and the reversal only occurs once at the beginning of training. Fig. 7d inspects prospective configuration at the first few training iterations: during relaxation, the hidden neuron is able to infer its prospective configuration, i.e., negative hidden activity encoding that the rewarded choice has reversed. In Fig. 7e, such inference by prospective configuration results in an increase of *W*_{1}: since it has inferred from the punishment that the rewarded choice has reversed to a non-rewarded one, such punishment strengthens the connection from the latent state representing non-rewarded choice to a punishment. By contrast, in backpropagation *W*_{1} is decreased: since it receives a punishment without updating the latent state (still encoding that the rewarded choice has not changed), it weakens the connection from the latent state to a reward. Fig. 7f shows the *W*_{1} and *W*_{2} in the simulation of the full task with stochastic rewards. The weights follow a similar pattern as in the simplified task, i.e., their magnitude increases in prospective configuration. This signifies that the network learns that the rewards from the two options are jointly determined by a hidden state. This increase of the magnitude of *W*_{1} and *W*_{2} enables the network to infer the hidden state from the feedback, and learn the task structure (as described for panel b). Fig. 7g shows the evolution of *W*_{0} in the full task. In prospective configuration, this weight remains closer to 0 than *W*_{1} and *W*_{2}. Inset shows *W*_{0} on one of the simulation in the main plot, where it is demonstrated that prospective configuration easily flips *W*_{0} as the rewarded choice changes, while backpropagation has difficulty in accomplishing this. The reason of such behavior is as follows: thanks to large magnitude of *W*_{1} and *W*_{2} in prospective configuration, an error on the output unit results in a large error on the hidden unit, so the network is able to quickly flip the sign of *W*_{0} whenever the observation mismatches the expectation. This results in an increased expectation on the Switch trials (panel c).

Taken together, presented three simulations illustrate that prospective configuration is a common principle that can explain a range of surprising learning effects in diverse tasks.

## Discussion

Our paper identifies the principle of prospective configuration, according to which learning relies on neurons first optimizing their pattern of activity to match the correct output, and then reinforcing these prospective activities through synaptic plasticity. Although it was known that in energy-based networks the activity of neurons shifts before weight update, it has been previously thought that this shift is a necessary cost of error propagation in biological networks, and several methods have been proposed to suppress it^{29, 30, 37, 50, 51} to approximate backpropagation more closely. By contrast, we demonstrate that this reconfiguration of neural activity is the key to achieving learning performance superior to backpropagation, and to explaining experimental data from diverse learning tasks. Prospective configuration further offers a range of experimental predictions distinct from those of backpropagation (Extended Data Figs. 6–7). In sum, we have demonstrated that our novel credit assignment principle of prospective configuration enables more efficient learning than backpropagation by reducing interference, superior performance in situations faced by biological organisms, requires only local computation and plasticity, and can match experimental data across a wide range of tasks.

Our theory addresses a long-standing question of how the brain solves the plasticity-stability dilemma, e.g. how it is possible that despite learning and adjustment of representation in primary visual cortex^{84}, we can still perceive the world and understand the meaning of visual stimuli we learned over our lifetime. According to prospective configuration, when some weights are modified during learning, compensatory changes are made to other weights, to ensure the stability of previously acquired knowledge. Previous computational models have also proposed mechanisms reducing interference between different pieces of learned information^{73, 85}, and it is highly likely that these mechanisms operate in the brain in addition to prospective configuration and jointly reduce the interference most effectively.

The advantages of prospective configuration suggest that it may be profitably applied in machine learning to improve the efficiency and performance of deep neural networks. An obstacle for this is that the relaxation phase is computationally expensive. However, it has been demonstrated that the speed of energy-based networks can be greatly increased by implementing the relaxation on analog hardware^{86, 87}, potentially resulting in energy-based network being faster than backpropagation. Therefore, we anticipate that our discoveries may change the blueprint of next-generation machine learning hardware — switching from the current digital tensor base to analog hardware, being closer to the brain and potentially far more efficient.

## Methods

This section provides necessary details for replication of results in the main text.

### Models

Throughout this work, we compare the established theory of *backpropagation* to the proposed new principle of *prospective configuration*. As explained in the main text, backpropagation is used to train *artificial neural networks* (ANNs), where the activity of a neuron is *fixed* to a value based on its input, while prospective configuration occurs in *energy-based networks* (EBNs), where the activity of a neuron is *not* fixed.

Since in ANNs the activity of neurons ** x** is determined by their input, the output of the network can be obtained by propagating the inputs “forward” through the computational graph. The output can then be compared against a target pattern to get a measure of difference known as a loss. Since the value of a node (activity of a neuron) in the computational graph is explicitly computed as a function of its input, the computational graph is usually differentiable. Thus, training ANNs with backpropagation modifies the weights

**to take a step towards the negative gradient of loss**

*w**L*, during which the activity of neurons

**is fixed. The weights**

*x***requiring modification might be many steps away from the output on the computational graph, where the loss**

*w**L*is computed; thus, is often obtained by applying the chain rule of computing a derivative through intermediate variables (activity of output and hidden neurons). For example, consider a network with 4 layers and let

*denote the activity of neurons in layer*

**x**^{l}*l*, while

*denote the weights of connections between layers*

**w**^{l}*l*and

*l*+ 1. Then the change in the weights originating from the first layer is computed: . This enables the loss to be backpropagated through the graph to provide a direction of update for all weights.

In contrast to ANNs, in EBNs, the activity of neurons ** x** is not fixed to the input from a previous layer. Instead, an energy function

*E*is defined as a function of the neural activity

**and weights**

*x***. For networks organized in layers (considered in this paper), the energy can be decomposed into a sum of local energy terms**

*w**E*:

^{l}Here, *E ^{l}* is called local energy, because it is a function of

*,*

**x**^{l}

**x**^{l–1}, and

*w*^{l–1}that are neighbours and connected to each other. This ensures that the optimization of energy

*E*can be implemented by local circuits, because the derivative of

*E*with respect to any neural activity (or weights) results in an equation containing only the local activity (or weights) and the activity of adjacent neurons. Predictions with EBNs are computed by clamping the input neurons to an input pattern, and then modifying the activity of all other neurons to decrease the energy:

Since the terms in *E* can be divided into local energy terms, this results in an equation that can be implemented with local circuits. This process of modifying the neural activity to decrease the energy is called *relaxation*, and we refer to the equation describing relaxation as *neural dynamics* — because it describes the dynamics of the neural activity in EBNs. After convergence of relaxation, the activities of the output neurons are taken as the prediction made by the EBN. Different EBNs are trained in slightly different ways. In case of *predictive coding network*^{30, 43, 55} (PCN), training involves clamping the input and output neurons to input and target patterns, respectively. Then, relaxation is run until convergence , after which the weights are updated using the activity at convergence to further decrease the energy:

This will also result in an equation that can be implemented with local plasticity since it is just a gradient descent on the local energy. We refer to such an equation as *weight dynamics*, because it describes the dynamics of the synaptic weights in EBNs.

Backpropagation and prospective configuration are not restricted to specific models. Depending on the structure of the network, and the choice of the energy function, one can define different models that implement the principle of backpropagation or prospective configuration. In the main text and most of the Extended Data, we investigate the most standard layered network. In this case, both ANNs and EBNs include *L* layers of weights *w*^{1}, *w*^{2}, …, * w^{L}*, and

*L*+ 1 layers of neurons

*x*^{1},

*x*^{2}, …,

*x*^{L+1}, where

*x*^{1}and

*x*^{L+1}are the input and output neurons, respectively. We consider the relationship between activities in adjacent layers for ANNs given by and the energy function for EBNs described by

This defines the ANNs to be the standard *multilayer perceptrons* (MLPs) and the EBNs to be the PCN. In Eq. (6) and below, (** v**)

^{2}denotes the inner product of vector

**with itself. The comparison between backpropagation and prospective configuration in the main text is thus between the above MLPs and PCNs. This choice is justified by that (1) they are the most standard models**

*v*^{88}and also (2) it is established that they two are closely related

^{30, 37}(i.e., they make the same prediction with the same weights and input pattern), thus enabling a fair comparison. Nevertheless, we show that the theory (Extended Data Fig. 1) and empirical comparison (Extended Data Figs. 3 and 4) between backpropagation and prospective configuration generalize to other choices of network structures and energy functions, i.e., other EBNs and ANNs, such as

*GeneRec*

^{89}and

*Almeida-Pineda*

^{90–92}.

Putting Eqs. (5) and (6) into the general framework, we can obtain the equations that describe MLPs and PCNs, respectively. Assume the input and target patterns are *s*^{in} and *s*^{target}, respectively. Prediction with MLPs is:
where **x**^{L+1} is the prediction. Training MLPs with backpropagation is described by:
which backpropagates the error layer by layer from output neurons.

The neural dynamics of PCNs can be obtained using Eq. (2):

Similarly, the weight dynamics of PCNs can be found:

To reveal the neural implementation of PCN, we define the prediction errors to be

The neural and weight dynamics of PCN can be expressed (by evaluating derivatives in Eqs. (9) and (10)):
where the symbol ○ denotes element-wise multiplication. Assuming that * ε^{l}* and

*are encoded in the activity of error and value neurons, respectively, Eqs. (11) and (12) can be realized with the neural implementation in Fig. 2c bottom. Particularly, error*

**x**^{l}**and value**

*ε***neurons are represented by red and blue nodes, respectively; excitatory + and inhibitory — connections are represented by connections with solid and hollow nodes, respectively. Thus, Eqs. (11) and (12) are implemented with red and blue connections, respectively. It should also be noticed that the weight dynamics is also realized locally: weight change described by Eq. (13) corresponds to simple Hebbian plasticity**

*x*^{93}in the neural implementation of Fig. 2c bottom, i.e., the change in a weight is proportional to the product of activity of pre-synaptic and post-synaptic neurons. Thus, a PCN, as an EBN, can be implemented with local circuits only, due to the local nature of energy terms (as argued earlier in this section).

In all simulations in this paper (unless stated otherwise), the integration step of the neural dynamics in Eq. (12) (i.e., relaxation) is 0.1, and the relaxation is performed for 128 steps. During the relaxation, if the overall energy is not decreased from the last step, the integration step is reduced by 50%; if the integration step is reduced two times (i.e., reaching 0.025), the relaxation is terminated early. By monitoring the number of relaxation steps performed, we notice that in most of the tasks we performed, the relaxation is terminated early at around 60 iterations.

In the Extended Data, we also investigate other choices of network structures and energy functions, resulting in other ANNs and EBNs. Overall, the EBNs investigated include PCNs^{30, 43, 55}, target-PCNs, and GeneRec^{89}, and the ANNs investigated include backpropagation and Almeida-Pineda^{90–92}. Details of all the models can be found in corresponding previous work, and are also given in the Supplementary Materials (SI) 2.1.

### Interference and measuring interference (i.e., target alignment) (Fig. 3)

In Fig. 3a, since it simulates the example in Fig. 1, structure of the network is 1-1-2; weights are all initialized to 1; input pattern is [1] and target pattern is [0, 1]. Learning rates of both learning rules are 0.2, and the weights are updated for 24 iterations.

In Fig. 3d, there are 64 neurons in each layer (including input and output layers) for each network; weights are initialized via Xavier uniform initialization^{94}. No activation function is used, i.e., linear networks are investigated. Depths of networks (*L*) are searched from {1, 2, 4, 6, 8, 10, 12, 14, 15}, as reported on the x-axis. Input and target patterns are a pair of randomly generated patterns of mean 0 and standard deviation 1. Learning rates of both learning rules are 0.001. Weights are updated for one iteration and target alignment is measured for this iteration for each of the 64 datapoints, then averaged over the 64 datapoints to produce the reported target alignment value. The whole experiment is repeated 3 times and the error bars report the standard error.

Simulations in Fig. 3e-f follow the setup of experiments in Fig. 4, thus, are described at the end of the next section.

### Biologically relevant tasks (Fig. 4)

In all supervised learning simulations in Fig. 4a-h, models are trained and tested on two datasets, FashionMNIST^{65} and CIFAR-10^{95}. With FashionMNIST, models are trained to perform classification of gray-scaled fashion item images into 10 categories such as trousers, pullovers and dresses. FashionMNIST is chosen because it is of moderate and appropriate difficulty for multi-layer non-linear deep neural networks, so that the comparisons with EBNs are informative. Classification of data in CIFAR-10 is more difficult, as it contains colored natural images belonging to categories such as cars, birds and cats. Both datasets consist of 60000 training examples (i.e., training set) and 10000 test examples (i.e., test set).

The experiments in Fig. 4a-h follow the configurations below, except for the parameters investigated in specific panels (such as batch size, size of the dataset, and size of the architecture), which are adjusted as stated in the description of specific experiments. The neural network is composed of 4 layers and 32 hidden neurons in each hidden layer. The size of the input layer is 28 × 28 for FashionMNIST^{65} and 32 × 32 for CIFAR-10^{66} (both datasets are gray-scaled). The size of the output layer is 10, as the number of classes for both datasets. The weights are initialized from a normal distribution with mean 0 and standard deviation , where *n ^{l}* and

*n*

^{l+1}are the number of neurons of the layer before and after the weight, respectively. This initialization is known as Xavier normal initialization

^{94}. The activation function

*f*() is

*Sigmoid*. We define one

*iteration*as updating the weights for one step based on a mini-batch. The number of examples in a mini-batch, called the batch-size, is by default 32. One

*epoch*comprises presenting the entire training set, split over multiple mini-batches. At the end of each epoch, the model is tested on the test set and the classification error is recorded as the “test error” of this epoch. The neural network is trained for 64 epochs; thus, ending up with 64 test errors. The mean of the test error over epochs, i.e., during training progress, is an indicator of how fast the model learns. The minimum of the test errors over epochs is an indicator of how well the model can learn, ignoring the possibility of over-fitting due to training for too long. Learning rates are searched independently for each configuration and each model. Each experiment is repeated 10 times (unless stated otherwise), and the error bars represent standard error.

We now describe settings specific to individual experiments. In Fig. 4b different batch sizes are tested (as shown on x-axis). In Fig. 4c the batch size is set to 1. In Fig. 4d different numbers of data points per class (data points per class) are included into the training set (subsets are randomly selected according to different seeds). In Figs. 4e–f investigating concept drifting^{62, 96, 97}, changes to class labels are made every 512 epochs, and the models are trained for 4096 epochs in total. Thus, every 512 epochs, 5 out of 10 output neurons are selected, and the mapping from these 5 output neurons to the semantic meaning is pseudo-randomly shuffled. In continual learning of Fig. 4g, training alternates between two tasks. Task 1 is classifying five randomly selected classes in a dataset, and task 2 is classifying the remaining five classes. The whole network is shared by the two tasks, thus, differently from the network used in other panels, the network only has 5 output neurons. This better corresponds to continual learning with multiple tasks in nature, because, for example, if humans learn to perform two different tasks, they typically use the one brain and one pair of hands (i.e., the whole network is shared), since they do not have two different pairs of hands (i.e., humans share the output layers across tasks). Task 1 is trained for 4 iterations and then task 2 is trained for 4 iterations, and the training continues until total of 84 iterations is reached. After each iteration, error on the test set of each task is measured, as “test error”. In Fig. 4h, the mean of test error of both tasks during training of Fig. 4g at different learning rates is reported.

In Fig. 3e, networks of 15 layers are trained and tested on FashionMNIST^{65} dataset. Fig. 3f investigates other network depths ({1, 2, 4, 6, 8, 10, 12, 14, 15}) in the same setup. Hidden layers are always of size 64. In this experiment, only part of the training set was used (60 datapoints per class) so that the test error is evaluated more frequently to reflect the difference on efficiency of the investigated learning rules. The activation function *f* () used is *LeakyReLU*, instead of the standard Sigmoid, because Sigmoid results in difficulty in training deep neural networks. Other unmentioned details follows the defaults as described above.

In the reinforcement learning experiments (Fig. 4i), we evaluate performance on three classic reinforcement learning problems: Acrobot^{98, 99}, MountainCar^{100}, and CartPole^{101}. We interact with these environments via a unified interface by OpenAI Gym^{102}. The observations *s _{t}* of these environments are vectors describing the status of the system, such as velocities and positions of different moving parts (for details refer to the original articles or documentation from OpenAI Gym). Each entry of the observation

*s*is normalized to mean 0 and standard deviation 1 via Welford’s online algorithm

_{t}^{103, 104}. The action space of these environments is discrete. Thus, we can have a network taking in observation

*s*and predicting the value (

_{t}*Q*) of each action

*a*with different output neurons. Such a network is known as an action-value network, in short, a

_{t}*Q*network. In our experiment, the

*Q*network contains two hidden layers, each of which contains 64 neurons, initialized the same way as the network used for supervised learning, described before. One can acquire the value of an action

*a*at a given observation

_{t}*s*by feeding in

_{t}*s*to the

_{t}*Q*network and reading out the prediction on the output neuron corresponds to the action

*a*, such value is denoted by

_{t}*Q*(

*s*). The training of

_{t}, a_{t}*Q*is a simple regression problem to target , obtained via

*Q*-learning with experience replay (summarized in Algorithm 1). Considering

*s*to be

_{t}

*s*^{in}and to be

*s*^{target}, the

*Q*network can be trained with prospective configuration or backpropagation. Note that is the target of the selected action

*a*(i.e., the target of one of the output neurons corresponds to the selected action

_{t}*a*), thus, is in practice considered to be

_{t}

*s*^{target}[

*a*]. For prospective configuration, it means the rest of the output neurons except the one corresponding to

_{t}*a*are freed; for backpropagation, it means the error on these neurons are masked out.

_{t}PCN of slightly different settings from the defaults is used for prospective configuration: the integration step is fixed to be half of the default (=0.05), and relaxation is performed for a fixed and smaller number of steps (=32). This change is introduced because *Q*-learning is more unstable (so smaller integration step) and more expensive (so smaller number of relaxation steps) than supervised learning tasks. To produce a smoother curve of “Sum of rewards per episode” in Fig. 4i from the *SumRewardPerEpisode* in Algorithm 1, the *SumRewardPerEpisode* curve along *TrainingEpisode* are averaged with a sliding window of length 200. Each experiment is repeated with 3 random seeds and the shadows represents standard error across them. Learning rates are searched independently for each environment and each model from the range {0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001}; and the results reported in Fig. 4i are for the learning rates yielding the highest mean of “Sum of rewards per episode” over training episodes.

### Simulation of motor learning (Fig. 5)

As shown in Fig. 5, we train a network that includes 2 input, 2 hidden, and 2 output neurons. The two input neurons are one-to-one connected to the two hidden neurons, and the two hidden neurons are fully connected to the two output neurons. The two input neurons are considered to encode presenting the blue and red background, respectively. The two output neurons are considered to encode the prediction of the perturbations towards positive and negative directions, respectively. Presenting or not presenting a background color are encoded as 1 and 0 respectively; presenting or not presenting perturbations of a particular direction are encoded as 1 and 0, respectively. The weights are initialized from a normal distribution with mean 0 and standard deviation fitted to behavioural data (see below), simulating that the participants have not built any associations before the experiments. Learning rates are independent for the two layers, as we expect the connections from perception to belief and the connections from belief to predictions to have different degree of plasticity. The two learning rates are also fitted to the data (see below).

The number of participants, training and testing trials follow exactly the human experiment^{75}. In particular, for each of 24 simulated participants, the weights are initialized with a different seed of the random number generator. They each experience two stages: training and testing. Note that the pre-training stage performed in the human experiment is not simulated here as its goal was to make human participants familiar with the setup and devices.

In the training stage, the model experiences 24 blocks of trials. In each block, the model is presented with the following sequence of trials, matching the original experiment^{75}.

The model is trained with two trials without perturbation: B0 and R0, with order counterbalanced across consecutive blocks. Note that in the human experiment there were two trial types without perturbations (channel and washout trials), but they are simulated in the same way here as B0 or R0 trials, because they both did not include any perturbations.

The model is trained with 32 trials with perturbations, where there are equal number of B+ and R-within each 8 trials in a pseudorandom order.

The model experiences two trials: B0 and R0, with order counterbalanced across consecutive blocks.

The model experiences

*n*← {14, 16, 18} washout trials (equal number of B0 and R0 trials in a pseudorandom order), where*n*← {*a, b, c*} denotes sampling without replacement from a set of values*a*,*b*and*c*, and replenishing the set whenever becomes empty.The model experiences one triplet, where the exposure trial is either B+ or R-, counterbalanced across consecutive blocks. Here, a triplet consists three sequential trials: B0, the specified exposure trial and again B0.

The model experiences again

*n*← {6, 8, 10} washout trials (equal number of B0 and R0 trials in a pseudorandom order).The model experiences again one triplet, where the exposure trial is either B+ or R-, whichever was not used on the previous triplet.

Then, in the testing stage, the model experiences 8 repetitions of four blocks of trials. In each block, one of combinations B+, R+, B- and R- is tested. The order of the four blocks is shuffled in each of the 8 repetitions. In each block, the model first experiences *n* ← {2, 4, 6} washout trials (equal number of B0 and R0 trials in a pseudorandom order). Then the model experiences a triplet of trials, where the exposure trial is the combination (B+, R+, B- or R-) tested in a given block, to assess single trial learning of this combination. The change in adaption in the model is computed as the absolute value of the difference in the predictions of perturbations on the two B0 trials in the above triplet, where the prediction of perturbation is computed as the difference between the activities of the two output neurons. The predictions are averaged over participants and the above repetitions.

The parameters of each learning rule are chosen such that the model best reproduces the change in adaptation shown in Fig 5f. In particular, we minimize the sum over set *C* of the 4 exposure trial types of the squared difference between average change in adaptation in experiment (*d _{c}*) and in the model (

*x*):

_{c}The model predictions are additionally scaled by a coefficient *a* fitted to the data, because the behavioural data and model outputs have different scales. Exhaustive search was performed over model parameters: standard deviation of initial weights and two learning rates for two layers could take values from {0.01, 0.05, 0.1}. Then, for each learning rule and each combination of the above model parameters, the coefficient *a* is resolved analytically (restricted to be positive) to minimize the sum of the squared errors of Eq. (14).

### Simulation of fear conditioning (Fig. 6)

As shown in Fig. 6c, the simulated network includes 2 input, 2 hidden, and 1 output neurons. The weights are initialized from a normal distribution of mean 0 and standard deviation 0.01, reflecting that the animals have not built an association between stimulus and electric shock before the experiments. Presenting or not presenting the stimulus (noise, light, or shock) is encoded as 1 and 0, respectively. The two input neurons are considered to be the visual and auditory neurons; thus, their activity corresponds to perceiving light and noise, respectively. The output neuron is considered to encode the prediction of the electric shock. The training and extinction sessions are both simulated for 32 iterations with the learning rate of 0.01. In the test session, the model makes a prediction with the presented stimulus (noise only). As in the previous section, we denote by *x _{c}* the prediction for each group

*c*from a set

*C*= {

*N*+,

*LN*+,

*LN*+

*L*–}. To map the prediction to the percentage of freezing, it is scaled by a coefficient

*a*(as the neural activity and the measure of freezing have different units) and shifted by a bias

*b*(as the rats may have some tendency to freeze after salient stimuli even if they had not been associated with a shock). The numbers reported in Fig. 6b are these scaled predictions. The coefficient

*a*(constrained to be positive) and bias

*b*are optimized for prospective configuration and backpropagation independently, analogously as described in the previous section, i.e. their values that minimize summed squared error given below are found analytically.

### Simulation of human reinforcement learning (Fig. 7)

As shown in Fig. 7b, we train a network that includes 1 input, 1 hidden, and 2 output neurons. The input neuron is considered to encode being in the task, so it is set to 1 throughout the simulation. The two output neurons encode the prediction of the value of the two choices. Reward and punishment are encoded as 1 and –1, respectively, because the participants were either winning or losing money. The model selects actions stochastically based on the predicted value of the two choices (encoded in the activity of two output neurons) according to the softmax rule (with temperature of 1). The weights are initialized from a normal distribution of mean 0 and standard deviation fitted to experimental data (see below), simulating that the human participants have not built any associations before the experiments. Number of simulated participants (number of repetitions with different seeds) was set to 16 as in the human experiment^{75}. The number of trials is not mentioned in the original paper, so we simulate for 128 trials for both learning rules.

To compare the ability of the two learning rules to account for the pattern of signal from mPFC, for each of the rules, we optimized the parameters describing how the model is set up and learns (the standard deviation of initial weights and the learning rate). Namely, we searched for the values of these parameters for which the model produces the most similar pattern of its output activity to that in the experiment. In particular, we minimized the sum over set *C* of four trial types in Fig. 7c of the squared difference between model predictions *x _{c}* and data

*d*on mean mPFC signal (Eq. (15)). The model predictions are additionally scaled by a coefficient

_{c}*a*and offset by a bias

*b*, because the fMRI signal had different units and baseline than the model. To compute the model prediction for a given trial type, the activity of the output neuron corresponding to the chosen option is averaged across all trials of this type in the entire simulation. The scaled average activity from the model is plotted in Fig. 7c, where the error bars show the standard error of the scaled activity. To fit the model to experimental data, the values of model parameters and the coefficient were found analogously as described in the previous section. In particular, we employ exhaustive grid search on the parameters. The models is simulated for all possible combinations of standard deviation of initial weights, and the learning rate, from the following set: {0.01, 0.05, 0.1}. Then, for each learning rule and each combination of the above model parameters, the coefficient

*a*(restricted to be positive) and the bias

*b*are resolved analytically to minimize sum of the squared error of Eq. (15).

## Data availability

Learning tasks analysed in Fig. 4a-h were built using the publicly available FashionMNIST^{65} and CIFAR-10^{66} datasets. They are incorporated in most machine learning libraries, and their original releases are available at https://github.com/zalandoresearch/fashion-mnist and https://www.cs.toronto.edu/~kriz/cifar.html, respectively. Reinforcement learning tasks analysed in Fig. 4i were built using the publicly available simulators by OpenAI Gym^{102}.

## Code availability

Complete code and full documentation reproducing all simulation results will be made publicly available at https://github.com/YuhangSong/A-New-Perspective upon publication of this work. It will be released under GNU General Public License v3.0 without any additional restrictions (for license’s details see https://opensource.org/licenses/GPL-3.0 by the open source initiative).

## Author contributions

Y.S. and R.B. conceived the project. Y.S., R.B., B.M. and T.S. contributed ideas for experiments and analysis. Y.S. and B.M. performed simulation experiments and analysis. Y.S., T.L, and R.B. managed the project. T.L, and Z.X. advised on the project. Y.S., R.B. and B.M. wrote the paper. T.S., T.L, and Z.X. provided revisions to the paper.

## Competing interests

The authors declare no competing interests.

## Additional information

**Extended Data Figures/Tables** is available for this paper in the same file (Section 1).

**Supplementary Information** is available for this paper in the same file (Section 2).

**Correspondence and requests** for materials should be addressed to Y.S. and R.B.

## 1 Extended Data

## 2 Supplementary Information

In this supplement, we present additional description and analysis of the simulated models. The first part provides details of all models simulated in the paper. The second part analyses target alignment and the third part - prospective index of PCNs. The fourth part discusses relationship between prospective configuration and target propagation.

### 2.1 Details of simulated models

This section gives more details of all simulated models. The general idea of *energy-based networks* (EBNs) and *artificial neural networks* (ANNs), and one of EBNs, *predictive coding network*^{30, 43, 55} (PCN), have been described in the Main text and Methods. PCN is again included here along with other simulated models to provide descriptions in a unified form, facilitating the reproduction of our reported results. Complete code and full documentation reproducing all simulation results will be made publicly available at https://github.com/YuhangSong/A-New-Perspective upon publication of this work.

Algorithms 2 to 7 describe how the four models simulated in this paper predict and learn. These four models are: PCN, backpropagation, *GeneRec*^{89}, and *Almeida-Pineda*^{90–92}. Among the four models, PCN and GeneRec are the two EBNs we investigate; backpropagation and Almeida-Pineda are the two ANNs we investigate. Specifically, PCN is compared against backpropagation, because it has been established that PCN are closely related to backpropagation^{30, 37} and they make the same prediction with the same weights and input pattern^{30}. Therefore we simulated prediction in these two algorithms in the same way (Algorithm 2). However, they learn differently (c.f. Algorithms 3 and 4). The other EBN, GeneRec, describes learning in recurrent networks, and ANN in this architecture is not trained by standard backpropagation, but a modified version proposed by Almeida and Pineda^{90–92} (thus called the *Almeida-Pineda* algorithm). Thus, GeneRec should be compared against Almeida-Pineda because they make same prediction with the same weights and input pattern^{89}. Therefore we simulated prediction in these two algorithms in the same way (Algorithm 5). But they learn differently (c.f. Algorithms 6 and 7). In a word, PCN and backpropagation are EBN and ANN working in feed-forward architecture, respectively; GeneRec and Almeida-Pineda are EBN and ANN working in recurrent architecture, respectively.

### Predict with backpropagation or *predictive coding network*^{30, 43, 55} (PCN)

Particularly, PCN & Backpropagation work in a network where prediction is made from the input through a series of forward weights {*w*^{1}, *w*^{2}, ⋯, * w^{L}*}; GeneRec & Almeida-Pineda works in a network where prediction is made from input through a mixture of forward weights {

*w*^{1},

*w*^{2}, ⋯,

*} and backward weights {*

**w**^{L}

*m*^{1},

*m*^{2},⋯,

*}. The forward weights {*

**m**^{L}

*w*^{1},

*w*^{2},⋯,

*} and backward weights {*

**w**^{L}

*m*^{1},

*m*^{2}, ⋯,

*} are not necessarily related. This architecture is also similar to the continuous Hopfield model*

**m**^{L}^{112, 113}. Unlike in some previous studies

^{29}, here, we focus on layered networks, where the sets of neurons at adjacent layers

*and*

**x**^{l}

*x*^{l+1}are connected by synaptic weights. Thus, we define two sets of weights for GeneRec & Almeida-Pineda that works in the recurrent network:

*is the forward weights connecting from*

**w**^{l}*to*

**x**^{l}

*x*^{l+1};

*is the backward weights connecting from*

**m**^{l}

*x*^{l+1}to

*.*

**x**^{l}Also note that GeneRec has been explored and re-discovered in recent works^{50, 114} showing how a closely related algorithm resembles backpropagation when the backward weights are the transposes of the forward weights * m^{l}* = (

*)*

**w**^{l}^{T}(or for a fully-connected network in their context

*w*=

_{i,j}*w*), and how the extreme version of the algorithm approximate backpropagation

_{j,i}^{29}.

Some common notations in the algorithms are: *α* is the learning rate for weights update; *γ* and are the integration step and length of relaxation, respectively (specified to the two EBNs, PCN and GeneRec); *s*^{in} and *s*^{target} are the input and target patterns, respectively. For Almeida-Pineda, which requires additional iterative process to propagate error, *β* and are the integration step and length of this iterative process, respectively. In our simulation, we use β = 0.01 and .

All simulated models work in mini-batch mode, that is to say, one iteration is to update the weights for one step on a mini-batch of data randomly sampled from the training set for classification tasks. The above sampling is without replacement, i.e., the same examples will not be sampled again before the completion of a epoch, which is when the entire training set has been sampled once. For example, considering a dataset of 1000 examples with a batch-size (number of examples in a mini-batch) of 10, then each iteration would update weights for one step on 10 examples, and it will take 100 such iterations to complete one epoch. To implement the Algorithms 2 to 7 described below in mini-batch mode, one can simply add an extra-dimension, the size of which is batch-size, to all the neuron-specific vectors in the algorithms such as * x^{l}*,

*and etc., and then reduce this dimension by summing over it when computing weight update*

**ε**^{l}**Δ**(and

*w*^{l}**Δ**if the model is GeneRec or Almeida-Pineda).

*m*^{l}Note that learning with Almeida-Pineda involves relaxation of the model, i.e., updating neural activity, in lines 5-12 of Algorithm 6. However, its function is to make a prediction with current weights and input pattern so that the error on the output neurons can be computed (in the following line 13), similar as the function of “forward pass” in backpropagation in lines 2-4 of Algorithm 3. The neural activity in the Almeida-Pineda model is fixed during spreading of error, like in backpropagation. Thus, Almeida-Pineda is classified as an ANN rather than an EBN (which updates neural activity during spreading of error).

### 2.2 Target alignment for networks without hidden layers (Fig. 3d)

Fig. 3d shows that target alignment for models without hidden layers, trained either with PC or BP, is exactly one, and here we prove this property analytically. Without hidden layers, PC and BP are identical algorithms. In a linear network, the change of the weight *w*^{1} is:

We denote output after learning by ** x**′

^{2}. The change of the output

**′**

*x*^{2}–

*x*^{2}is:

Here (*x*^{1})^{T} *x*^{1} is a positive scalar (if at least one entry in *x*^{1} is non-zero). Thus,

According to the definition of target alignment, which is the cosine similarity of the direction of the target (i.e., *ε*^{2}) and the direction of learning (i.e., ** x**′

^{2}–

*x*^{2}), target alignment of this network is exactly one. This conclusion also applies to network with nonlinear activation function.

### 2.3 Formal proofs of prospective index of PCN (Extended Data Fig. 1)

This section formally proves two properties of the prospective index *ϕ ^{l}* of

*predictive coding network*

^{30, 43, 55}(PCN), that can be observed in Extended Data Fig. 1d. To briefly recap, prospective index

*ϕ*quantifies to what extent the hidden neural activity of the network following clamping output neurons to a target pattern is shifting toward the hidden neural activity following subsequent weight modification. Below we show two properties visible in Extended Data Fig. 1d. Firstly, prospective index of the first hidden layer (

^{l}*ϕ*

^{2}) in a PCN is always one. Secondly, the prospective index in other layer is close to one because, the weights

**in PCN are updated towards a configuration**

*W**** whose prospective index is one.**

*W*#### 2.3.1 Prospective index of the first hidden layer of PCN is always one

We assume that the model does not make a perfect prediction with the current weights, so that the error in the prediction drives the learning. As defined in Extended Data Fig. 1a, vectors *v*^{⊕,l} and ** v**′

^{,l}describe the changes in hidden neuron activity, due to target pattern being provided and learning respectively. Specifically for layer

*l*= 2, these vectors are:

We will now show that for PCN the above vectors *v*^{⊕,2} and ** v**′

^{,2}point in the same direction. The change in activity due to learning

**′**

*v*^{,2}is equal to

Since the value nodes of the first (input) layer *x*^{1} are always fixed to the input signal *s*^{in}, the above Eq. (23) can further be written as,

Using Eqs. (13) and (11), we write

In Eq. (25), denotes inputs to neurons in layer *l*, i.e., . Note that , because both of these quantities are equal to *w*^{1} *f* (*s*^{in}) (the input of the first hidden layer (*l* = 2) does not change in response to output neuron being clamped). Using , the above Eq. (25) can further be written as,

Note that *α*(*f*(*s*^{in}))^{T} *f*(*s*^{in}) is a positive scalar (if at least one entry in the input pattern is non-zero). Comparing Eqs. (21) and (26), we can see that vectors ** v**′

^{,2}and

*v*^{⊕,2}are just scaled versions of each other, hence the cos of the angle between them is equal to 1, and thus prospective index is also equal to 1 (in the limit of

*κ*→ 0).

#### 2.3.2 Weights in PCN are updated towards a configuration with prospective index of one

As seen in Extended Data Fig. 1d, the prospective index for layers *l* > 2 is very close to one. To provide an intuition for why this is the case, in this section we demonstrate how PCNs would need to be modified to have prospective index equal to 1. We will refer to such modified model as target-PCN, and calculate its prospective index.

As in the previous section, we assume that the model does not make a perfect prediction with the current weights, so that the error in the prediction drives the learning. We start with recapping what happens in sequence in one iteration of the standard PCN.

Start from relaxation with only input neurons clamped to input pattern (⊖) and with current weight

, the hidden neuron activity settles to:*W*Both input and output neurons are clamped to the input and target pattern respectively (⊕) and then the hidden neuron activity is relaxed to:

Weights

are updated for one step to*W*′ to decrease the energy, while hidden neuron activity stays still from the last step:*W*Output neurons are freed but the input neuron is still clamped to the input pattern and then the hidden neuron activity is relaxed to:

In the above step 3, weights are updated for one step from ** W** to

**′. However, one can investigate the case of updating weights**

*W***for many steps until convergence**

*W**** in the above step 3. This will result in weights**

*W**** that represents: “the target towards which the weights**

*W***are updated”. Thus, we call this variant “target-PCN” and it is summarized in Algorithm 8. Specifically, the procedure of target-PCN is to replace the above steps 3 and 4 of standard PCN with:**

*W*3. Weights are updated for many steps from

to*W** to decrease the energy till convergence, while hidden neuron activity stays still from the last step: ;*W*4. Output neurons are freed but the input neuron is still clamped to the input pattern and then the hidden neuron activity is relaxed to: ;

In the following, we demonstrate prospective index of target-PCN is one for all layers. First, we should notice that the minimum of energy *E* of PCN is zero, since the energy function is a sum of quadratic terms, i.e., Eq. (6). Then, we should notice that such energy *E* of PCN can be optimized to its minimum of zero by optimizing only ** W**. Particularly, the local energy term of layer

*l*is:

In the above Eq., * x^{l}* –

**w**^{l–1}

*f*(

*x*^{l–1}) can be optimized to produce a zero vector by optimizing only

**w**^{l–1}as long as

*f*(

*x*^{l–1}) is not a zero vector. Specifically, let us denote all the non-zero entries in

*f*(

*x*^{l–1}) by , where

*I*is the set of indices

*i*so that is non-zero. Since

*f*(

*x*^{l–1}) is not a zero vector, . To demonstrate that there exists a solution for so that we construct an example of such solution. Such sample solution is to pick one index

*g*from

*I*, then have and . Thus, as long as

*f*(

*x*^{l–1}) is not a zero vector , there exists a solution of

*w*^{l–1}that makes

*–*

**x**^{l}

**w**^{l–1}

*f*(

*x*^{l–1}) a zero vector.

Thus, in step 3 of the target-PCN, the energy of the network is at its minimum of zero. This further implies that in the step 4 of the target-PCN, the neural activity does not move, i.e.,

According to the definition of prospective index in Extended Data Fig. 1a-b, the prospective index of this target-PCN (*ϕ**^{,l}) is:

This theoretical result is further confirmed by empirical observation in Extended Data Fig. 1d. Since the standard PCN modifies the weights in a similar direction as target-PCN, it is likely to have a similar prospective index.

### 2.4 Relationships of predictive coding networks to target propagation (Extended Data Fig. 2)

In Extended Data Fig. 2, we illustrate that prospective configuration, particularly, *predictive coding network*^{30, 43, 55} (PCN), has close a relationship to a separate and influential model of credit assignment – target propagation^{60}. Here we formally prove these observations. Particularly, we show that

In an output-constrained PCN, neural activity after relaxation converges to the local target;

In an input-output-constrained PCN, neural activity after relaxation converges to the weighted sum of the predicting activity and the local target.

In the above, predicting activity refer to the neural activity when the model is making prediction, and they are the same for both backpropagation and PCN as they compute the same neural activity when making a prediction, and local target are described in the following subsection.

#### 2.4.1 Target-propagation

We first briefly review target propagation^{60}. The key insight behind target propagation is that rather than updating weights based on a gradient of a loss function, one can instead attempt to explicitly compute what are the optimal activity for the neurons so that they can produce the desired target pattern, and then update the weights so as to nudge the current neural activity towards the optimal activity directly. We call these optimal activity *local target* since if the neurons takes this activity, the network would produce the desired target pattern. Importantly, we can directly compute the local target in terms of the *inverses* of the weights and activation functions. Namely, suppose that we have a three-layer network with activation functions *f* (), weight matrices *w*^{1}, *w*^{2}, *w*^{3} and an input pattern *s*^{in}. The output of this network is *x*^{4} = *w*^{3} *f*(*w*^{2} *f*(*w*^{1}(*f*(*s*^{in}))). Suppose instead that we do not want the network to output *x*^{4} for a given *s*^{in} but rather a given target pattern *s*^{target}. Then, the activity at the first layer that would produce this desired activity can be exactly computed by inverting^{1} the network . From this,we can define a recursion of one local target in terms of another at the layer above,

Like prospective configuration, these targets are also *prospective*, in some sense, since they represent the counterfactual what the neural activity *should have been* in order to produce the desired target pattern.

#### 2.4.2 Output-constrained PCN

In this subsection we investigate the “output-constrained PCN”: in this PCN input neurons are not clamped to the input pattern but output neurons are clamped to the target pattern. We show that in this PCN, the activity after relaxation is precisely equal to the local target. Since *x*^{1} is not constrained to the input pattern, we can look at its dynamic by setting *l* = 1 in Eq. (12). Since there is no error term or error nodes at the input layer, there is only the later term left when setting *l* = 1 in Eq. (12) (note that here we write in matrix & vector form):

Considering the above dynamic has converged, we can set Δ*x*^{1} = **0** in the above equation and solving for *x*^{1}, then we can obtain the converged value of *x*^{1}:

Now we look at the dynamic of *x*^{2} by setting *l* = 2 in Eq. (12):

Putting the solved *x*^{1}, i.e., Eq. (33), into the above Eq., we have:

Considering the above dynamic has converged, we can set Δ*x*^{2} = **0** in the above equation and solving for *x*^{2}, then we can obtain the converged value of *x*^{2}:

One can now see the proof goes recursively until *l* = *L* and *x*^{L+1} is fixed to the target pattern *s*^{target}:
which is exactly the recursive formula of the local target in target propagation, i.e., Eq. (30). Thus, neural activity of output-constrained PCN after relaxation equals to the local target.

#### 2.4.3 Input-output-constrained PCN

In this subsection we investigate the “input-output-constrained PCN”: in this PCN both input and output neurons are clamped to the input and target patterns, respectively. We show that in this PCN, the activity after relaxation are the weighted sum of the predicting activity and the local target. Particularly, since in a input-output-constrained PCN, we can only solve for the equilibrium after relaxation analytically in the linear case, we prove this for a linear PCN. Nevertheless, the analysis still provides useful insights. Looking at the network dynamics at a given layer *l*, i.e., Eq. (12), we can write the dynamics in the linear case as,

If we then set Δ** x^{l}** =

**0**and solve for

*, we obtain,*

**x**^{l}If we assume that the norm of the weights is large compared to the identity matrix ** I**, i.e., we consider , the above equilibrium solution can further be approximated by:
where the equilibrium solution is simply the weighted sum of the predicting activity and the local target.

These theoretical results let us view prospective configuration from another angle. Namely, that prospective configuration tends to move the activity from the predicting activity towards the local target that would be computed by target propagation. These local target are intrinsically prospective in the sense that they are the neural activity which, if they were produced, they would minimize the loss function (i.e., produce the desired target pattern). In other words, they are in some sense the activity which the weight updates of a network are trying to achieve. By moving towards these local target directly in activity-space, as opposed to only in weight-space like backpropagation, the prospective configuration can often propose faster and more efficient updates than can be achieved by backpropagation.

## Acknowledgements

We thank Timothy Behrens for comments on the manuscript, and Andrew Saxe for discussions. Yuhang Song was supported by the China Scholarship Council under the State Scholarship Fund and J.P. Morgan AI Research Awards. Beren Millidge and Rafal Bogacz were supported by the the Biotechnology and Biological Sciences Research Council grant BB/S006338/1 and Medical Research Council grant MC_UU_00003/1. Thomas Lukasiewicz and Tommaso Salvatori were supported by the Alan Turing Institute under the EPSRC grant EP/N510129/1 and by the AXA Research Fund. Zhenghua Xu was supported by National Natural Science Foundation of China under the grant 61906063, by the Natural Science Foundation of Hebei Province, China, under the grant F2021202064, by the Natural Science Foundation of Tianjin City, China, under the grant 19JCQNJC00400, by the “100 Talents Plan” of Hebei Province, China, under the grant E2019050017, and by the Yuanguang Scholar Fund of Hebei University of Technology, China.

## Footnotes

↵1 Note that in realistic networks the weight matrices are not all square so an exact inverse (

)**w**^{l}^{−1}does not exist. Instead, we can compute approximations of the inverse using the Moore-Penrose pseudoinverse^{115}()**w**^{l}^{†}, which is the least squares solution to the optimization problem argmin_{w}||–*I*||.**w**^{l}**w**

## References

- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.
- 6.
- 7.
- 8.
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.
- 15.
- 16.↵
- 17.↵
- 18.
- 19.
- 20.
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.
- 29.↵
- 30.↵
- 31.
- 32.
- 33.↵
- 34.
- 35.
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.
- 45.
- 46.
- 47.
- 48.
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.
- 69.↵
- 70.↵
- 71.↵
- 72.
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.
- 80.
- 81.
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.
- 92.↵
- 93.↵
- 94.↵
- 95.↵
- 96.↵
- 97.↵
- 98.↵
- 99.↵
- 100.↵
- 101.↵
- 102.↵
- 103.↵
- 104.↵
- 105.↵
- 106.↵
- 107.↵
- 108.↵
- 109.↵
- 110.↵
- 111.↵
- 112.↵
- 113.↵
- 114.↵
- 115.↵