## Abstract

Discriminating distinct objects and concepts from sensory stimuli is essential for survival. Our brains accomplish this feat by forming disentangled internal representations in deep sensory networks shaped through experience-dependent synaptic plasticity. To elucidate the principles that underlie sensory representation learning, we derive a local plasticity model that shapes latent representations to predict future activity. This Latent Predictive Learning (LPL) rule conceptually extends Bienenstock-Cooper-Munro (BCM) theory by unifying Hebbian plasticity with predictive learning. We show that deep neural networks equipped with LPL develop disentangled object representations without supervision. The same rule accurately captures neuronal selectivity changes observed in the primate inferotemporal cortex in response to altered visual experience. Finally, our model generalizes to spiking neural networks and naturally accounts for several experimentally observed properties of synaptic plasticity, including metaplasticity and spike-timing-dependent plasticity (STDP). We thus provide a plausible normative theory of representation learning in the brain while making concrete testable predictions.

## Introduction

Recognizing objects and concepts from sensory inputs is crucial for perception. To that end, brains must effectively distinguish between highly entangled stimuli. For instance, the activity patterns that the retinal ganglion cells in our eyes send to the brain in response to viewing a cat or a dog may, in some cases, be more similar than two different views of the same dog (Fig. 1a). Yet, we have no problem distinguishing cats from dogs because our brains are able to internally represent them as separate objects or categories that remain invariant to different views and their context. What processing enables such invariant object recognition?

Visual stimuli evoke distinct activity patterns in individual sensory neurons that correspond to points in a high-dimensional space spanned by the neuronal activity levels. All possible stimuli of one category, for instance images of dogs, lie within a subregion of this space, a manifold, whereas other categories lie on different manifolds. However, for many stimuli, the corresponding manifolds are *entangled* like crumpled-up sheets of paper, which makes it impossible for downstream neurons to decode the underlying stimulus category through simple linear combinations of their inputs. *Disentangling* behaviorally relevant categories therefore requires deep sensory networks that can extract invariant linearly separable category representations, such as those found in the visual system (Fig. 1b; [3]). Crucially, successful disentangling requires specific network connectivity, which is thought to be shaped through experience-dependent plasticity [4]. However, current data-driven plasticity models are unable to account for the emergence of disentangled representations in deep biological neural networks.

In contrast, artificial deep neural networks (DNNs) used in machine learning *do* yield highly disentangled representations by dint of supervised training algorithms. These algorithms gradually transform the output of a neural network model to match a set of targets or “labels” associated with given input data. Over the last decade, the combination of supervised learning with rich datasets and larger network models has created powerful machine learning systems that achieve human-level performance on a diversity of tasks [5, 6]. However, what is most striking from a neuroscience perspective is that DNNs also reproduce essential aspects of the representational geometry of biological neural networks, even when they are not explicitly optimized to do so [7–9]. This similarity suggests that DNNs are valuable tools to elucidate neural information processing in the brain [10, 11]. Yet, two central issues remain concerning their interpretation as models of biological learning. First, the training algorithms used in deep learning are typically end-to-end, i.e., the algorithms optimize network connections across the hierarchy with the goal of minimizing errors at the network’s output. Doing so requires solving the “credit assignment problem,” whereby targeted information must be sent to neurons in the network’s hidden layers about their respective contributions to the output errors. It remains unclear how neurobiology solves the credit assignment problem [12].

Second, supervised learning requires input data with specific semantic “labels” that are not conceivably available to animals and humans. Here, self-supervised learning (SSL), a family of unsupervised machine learning algorithms, may offer a remedy as they do not need labeled data but instead require that internal network representations belonging to related inputs be predictive of one another [13, 14]. Hence, network representations themselves act as targets for similar inputs. For example, a network has to predict visual features across different parts of an image or predict future representations of sensory inputs from those in the past. It has been proposed that biological neural networks may similarly rely on prediction as a learning principle [15–18], for instance, by extracting slowly varying features from their sensory inputs as done in slow feature analysis (SFA) [19, 20]. In other words, predictive learning posits that biological networks change their connections with the goal of “pulling” together related internal representations, and therefore form similar representations for stimuli that frequently occur close in time (Fig. 1c).

However, a major issue with this strategy is that without any forces opposing this representational pull, such learning inevitably leads to “representational collapse,” whereby all inputs are mapped to the same internal activity pattern which precludes linear separability (Fig. 1c). One typical solution to this issue is to add forces that “push” representations corresponding to *different* unrelated stimuli away from one another (Fig. 1d). This is usually done by invoking so-called “negative samples,” which are inputs that do not frequently occur together in time. This approach has been linked to biologically plausible three-factor learning rules [21–23], but it requires constant switching of the sign of the plasticity rule depending on whether two successive inputs are related to each other or not. Yet, it is unknown whether and how such a rapid sign switch is implemented in the brain.

Another possible solution for avoiding representational collapse without negative samples is to prevent neuronal activity from becoming constant over time, for instance, by maximizing the variance of the activity [24]. Interestingly, variance maximization is a known signature of Heb-bian plasticity [25, 26], which has been found ubiquitously in the brain [27–29]. While Hebbian learning is usually thought of as the primary plasticity mechanism rather than playing a supporting role, Hebbian plasticity alone has had limited success at disentangling representations in deep hierarchical neural network models [10, 30, 31].

In this article, we introduce Latent Predictive Learning (LPL), a conceptual learning framework that overcomes this limitation and reconciles SSL with Hebbian plasticity. Specifically, the learning rules derived within our framework combine a BCM-like plasticity threshold [1, 2] as observed in experiments (Fig. 1e; [27, 32–34]), with a predictive component inspired by SFA [19, 20] that renders single neurons selective to temporally contiguous features in their inputs. When applied to the layers of deep hierarchical networks, LPL yields disentangled representations of objects present in natural images while neither requiring labels nor negative samples. Crucially, LPL effectively disentangles representations despite being a purely local learning rule, i.e., without requiring explicit spatial credit assignment mechanisms. We demonstrate that LPL captures central findings of unsupervised visual learning experiments in monkeys. Finally, the corresponding spiking LPL rule learns predictive representations in spiking neural networks and naturally yields classic spike-timing-dependent plasticity (STDP) windows and a characteristic firing-rate dependence observed in neurobiology. In light of these findings, we argue that LPL constitutes a plausible plasticity mechanism that may underlie representation learning in biological sensory networks.

## Results

To study the interplay of Hebbian and predictive plasticity in sensory representation learning, we derived a plasticity model from an SSL objective function that is reminiscent of and extends the classic BCM learning rule [1, 2] (Methods; Supplementary Note S1). According to our learning rule, the temporal dynamics of a synaptic weight *W _{j}* are given by
where

*η*is a small positive learning rate,

*x*(

_{j}*t*) denotes the activity of the presynaptic neuron

*j*,

*z*(

*t*) =

*f*(

*a*(

*t*)) is the neuronal activity with the activation function

*f*(Fig. 2a), and the net input current

*a*(

*t*) = ∑

*(*

_{k}W_{k}x_{k}*t*). We call the first term in parentheses the predictive term because it promotes learning of slow features [19, 20] by effectively “pulling together” postsynaptic responses to temporally consecutive input stimuli. Importantly, it cancels when the neural activity does not change and, therefore, accurately

*predicts*future activity. In the absence of any additional constraints, the predictive term leads to collapsing neuronal activity levels [19]. In our model, collapse is prevented by the Hebbian term in which , the running average of the neuronal activity appears, which is reminiscent of BCM-theory [1, 2]. Its strength further depends on an online estimate of the postsynaptic variance of neuronal activity . This modification posits an additional metaplasticity mechanism controlling the balance between predictive and Hebbian plasticity depending on the postsynaptic neuron’s past activity.

To make the link to BCM explicit, we rearrange the terms in Eq. (1) to give:
where Θ(*t*) corresponds to a time-dependent sliding plasticity threshold (Fig. 2b). While the precise shape of the learning rule depends on the choice of neuronal activation function, its qualitative behavior remains unchanged as long as the function is monotonic (Supplementary Fig. S1). Despite the commonalities, however, there are three essential differences to the BCM model. First, in our model, the threshold depends only linearly on (Fig. 2b), whereas in BCM, the threshold is typically a supralinear function of the moving average . Second, the added dependence on the predictive term constitutes a separate mechanism that modulates the plasticity threshold depending on the rate-of-change of the postsynaptic activity (Fig. 2c,d). Third, our model adds a variance-dependence that has diverse effects on the sliding threshold when the neuronal output does not accurately predict future activity and, thus, changes rapidly (Fig. 2c,d). We will see that these modifications are crucial to representation learning from the temporal structure in sensory inputs. Because the predictive term encourages neurons to predict future activity at their output, and thus in latent space rather than the input space, we refer to Eq. (1) as the Latent Predictive Learning (LPL) rule.

### LPL finds contiguous features in temporal data

To investigate the functional advantages of LPL over BCM and other classic Hebbian learning rules (Supplementary Note S2) in a well-controlled setting, we first designed a synthetic twodimensional learning task in which we parametrically controlled the proportion of predictable changes between subsequent observations (Fig. 2e; Methods). Specifically, the data sequence consisted of noisy inputs distributed over two clusters separated along the x-axis. Consecutive inputs had a high probability of staying within the same cluster, thus making cluster identity a temporally contiguous feature. By varying the noise amplitude *σ _{y}* in the

*y*-direction, we effectively controlled the amount of unpredictable changes in the input sequences. We exposed a single rate neuron model to different datasets with varying values of

*σ*, while the two input connections were plastic and evolved according to the LPL rule (Eq. (1)) until convergence. We then measured the selectivity of the neuron to cluster identity, defined as the normalized difference between the neuron’s average response to inputs from the two clusters (Methods).

_{y}We found that LPL rendered the neuron selective to the cluster identity for a large range of *σ _{y}* values (Fig. 2f). However, without the predictive term, the neuron’s selectivity to cluster identity was lost for large

*σ*values. This behaviour was expected because omitting the predictive term renders the learning rule purely Hebbian, which biases selectivity toward directions of highest variance. To illustrate this point, we repeated the same simulated learning experiments with Oja’s rule, a classic Hebbian rule that finds the principal component in the input, and found similar qualitative behaviour, except the change was more abrupt at

_{y}*σ*≈ 1. Thus LPL’s predictive term changes the learning rule’s behavior substantially by selecting predictable features in the input instead of directions of high variance.

_{y}Next, we sought to confirm that the Hebbian term is essential for LPL to prevent representational collapse. To that end, we simulated learning with *σ _{y}* = 1 under LPL without the Hebbian term (cf. Eq. (1)) and measured the mean output activity of the neuron in response to random inputs in the sequence. We observed that without the Hebbian term, the neuron’s activity collapses to zero as expected (Fig. 2g). Conversely, learning with the Hebbian term without the predictive term did not result in collapse. Therefore, LPL’s Hebbian component is essential to prevent activity collapse.

Moreover, Hebbian plasticity needs to be dynamically regulated to prevent run-away activity. In LPL this regulation is achieved by inversely scaling the Hebbian term by a moving estimate of the variance of the postsynaptic activity . Without this variance-modulation, neural activity either collapsed or succumbed to runaway activity depending on whether the predictive term or the Hebbian term was dominant (Supplementary Note S3). Either case precluded the neuron from acquiring cluster selectivity. We verified that these findings generalized to higherdimensional tasks with more complex co-variance structure (Supplementary Note S4). Hence, the combination of the predictive with variance-modulated Hebbian plasticity in LPL is needed to learn invariant predictive features independently of the co-variance structure in the data.

### LPL disentangles representations in deep hierarchical networks

As we view a scene, we often move through it which causes us to see the objects in the scene from different angles. Similarly, the objects, animals, or people in the scene themselves may move. Finally, our gaze constantly shifts due to saccadic eye movement (Fig. 3a; [35]). All of these influences result in different visual projections of the same objects within a scene onto our retina. Therefore, the objects themselves constitute temporally contiguous features in normal vision.

We thus wondered whether training an artificial neural network with LPL on sequences of images in which object identity is preserved results in disentangled object representations at the network’s output. To that end, we built a convolutional DNN model in which we “stacked” layers whose synaptic connections evolved according to the LPL rule. Additionally we included a decorrelation term to prevent neurons within a given layer from becoming correlated. In biological neural networks with separate excitatory and inhibitory neuron types this role can be readily achieved through plasticity at inhibitory synapses [36–39]. The whole learning rule was implemented in a “layer-local” manner, meaning that there were no error signals backpropagated through layers (Methods).

To simulate temporal sequences of related visual inputs, we generated pairs of images by applying different randomized transformations to images sampled from STL-10, a large objectrecognition dataset from computer vision (Supplementary Fig. S2; Methods). We exposed our network model to these visual data until learning converged and evaluated the linear decodability of object categories from the learned representations using a separately trained linear classifier.

We found that in networks trained with LPL, object categories could be linearly decoded at the output with an accuracy of (63.2 ± 0.3)% (Fig. 3b; Table 1), suggesting that the network has formed partially disentangled representations (Supplementary Fig. S3). To elucidate the contributions of the different learning rule components to disentangling, we conducted several ablation experiments. First, we repeated the same simulation but now excluding the predictive term. This modification resulted in an accuracy of (27.0 ± 0.2)%, which is *lower* than the linear readout accuracy of a classifier trained directly on the pixels of the input images (see Table 1), indicating that the network did not learn disentangled representations of object identity. This finding is consistent with previous studies that suggested that purely Hebbian plasticity without a predictive component fails to learn disentangled representations in deep networks [10, 30, 31]. We measured a similar drop in accuracy when we disabled either the Hebbian or the decorrelation component during learning (Fig. 3b).

Convolutional DNNs trained through supervised learning use depth to progressively separate representations as activity propagates through their layers [5, 6]. We sought to understand whether networks trained with LPL similarly leverage depth. To answer this question, we measured the linear readout accuracy of the internal representations at every intermediate layer in the network. Crucially, we found that in the LPL-trained networks, the readout accuracy increased with the number of layers until it gradually saturated (Fig. 3c), whereas this was not the case when any of the components of LPL was disabled. Together, these results suggest that each of the three terms of LPL is crucial for learning disentangled representations in hierarchical DNNs.

We made an additional observation worth noting. When learning occurs without the predictive term, linear readout accuracy in the early layers up to Layer 3 was improved before decreasing below the pixel-level baseline accuracy with increasing depth (Fig. 3c). We observed the same effect for the full LPL rule when we exposed the network to temporally inconsistent image sequences by showing random consecutive pairs of images. These observations suggest that learning in early layers may not critically depend on predictive plasticity, but that it plays an increasingly important role in deeper layers. This finding is reminiscent of experimental work in the rat visual cortex showing that temporally inconsistent visual experience in early life leads to impaired complex cell development while simple cells remain largely unaffected [40].

In DNNs exposed to normal visual experience in which object identity changes slowly, the two most common causes for failure to form disentangled representations are representational collapse and dimensional collapse (Supplementary Fig. S4), which results from excessively high correlations between neurons [41, 42]. To disambiguate between these two possibilities in our model, we computed the dimensionality of the representations and the mean neuronal activity at every layer (Methods). We found that disabling either the Hebbian or the decorrelation component led to a dimensionality of approximately one, whereas the LPL rule and the learning rule without the predictive term resulted in high-dimensional representations of dimensionality ≈ 15 or ≈ 50 respectively (Fig. 3d). When we disabled the Hebbian term, this resulted in zero activity across all layers (Fig. 3e), which suggests that representational collapse underlies the network’s inability to disentangle its input. In contrast, disabling the decorrelation term did not lead to zero activity levels indicating that the reason for poor linear readout accuracy is dimensional collapse (Fig. 3e). Finally, we verified that excluding the predictive component of LPL did not cause collapse either in activity levels or in dimensionality. This suggests that the decreasing linear readout accuracy with depth is due to the network’s inability to learn good internal representations. Taken together, these results show that the predictive term is crucial for disentangling object representations in DNNs (Fig. 3) whereas the other terms are essential to prevent different forms of collapse.

In all of the above we assumed layer-local learning, meaning that no learning signals were backpropagated along the hierarchy. Such learning could conceivably be implemented by local learning rules, but it is markedly different from typical DNN training which is end-to-end, namely, training optimizes an objective at the network’s output layer. End-to-end training is typically achieved with backpropagation, an algorithm that assigns credit or blame for output errors to neurons in intermediate layers, thereby allowing them to update their weights in a direction that is conducive to reducing the overall error at the network level. It is an ongoing debate whether and how neurobiology achieves credit assignment [10, 12, 43, 44]. In any case, we wanted to know how coordinating learning across layers might influence representation learning with LPL. To answer this question, we repeated our simulations with end-to-end learning using the LPL objective at the network’s output (Methods) and found that it reproduced all our key findings of layer-local learning albeit with an improved overall linear readout accuracy (Fig.3b; Table 1). These findings show that LPL disentangles representations irrespective of whether a layer-local objective or end-to-end optimization are used, but its overall performance improves when the underlying learning rules comprise elements of end-to-end optimization. End-to-end optimization in the brain presumably requires dedicated neural circuitry for solving the credit assignment problem, which may be implemented through neuromodulators and dedicated neuronal circuit elements [44–46].

### LPL captures invariance learning in the primate inferotemporal cortex

Changing the temporal continuity structure of visual stimuli has been shown to induce neuronal selectivity changes in primate inferotemporal cortex (IT). This effect has been interpreted as a consequence of unsupervised temporal slowness learning (UTL) [16], a principle directly captured by the predictive term in the LPL rule. In Li et al. [16], a macaque freely viewed a blank screen, with objects appearing in the peripheral visual field at one of two alternative locations relative to the (tracked) center of its gaze, prompting the macaque to perform a saccade to this location (Fig. 4a). The experimenters differentiated between normal exposures in which the object does not change during the saccade and “swap exposures” in which the initially presented object was consistently swapped out for a different one as the monkey saccaded to a specific target location X_{swap}. Hence, such swap exposures created an “incorrect” temporal association between one object at position X_{swap} and a different one at the animal’s center of gaze X_{c}. For any particular pair of swap objects, either the location above or below the center of gaze was chosen as X_{swap}, and transitions from the opposite peripheral position X_{nonswap} to the center X_{c} were kept consistent as a control. The authors found that the position tolerance of object selectivity of individual neurons in the monkey’s IT was systematically altered by swap exposures, an effect they attributed to unsupervised learning. Specifically, a neuron initially selective to an object *P* over another object *N*, reduced or even reversed its selectivity at the swap position X_{swap} while preserving its selectivity at the non-swap position X_{nonswap} (Fig. 4b).

We wanted to know whether LPL can account for these observations. To that end, we developed a computational model of the experiment conducted by Li et al. [16] based on the DNN introduced in the previous section. To simulate the prior visual experience that animals had before entering the experiment, we pretrained our network model with LPL on a large natural image dataset (Methods). After pretraining, the learned representations were invariant to where an object was presented on a canvas (Supplementary Fig. S5), a known property of neural representations in the primate IT [4]. Next, we simulated targeted perturbations in the model inputs analogous to the original experimental design. Following Li et al. [16], for a given pair of images from different classes, we switched object identities during transitions from a specific peripheral position, say X_{1}, to the central position X_{c}, while keeping transitions from the other peripheral position X_{2} to the center unmodified. We used X1 as the swap position for half of the image pairs, and X_{2} for the other half. During exposure to these swapped inputs, we recorded neuronal responses in the network’s output layer while the weights in the network model evolved according to the LPL rule.

We observed that the neuronal selectivity between preferred inputs *P*, as defined by their initial preference (Methods), in comparison to non-preferred stimuli *N* in the model qualitatively reproduced the evolution of object selectivity reported in the experiments (Fig. 4b). Effectively, LPL trained the network’s output neurons to reduce their selectivity to their preferred inputs *P* at the swap position while preserving their selectivity at the non-swap position. Furthermore, we observed that object selectivity between pairs of control objects not used during the swap training protocol showed no changes, consistent with the experiment (Fig. 4b). Further analysis revealed that the origin of the selectivity changes between *P* and *N* stimuli at the swap position was due to both increases in responses to N and decreases in responses to *P*, an effect also observed in the experiments (Fig. 4c). Thus, LPL can account for neuronal selectivity changes observed in monkey IT during in-vivo unsupervised visual learning experiments.

### Spiking neural networks with LPL selectively encode predictive inputs

So far we considered LPL in discrete-time rate-based neuron models without an explicit separation of excitatory and inhibitory neurons. In contrast, cortical circuits employ spiking neurons that obey Dale’s law, and learn in continuous time through STDP. Hence, we wanted to test whether our theory would extend to such biologically plausible plastic spiking neural networks (SNNs). To that end, we simulated a recurrent network model consisting of 100 excitatory and 25 inhibitory neurons with sparse connectivity (Fig. 5a; Methods). The circuit received input from five Poisson populations consisting of 100 neurons each, whose firing rates encoded temporally varying signals with different temporal properties (Fig. 5b; Methods). Specifically, input population P0 had a constant firing rate, whereas *P*1’s and *P*2’s firing rates followed two independently varying signals. We also defined two control populations *P*1_{ctl} and *P*2_{ctl} whose firing rates were temporally shuffled versions of *P*1 and *P*2 using bins of 10 ms duration. To avoid confounding effects due to firing rate differences, we ensured that all populations had the same mean firing rate of 5 Hz. The inputs were connected to the excitatory network neurons through plastic connections that evolved according to a spiking generalization of the local LPL rule (cf. (1)) without the decorrelation term. Decorrelation was achieved through plasticity of inhibitory connections onto excitatory neurons. Specifically, we relied on an inhibitory STDP rule that actively decorrelates neuronal activity by promoting excitatory-inhibitory balance ([37]; Methods).

We ran the SNN model for approximately 28 h of simulated time, at which point the network’s firing dynamics had settled into an asynchronous irregular activity regime from which the slowly varying input signals could be decoded linearly with high fidelity (Fig. 5b). In contrast, the rate fluctuations of the shuffled control signals (*P*1_{ctl} and *P*2_{ctl}) and the constant firing-rate input (*P*0) could not be reconstructed linearly with high accuracy consistent with the idea that the network preferentially represents the slowly varying inputs in its activity. To directly check whether the neurons had developed preferential selectivity for the slow input signals, we analyzed the afferent connectivity matrix and computed the relative difference between the average afferent weight from each signal in comparison to its associated control pathway. As expected, we found that neuronal weights were preferentially tuned to the slow input channel instead of the associated shuffle control inputs (Fig. 5c,d). However, this selectivity was lost when we turned either the predictive or the Hebbian term off. The absence of Hebbian plasticity was further accompanied by activity collapse (Fig. 5e), consistent with our findings in the rate-based setting.

To investigate the role of inhibition in successful representation learning in the spiking setting, we repeated the above simulation without the inhibitory population. This manipulation resulted in excessively high firing rates (Fig. 5e; Supplementary Fig. S6), a notable reduction of the representational dimensionality (Fig. 5f; Methods), and lower selectivity to the slow signals (Fig. 5d). The reasons for this reduction can be seen in the distribution of weight vectors. In the network with plastic inhibition, weights were more decorrelated and exclusively selective to either *P*1 or *P*2 (Fig. 5g). In contrast, removing inhibition resulted in more correlated weights with few neurons preferentially tuned to one signal or the other (Fig. 5h). Finally, we repeated the same simulation in a network in which an inhibitory population was present but without inhibitory plasticity. This manipulation led to comparable representational dimensionality as for LPL (Fig. 5f), but caused a loss of selectivity relative to the shuffled controls (Fig. 5d). These results indicate that inhibition is needed to prevent correlated neuronal activity and the ensuing reduction in representational dimensionality. Further, inhibitory plasticity is required to ensure that the slow signals are preferentially represented (Supplementary Fig. S6). Together, these findings illustrate that LPL extends to realistic spiking circuits with separate excitatory and inhibitory neuronal populations.

### LPL qualitatively reproduces experimentally observed rate and spike-timing dependence of synaptic plasticity

Next, we wanted to examine whether the spike-based LPL rule is consistent with experimental observations of plasticity induction. Experiments commonly report intertwined rate and spiketiming dependence presumably mediated through nonlinear voltage- and calcium-dependent cellular mechanisms [28, 29, 47]. Theoretical work has further established conceptual links between phenomenological STDP models, SFA, and BCM theory [20, 48–52].

To compare LPL to experiments, we simulated a standard STDP induction protocol. Specifically, we paired 100 pre- and post-synaptic action potentials with varying relative timing Δ*t* for a range of different repetition frequencies *ρ*. During the entire plasticity induction protocol, the postsynaptic cell was kept depolarized close to its firing threshold and weights evolved according to spike-based LPL. We repeated the simulated induction protocol for different initial values of the slowly moving averages of the postsynaptic firing rate and variance (Methods). This was done because these variables do not change much over the course of a single induction protocol due to their slow dynamics. Their presence, however, makes LPL a form of metaplasticity, i.e., the strength of plasticity is induction dependent on past neuronal activity.

We found that for small initial values of , the induced weight changes followed an antisym-metric temporal profile consistent with STDP experiments (Fig. 6a). For larger initial values of , the STDP window changed to a more symmetric and then ultimately an antisymmetric profile while the plasticity amplitude was suppressed as expected due to the variance-dependent suppression of the Hebbian term in the learning rule (Fig. 6b,c). Next we investigated the effect of different initial values for , which acts as a moving threshold reminiscent of BCM. Specifically, we recorded plastic changes at two fixed spike timing intervals Δ*t* = ±10 ms for . For intermediate threshold values , causal spike timing induced long-term potentiation (LTP) with a nonlinear frequency dependence (Fig. 6d) whereas acausal pre-after-post timings showed a characteristic cross-over from LTD to LTP similarly observed in experiments [27]. In contrast, a low initial threshold , which would occur in circuits that have been quiescent for extended periods of time, resulted in LTP induction for both positive and negative spike timings whereas a high initial value , corresponding to circuits with excessively high activity levels, led to LTD (Supplementary Fig. S7). Importantly such slow shifts in activity-dependent plasticity behavior are consistent with the metaplasticity observed in monocular deprivation experiments [2, 33, 52]. Thus, LPL qualitatively captures key phenomena observed in experiments such as STDP, the rate-dependence of plasticity, and metaplasticity, despite not being optimized to reproduce these phenomena. Rather our model offers a simple normative explanation for the necessity of different plasticity patterns that are also observed experimentally [47].

## Discussion

In this article, we have introduced LPL, a local plasticity rule that extends BCM theory by adding a predictive component to Hebbian learning. We demonstrated that LPL disentangles latent object representations in DNNs through mere exposure to temporal data in which object identity is preserved across successive inputs provided neuronal activity is decorrelated. Crucially, we show that both predictive and Hebbian learning have to work in symphony to accomplish this. Moreover, we demonstrated that LPL qualitatively captures the representational changes observed in unsupervised learning experiments in monkey IT [16]. Finally, we extended LPL to spiking neural networks and found that the resulting learning rule naturally reproduces STDP and its rate-dependence as observed in experiments, while further predicting a new form of metaplasticity with distinct variance-dependent changes to the STDP window.

The idea that sensory networks use temporal prediction as a learning objective to form disentangled internal representations has been studied extensively in both machine learning and neuroscience. The model introduced in this article combines and extends aspects of biologically plausible plasticity models closely related to BCM theory with central ideas from SFA and more recent SSL approaches in machine learning. While SSL has shown great promise in representation learning without labelled data, it is typically formulated as a contrastive learning problem requiring negative samples [13, 14] to prevent representional collapse. However, negative samples explicitly break temporal contiguity during learning and are thus not biologically plausible. LPL does not require negative samples. Instead, it relies on variance regularization as proposed in Bardes et al. [24] to prevent collapse. Our model uses virtually the same mechanism while building a conceptual bridge from variance regularization to BCM theory. However, we used a logarithmic variance dependence instead of a piece-wise linear dependence (Supplementary Note S3). LPL’s logarithmic dependence yields a smooth derivative which more clearly exposes the relationship to Hebbian metaplasticity.

Like most SSL approaches, Bardes et al. [24] used an end-to-end learning approach whereby the objective function is formulated on the embeddings at the network output. In contrast, we studied the case of greedy layer-wise learning in which the objective is applied to each layer individually. Doing so alleviates the need for backpropagation of errors into hidden layers and, as a result, permitted us to formulate the weight updates as local learning rule. Such a local learning rule formulation is similar to works combining a contrastive objective with greedy layer-wise training [30, 53]. Furthermore, recent work by Illing et al. [21] showed that greedy layer-wise learning is directly linked to local learning rules that rapidly switch between Hebbian and anti-Hebbian learning through a global third factor. However, all of these models required implausible negative samples, whereas LPL does neither require end-to-end training nor negative samples.

LPL shares the shape of the BCM rule, which has been qualitatively confirmed in numerous experimental studies both in-vitro [27, 32, 33] and in-vivo [34]. Furthermore, BCM has been linked to STDP [29] and informed numerous phenomenological plasticity models [48–51, 54]. However, unequivocal evidence for the predicted supralinear behavior of the firing ratedependence of the BCM sliding threshold remains scarce [33] and the fast sliding threshold required for network stability seems at odds with experiments [52, 55, 56]. In contrast, LPL resembles the shape of the BCM learning rule, but does not require a fast nonlinear sliding threshold for stability. Instead, it posits a fast-acting variance-dependence that ensures stability by suppressing Hebbian plasticity when the variance of the output activity is too high. This suppressive effect allows the sliding threshold, which could be implemented by either neuronal or circuit mechanisms [33, 57, 58], to catch up slowly, more consistent with experiments. LPL, therefore, offers a simple explanation that could help close the current gap between theory and experiment. Validating this theory will require future studies investigating whether and how neuronal circuits regulate plasticity.

The notion of slowness learning has been studied extensively in the context of the Trace Rule [59] and SFA [19, 39] which have conceptual ties to STDP [20]. However, the Trace Rule enforces a hard constraint on the norm of the weight vector to prevent collapse, while SFA explicitly restricts neuronal variance to one. In contrast, LPL merely enforces a soft variance constraint [24] through variance dependence of Hebbian plasticity to the same effect. A similar soft constraint on the variance can be derived from statistical independence arguments [60] within a mutual information view of SSL [13]. However, these studies used negative samples, assume rapid global sign switching of the learning rule, and did not connect their work to biological plasticity mechanisms.

Our study has several limitations which we aim to address in future work. First, our study is limited to visual tasks of core object recognition, whereas other sensory modalities may use LPL as a mechanism to form disentangled representations of the external world. Moreover, we restricted ourselves to artificial data augmentation techniques borrowed from SSL to generate temporally related images, for instance, by randomly cropping two images from an underlying larger image. We did so for computational efficiency and due to the lack of suitable alternative datasets. It seems clear that this methodology can only serve as a crude proxy for the type of successive stimuli that our visual system experiences as we sample an image through a complex sequence of saccades. Additionally, the image itself may be non-stationary as the state of the world evolves around us. Finally, there remains a performance gap in classification performance compared to less plausible fully supervised and contrastive approaches (Supplementary Table S3) showing that there remains room for improvement, possibly by incorporating biological circuit mechanisms into the model. It is left as future work to show how LPL can be extended to the circuit level and to more ethologically realistic sensory modalities [61] and video input while further combining them with plausible models of saccadic eye movement [62].

We have evaluated our model’s ability to disentangle object representations in DNNs and to reproduce neuronal activity signatures of unsupervised visual learning in experiments performed in monkey IT [16]. A logical next question is how well our model matches the representational geometry of other experimental recordings as evaluated previously for contrastive models [7, 63–65]. An appealing metric for such an evaluation is provided by Brain-Score [66]. However, it is typically used for much larger network models trained on ImageNet, a rich 1000-class dataset [7]. In contrast, our networks were smaller and trained on CIFAR-10 and STL-10 with only ten classes. When we evaluated our model on Brain-Score (Supplementary Fig. S8), we found that the models trained end-to-end with either LPL or a supervised objective showed the highest Brain-Score, whereas networks trained with layer-local LPL exhibited a low score. One possible explanation for this finding is that the brain may indeed rely on elements of end-to-end optimization implemented through biologically plausible credit assignment mechanisms at the circuit level. Still, our architecture’s Brain-Scores were lower overall than larger network models trained on larger datasets. Due to these architectural differences, interpreting these results is not straightforward. While a more detailed analysis of the origins of these difference goes beyond the scope of the present article, we intend to study this question in more detail in the future.

Despite these limitations, our model makes several concrete predictions about synaptic plasticity. As we have shown, modulating the strength of Hebbian plasticity as a function of the variance of the postsynaptic activity is essential to LPL. A direct prediction of our model is, therefore, that the predictive contribution to plasticity should be best observable when post-synaptic activity is highly variable, while it should be barely observable at low variance levels. While our model does not make quantitative predictions about the time scale on which each neuron would have to estimate its output variance, one would expect that a neuron that has been inactive for a long time, as may be the case in slice experiments, would show stronger Heb-bian learning than neurons participating in in-vivo activity. Moreover, LPL should manifest in metaplasticity experiments as a transition from an asymmetric Hebbian STDP window, via a symmetric window to, to ultimately an anti-Hebbian window (cf. Fig. 6) when priming the postsynaptic neuron with increasing output variance. Specifically, we expect a neuron which has remained quiescent for a long period of time to display a classic STDP window, whereas a neuron whose activity has undergone substantial fluctuations in the recent past should show an inverted STDP window. Such metaplasticity may account for the diversity of different shapes of STDP windows observed in experiments [47, 67].

To fathom how established data-driven plasticity models are related to theoretically motivated learning paradigms such as SFA and SSL is essential to understanding the brain. A central open question in neuroscience remains: How do the different components of such learning rules interact with the rich local microcircuitry to yield useful representations at the network level? In this article, we have only scratched the surface by proposing a local plasticity rule and illustrating its aptitude for disentangling internal representations. However, a performance gap remains compared to learning algorithms that can leverage top-down feedback. We expect that extending predictive learning to the circuit and network level will narrow this gap and generate deep mechanistic insights into the underlying principles of neural plasticity.

## Methods

### Plasticity model

The LPL rule is derived from an objective function approach. It consists of three distinct parts, each stemming from a different additive term in the following combined objective function:

First, the predictive component minimizes neuronal output fluctuations for inputs that occur close in time. Second, a Hebbian component maximizes variance and thereby prevents representational collapse. Finally, is a decorrelation term that we use in all non-spiking network simulations to prevent excessive correlations between neurons within the same layer in a network. In SNNs decorrelation is achieved without this term through lateral inhibition and inhibitory plasticity.

In the following, we consider a network layer with *N* input units and *M* output units trained on batches of *B* pairs of consecutive stimuli. In all simulations we approximate the temporal derivative * ^{dz}/dt* which appears in Eqn. (1) by finite differences

*z*(

*t*) –

*z*(

*t*– Δ

*t*) assuming a discrete timestep Δ

*t*while absorbing all constants into the learning rate. In this formulation, the LPL rule has a time horizon of two time steps in the sense that only one temporal transition enters into the learning rule directly. We used this insight to efficiently train our models using minibatches of paired consecutive input stimuli which approximates learning on extended temporal sequences consisting of many time steps. Let be the input to the network at time

*t*, be the weight matrix to be learned, be the pre-activations, and the activity of the

*i*th output neuron at time

*t*. Finally,

*b*indexes the training example within a minibatch of size B.

#### Predictive component

We define the predictive objective as the mean squared difference between neuronal activity in consecutive time steps.

Here SG denotes the Stopgrad function, which signifies that the gradient is not evaluated with respect to quantities in the past, thereby removing the need for backpropagation through time.

#### Hebbian component

To avoid representational collapse we rely on the Hebbian plasticity rule that results from minimizing the negative logarithm of the variance of neuronal activity:
where and are the current estimates of the mean and variance of the activity of the *i*th output neuron. Note that we do not compute gradients with respect to the mean estimate, which would require backpropagation through time. Assuming that the mean is fixed allows formulating LPL as a temporally local learning rule (cf. Eq. (3)). To minimize the computational burden in DNN simulations, we performed all necessary computations on minibatches, which includes estimating the mean and variance. However, these quantities could also be estimated using stale estimates from previous inputs, a requirement for implementing LPL as an online learning rule. Using stale mean and variance estimates from previous minibatches in our DNN simulations did cause a drop in readout performance (Supplementary Table S1). Still, such a drop could possibly be avoided either using larger mini batch sizes, by further reducing the learning rate, or by computing the estimates as running averages over past inputs. All of the above manipulations result in essentially the same learning rule (see Supplementary Note S1).

#### Decorrelation component

Finally, we use a decorrelation objective to prevent excessive correlation between different neurons in the same layer as suggested previously [24, 36, 68]. The decorrelation loss function is the sum of the squared off-diagonal terms of the covariance matrix between units within the same layer, which is given as with a scaling factor that keeps the objective invariant to the number of units in the population.

#### The full learning rule

We obtain the LPL rule as the negative gradient of the total objective plus an added weight decay. For a single network layer, this yields the layer-local LPL rule where we omitted the time argument t from all present quantities for brevity:

Here *λ*_{1} and *λ*_{2} are parameters which control the relative strengths of each objective, *α* and *β* are the appropriate normalizing constants for batch size and number of units, and *η _{w}* is a parameter controlling the strength of the weight decay.

#### Numerical optimization methods

We implemented all network models learning with LPL using gradient descent on the equivalent objective function in PyTorch [69] with the Lightning framework [70]. DNN simulations were run on five Linux workstations equipped with Nvidia Quadro RTX 5000 graphic processing units (GPUs) and a compute cluster with Nvidia V100 and A100 GPUs. In case of the DNNs, we used the Adam optimizer [71] to accelerate learning. Parameter values used in all simulations are summarized in Supplementary Table S2.

### Learning in the single neuron setup

We considered a simple linear rate-based neuron model whose output firing rate *z* is given by the weighted sum of the firing rates *x _{j}* of the input neurons, i.e,

*z*= ∑

*, where*

_{j}W_{j}x_{j}*W*corresponds to the synaptic weight of input

_{j}*j*. We trained the neuron using stochastic gradient descent (SGD) on the corresponding objective function:

Here, and in all following simulations, we fixed the Hebbian coefficient *λ*_{1} = 1. We also added a small constant *ϵ* = 10^{-6} to the estimate of the variance *σ _{z}* for numerical stability. In case of a single rate neuron, the LPL rule (Eq. (7)) simplifies to Eq. (1) without the decorrelation term.

#### Synthetic two-dimensional dataset generation

The two-dimensional synthetic data sequence (Fig. 2e) consists of two clusters of inputs, one centered at *x* = −1, and the other at *x* = +1. Pairs of consecutive data points were drawn independently from normal distributions centered at their corresponding cluster. To generate a family of different datasets, we kept the standard deviation in the x-direction fixed at *σ _{x}* = 0.1 and varied

*σ*. Additionally, to account for occasional transitions between clusters with probability

_{y}*p*, we included a corresponding fraction of such “crossover pairs” in the training batch. For each value of

*σ*, we simulated the evolution of the input connections of a single linear model neuron that received the

_{y}*x*and

*y*as its two inputs, and updated its input weights according to LPL. In the simulations in Fig. 2 we assumed

*p*→ 0, however, the qualitative behavior remained unchanged for noise levels below

*p*= 0.5, i.e, as long as the “noisy” pairs of points from different clusters were rare in each training batch (Supplementary Fig. S9).

#### Neuronal selectivity measure

After training weights to convergence, we measured the neuron’s selectivity to the x-input as the normalized difference between mean responses to stimuli coming from the two respective input clusters. Concretely, let 〈*z*_{1}〉 be the average output caused by inputs from the *x* = 1 cluster, and 〈*z*_{2}〉 from the *x* = −1 cluster, then the selectivity *χ* is defined as:
with *z*_{max} the maximum and *z*_{min} the minimum response across all inputs.

### Learning in deep convolutional neural networks

For all network simulations, we used a convolutional DNN based on the VGG-11 architecture [72] (see Supplementary Note S5 for details). We trained this network on STL-10 [73] and CIFAR-10 [74] (Supplementary Fig. S10), two natural image datasets (see Supplementary Table S2 for hyperparameters). To simulate related consecutive inputs, we used two differently augmented versions of the same underlying image, a typical approach in vision-based SSL methods. Specifically, we first standardized the pixel values to zero mean and unit standard deviation within each dataset before using the set of augmentations originally suggested in [14], which includes random crops, blurring, color jitter and random horizontal flips (see Supplementary Fig. S2 for examples).

#### Network training

We trained our network models on natural image data by minimizing the equivalent LPL objective function. For both datasets, we trained the DNN using the Adam optimizer with default parameters [71] and a cosine learning rate schedule that drove the learning rate to zero after 800 epochs. We distinguish between two cases: layer-local and end-to-end learning. End-to-end learning corresponds to training the network by optimizing at the network’s output while using backpropagation to train the hidden layer weights. This is the standard approach used in deep learning. In contrast, in layer-local learning, we minimized the LPL objective at each layer in the network independently without backpropagating loss gradients between layers similar to previous work [21, 30]. In this case, every layer greedily learns predictive features of its own inputs, i.e, its previous layer’s representations. To achieve this behavior, we prevented PyTorch from backpropagating gradients between layers by detaching the output of every layer in the forward pass and optimizing the sum of per-layer losses .

Unless mentioned otherwise, we used global average pooling (GAP) to reduce feature maps to a single vector before applying the learning objective at the output of every convolutional layer for layer-local training, or just at the final output in the case of end-to-end training. Although pooling was not strictly necessary and LPL could be directly applied on the feature maps, it substantially sped up learning and led to an overall improved linear readout accuracy on CIFAR-10 (Supplementary Table S1). However, we observed that GAP was essential on the STL-10 dataset for achieving readout accuracy levels above the pixel-level baseline (cf. Table 1). This discrepancy was presumably due to the larger pixel dimensions of this dataset and the resulting smaller relative receptive field size in early convolutional layers. Concretely, feature pixels in the first convolutional layer of VGG-11 have a receptive field of 3 × 3 pixels covering a larger portion of the 32 × 32 CIFAR-10 images as compared to the 96 × 96 STL-10 inputs. This hypothesis was corroborated by the fact that when we sub-sampled STL-10 images to a 32 × 32 resolution, the dependence on GAP was removed and LPL was effective directly on the feature maps (Supplementary Table S1).

#### Baseline models

As baseline models for comparison (Supplementary Table S3), we trained the same convolutional neural network (CNN) network architecture either with a standard crossentropy supervised objective, which requires labels, or with a contrastive objective, which relies on negative samples. To implement contrastive learning, the network outputs z(t) were passed through two additional dense projection layers *v*(*t*) = *f*_{proj}(*z*(*t*)), which is considered crucial in contrastive learning to avoid dimensional collapse [41]. Finally, the following contrastive loss function was applied to these projected outputs
where is the cosine similarity between two representations *v*_{1} and *v*_{2}. The second term in the loss function is a sum over all pairwise similarities between inputs in a given minibatch. These pairs correspond to different underlying base images and therefore constitute negative samples. During training the network is therefore optimized to reduce the representational similarity between them.

For training the layer-local versions of the supervised and contrastive models, we followed the same procedure as with LPL of optimizing the respective loss function at the output of every convolutional layer *l* of the DNN without backpropagation between the layers. Because projection networks are necessary for avoiding dimensional collapse in case of contrastive learning, we included two additional dense layers to obtain the projected representations at every level of the DNN before calculating the layer-wise contrastive loss . This meant that gradients were backpropagated through each of these dense layers for training the corresponding convolutional layers of the DNN, but consecutive convolutional layers were still trained independent of each other.

### Analysis of population activity and representational dimension

We adopted two different metrics in order to analyze the representations learned by the DNN after unsupervised training with LPL on the natural image datasets.

#### Linear readout accuracy

To evaluate how well the LPL rule trained the DNN to disentangle and identify underlying latent factors in a given image, we measured linear decodability by training a linear classifier on the network outputs in response to a set of training images. Crucially, during this step we only trained the readout weights while keeping the weights of the LPL-pretrained DNN frozen. We then evaluated the linear readout accuracy (Fig. 3b) on a held-out test set of images. We used the same procedure to evaluate the representations at intermediate layers (Fig. 3c), and for the baseline models.

#### Dimensionality and activity measures

To characterize mean activity levels in the network models, we averaged neuronal responses over all inputs in the validation set. To quantify the dimensionality of the learned representations, we computed the participation ratio [75]. Concretely, if are *N*-dimensional representations of *B* input images, and *λ _{i}*, 1 ≤

*i*≤

*N*is the set of eigenvalues of

*Z*, then the participation ratio is given by:

^{T}Z#### Brain-Score

We computed Brain-Scores [66] for instances of our model by submitting it to http://www.brain-score.org. Specifically, we obtained the values for the untrained network and after training the model on STL-10 using both LPL and supervised training. We computed these scores for both end-to-end and layer-wise training objectives. Because our networks were trained on 96 × 96 images, we downsampled the Brain-Score inputs to our network’s native resolution using the preprocessing function provided in the submission pipeline.

### Model of unsupervised learning in inferotemporal cortex

#### Network model and pretraining dataset

To simulate the experimental setup of [16], we modeled the animal’s ventral visual pathway with a convolutional DNN. To that end, we used the same network architecture as before except that we removed all biases in the convolutional layers in order to prevent boundary effects. This modification resulted in a drop in linear readout accuracy (Supplementary Table S1). Pre-training of the network model proceeded in two steps as follows. First, we performed unsupervised pre-training for 800 epochs on STL-10 using augmented image views exactly as before. Next, we added a fully-connected dense layer at the network’s output, and trained it for 10 epochs with the LPL objective while keeping the weights of the convolutional layers frozen. During this second pre-training phase, we used augmented STL-10 inputs which were spatially extended in order to account for the added spatial dimension of different canvas positions in the experiment by Li et al. [16]. The expanded inputs consisted of images placed on a large black canvas at either the center position X_{c} or one of two peripheral positions X_{1/2} at the upper or lower end of the canvas. Concretely, these images had dimensions (13 × 96) × 96 which resulted in an expanded feature map at the output of the convolutional DNN with spatial dimensions 13 × 1 (see Supplementary Note S5 for details). Note that we only expanded the canvas in the vertical dimension instead of using a setup with a 13 × 13 feature map because it resulted in a substantial reduction of computational and memory complexity. During this second stage of pre-training, the network was only exposed to “true” temporal transitions wherein the image was not altered between time steps apart from changing position on the canvas.

#### Data generation for simulated swap exposures

To simulate the experiment by [16], we exposed the network to normal and swap temporal transitions. In the latter case the image was consistently switched to one belonging to a different object category at the specific swap position. The swap position for a given pair of images was randomly pre-selected to be either X_{1} or X_{2}, while the other non-swap position was used as a control. Specifically, we switched object identities during transitions from one peripheral swap position, say X1, to the central position X_{c}, while keeping transitions from the other peripheral position X_{2} to the center unmodified. As in the experiment, we chose several pairs of images as swap pairs, and fixed X1 as the swap position for half the pairs of images and X2 as the swap position for the other half. To simulate ongoing learning during exposure to these swap and non-swap input sequences, we continued fine-tuning the convolutional layers. To that end, we used the Adam optimizer we used during pre-training with its internal state restored to the state at the end of pre-training. Moreover, we used a learning rate of 10^{-7} during fine-tuning which was approximately 100 × larger than the learning rate reached by the cosine learning rate schedule during pre-training (4 × 10^{-9}, after 800 epochs). Finally, we trained the newly added dense layers with vanilla SGD with a learning rate of 0.02.

#### Neuronal selectivity analysis

Before training on the swap exposures, for each output neuron in the dense layer, we identified the preferred and non-preferred member of each swap image pair, based on which image drove higher activity in that neuron. This allowed us to quantify object selectivity on a per-neuron basis as *P* – *N*, where *P* is the neuron’s response to its initially preferred image, and *N* to its nonpreferred image at the same position on the canvas. Note that, by definition, the initial object selectivity for every neuron is positive. Finally, we measured the changes in object selectivity *P* – *N* during the swap training regimen, at the swap and non-swap positions averaging over all output neurons for all image pairs. As a control, we included measurements of the selectivity between pairs of control images that were not part of the swap set.

### Spiking neural network simulations

We tested a spiking version of LPL in networks of conductance-based leaky integrate-and-fire (LIF) neurons. Specifically, we simulated a recurrent network of 125 spiking neurons (100 excitatory and 25 inhibitory neurons) receiving afferent connections from 500 input neurons. In all simulations the input connections evolved according to the spike-based LPL rule described below. In our model, neurons actively decorrelated each other through locally connected inhibitory interneurons whose connectivity was shaped by inhibitory plasticity.

#### Neuron model

The neuron model was based on previous work [26, 55] in which the membrane potential *U _{i}* of neuron

*i*evolves according to the ordinary differential equation where

*τ*

^{mem}denotes the membrane time constant,

*U*are the synaptic reversal potentials (Supplementary Table S4), and the the corresponding synaptic conductances expressed in units of the neuronal leak conductance. The excitatory conductance is the sum of NMDA and AMPA conductances . Their dynamics are described by the following differential equations whereas the inhibitory GABA conductance evolves as

^{x}In the above expressions refers to the afferent spike train emitted by neuron *j*, in which are the corresponding firing times, and *τ ^{x}* denotes the individual neuronal and synaptic time constants (Supplementary Table S4). Neuron

*i*fires an output spike whenever its membrane potential reaches the dynamic firing threshold

*ϑ*(

_{i}*t*) that evolves according to to implement an absolute and relative refractory period. Specifically,

*ϑ*jumps by Δ

_{i}_{ϑ}= 100 mV every time an output spike is triggered after which it exponentially decays back to its rest value of

*ϑ*

^{rest}= −50 mV. All neuronal spikes are delayed by 0.8 ms to simulate axonal delay and to allow efficient parallel simulation before they trigger postsynaptic potential in other neurons.

#### Time varying spiking input model

Inputs were generated from 500 input neurons divided into five populations of 100 Poisson neurons each. All inputs where implemented as independent Poisson processes with the same average firing rate of 5 Hz and neurons within the same group shared the same instantaneous firing rate. Concretely, neurons in P0 had a fixed firing rate of 5 Hz, whereas the firing rates in groups *P*1 and *P*2 changed slowly over time. Specifically, we generated periodic template signals *X*(*t*) from a Fourier basis
with random uniformly drawn coefficients 0 ≤ *θ _{k}*,

*ϕ*< 1. The spectral decay constant

_{k}*α*= 1.1 biased the signals toward slow frequencies and thus slowly varying temporal structure. We chose the period

*T*= 3s for

*P*1 and (3+

^{1}/13)s for

*P*2 respectively. The different periods were chosen to avoid phase-locking between the two signals. Both signals were then sampled at 10 ms intervals, centered on 5 Hz, variance-normalized, and clipped below at 0.1Hz before using them as periodic time varying firing rates for

*P*1 and

*P*2. Additionally, we simulated control inputs

*P*1/2

_{ctl}of the two input signals by destroying their slowly varying temporal structure. To that end, we repeated the original firing rate profile for 13 periods before shuffling it on a time grid with 10 ms temporal resolution.

#### Spike-based LPL

To extend LPL to the spiking domain, we build on SuperSpike [77], a previously published online learning rule, which had only been used in the context of supervised learning in SNNs thus far. In this article, we replaced the supervised loss with the LPL loss (Eq. (3)) without the decorrelation term. The resulting spiking LPL online rule for the weight *w _{ij}* is given by
with the learning rate

*η*= 10

^{-2}, a small positive constant

*ξ*= 10

^{-3}to avoid division by zero. Further,

*α*is a double exponential causal filter kernel applied to the neuronal spike train

*S*(

_{i}*t*). Similarly,

*ϵ*is a causal filter kernel that captures the temporal shape of how a presynaptic spike influences the postsynaptic membrane potential. For simplicity, we assumed a fixed kernel and ignored any conductance-based effects and NMDA dependence. Further, we added the transmitter-triggered plasticity term with

*δ*= 10

^{-5}to ensure that weights of quiescent neurons slowly potentiate in the absence of activity to ultimately render them active [78]. Finally,

*λ*= 1 is a constant that modulates the strength of the Hebbian term. We set it to zero to switch off the predictive term where this is mentioned explicitly.

Further, *f*’(*U _{i}*) =

*β*(1 +

*β*|

*U*–

_{i}*ϑ*

^{rest}|)

^{-2}is the surrogate derivative with

*β*= 1mV

^{-1}, which renders the learning rule voltage-dependent. Finally, and are slowly varying quantities obtained online as exponential moving averages with the following dynamics with

*τ*

^{mean}= 600 s and

*τ*

^{var}= 20 s. These quantities confer the spiking LPL rule with elements of metaplasticity [33].

In our simulations, we computed the convolutions with *α* and *ϵ* by double exponential filtering of all quantities. Generally, for the time varying quantity *c*(*t*) we computed
which yields the convolved quantity . Specifically, we used , and .

Overall, one can appreciate the resemblance of Eq. (18) with the non-spiking equivalent (cf. Eq. (1)). As in the non-spiking case the learning rule is local in that it only depends on pre- and postsynaptic quantities. The predictive term in the learning rule can be seen as an instantaneous error signal which is minimized when the present output spike train *S _{i}*(

*t*) is identical to a delayed version of the same spike train

*S*(

_{i}*t*– Δ

*t*) with Δ

*t*= 20 ms. In other words, the past output serves as a target spike train (cf. [77]).

#### Microcircuit connectivity

Connections from the input population to the network neurons and recurrent connections were initialized with unstructured random sparse connectivity with different initial weight values (Supplementary Table S5). One exception to this rule was the excitatory-to-inhibitory connectivity which was set up with a Gaussian connection probability profile
with *c*(*i*) = 0.25*i* with *σ*^{2} = 20 to mimic the dense local connectivity onto inhibitory neurons due to which inhibitory neurons inherit some of the tuning of their surrounding excitatory cells.

#### Inhibitory plasticity

Inhibitory to excitatory synapses were plastic unless mentioned otherwise. We modeled inhibitory plasticity according to a previously published inhibitory STDP
using pre- and postsynaptic traces
with time constant *τ*^{STDP} = 20 ms, learning rate ζ = 1 × 10^{−3}, and target firing rate *κ* = 10Hz.

#### Reconstruction of input signals from network activity

To reconstruct the input signals, we first computed input firing rates of the five input populations by binning their spikes emitted during the last 100 s of the simulation in 25 ms bins. We further averaged the binned spikes over input neurons to provide the regression targets. Similarly, we computed the binned firing rates of the network neurons but without averaging over neurons. We then performed Lasso regression using SciKit-learn with default parameters to predict each target input signal from the network firing rates. Specifically, we trained on the first 95 s of the activity data, and computed *R*^{2} scores on the Lasso predictions over the last 5s of held-out data (Fig. 5b).

#### Signal selectivity measures

We measured signal selectivity of each neuron to the two slow signals relative to their associated shuffled controls (Fig. 5d) using the following relative measure defined on the weights:
where is the average synaptic connection strength from the signal pathways *P*1/2 onto excitatory neuron *i*, and is the same but from the control pathways *P*1/2_{ctl}.

#### Representational dimension

To quantify the dimensionality of the learned neuronal representations (Fig. 5f), we binned network spikes in 25 ms bins and computed the participation ratio (Eq. (11)) of the binned data.

#### Neuronal tuning analysis of the learned weight profiles

To characterise the receptive fields of each neuron (Fig. 5g,h), we plotted *w*_{P1} against *w*_{P2} for every neuron in the excitatory population (Figs. 5g,h; left), and colored the resulting weight vectors by mapping the cosine of the vectors with the x-axis (*w*_{P2}) to a diverging color map. Furthermore, we calculated the relative tuning index as follows

#### STDP induction protocols

To measure STDP curves, we simulated a single neuron using the spiking LPL rule (Eq. 18) with a learning rate of *η* = 5 × 10^{-3}. In all cases, we measured plasticity outcomes from 100 pairings of pre- and postsynaptic spikes at varying repetition frequencies *ρ*. The postsynaptic neuron’s membrane voltage was held fixed between spikes at −51mV for the entire duration of the protocol. To measure STDP curves, we set the initial synaptic weight at 0.5 and simulated 100 different pre-post time delays Δ*t* chosen uniformly from the interval [−50, 50] ms with *ρ* = 10 Hz. To measure the rate-dependence of plasticity, we repeated the simulations for fixed Δ*t* = ±10 ms while varying the repetition frequency *ρ*.

#### Numerical simulations

All SNN simulations were implemented in C++ using the Auryn SNN simulator^{1} [79]. Throughout we used a 0.1 ms simulation time step. Simulations were run on seven Dell Precision workstations with eight-core Intel Xeon CPUs.

## Data availability

The deep learning tasks used the STL-10 [73] and CIFAR-10 [74] datasets, typically available through all major machine learning libraries. The original releases for these datasets can be found at http://ai.stanford.edu/%7Eacoates/stl10/, and https://www.cs.toronto.edu/~kriz/cifar.html respectively.

## Code availability

The simulation code to reproduce the key results is publicly available at https://github.com/fmi-basel/latent-predictive-learning.

## Author contributions

F.Z. conceived the study. M.S.H. and F.Z. developed the theory. M.S.H. wrote DNN code, performed simulations, and analysis. F.Z. developed SNN code. M.S.H. and F.Z. wrote the manuscript.

## Competing interests

The authors declare no competing interests.

## Supplementary Figures

## Supplementary Tables

## Supplementary Notes

### S1 Equivalence of the objective function and learning rule formulations

Here, we show that the objective functions defined in Eqs. (4), (5), and (6) indeed result in the LPL rule (Eq. (7)).

#### Predictive component

We recall that the predictive objective is the mean squared difference between neuronal activity in consecutive time steps.

Taking the derivative with respect to the network weights results in the following learning rule which does not require backpropagation through time due to the Stopgrad function.

#### Hebbian component

The Hebbian component minimizes the negative logarithm of the variance of neuronal activity:
where and are the mean and variance of the activity of the *i*th output neuron over the minibatch. The corresponding learning rule is obtained as the negative gradient of this loss function with respect to the weights *W*. The gradient itself is given by:

Note that the objective and the resulting gradient is essentially unchanged upto a scaling factor when we use running estimates of the variance with a time constant *τ* instead of batch estimates.

#### Decorrelation component

Finally, the decorrelation objective is the decorrelation loss function as the sum of the squared off-diagonal terms of the covariance matrix between units. which gives the gradient:

#### The full learning rule

The combined weight updates for a descent along the sum of the three gradients in Eqs. (28), (29), and (30) in a single-layer network finally yields the LPL rule including the decorrelation component:
where and are the appropriate normalizing constants, and *λ*_{1} and *λ*_{2} are the loss coefficients. Including weight decay in the weight update finally yields the LPL rule for a network given in Eq. (7).

### S2 Relating the Hebbian component of LPL to Oja’s rule

To see the relation of the Hebbian component of LPL with the classic Oja’s rule [1], we consider the case of a single output neuron (*M* = 1), with no nonlinearity (*f*’(*a*) = 1), along with the assumption that the input is zero-centered . Consequently, and , which yields a very simple Hebbian learning rule for descending the gradient in Eq. (29):

This update rule along with a weight decay (with coefficient *η _{w}*) yields a learning rule that, on average, is equivalent to Oja’s rule up to a scaling factor, and in fact has exactly the same non-zero fixed point when

*η*= 1, but with different convergence dynamics because of the multiplication by

_{w}^{1}/〈

*z*

^{2}〉.

Oja’s rule is presented below for reference

### S3 Importance of the variance-dependent modulation of Hebbian learning

To analytically understand the importance of the variance-dependent scaling of the Hebbian term in the learning rule, we first looked at the synthetic two-dimensional learning task from Fig. 2e, and modeled the behaviour of the LPL loss functions under a particular distribution of the representations. Specifically, we considered the case where the two input clusters map to a mixture of two normal distributions in *representation space* with each Gaussian component corresponding to the representations of one input cluster. Each of the two Gaussians are assumed to have a standard deviation of *r*, with their means symmetrically located on either side of zero with a distance of *D* = 4*r* between their centers. We used this setting to investigate how the predictive and Hebbian loss terms of LPL behave under different values of the representational variance by co-varying *r* and *D*.

The predictive loss in this case is proportional to the expected squared difference 〈(*z*_{1} – *z*_{2})^{2}〉 between two independently drawn samples *z*_{1/2} from the same Gaussian, i.e, . The overall variance of the representations is (variance of the Gaussian mixture). Under this representational distribution, we studied the overall loss function obtained by combining the predictive loss with different variance-maximising losses, each a different
decreasing function of the variance, i.e, . Specifically, we considered the cases , and . These loss functions are plotted in Fig. S11b-d respectively along with , and the resulting full LPL objective in each case. With the naive variance maximization objective , the full objective is dominated by the variance term at large values of *D* (Fig. S11b), and therefore inherently drives unstable learning. It is not possible to remedy this situation by simply using a smaller weight for the variance loss, for instance, by weighting with a small weight of 0.1. This is because downweighting the variance objective simply moves the loss minimum at *D* = ∞ to *D* = 0, the exact situation of collapse the variance objective is meant to prevent (Fig. S11b). In contrast, using (Fig. S11c), or (Fig. S11d) along with constitute loss landscapes with minima at finite non-zero values of *D*. This is because these variance objectives only dominate at low values of *D*, but have diminishing influence with growing D allowing the predictive term to dictate learning, and preventing runaway activity.

We validated these scaling arguments with learning simulations using each of the three proposed learning objectives on the synthetic two-dimensional sequence learning task from Fig. 2e. We found that using the naive variance maximization objective results in poor learning of cluster selectivity, whereas and prove effective (Fig. S11e). Furthermore, the naive variance objective indeed suffers from runaway instability (Fig. S11f).

### S4 Predictive feature selected by LPL strictly depends on temporal contiguity properties of the input sequence

The slow or “predictive” feature picked up by a single neuron learning with LPL purely depends on the temporal order of stimuli that it is exposed to. One would expect, then, that it is possible to manipulate the learned feature by altering the temporal sequence of stimuli.

To illustrate that this is indeed the case, we designed a predictive learning task similar to that in Fig. 2e using a subset of images from the MNIST handwritten digit dataset [3]. Specifically, we sampled 2000 images from this dataset corresponding to the digits “five” and “six”, equally distributed between the two classes. We generated sample inputs by embedding these 28 × 28 grayscale images in a 56 × 56 blank canvas at either the top-left or bottom-right location. Because we sought to demonstrate that changing the temporal transition structure qualitatively changes neuronal selectivity, we considered two types of sequences with distinct temporal contiguity properties (Fig. S12). In the Digit Sequence we preserved digit identity (either five or six) across subsequent input images, while changing their position on the canvas, whereas in the Location Sequence, we presented different digits at the same location in successive inputs. Therefore, the predictive feature is location in the Location Sequence and digit identity in the Digit Sequence. Furthermore, digit identity and digit location were approximately aligned with the first two principal components of the data which account for 30 % and 5 % of the explained variance respectively (Fig. S12).

We again exposed a single rate neuron model to these two sequence types, while allowing the plastic input connections to evolve according to the LPL rule. After convergence, we measured neuronal selectivity to digit identity and location. We measured selectivity to location and digit identity with the same measure defined in Eq. (9), only changing what inputs fall into clusters 1 and 2 in each case. Concretely, we measured selectivity to digit identity by setting 〈*z*_{1}〉 to be the mean response to the digit five (at any location) and 〈*z*_{2}〉 the mean response to the digit six. Finally, we set 〈*z*_{1}〉 and 〈*z*_{2}〉 to the mean responses to digits at the two locations regardless of digit identity in order to measure location selectivity.

At initialization with random weights, the neuron was partially selective to both location and digit identity (Fig. S12f). However, subsequent training with LPL rendered the neuron purely selective to either location or digit identity depending on which sequence it was exposed to during training. Yet, when the predictive term was turned off, the specific sequence did not matter and the neuron always became selective to location, which coincides with the direction of highest variance in the data (PC1; Supplementary Fig. S12). Finally, we confirmed that Oja’s rule showed the same behavior (Fig. S12f). Thus, a neuron learning with LPL finds temporally contiguous features in high-dimensional sequential data rather than the direction of largest variance, and the the temporally contiguous feature that is learned is strictly determined by the temporal sequence of the stimuli the neuron is exposed to.

### S5 Details of the deep neural network architecture

For all DNN simulations, we used the convolutional layers of the VGG-11 architecture consisting of eight blocks each containing 3 × 3 convolutions, the ReLU activation function followed by a 2 × 2 max-pool operation in some blocks (detailed architecture description provided below).

Furthermore, for the simulations modeling unsupervised learning in the IT, we used an adaptive average pooling layer with spatial output dimensions of 13 × 1. This ensured that the final pooling layer preserved spatial separation along the canvas itself so that the final feature map consisted of 13 × 1 512-dimensional vectors. We added a fully connected layer on top of these feature maps to finally get a single 512-dimensional feature vector per image.

## Acknowledgments

We thank all members of the Zenke Group for comments and discussions that shaped this project, and Atul Kumar Sinha for many helpful suggestions. We are particularly grateful to Julian Rossbroich for providing invaluable insights throughout the course of this work. This project was supported by the Swiss National Science Foundation [grant number PCEFP3_202981] and the Novartis Research Foundation.

## Footnotes

This manuscript version now contains additional results on Latent Predictive Learning in spiking neural networks in which lateral inhibition and inhibitory plasticity implement neuronal decorrelation.

## References

- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].
- [46].↵
- [47].↵
- [48].↵
- [49].
- [50].
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].
- [65].↵
- [66].↵
- [67].↵
- [68].↵
- [69].↵
- [70].↵
- [71].↵
- [72].↵
- [73].↵
- [74].↵
- [75].↵
- [76].↵
- [77].↵
- [78].↵
- [79].↵