## Abstract

In addition to long-time scale rewiring, synapses in the brain are subject to significant modulation that occurs at much shorter time scales and allows them to process short-term information. Despite this, models of the brain like recurrent neural networks (RNNs) often have their weights frozen after training, relying on an internal state stored in neuron activity to process temporal information. Although networks with dynamical synapses have been explored previously, often said dynamics are added to networks that also have recurrent connections and thus the short-time scale computational capabilities of synapse modulation alone remain unclear. In this work, we analyze the dynamics of a network that relies solely on synaptic modulations to process short-time scale information, the multi-plasticity network (MPN). We thoroughly examine the neural population dynamics of the MPN trained on integration-based tasks and compare it to known RNN dynamics, findings the two to have fundamentally different behavior and attractor structure. We find said differences in dynamics allow the MPN to outperform its RNN counterparts on several neuroscience-relevant tasks. Of note, the MPN has a significantly simpler attractor structure that allows it to be more flexible in training and sequential-learning settings. Lastly, how the dynamics change for MPNs trained on contextual and continuous integration tasks is also investigated.

## 1 Introduction

The brain’s synapses constantly change in response to information under several distinct biological mechanisms [1–7]. These changes can serve significantly different purposes and occur at drastically different time scales.

Such mechanisms include synaptic rewiring, which modifies the topology of connections between neurons in our brain at a time scale of minutes to hours and is assumed to be the basis of long-term memory [3]. To understand the role of rewiring for computation and information storage, artificial neural networks (ANNs) whose synapses/weights are modified during a training period have received widespread study in neuroscience. The need for ANNs that could also incorporate shorter-term, temporally correlated information naturally arose, leading to recurrent neural networks (RNNs) that are capable of holding transient internal states in their neural activity using recurrent connections. Said RNNs have found widespread success in modeling parts of our brain and as such significant work has been spent trying to understand how they operate [8–16]. In particular, considerable progress has been made in analyzing brain-like RNNs as population-level dynamical systems, a framework known as *neural population dynamics* [17]. Such studies have revealed a striking universality of the underlying computational scaffold across different types of RNNs and tasks [18].

At time scales shorter than that of synaptic rewiring, individual synapses can have their strength modified [4–7]. These changes can occur over a spectrum of time scales and can be intrinsically transient [6, 7]. Though such mechanisms may not lead to structural changes in our connectome, they are thought to be vital to the brain’s function. For example, spike-time dependent plasticity (STDP) adjusts the strengths of connections based on the relative timing of pre-and post-synaptic spikes and is widely believed to play an integral part in learning and the storage of information in our brains [4, 5]. Although RNNs model the brain’s significant recurrent connections, the weights in these networks still remain fixed after training and hence neglect the role shorter-term synaptic dynamics can have in processing information.

In this work, we investigate a new type of ANN that uses a biologically-motivated synaptic modulation to process short-term sequential information. The *multi-plasticity network* (MPN) learns using two distinct plasticity mechanisms: (1) long-term synaptic rewiring via standard supervised ANN training and (2) short-term, Hebbian-like synaptic modulations. Unlike many other network models with synaptic dynamics [19–23], *the MPN has no recurrence*, and thus can only rely on modulations of synaptic strengths to pass short-term information across time. Although both recurrent connections and synaptic modulation are present in the brain, it can be difficult to isolate how each of these affects temporal computation. The MPN thus allows for an in-depth study of the computational power of synaptic modulation alone and how the dynamics behind said computations may differ from networks like RNNs. In particular, we thoroughly characterize the MPN’s low-dimensional behavior in the neural population dynamics framework [17], allowing for a comparison to known RNN dynamical systems. We find the MPN computes using completely different dynamics than its RNN counterparts, and further explore the potential benefits behind its distinct dynamics on several neuroscience-relevant tasks.

We specifically investigate the dynamics of the MPN on tasks that require an integration of information over time. Said tasks consist of a time-sequence of inputs passed to the network followed by a cue to respond to said evidence. To correctly respond to said task, the network is required to both store and update its internal state as well as compare several distinct items in its memory. The dynamical systems behind RNNs that learn such tasks have been well studied in the neuroscience and machine learning literature [24, 25].

### Contributions

The primary contributions and findings of this work are as follows:

We elucidate the neural population dynamics of the MPN trained on integration-based tasks and show it operates with qualitatively different dynamics and attractor structure than RNNs. We support this with analytical approximations of said dynamics.

We show how the MPN’s synaptic modulations allow it to store and update information using a task-independent, single point-like attractor in the network’s state space.

Despite its simple attractor structure, for integration-based tasks, we show the MPN performs at level comparable or exceeding RNNs on several neuroscience-relevant measures.

The MPN is shown to have dynamics that make it a more effective reservoir, less susceptible to catastrophic forgetting, and more flexible to taking in new information than RNN counterparts.

We show the MPN is capable of learning more complex tasks, including contextual and continuous integration, and elucidate the changes in dynamics that allow the network to solve them.

### Related work

Networks with synaptic dynamics have been investigated previously [19–23, 26, 27]. The MPN examined in this work is most similar to the ‘HebbFF’ examined in Ref. [27], where it was shown to have promising results of generalization performance on memory recall tasks. The MPN uses a multiplicative modulation mechanism rather than the additive modulation of the ‘HebbFF’. The synaptic modulation updates were originally inspired by ‘fast weights’ used in machine learning for flexible learning [28]. However, in the MPN, both plasticity rules apply to the same weights rather than different ones.

This work largely focuses on understanding computation through a *neural population dynamics* -like analysis [17]. In particular, we focus on the dynamics of networks trained on integration-based tasks, that have previously been studied in RNNs [18, 29–31]. These studies have demonstrated a degree of universality of the underlying computational structure across different types of tasks and RNNs [18]. Due to the MPN’s dynamic weights, its operation is fundamentally different than said recurrent networks.

## 2 Setup

All integration-based tasks in this work consist of a discrete sequence of vector inputs, **x*** _{t}* for

*t*= 1, 2,

*…, T*. At time

*T*, the network is queried by a ‘go signal’ for an output, for which the correct response can depend on information from the entire input sequence. Throughout this paper, we denote vectors using lowercase bold letters, matrices by uppercase bold letters, and scalars using standard (not-bold) letters. The input, hidden, and output layers of the networks we study have

*d, n*, and

*N*neurons, respectively.

### Multi-plasticity network

The *multi-plasticity network (MPN)* is an artificial neural network consisting of input, hidden, and output layers of neurons. It is identical to a fully-connected, two-layer, feedforward network (Fig. 1, middle), with one major exception: the weights connecting the input and hidden layer are modified by the time-dependent *synapse modulation (SM) matrix*, **M** (Fig. 1, left). The expression for the hidden layer activity at time step *t* is

where **W**_{inp} is an *n*-by-*d* weight matrix representing the network’s synaptic strengths that is fixed after training, ‘⊙’ denotes element-wise multiplication of the two matrices (the Hadamard product), and the tanh(·) is applied element-wise. For each synaptic weight in **W**_{inp}, a corresponding element of **M**_{t−1} multiplicatively modulates its strength. Note if **M**_{t−1} = **0** the first term vanishes, so the **W**_{inp} are unmodified and the network simply functions as a fully-connected feedforward network.

What allows the MPN to store and manipulate information as the input sequence is passed to the network is how the SM matrix, **M**_{t}, changes over time. We define **M**_{0} = **0** at the start of each input sequence and at each time step we update the SM matrix with a Hebbian-like learning rule [2, 27, 28],
where *λ* and *η* are parameters learned during training. That is, each element of the SM matrix is updated by the product of the pre-and post-synaptic firing rates times some learned parameter *η*. We allow for −∞ *< η <*∞, so if *η* is negative the updates are anti-Hebbian. Additionally, 0 *< λ <* 1, so the SM matrix exponentially decays at each time step, asymptotically returning to its **M** = **0** baseline. Since the SM matrix is updated and passed forward at each time step, we will often refer to **M**_{t} as the *state* of the MPN.

Biologically, the SM matrix represents a general synapse-specific modulation of strength on shorter time-scales than the structural changes. This is primarily envisioned as coming from associative forms of plasticity such as spike-time dependent plasticity, which depend on both pre-and postsynaptic neuron activity [4, 5]. While these mechanisms are very complicated and can span many orders of magnitude in time scales, for the purpose of this study we use the simplified dynamics in Eq. (2) for theoretical tractability. The decay induced by *λ* represents the return to baseline of the aforementioned processes, which all occur at a relatively slow speed to their onset [32, 33]. To ensure the eventual decay of such modulations, unless otherwise stated, throughout this work we further limit *λ < λ*^{max} with *λ*^{max} = 0.95.

Lastly, the output of the MPN at time *T* is determined by a fully-connected readout matrix, **y**_{T} = **W**_{RO}**h**_{T}, where **W**_{RO} is an *N* -by-*n* weight matrix adjusted during training. Throughout this work, we will view said readout matrix as *N* distinct *n*-dimensional readout vectors, i.e. one for each output neuron.

### Recurrent neural networks

As discussed in the introduction, throughout this work we will compare the learned dynamics and performance of the MPN to artificial RNNs. The hidden layer activity for the simplest recurrent neural network, the *Vanilla RNN*, is
with **W**_{rec} the recurrent weights, an *n*-by-*n* matrix that updates the hidden neurons from one time step to the next (Fig. 1, right). We also consider a more sophisticated RNN structure, the *gated recurrent unit (GRU)*, that has additional gates to more precisely control the recurrent update of its hidden neurons (see Methods 5.2). In both these RNNs, information is stored and updated via the hidden neuron activity, so we will often refer to **h**_{t} as the RNNs’ *hidden state* or just its *state*. The output of the RNNs is determined through a trained readout matrix in the same manner as the MPN above, i.e. **y**_{T} = **W**_{RO}**h**_{T}.

### Training

The weights of the MPN and RNNs will be trained using gradient descent/backpropagation through time, specifically ADAM [34]. All network weights are subject to L1 regularization to encourage sparse solutions (Methods 5.2). Cross-entropy loss is used as a measure of performance during training. Gaussian noise is added to all inputs of the networks we investigate.

## 3 Results

### 3.1 Network dynamics on a simple integration task

#### Simple integration task

We begin our investigation of the MPN’s dynamics by training it on a simple *N* -class^{1} integration task, inspired by previous works on RNN integration-dynamics [29, 31]. In this task, the network will need to determine for which of the *N* classes the input sequence contains the most evidence (Fig. 2a). Each stimulus input, **x**_{t}, can correspond to a discrete unit of evidence for one of the *N* classes. We also allow inputs that are evidence for none of the classes. The final input, **x**_{T}, will always be a special ‘go signal’ input that tells the network an output is expected. The network’s output should be an integration of evidence over the entire input sequence, with an output activity that is largest from the neuron that corresponds to the class with the maximal accumulated evidence.^{2} Prior to adding noise, each possible input, including the go signal, is mapped to a random binary vector (Fig. 2b). We will also investigate the effect of inserting a *delay period* between the stimulus period and the go signal, during which no input is passed to the network, other than noise (Fig. 2c).

We find MPNs are capable of learning the above integration task to near perfect accuracy across a wide range of class counts, sequence lengths, and delay lengths. It is the goal of this section to illuminate the dynamics behind the trained MPN that allow it to solve such a task and compare them to more familiar RNN dynamics. Here in the main text, we will explicitly explore the dynamics of a 2-class integration task, generalizations to *N >* 2 classes are straightforward and are discussed in the Methods 5.4. We will start by considering the simplest case of integration without a delay period, revisiting the effects of delay afterwards.

Before we dive into the dynamics of the MPN, we give a quick recap of the known RNN dynamics on integration-based tasks.

#### Review of RNN integration: attractor dynamics encodes accumulated evidence

Several studies, both on natural and artificial neural networks, have discovered that networks with recurrent connections develop attractor dynamics to solve integration-based tasks [29–31]. Here we specifically review the behavior of artificial RNNs on the aforementioned *N* -class integration tasks that share many qualitative features with experimental observations of natural neural networks. Note also the structure/dimensionality of the dynamics can depend on correlations between the various classes [31], in this work we only investigate the case where the various classes are uncorrelated.

RNNs are capable of learning to solve the simple integration task at near-perfect accuracy and their dynamics are qualitatively the same across several architectures [18, 31]. Discerning the network’s behavior by looking at individual hidden neuron activity can be difficult (Fig. 3a), and so it is useful to turn to a population-level analysis of the dynamics. When the number of hidden neurons is much larger than number of integration classes (*n* ≫ *N*), the population activity of the trained RNN primarily exists in a low-dimensional subspace of approximate dimension *N* − 1 [31]. This is due to recurrent dynamics that create a task-dependent attractor manifold of approximate dimension *N* − 1, and the hidden activity often operates close to said attractor.^{3} In the 2-class case, the RNN will operate close to a finite length line attractor. The low-dimensionality of hidden activity allows for an intuitive visualization of the dynamics using a two-dimensional PCA projection (Fig. 3b). From the sample trajectories, we see the network’s hidden activity starts slightly offset from the line attractor before quickly falling towards its center. As evidence for one class over the other builds, *the hidden activity encodes accumulated evidence by moving along the one-dimensional attractor* (Fig. 3b). The two readout vectors are roughly aligned with the two ends of the line, so the further the final hidden activity, **h**_{T}, is toward one side of the attractor, the higher that class’s corresponding output and thus the RNN correctly identifies the class with the most evidence. For later reference, we note that the hidden activity of the trained RNN is not highly dependent upon the input of the present time step (Fig. 3c), but instead it is the *change* in the hidden activity from one time step to the next, **h**_{t} − **h**_{t−1}, that are highly input-dependent (Fig. 3c, inset). For the Vanilla RNN (GRU), we find 0.53 ± 0.01 (0.88 ± 0.01) of the hidden activity variance to be explained by the accumulated evidence and only 0.19 ± 0.01 (0.29 ± 0.01) to be explained by the present input to the network (mean ± s.e., Methods 5.4).

#### MPN hidden activity encodes inputs, not so much accumulated evidence

We now turn to analyzing the hidden activity of the trained MPNs in the same manner that was done for the RNNs. The MPN trained on a 2-class integration task appears to have significantly more sporadic activity in the individual components of **h**_{t} (Fig. 3d). We again find the hidden neuron activity to be low-dimensional, with approximate dimension 2.07±0.12 (mean s.e.), lending it to informative visualization using a PCA projection (Methods 5.4). Unlike the RNN, we observe the hidden neuron activity to be separated into several distinct clusters (Fig. 3e). Exemplar input sequences cause **h**_{t} to rapidly transition between said clusters. Coloring the **h**_{t} by the sequence input at the present time step, we see the different inputs are what divide the hidden activity into distinct clusters, that we hence call *input-clusters* (Fig. 3f). That is, the hidden neuron activity is largely dependent upon the most recent input to the network, rather than the accumulated evidence as we saw for the RNN. However, within each input-cluster, we also see a variation in **h**_{t} from accumulated evidence (Fig. 3e). We now find only 0.21 ±0.01 of the hidden activity variance to be explained by accumulated evidence and 0.87 ±0.01 to be explained by the present input to the network (mean ± s.e., Methods 5.4).

With the hidden neuron activity primarily dependent upon the current input to the network, one may wonder how the MPN ultimately outputs information dependent upon the entire sequence to solve the task. Like the other possible inputs to the network, the go signal has its own distinct input-cluster within which the hidden activities vary by accumulated evidence. Amongst all input-clusters, we find the readout vectors are most aligned with the evidence variation within the go cluster (Fig. 3f, inset). The readouts are then primed to distinguish accumulated evidence immediately following a go signal, as required by the task.^{4}

#### MPNs encode accumulated evidence in the synapse modulations (M_{t})

Although the hidden neuron behavior is useful for comparison to the RNN and to understand what we might observe from neural recordings, information in the MPN is passed from one step to the next solely through the SM matrix (Eq. (2)) and so it is also insightful to understand its dynamics. Flattening each **M**_{t} matrix, we can investigate the population dynamics in a manner identical to the hidden activity.

Once again, we find the variation of the SM matrix to be low-dimensional meaning we can visualize the evolution of its elements in its PCA space (Figs. 3g,h). From exemplar sequences, we see that **M**_{t} appears to evolve in time along a particular direction as the input sequence is passed (Fig. 3g). Perpendicular to this direction, we see a distinct separation of **M**_{t} values by accumulated evidence, very similar to what was seen for the RNN hidden activity (Fig. 3h). Also like the RNN, the distinct evidence inputs tend to cause a change in the state in opposite directions (Fig. 3h, inset). We also note the input that provides no evidence for either class and the go signal both cause sizable changes in the SM matrix.

Thus, since the state of the MPN is stored in the SM matrix, we see its behavior is much more similar to the dynamics of the hidden neuron activity of the RNN: each of the states tracks the accumulated evidence of the input sequence. Information about the relative evidence for each class is stored in the position of the state in its state space, and this information is continuously updated as new inputs are passed to the network by moving around said state space. Even so, the fact that the size of the MPN state seems to grow with time (a fact we confirm two paragraphs below) and has large deflections even for inputs that provide no evidence for either class make it stand apart from the dynamics of the RNN’s state.

#### The MPN and RNNs have distinct long-time behaviors

Given that **M**_{t} appears to get progressively larger as the sequence is read in (Fig. 3e), one might wonder if there is a limit to its growth. More broadly, this brings up the question of what sort of long-time behavior the MPN has, including any attractor structure. A full understanding of attractor dynamics is useful for characterizing dynamical systems. Attractors are often defined as the set of states toward which the network eventually flows asymptotically in time. As mentioned earlier, it is known that RNNs form low-dimensional, task-dependent attractor manifolds in their hidden activity space for integration-based tasks [29, 31]. However, where the activity of a network flows to is dependent upon what input is being passed to the network at that time. For example, the network may flow to a different location under (1) additional stimulus input versus (2) no input. We will investigate these two specific flows for the MPN and compare them to RNNs.

We will be specifically interested in the flow of **M**_{t}, the MPN’s state, since from the previous section it is clear its dynamics are the closest analog of an RNN’s state, **h**_{t}. We train both the MPN and RNNs on a *T* = 20 integration task and then monitor the behavior of their states for stimuli lengths ranging from 10 to 200 steps. As might be expected from the dynamics in the previous section, we do indeed observe that **M**_{t}’s components grow increasingly large with longer stimuli (Fig. 4a). Meanwhile, the RNN’s state appears to remain constrained to its line attractor even for the longest of input sequences (Fig. 4b). To quantitatively confirm these observations, we look at the magnitude of the network’s final state (normalized by number of components) as a function of the stimulus length (Fig. 4c, Methods 5.5). As expected, since the RNNs operate close to an attractor, the final state magnitude does not increase considerably despite passing stimuli 10 times longer than what was observed in training. In contrast, the magnitude of the MPN state can change by several orders of magnitude, but its growth is highly dependent on the value of its *λ* parameter that controls the rate of exponential decay (Fig. 4c). As **M**_{t} grows in magnitude so does the size of its decay and eventually this decay will be large enough to cancel out the growth of **M**_{t} from additional stimuli input. For smaller *λ* this decay is larger and thus occurs at shorter sequence lengths. Despite this saturation of state size, the accuracy of the MPN does not decrease significantly with longer sequence lengths (Fig. 4d). These results also demonstrate the MPN (and RNNs) are capable of generalizing to both shorter and longer sequence lengths, despite being trained at a fixed sequence length.

#### The MPN has a single, point-like attractor that its state uniformly decays toward

Another important behavior that is relevant to the operation of these networks is how they behave under no stimulus input. Such inputs occur if we add a delay period to the simple integration task, so we now turn to analyzing the MPN trained on a integration-delay task (Fig. 2c). We again train MPNs with varying *λ* and RNNs, this time on a task with a delay length of 20 time steps.^{5} During the delay period, the state of the MPNs decay over time (Fig. 4e). Once again, since the RNNs operates close to its a line attractor manifolds, its state changes little over the delay period, other than flowing more towards the ends of the line (Fig. 4f). Again, we quantify this behavior by monitoring the normalized final state magnitude as a function of the delay length (Fig. 4g). We see the decay in the MPN’s state is fastest for networks with smaller *λ* and that the network with *λ* = 1.0 has no such decay.

Perhaps obvious in hindsight, the MPN’s state will simply decay toward **M** = **0** under no input. As such, for *λ <* 1, *the MPN has an attractor at* **M** = **0**, due to the exponential decay of its state built into its update expression. Since the evidence is stored in the magnitude of certain components of the SM matrix, a uniform decay across all elements maintains their relative size and does not decrease the MPN’s accuracy for shorter delay lengths (Fig. 4h). However, eventually the decay decreases the information stored in **M**_{t} enough that early inputs to the network will be indistinguishable from input noise, causing the accuracy of the network to plummet as well. The RNN, since it operates close to an attractor, has no appreciable decay in its final state over longer delays (Fig. 4g). Still, even RNN attractors are subject to drift along the attractor manifold and we do eventually see a dip in accuracy as well (Fig. 4h).

#### MPNs are ‘activity silent’ during a delay period

We have seen the MPN’s state decays during a delay period, here we investigate what we would observe in its hidden neurons during said period. Since the MPN’s hidden activity primarily depends on the present input, during the delay period when no input is passed to the network (other than noise), we expect the activity to be significantly smaller. Indeed, at the start of the delay period, we see a few of the MPN’s hidden neuron activities quickly drops in magnitude, before spiking back up after receiving the go signal (Fig. 5a). Meanwhile, at the onset of the delay, the RNN’s hidden layer neurons quickly approach finite asymptotic values that remain fairly persistent throughout the delay period. The aforementioned behaviors are also seen by taking the average activity magnitude across the entire population of hidden neurons in each of these networks (Fig. S2a). Reduced activity during delay periods has been observed in working memory experiments and models and is sometimes referred to as an ‘activity-silent’ storage of information [35–37]. It contrasts with the ‘persistent activity’ exhibited by RNNs that has been argued to be more metabolically expensive [38].

To further quantify the degree of variation in the output information stored in the hidden neuron activity of each of these networks as a function of *t*, we train a decoder on the **h**_{t} (Methods 5.5) [23]. Confirming that the MPN has significant variability during its stimulus and delay periods, we see the MPN’s decoding accuracy drops to almost chance levels at the onset of the delay period before jumping back to near-perfect accuracy after the go signal (Fig. 5b). Since the part of the hidden activity that tracks accumulated evidence is small, during this time period said activity is washed out by noise, leading to a decoding accuracy at chance levels. Meanwhile, the RNN’s hidden neuron activity leads to a steady decoding accuracy throughout the entire delay period, since the RNN’s state just snaps to the nearby attractor, the position along which encodes the accumulated evidence. Additionally, the RNN’s trained decoders maintain high accuracy when used at different sequence times, whereas the cross-time accuracy of the MPN fluctuates significantly more (Fig. S2e,f,g). The increased time-specificity of activity in the MPN has been observed in working memory experiments [39]. ^{6}

#### Analytical confirmation of MPN dynamics

It is possible to analytically approximate the behavior of the MPN’s **h**_{t} and **M**_{t} at a given time step. Details of the derivation of these approximations is given in Methods 5.3. Briefly, the approximation relies on neglecting quantities that are made small with an increasing number of neurons in either the input or hidden layer. The net effect of this is that *synaptic modulations are small* and thus can be neglected at leading-order approximations. Explicitly, the approximations are given by
where, in the first expression, we have indicated the terms that are the leading and sub-leading contributions. These approximation do quite well in predicting the element-wise evolution of **h**_{t} and **M**_{t}, though are notably bad at predicting the hidden activity during a delay period where it is driven by only noise (Fig. 5c). We quantify how good the approximations do across the entire test set and see that they improve with increasing input and hidden layer size (Fig. 5d).

These simplified analytical expressions allow us to understand features we’ve qualitatively observed in the dynamics of the MPN. Starting with the expression for **h**_{t}, we see the leading-order contributions comes from the term **W**_{inp}**x**_{t}, which is solely dependent upon the current input, i.e. *not* the sequence history. Comparing to the exact expression, Eq. (2), the leading-order approximation is equivalent to taking **M**_{t−1} = **0**, so *at leading-order the MPN just behaves like a feedforward network*. This explains why we see the input-dependent clustering in the **h**_{t} dynamics: a feedforward network’s activity is only dependent on its current input. Meanwhile, the sub-leading term depends on all previous inputs (**x**_{t−1}, **x**_{t−2}, …, **x**_{1}), which is why the individual input-clusters vary slightly by accumulated evidence.

From the approximation for **M**_{t}, we see its update from one time step is solely dependent upon the current input as well. Without the *λ* term, **M**_{t} simply acts as an accumulator that counts the number of times a given input has been passed to the network – exactly what is needed in an integration-based task. In practice, with *λ <* 1, the contribution of the earlier inputs of the network to **M**_{t} will slowly decay.

### 3.2 Capacity, robustness, and flexibility

Having established an understanding of how the dynamics of the MPN compares to RNNs when trained on a simple integration-delay task, we now investigate how their different operating mechanisms affect their performance in various settings relevant to neuroscience.

#### MPNs have comparable integration capacity to GRUs and outperform Vanilla RNNs

Given their distinct state storage systems, it is unclear how the capacity of the MPN compares to RNNs on integration-based tasks. In RNNs, the capacity to store information has been linked to the number of synapses/parameters and the size of their state space [40]. For example, since we know that RNNs tend to use an approximate (*N* −1)-dimensional attractor manifold in their hidden activity space to solve an *N* -class task [31], one might expect to see a drop in accuracy for *N*− 1 *> n*, with *n* the number of hidden neurons. To investigate state storage capacity in the MPN and RNNs, we limit the number of adjustable parameters/synaptic connections by making the number of neurons in each layer small, specifically taking the number of input and hidden neurons to be *d* = 10 and *n* = 3, respectively.

We observe the MPN and GRU are capable of training to accuracies well above chance, even for *N* ≥ 3 classes, while the Vanilla RNN’s accuracy quickly plummets beyond *N* = 2 (Fig. 6a). The size of the MPN state scales as *nd*, and indeed we see the the accuaracies receive a small bump, becoming more comparable to that of the GRU, when the input dimension is increased from *d* = 10 to 40.^{7}

A second way we can test for integration information capacity is to increase the length of the input sequences in the task. Across the board, both the MPN and RNNs are capable of learning input sequences up to length *T* = 200 at high accuracy (Fig. 6b). Although differences are of only a few percent, the MPN is capable of learning to integrate relatively long sequences at a level greater than Vanilla RNNs, but not quite as good as GRUs.

#### MPNs can operate at high input noise and minimal training

To test the robustness of the MPN’s dynamics we make the integration task harder in two different ways. First, we add increasingly more noise to the inputs to the network. Even for networks trained with a relatively small amount of noise, we find both the MPN and RNNs are capable of achieving near-perfect accuracy on the 2-class task up to when noise is a comparable magnitude to the signal (Fig. 6c). Continuing to increase the size of the noise, we see all networks eventually fall to chance accuracy, as expected. Notably, the RNNs maintain higher accuracy for a slightly smaller ratio of signal to noise magnitude. This might be expected given the RNN’s operation close to attractor manifolds, which are known to be robust to perturbations such as noisy inputs [17].

Second, we aim to understand if the MPN’s intrinsic dynamics at initialization allow it to perform integration with a minimal adjustment of weights. We test this by freezing all internal parameters at their initialization values and only training the MPN’s readout layer. It is well known that RNNs with a large number of hidden layer neurons have varied enough dynamics at initialization that simply adjusting the readout layer allows them to accomplish a wide variety of tasks [41–43]. Such settings are of interest to the neuroscience community since the varied underlying dynamics in random networks allow for wide variety of responses, matching the observed versatility of certain areas of the brain, e.g. the neocortex [41]. Since the *η* and *λ* parameters play especially important roles in the MPN, we fix them at modest values, namely *η* = 1.0 and *λ* = 0.95.^{8} For two different layer sizes, we find all networks are capable of training in this setup, but the MPN consistently outperforms both RNNs (Fig. 6d). Notably, even the MPN with significantly less neurons across its input/hidden layers outperforms the RNNs. These results suggest that the intrinsic computational structure built into the MPN from its update expressions allow for it to be particularly good at integration tasks, even with randomized synaptic connections.

#### MPNs are flexible to taking in new information

The flexibility of a network to learn several tasks at once is an important feature in both natural and artificial neural networks [44–48]. It is well-known that artificial neural networks can suffer from large drops in accuracy when learning tasks sequentially. This effect has been termed *catastrophic forgetting*. For example, although it is known artificial RNNs are capable of learning many neuroscience-related tasks at once, this is not possible without interleaving the training of said tasks or modifying the training and/or network with continual-learning techniques [49, 50]. Given the minimal training needed for an MPN to learn to integrate, as well as its task-independent attractor structure, here we test if said flexibility also extends to a sequential learning setting.

To test this, we train the MPN and GRU on a 2-class integration-delay task until a certain accuracy threshold is met and then train on a *different* 2-class integration-delay task until the accuracy threshold is met on the novel data. Afterwards, we see how much the accuracy of each network on the original 2-class task falls.^{9} We find that the MPN loses significantly less accuracy than the GRU when trained on the new task (Fig. 6e). Intuitively, an RNN might be able to use the same integration manifold for both integration tasks, since they each require the same capacity for the storage of information. The state space dimensionality of the MPN and GRU do not change significantly pre-and post-training on the novel data (Methods 5.5). However, we find the line attractors reorient in state space before and after training on the second task, on average shifting by 23 ± 3 degrees (mean ± s.e.). Since the MPN has a task-agnostic attractor structure, it does not change in the presence of new data.

To understand the difference of how these two networks adapt to new information in more detail, we investigate how the MPN and RNN dynamics treat a novel input, e.g. how networks trained on a 2-class task behave when suddenly introduced to a novel class. For the RNN, the novel inputs to the network do not cause the state to deviate far from the attractor (Fig. 6f). The attractors that make the RNNs so robust to noise are their shortcoming when it comes to processing new information, since anything it hasn’t seen before is simply snapped into the attractor space. Meanwhile for the MPN, the minimal attractor structure means new inputs have no problem deflecting the SM matrix in a distinct direction from previous inputs (Fig. 6g). To quantify the observed separability of the novel class we train a decoder to determine the information about the output contained in the final state of each network (still with no training on the novel class). The MPN’s states have near-perfect separability for the novel class even before training, 0.998±0.001 accuracy, while the GRU has more trouble separating the new information, 0.812±0.009 accuracy (mean±s.e., Fig. 6h). Hence, out of the box, the MPN is primed to take in new information.

### 3.3 Additional integration-based tasks

Given the simplicity of the MPN, it may be called into question if such a setup is capable of learning anything beyond the simple integration-delay tasks we have presented thus far. Additionally, if it is capable of learning other tasks, how its dynamics may change in such settings is also of interest. To address these questions, in this section we train and analyze the MPNs on additional integration tasks studied in neuroscience, some of which require the network to learn more nuanced behavior such as context. Additionally, dynamics of a network trained on a true-anti contextual task are shown in Fig. S5 (Methods 5.1).

#### Retrospective and prospective contextual integration

Integration with context is a well-known task in the neuroscience literature [24, 53]. In this setup, two independent 2-class integration-delay *subtasks* are passed to the network at once (inputs are concatenated) and the network must provide the correct label to only one of the subtasks based on the contextual input it receives (i.e. report subtask 1 or 2). We begin by discussing the retrospective case, where the context comes *after* the stimuli of the subtasks, in the middle of the delay period (Fig. 7a).

We find the MPN is easily capable of learning this task and again achieves near-perfect accuracy, on average 98.4%. To understand how the MPN is capable of processing context, we can investigate how its dynamics change from that of the simple integration task analyzed previously. Once again turning to the state of MPN to see what information it encodes, during the time-period where the subtask stimuli are being input we see **M**_{t} holds information about both the integration subtasks simultaneously (Fig. 7b). That is, we see the same continuum that encodes the relative evidence between the two classes that we saw in the single task case earlier (Fig. 3f), but for both of the integration subtasks. Furthermore, these two one-dimensional continua lie in distinct subspaces, allowing a single location in the two-dimensional space to encode the relative evidences of both subtasks at once. This make sense from the perspective that the network does not yet know which information is relevant to the output, so must encode information from each subtask prior to seeing the context.

When the context is finally passed to the network, we see its state space becomes increasingly separated into two clusters that correspond to whether the context is asking to report the label from subtask 1 or 2 (Fig. 7c). Separating the states into distinct regions of state space allows for them to be processed differently when converted to hidden and output activity, and we now show this is how the MPN solves the task. We can quantify the difference the separation induced by contextual input produces by looking at the how each states gets converted into an output and how this changes with context. We define a subtask’s *readout difference* such that more positive values means the two classes belonging to the subtask are easier to distinguish from one another via the readout vectors, i.e. what is needed to solve the subtask (Methods 5.5). Prior to the context being passed, we see the readout difference increases for both subtasks as evidence is accumulated (Fig. 7d). As soon as the contextual input is passed, the readout difference for the subtask that is now irrelevant (the one that doesn’t match the context, dotted line) immediately plummets, while that of the relevant subtask (the one that matches the context, solid line) increases (Fig. 7d).^{10} After a final delay period, the go signal is passed to the network and these two separate clusters of state space are readout distinctly, allowing the network to output the label of the appropriate subtask.

The prospective task is identical to the retrospective one, except that the context comes *before* the evidence sequence (Fig. 7e). Again looking to the MPN’s **M**_{t} dynamics, we see the context quickly separates the evidence-less states into two separate clusters (Fig. 7f). Soon after, once evidence starts being passed to the network, the clusters remain distinct and evidence for the individual integration subtasks is accumulated (Fig. 7g). Once again quantifying the information contained in **M**_{t} via the readout difference, we see the subtasks that match the context have their readout difference grow quickly (Fig. 7h). Meanwhile, since the context has already been processed by the network, the evidence needed to solve the irrelevant subtask is stored in a way that is unimportant for the readouts.

#### Continuous integration

Thus far we have investigated integration tasks where evidence comes in discrete chunks, but often evidence from stimuli can take on continuous values [24]. In this task, the network receives continuous inputs and must determine if the input was drawn from a distribution with positive or negative mean (Fig. 7i). Evidence is again passed to the network through a random binary vector, but the vector is multiplied by the continuous signal. The MPN is once again able to achieve near perfect accuracy on this task.

Since all evidence is scalar multiples of a random input vector, the hidden neuron activity prior to the go signal now exists in a single cluster as opposed to the distinct input-clusters we saw for the discrete case (Fig. 7j, S5h). The go signal again has its own separate cluster, within which the hidden state varies with the total evidence of the sequence and is well-aligned with the readout vectors (Fig. 7j). Although the dynamics of this task may look more similar to the line-attractors, we again note that the hidden neuron activity largely tracks input activity rather than accumulated evidence, unlike an RNN (Fig. 7k). The accumulated evidence is still stored in the SM matrix, **M**_{t}. In the low-dimensional space, the state moves along a line to track the relative evidence between the two classes, before jumping to a separate cluster when the go signal is passed (Fig. 7l). Again note this is distinct from line attractor dynamics of the RNN, since in the absence of stimulus during a delay period, **M**_{t} will still exponentially decay back towards its baseline value at **M** = **0**.

## 4 Discussion

In this work, we have thoroughly explored the trained integration dynamics of the MPN, a network with multiple forms of plasticity. It has connections between neurons that are effectively a product between two terms: (1) the **W**_{inp} matrix, trained in a supervised manner with backpropagation and assumed to be constant during input sequences and (2) the synaptic modulation matrix, **M**, which has faster dynamics and evolves in an unsupervised manner. We analyzed MPNs without recurrent connections so that they have to rely solely on synaptic dynamics for the storage and updating of short-time-scale information. Unlike an RNN, we have found the dynamics of the hidden neurons in the MPN primarily track the present input to the network, and only at subleading-order do we see them encode accumulated evidence. This makes sense from the point of view that the hidden neurons have two roles in the MPN: (1) they connect directly to the readouts and must hold information about the entire sequence for the eventual output, but also (2) they play a role in encoding the input information to update the **M**_{t} matrix.

The synaptic modulations, contained in the SM matrix, **M**_{t}, encode the accumulated evidence of input sequence through time. Hence, the synaptic modulations play the role of the state of the MPN, similar to the role the hidden activity, **h**_{t}, plays in the RNN. Additionally, we find the MPN’s state space has a fundamentally different attractor structure than that of the RNN: the uniform exponential decay in **M**_{t} imbues the state space with a single point-like attractor at **M** = **0**. Said attractor structure is task-independent, which significantly contrasts the manifold-like, task-dependent attractors of RNNs. Despite its simplistic attractor structure, the MPN can still hold accumulated evidence over time since its state decays slowly and uniformly, maintaining the relative encoding of information (Fig. 4).

Despite its relatively simple dynamics, across many neuroscience-relevant tests, we found the MPN was capable of outperforming the Vanilla RNN and often performing at comparable levels to the GRU. The exception to this was noise robustness, where the RNNs’ attractor structure allows them to outperform the MPN in a relatively small window of signal to noise ratio. However, the MPN’s integration dynamics that rely on its minimal attractor structure allowed it to outperform RNNs in both minimal training and sequential-learning settings. Altogether, we find such performance surprising given the simplicity of the MPN. Unlike the highly designed architecture of the GRU, the MPN operates using relatively simple biological mechanisms, with no more architectural design than the simplest feedforward neural networks.

The simplicity of the MPN leaves plenty of room for architectural modifications to either better match onto biology or improve performance. Foremost among such modifications is to combine recurrence and dynamic synapses into a single network. The MPN and Vanilla RNN are subnetworks of this architecture, and thus we already know this network could exhibit either of their dynamics or some hybrid of the two. In particular, if training is incentivized to find a solution with sparse connections or minimal activity, the MPN that computes with no recurrent connections and activity silence could be the preferred solution. The generality of the MPN’s synaptic dynamics easily allows for the addition of such dynamic weights to any ANN layer, including recurrent layers [54]. Finally, adding synaptic dynamics that vary with neuron or individual synapses would also be straightforward: the scalar *λ* and *η* parameters that are uniform across all neurons can be replaced by a vector or matrix equivalents that can be unique for each pre-or post-synaptic neuron or synapse. A non-uniform decay from vector/matrix-like *λ* would allow the MPN have a more nuanced attractor structure in its state space.

## 5 Methods

As in the main text, throughout this section we take the number of neurons in the input, hidden, and output layers to be be *d, n*, and *N*, respectively. We continue to use uppercase bold letters for matrices and lowercase bold letters for vectors. We use *i, j* = 1, *…, n* to index the hidden neurons and *I, J* = 1, *…, d* to index the input neurons. For components of matrices and vectors, we use the same non-bolded letter, e.g. *M _{iJ,t}* for

**M**

_{t}or

*x*for

_{I,t}**x**

_{t}.

### Supporting code

Code for this work can be found at: https://github.com/kaitken17/mpn. We include a demonstration of how to implement a *multi-plasticity layer*, which allows one to incorporate the synaptic modulations used in this work into any fully-connected ANN layer. This allows one to easily generalize the MPN to, say, deeper networks or networks with multi-plastic recurrent connections.

### 5.1 Tasks

#### Simple integration task

The simple integration task is used throughout this work to establish a baseline for how MPNs and RNNs learn to perform integration. It is inspired by previous work on how RNNs learn to perform integration in natural language processing tasks, where it has been shown they generate attractor manifolds of a particular shape and dimensionality [29–31].

The *N* -class integration task requires the network to integration evidence from multiple classes over time and determine the class with the most evidence (Fig. 2a). Each example from the task consists of a sequence of *T* input vectors, **x**_{1}, *…,* **x**_{T}, passed to the network one after another. We draw possible inputs at a given time step from a bank of *stimulus inputs*, {evid_{1}, evid_{2}, *…,* evid_{N}, null}. Here, ‘evid_{m}’ corresponds to one unit of evidence for the *m*th class. The ‘null’ input provides evidence for none of the classes. Each sequence ends in an ‘go signal’ input, letting the network know an output is expected at that time step. All the stimulus inputs have a one-to-one mapping to a distinct random binary vector that has an expected magnitude of 1 (Fig. 2b, see below for details). Finally, each example has an integer label from the set {1, *…, N*}, corresponding to the class with the most evidence. The network has correctly learned the task if its largest output component at time *T* is the one that matches each example’s label.

The input sequence examples are randomly generated as follows. For a given sequence of an *N* -class task, the amount evidence for each class in the entire sequence can be represented as an *N* –dimensional *evidence vector*. The *m*th element of this vector is the number of evid_{m} in the given sequence. For example, the 3-class sequence of length *T* = 6, “evid_{2}, evid_{1}, null, evid_{1}, evid_{3}, go” has an evidence vector (2, 1, 1). The sequences are randomly generated by drawing them from a uniform distribution *over possible evidence vectors*. That is, for a given *N* and *T*, we enumerate all possible evidence vectors and draw uniformly over said set. Sequences that have two or more classes tied for the most evidence are eliminated. Then, for a given evidence vector, we draw uniformly over sequences that could have generated said vector.^{11} Unless otherwise stated, we generally consider the case of *T* = 20 throughout this work.

To generate the random binary vectors that map to each possible stimulus input, each of their elements is independently drawn uniformly from the set . The binary vectors are normalized by so they have an expected magnitude of 1,
where ∥·∥_{2} denotes L2-normalization. Note the expected dot product between two such vectors is

Often, we will take the element-wise (Hadamard) product between two input vectors, the expected magnitude of the resulting vector is
where we have used the fact that the only nonzero element of the element-wise product occurs when the elements are both . The fact that this product scales as 1*/d* will be useful for analytical approximations later on.

Notably, since this task only requires a network to determine the class with the *most* evidence (rather than the absolute amount of evidence for each class), we claim this task can be solved by keeping track of *N* − 1 relative evidence values. For example, in a 2-class integration task, at a minimum the network needs to keep track of a single number representing the relative evidence between the two classes. For a 3-class task, the network could keep track of the relative evidence between the first and second, as well as the second and third (from which, the relative evidence between the first and third could be determined). This generalizes to *N* − 1 for an *N* -class integration class.^{12}

#### Simple integration task with delay

A modified version of the simple integration task outlined above involves adding a delay period between the last of the stimulus inputs and the go signal (Fig. 2c). We denote the length of the delay period by *T*_{delay} and *T*_{delay} *< T* − 1. During the delay period, the sequence inputs (without noise) are simply the zero vector, **x**_{t} = **0** for *t* = *T* − 1 − *T*_{delay}, *…, T* − 1. Unless otherwise stated, we consider the case of *T* = 40 and *T*_{delay} = 20 for this task. We briefly explore the effects of training networks on this task where the delay input has a small nonzero magnitude, see Fig. S2.

#### Contextual integration task (retrospective and prospective)

For the retrospective context task, we test the network’s ability to hold onto multiple pieces of information and then distinguish between said information from a contextual clue (Fig. 7a). The prospective integration task is the same but has the contextual cue precede the stimulus sequence (Fig. 7e). Specifically, we simultaneously pass the network *two N* = 2 simple integration tasks with delay by concatenating their inputs together. The label of the entire sequence is the label of one of the two integration subtasks, determined by the context which is randomly chosen uniformly over the two possibilities (e.g. subtask 1 or subtask 2) for each sequence. Note for each of the subtasks, labels take on the values *ℓ* = 1, 2, so there are still only two possible output labels.

As with above, each task has its various inputs mapped to random binary vectors and the full concatenated input has an expected magnitude of 1. We specifically considered the case where the input size was *d* = 50, so each subtask has 25-dimensional input vectors. The full sequence length was *T* = 40, where *T*_{stimulus} = 19 and *T*_{context} = 5. For both the retrospective and prospective setups, context was passed with 10 delay time steps before it, and 5 delay time steps after it.

#### Continuous integration task

The continuous integration tests the networks ability to integrate over a continuous values, rather than the discrete values used in the pure integration task above (Fig. 7i).

The continuous input is randomly generated by first determining the mean value, *μ*, for a given example by drawing uniformly over the range [−0.1875, 0.1875]. For a sequence length of *T*, at each time step the continuous signal is drawn from the distribution *𝒩* (*μ, T/*750). Each example has a binary label corresponding to whether *μ* is positive or negative. Numerical values were chosen such that the continuous integration has a similar difficulty to the integration task investigated in Ref. [24]. The continuous signal is then multiplied by some random binary vector. Unlike the previous tasks, since the continuous input can be negative, the input values to the network can be negative as well. The go signal is still some random binary vector. We specifically consider the case of *T* = 20.

#### True-anti contextual integration task

The true-anti contextual integration task is the same as the simple *N* = 2 integration task, except the correct label to a given example may be the class with the *least* evidence, determined by a contextual clue. The contextual clue is uniformly drawn from the set {true, anti}. Each possible clue again one-to-one maps to a random binary vector that is *added* to the random binary vectors of the normal stimulus input at all time steps (Fig. S5a). We specifically consider the case of *T* = 20.

Note this task is not discussed in detail in the main text, instead the details are shown in Fig. S5. We find both the MPN and GRU are easily able to learn to solve this task.

### 5.2 Networks

In this section, we give a more thorough description for the various networks consider in this work. The number of adjustable parameters in the networks as well as the scaling of the number of state update operations they require are given in Table 1. Unless otherwise stated, throughout this work we take networks to have input and hidden neuron counts of *d* = 50 and *n* = 100, respectively. In this setting, the MPN has roughly 1*/*3 the number of trainable parameters as the Vanilla RNN and 1*/*9 the number of the GRU.

#### Multi-plasticity network

The multi-plasticity networks used in this work can be thought of as a generalization of a two-layer, fully-connected, feedforward network, given by
where **W**_{inp} ∈ ℝ* ^{n×d}* are the weights connecting the input and hidden layer,

**W**

_{RO}∈ ℝ

*are the weights connecting the hidden and output layer, and*

^{N ×n}**h**

_{t}∈ ℝ

*are the hidden activities. The function*

^{n}*ϕ*(·) is some activation function, usually non-linear, which is applied element-wise. Weights

**W**

_{inp},

**W**

_{RO},

**b**, and

**b**

_{RO}are adjusted during training (see below for details).

The primary difference between the MPN and the above network is that the weights between the input and hidden layers are modified by a time-dependent synaptic modulation (SM) matrix, **M**_{t}. That is, Eq. (9a) is replaced by
where **M**_{t−1} ∈ ℝ* ^{n×d}*, the same dimensions as

**W**

_{inp}, and

**1**∈ ℝ

*has 1 for all its elements. Here, ‘⊙’ represent an element-wise multiplication of the two matrices (the Hadamard product). In the main text, we set*

^{n×d}**b**=

**0**and

*ϕ*(·) = tanh (·). The former condition we take this to represent the lack of a background signal, and does not change the qualitative results of the main text. In practice, no significant difference in dynamics from the sigmoid activation function was observed, and the tanh was chosen to better match the computational structure of the RNNs and existing literature on their dynamics. In Figs. S1e-S1h, we show the dynamics are qualitatively the same for

*ϕ*(·) = max (·, 0), i.e. the ReLU activation function.

The SM matrix, **M**_{t}, serves as an internal state of the network, intialized at **M**_{0} = **0** ∈ ℝ* ^{n×d}*. The time-dependence of the SM matrix comes from its Hebbian-like updates
where

*η*and

*λ*are parameters that can be adjusted during training. Notably, unlike the weight modification from backpropagation, the SM matrix weight modification is

*local*. That is, the information to update a given synapse/weight only comes from nodes to which the weight is directly connected. Note the convention we have chosen for the time labels in these update expressions means, in order for

**h**

_{1}to be well defined, we must specify an initial state for

**M**

_{0}, which throughout this work we take to be

**M**

_{0}=

**0**. It is possible to train

**M**

_{0}as an additional set of parameters for the network. In practice, we don’t observe significant qualitative differences in the dynamics when this is done.

We let *η* ∈ ℝ, while *λ* must obey 0 ≤ *λ* ≤ *λ*^{max} where *λ*^{max} is a hyperparameter of the MPN. Throughout this work we choose *λ*^{max} = 0.95 unless otherwise noted. This is to enforce a level of biological realism, where the SM matrix exponentially decays. In practice, during training we find that *λ* comes close to saturating *λ*^{max}. Note for *λ* = 0.95, in the usual settings where we train for 20 or 40 sequence steps, this means by the time the go signal is passed the first input perturbations decay to roughly 0.35 and 0.13 of their initial values, respectively.

The output of the network is given by the expression Eq. (9b). For ease of interpretation of the mapping between hidden activity and the network output, throughout the main text we take **b**_{RO} = **0**. Once more, dynamics and quantiative results do not differ significantly from this change.

To emphasize its functional dependence on only the current input and the synaptic modulations, the MPN can be written in the generic form **M**_{t} = *G* (**M**_{t−1,} **x**_{t}). Similarly, the hidden layer activity is given by the generic expression **h**_{t} = *G ^{′}* (

**M**

_{t−1,}

**x**

_{t}), which makes it fundamentally different than the RNNs’ generic expression,

**h**

_{t}=

*F*(

**h**

_{t−1,}

**x**

_{t}), see below for additional details.

To understand the dynamics of the MPN, it is occasionally useful to see what the network’s hidden activity would look like *if* the go signal were passed to it at the current time step. This is done by translating any given state **M** into its corresponding post-go-signal hidden activity by doing what we call a go *signal projection*, defined by
where **x**^{go} is the go signal input. That is, this defines a mapping from any MPN state to a hidden activity. An example of this projection is shown in Figs. S1c,d, corresponding to the MPN trained on the *N* = 2 simple integration task discussed in Fig. 3.

#### Additive multi-plasticity network

An alternative model for the MPN, where the SM matrix additively modifies the input weights, is given by [27]
where once more **M**_{t−1} ∈ ℝ* ^{n×d}*. Other than the above expression for hidden neuron activity, this network is identical to the MPN with element-wise multiplication above. Tests on the performance of the additive MPN compared to the multiplicative MPN used throughout this work found the additive model to generally perform worse than its multiplicative counterpart (Fig. S4).

^{13}The dynamics of this network are also found to be low-dimensional and are shown in Figs. S1i-S1l. An in-depth analysis of its dynamics is outside the scope of this work.

#### Recurrent neural networks

Throughout this work, we will compare the MPN to RNNs that are common artificial neural networks used in both the neuroscience and machine learning communities. The simplest version of the RNN is the Vanilla RNN, which can again be thought of as a simple modification to the two-layer feedforward network, Eq. (9). Now the hidden layer neurons serve as an additional input to the next layer,
where **W**_{rec} ∈ ℝ* ^{n×n}* are weights trained in the same manner as those in

**W**

_{inp}. Note with this convention for

**h**

_{1}to be well defined we must specify the initial state

**h**

_{0}, which we always take to be

**h**

_{0}=

**0**∈ ℝ

*. Similar to the MPNs above,*

^{n}**h**

_{0}could be trained as an additional parameter, but we simply fix it in this work. Additionally, like the MPN, throughout this work we take

*ϕ*(·) = tanh (·) and

**b**

_{RO}=

**0**.

Vanilla RNNs suffer from the vanishing/exploding gradient problem and more sophisticated units have been introduced to remedy these problems. Perhaps one of the two most famous generalizations of the Vanilla RNN is the Gated Recurrent Unit (GRU) [55]. The GRU introduces several gates to the Vanilla RNN’s hidden activity update expressions. These gates control how the hidden activity of the network is changed from one time step to the next via *additive* updates, allowing information to persist for longer time scales [28, 56]. The explicit expressions for the GRU are given by
where **u**_{t} and **r**_{t} are the update and reset gate vectors, respectively. The hidden activity is translated to an output through a readout layer, identical to Eq. (14b).

Both the above RNNs can be written in the generic form **h**_{t} = *F* (**h**_{t−1,} **x**_{t}). That is, despite its complicated update expressions, the GRUs hidden state at a given time step is still only a function of its current input and previous hidden state. We again note the similarity and differences of this expression to the updates expression for the MPN, **h**_{t} = *G ^{′}* (

**M**

_{t−1,}

**x**

_{t}).

##### 5.2.1 Training

All networks are trained using standard backpropagation (through time), as implemented through the PyTorch package. Networks are trained using ADAM with default parameters [34] and a constant learning rate, set to 1 × 10^{−3}. All trained parameters of the network are subject to L1 regularization, and unless otherwise stated the coefficient of said regularization is 10^{−4}. In all training settings, Gaussian noise of expected magnitude 0.1 is added to the inputs of the network. Throughout this work, training was conducted for a minimum number of time steps and then stopped under three possible conditions: (1) a (rolling average) validation accuracy was met, (2) the (rolling average) validation loss became saturated, or (3) the maximum training time was reached. See below for details on when each of these thresholds were used. Gradients were clipped to 10 to avoid gradient explosion. All weight matrices are initialized using Xavier initialization, e.g. for **W**_{inp} ∈ ℝ* ^{n×d}* each element is drawn from a uniform distribution over [−

*γ, γ*] with .

For the MPN, *η* is similarly drawn from a uniform distribution between . *λ* is initialized to its maximum value *λ*^{max}, as generally we find it to approach its maximal value during training anyway.

### 5.3 Theoretical Analysis

#### MPN analysis

Here we give additional details of the theoretical approximations for MPN dynamics. First note that we can write a given **M**_{t} in terms previous inputs, hidden activity, and its initial state
where, in reaching the final line, we have assumed that **M**_{0} = **0**.

In the settings we train the MPNs in this work, *we find the effect of the modulation matrix on the hidden activity to be small compared to the unmodulated weights*. That is, in general, the contributions to the hidden activity in Eq. (10a) obey

Thus, the leading-order contributions to the dynamics of both the hidden activity and the SM matrix come from expanding in terms of the number of modulations and neglecting terms with compounded modulation contributions. In Appendix A, we show explicitly why this occurs in the setup we consider in this work.^{14} Given the relative size of terms discussed above, the leading-order hidden activity can then be approximated by
which is just the expression of the hidden activity in an unmodulated feedforward network. Plugging this approximation into the update expression for **M**_{t} above, we arrive at an approximation for the modulation matrix

Plugging in *ϕ* = tanh and **b** = **0**, this matches the approximation used in Eq. (5) of the main text (again see Appendix A for additional details).

Notably, the leading-order expression for **h**_{t} above does not capture the variation due to accumulated evidence we see in the hidden activity. Although subleading, such variation is important for the MPN to solve the integration task. To get a more accurate approximation, we keep terms of subleading order, i.e. terms that arise from a single application of the SM matrix. In this case we arrive at the approximation for hidden activity used in the main text,

Once more, details of how one arrives at this expression are given in Appendix A. One can of course continue expanding in terms of the modulation matrix to get increasingly accurate and complicated expressions. Notably, from the above expression, one can see that in order for the SM matrix to have a nonzero contribution requires the input of the current and previous time steps to have a nonzero Hadamard product. This is especially important for the final hidden activity, which in this work we take to be the go signal, otherwise all subleading contributions to the hidden activity are zero and the network cannot train. This would occur if one used one-hot inputs for all distinct inputs to the network.

#### Review of RNN analysis

Since we have observed RNNs tend to operate in close vicinity to attractors that in our case are slow/fixed points, we are motivated to approximate their behavior using a linear expansion [29]. If we expand the generic RNN expression, *F* (**h**, **x**), to linear order about the arbitrary location (**h**, **x**) = (**h**^{e}, **x**^{e}), we have
where we have defined the Jacobian matrices

If we take **h**^{e} to be a slow/fixed point under the input **x**^{e} = **0**, we have that **h**^{e} ≈ *F* (**h**^{e}, **x**^{e} = **0**). Inserting this into the above expression, we can approximate the effect of a given input at time **x**_{t} on the hidden activity **h**_{s+t} by (**J**^{rec})^{s}**J**^{inp}**x**_{t}. Thus an approximation of the state at time *t*, assuming **h**_{0} = **0**, is given by [29]

Using SVD, we can always write **J**^{rec} = **RΛL** with **R** = **L**^{−1} and **Λ** = diag (*λ*_{1}, *…, λ _{n}*). Then this reduces to
where in the second line

**r**

_{i}and

**l**

_{i}are the columns of

**R**and

**L**, respectively. From this expression it is straightforward to understand the impact each input will have on a given hidden state in terms of its projection onto the eigenmodes. We see that an eigenmode’s contribution will disappear/explode over time if

*λ <*1 or

*λ >*1, respectively. In practice, trained networks tend to have one eigenvalue

*λ*≈ 1 for each integration dimension (e.g. a single eigenvalue at

*λ*≈ 1 for a line attractor) [31]. So long as there are integrator modes, we see all inputs have the ability to contribute to a given

**h**

_{t}on roughly equal footing. This is fundamentally different from the operation of the MPN discussed above.

### 5.4 Network dynamics and dimensional reduction

#### 5.4.1 Determining dimensionality, PCA projection, and explained variance

Throughout this work, we use PCA to visualized the dynamics. This is always done by passing some test set to the network of size *m*, and collecting all activity vectors/matrices (e.g. hidden activity or SM matrices) over the entire test set and some time period, *S*. Unless otherwise stated, this time period is over the entire input sequence, i.e. *S* = *T*, and thus this yields on order *m* × *T* vectors/matrices. Let *D* denote the size of the space said vectors or matrices live in (e.g. *D* = *n* for hidden activity, *D* = *n* × *d* for SM matrices). PCA is then performed over this set, which yields some set of PCA vectors, **w**_{α}, and their associated ratio of variance explained, *v _{α}*, for

*α*= 1,

*…,*min (

*D, m*×

*T*). The dimensionality of activity is determined by calculating the

*participation ratio*on the

*v*[31],

_{α}Except for the smallest networks considered in this work, we always find PR ≪ *D*, which we take to mean the activity operates in a low-dimensional space.

To determine the hidden activity’s proportion of variance explained by either the accumulated evidence or the present input, we associate each hidden activity vector, **h**_{t}, with the relative score difference between the two classes and present input, **x**_{t}, at the given time step. We then use `scikit-learn`‘s `LinearRegression` class with default parameters to fit the hidden activity over an entire test to the accumulated evidence and present input individually. Note for the former, which consists of four possible input categories, we use three indicator functions with the reference case chosen to be the ‘null’ input. A linear fit was chosen for simplicity and the authors felt it was appropriate given the fact the outputs of the network are also a linear function of the hidden activity. The reported proportion of variance explained correspond to the *r*^{2} of the individual fits, averaged over 10 separate initializations.

#### 5.4.2 Generalization of MPN dynamics for *N >* 2

In the main text, the dynamics of the MPN was discussed for the case of an *N* = 2 integration task. Here we expand on how said dynamics generalize for *N >* 2. Throughout this discussion, we always assume the number of classes is significantly smaller than the hidden activity dimension, i.e. *N* ≪ *n*.

Like the *N* = 2 case, for *N >* 3 we observe the hidden activity to continue to be low-dimensional and separated into distinct input-clusters (dependent upon the most recent input to the network). Fig. S3 shows a visualization of the dynamics for the MPN trained on an *N* = 3 task. Within each input-cluster, the hidden activity continues to vary with accumulated evidence. However, since keeping track of the relative scores between classes now requires the network to hold more than a single number, the variance of accumulated evidence within each input-cluster takes on a more complex structure. For example, in the *N* = 3 case, we see each input-cluster is roughly triangle-like, with the hidden activity belonging to the 3 possible output labels clustering toward each of the corners (Fig. S3e). Similar to the RNN dynamics (see below), we expect accumulated evidence for an *N* -class task to take on the approximate shape of an *N* − 1-simplex within each input-cluster [31].

Once again, the SM matrix continues to monotonically evolve along a particular direction with sequence index (Fig. S3g). We also find the SM matrix again encodes accumulated evidence of the distinct classes (Fig. S3h). That is, each distinct input deflects the SM matrix in a distinct direction, allowing said matrix to encode all previous inputs to the MPN up to that point.

As mentioned in the main text, we limit our analysis of dynamics to the case where the separate classes of the task are uncorrelated. Similar to RNNs, we expect the dynamics to qualitatively change in cases where classes are non-uniformly correlated with one another [31].

#### 5.4.3 Review of RNN Dynamics

Recall that to solve the *N* -class simple integration task, at a minimum a network needs to keep track of *N* − 1 relative evidence values (see Sec. 5.1 above). In general, so long as *n* ≪ *N*, after training the RNNs’ hidden activity lies in a low-dimensional subspace that has the approximate shape of an *N* − 1 regular simplex (which has dimension *N* − 1) [31]. By moving around in said subspace, the RNN encodes the *N* − 1 relative evidence values needed to solve the task. If instead the network needed to keep track of the absolute evidence for each class, we would expect the network’s hidden activity to lie in an *N* -dimensional subspace. For example, in the *N* = 2 case, the relative evidence is a single number, so we only see a one-dimensional line attractor (a 1-simplex). Beyond 2-class, the attractor becomes higher-dimensional, e.g. *N* = 3 yields a two-dimensional triangular attractor (a 2-simplex) because the network needs to keep track of the relative evidence of two distinct pairs of classes, from which it also encodes the relative evidence of the final pair of classes (Fig. S3).

### 5.5 Figure details

**Figure 1** details

Explicit expressions for the MPN, fully connected network, and (Vanilla) RNN are given above in Sec. 5.2. The details of the backpropagation modified weights are given in Sec. 5.2.1. The details of the modulated synapses/weights are given in Eq. (2).

**Figure 2** details.

Details of the simple integration task shown in Fig. 2 can be found in Sec. 5.1 above.

**Figure 3** details.

All subplots in this figure are for networks trained on a 2-class integration task with *T* = 20 and no delay period. For this task, there are only four possible inputs to the networks, ‘evid_{1}’, ‘evid_{2}’, ‘null’, or ‘go’. Networks are trained until an accuracy thershold of 98% is reached, with a minimum of 2000 training batches.

Fig. 3a,d show the activity over the input sequence for a few randomly chosen hidden neurons of a single test example. The single test example is randomly chosen from the set of examples that have at least half the of the maximum possible accumulated evidence difference. A large collection of plots in this work are projected into PCA space as a means of visualizing the low-dimensional dynamics. To generate the plot in Fig. 3b, all hidden neuron activity, **h**_{t} for *t* = 1, *…, T* were collected over a test batch of size 1000. Across this entire set of hidden activity, PCA was performed (with a centering of per-feature means). The projection of the hidden neuron activities onto their first two components is plotted in Fig. 3b. The hidden activity is colored by the relative evidence of the two classes, i.e. total evidence for the red class minus evidence for the blue class, at the associated time-step. In addition to the hidden activity, two example trajectories and the two readout vectors are projected onto the hidden neuron PCA space as well. Fig. 3c contains the same hidden activity projection, but now the hidden states are colored by the most recent input passed to the network, i.e. whether the input corresponded to ‘evid_{1}’ (red), ‘evid_{2}’ (blue), ‘null’ (grey), or ‘go’ (purple). The inset was computed by calculating the change in the hidden activity, **h**_{t} − **h**_{t−1}, induced by the input **x**_{t}, again projected into the hidden activity PCA space. Once again, these changes in hidden activity are colored by the corresponding input. Additionally, for each possible input, the set of all hidden activity changes was averaged together to form a mean change in activity, and this was also plotted in the inset as a darker line. For the MPN hidden activity, Figs. 3e,f were generated in an identical manner to those in Figs. 3b,c (with the exception of the inset). Note there is no initial hidden activity for the MPN, just an initial SM matrix. The inset of Fig. 3f was computed by separating all hidden activities into distinct groups by their most recent input. Then, for each group of hidden activities, PCA was performed on the subset. The ‘RO alignment’ plotted in the inset is the average across all readout vectors of the cosine angle magnitude between the readout vector, **r**, and the top PC direction, **w**, i.e. |**r** · **w**| */* (∥**r**∥_{2}∥**w**∥_{2}).

The MPN state (**M**_{t}) activity shown in Figs. 3g,h is plotted in an analogous manner to the hidden activity for the previous subplots in this figure. That is, we collect all **M**_{t} for *t* = 0, *…, T* over the test set, and perform PCA on the *flattened* matrices. We then project the flattened states onto said PCA space and color them in the same manner as the hidden activity in Figs. 3b,e (relative evidence of the two classes). Figs. 3g,h differ only in that they show two different PCA directions along the *y*-axis. The final SM matrix for each input sequence was colored in a slightly darker shade (but still colored by relative evidence) for clarity. We note that the time-evolution direction is not always aligned with the second PC direction as it is in this plot – in our tests it was also often aligned with the first PC direction or diagonally oriented in the PC1-PC2 plane. The inset of Fig. 3h is completely analogous to that in Fig. 3b, but for the MPN state rather than the RNN state. That is, it shows the change in SM matrix, **M**_{t} − **M**_{t−1}, colored by the present input, **x**_{t}.

**Figure 4** details.

The networks that generated the data in the top row of this figure were all trained on an *N* = 2, *T* = 20 integration task until they reached an average validation accuracy of 98%, with a minimum training size of 2000 batches. MPNs were trained at a variety of *λ*^{max} values, but in practice we found the MPNs always saturated said value at its maximum so *λ* = *λ*^{max}. After training, the networks were then passed longer sequences (up to *T* = 200) and their states (i.e. **h**_{t} for the RNNs and **M**_{t} for the MPNs) were tracked as a function of *t*. Note the same MPN was used to generate Fig. 4a as that used to generate the dynamics plots in Figs. 3[d-f]. Additionally, the same Vanilla RNN was used to generate Fig. 4b as that used to generate the dynamics plots in Figs. 3[a-c]. All data in Figs. 4c,d were averaged over 10 different network initializations. The normalized final state for the RNNs and MPNs are given by

Note the final state of the MPN is chosen to be **M**_{T} _{−1} because **M**_{T} is not used in the calculation of the final output.

For the delay task shown in the bottom row, we train the networks on an *N* = 2 integration-delay task with *T*_{delay} = 20 and a total sequence length of *T* = 40. Again, training is stopped when the networks reach an accuracy threshold of 98% with a minimum training size of 2000 batches. Figs. 4e,f are generated analogously to Figs. 4a,b, with *T*_{delay} varied from 10 to 100. The red and blue points shown in Figs. 4e,f correspond to the state projections immediately preceding the onset of the delay. Note Fig. 4f shows the state space of a GRU while Fig. 4b shows that of a Vanilla RNN, since the latter was not able to train on the integration-delay task. Figs. 4g,h are generated analogously to Figs. 4c,d, using the same definition of normalized final states.

**Figure 5** details.

The networks shown in Figs. 5a,b were trained on an *N* = 3 integration task, with *T* = 40 and *T*_{delay} = 20. Sample neurons in Figs. 5a are randomly chosen from a single randomly chosen test example. To calculate the decoding accuracy in Fig. 5b, a linear SVC was trained on the hidden activity of each of the networks, with each hidden activity labeled by its corresponding input sequence’s label. A linear SVC was chosen because the readout vectors also implement flat decision boundaries (although they are piece-wise in the case of the readout vectors). The linear SVC was implemented using `scikit-learn`‘s `LinearSVC` class with default settings except for number of iterations was increased to 100, 000 and balanced class weights. Since the number of classes in this example was *N* = 3, the multi-class strategy was all-vs.-one. Additionally, 10 fold cross-validation was used and the results shown are averaged over folds.

Fig. 5c contains the activity as a function of time of a few randomly chosen components of the hidden activity and SM matrix of an MPN from a randomly chosen test example. The theoretical predictions come from Eqs. (4) and (5). In Fig. 3d, to quantify the error in the theoretical approximation we use the following expressions
where ∥ · ∥_{2} is the L2-normalization of the correspond vector or matrix (the Frobenius norm) and and are the approximations in Eqs. (4) and (5), respectively. See the derivation leading up to Eqs. (20) and (19) to see details of these approximations.

**Figure 6** details.

Network capacity in Fig. 6a was measured by training networks on tasks with an increasingly large number of classes, *N*. A sequence length of *T* = 20 was still used for all *N*. Training was conducted until the rolling average validation loss plateaued. The rolling average validation loss is the average of the last 10 measured validation losses. Since validation loss is only computed 10 batches, this represents an average over the previous 100 batches of training. In particular, if the current rolling average validation loss was larger than the rolling validation loss measured 100 batches prior, the network’s training was determined to have plateaued. Occasionally this metric may have caused a network to end training early, hence for the aggregate measure we plotted the median accuracy over 10 separate initializations. In Fig. 6b, network capacity as a function of sequence length, *T*, was computed in an identical manner. A class size of *N* = 2 was used for all *T*.

For Fig. 6c, we trained networks at a fixed ratio of (expected) input noise and signal magnitude. Specifically, if the input without noise is ** and the noise is ****n**, then

*α* = 10 was chosen for the initial training. We then systematically scanned over *α* values and measured accuracy (without retraining on different *α*). It was observed that training the networks at lower *α* did not change the results significantly.

For the echo-state setup in Fig. 6d, all weights except for the readout matrix, **W**_{RO}, were frozen at their initialization values during training. Since in the MPNs, *λ* and *η* play especially important roles in their operation, we initialized *λ* = *λ*^{max} = 0.95 and *η* = 1.0. Training was performed until the validation loss plateaued, see above.

Fig. 6e was generated by first training the networks on an *N* = 2 integration-delay task until an accuracy threshold, measured by the rolling average validation accuracy (over the last 10 measured validation runs), achieves an accuracy of 97% or higher. Then, the network was trained on a new *N* = 2 integration-delay task. Afterwards, we test how much the accuracy on the original *N* = 2 task fell. We take *T* = 40 and *T*_{delay} = 20 for both tasks. The new task uses the same random vectors for the ‘null’ and go signal inputs, but generates new random vectors for the two types of evidence. The label set of the new task is the same as the old task and the readouts must be adjusted to compensate. That is, we use the same readouts for both tasks, so the readouts may get adjusted during the second task. This is chosen to simulate the bottleneck of sparse strong connections between areas of the brain. After the network achieves the same accuracy threshold on the new task, its accuracy on the original task is reported. In this setup, we reduce the networks’ batch size to 1 so we the training can be monitored down to individual examples and also reduced the L1 regularization coefficient to 10^{−6}. Additionally, raw data in Fig. 6e shows an average over 20 training examples, and some accuracies (for the GRU in particular) are too low to be seen on the plot.

For the MPN, we find the change in dimensionality of the state space (post-training minus pre-training) to be 0.08 ± 0.04 (mean ± s.e.). Similarly, for the GRU, we find the change to be 0.19 ± 0.08 (mean ± s.e.). To find the change in GRU line attractor angle, we take the first PC direction of the hidden states to be a good measure of the direction of the line attractor. For both pre-and post-training, we continue to find the dimensionality of the GRU state space to be small, 1.52 ± 0.10 and 1.71 ± 0.13, respectively (mean ± s.e.).

The PC plots in Figs. 6f,g are generated as per usual for the states of the respective networks (including the novel states). Looking along higher PC directions, we did not observe a large degree of separation between the new class and previous classes in the RNN.

The decoding accuracy in Fig. 6h was calculating by training a linear SVC on only the final state activity of the MPN and GRU. The sequence labels were used to label the states. Otherwise, the setup for the linear SVC was identical to that used in Fig. 5b, see above. Since the MPN and GRU have the same *n* = 100 and *d* = 50, this means the state of the MPN has 50 times more components than the GRU. Thus, to ensure the SVCs are being trained on equal footing for the two networks, we dimensionally reduce the (flattened) MPN states using PCA so that said states have the same number of components as the GRU state (100 in this case). Note that in practice, both networks operate their states in significantly smaller subspaces than the number of components they have, and said subspaces are of comparable dimension for the two types of networks (but not the same, and this may contribute to the MPN’s higher accuracy).

**Figure 7** details.

Details on the tasks shown in Figs 7a,e,i can be found in Sec. 5.1 above. Accuracy on the retrospective task was found by training 10 different initializations on said task with *T* = 40 and *T*_{delay} = 20 until the average validation loss saturated (with a minimum training time of 2000 batches). In figures for the retrospective/prospective integration, we color the example sequences by both their final label as when as their context, yielding 2 × 2 = 4 possible colorings.

For Figs. 7[b-d] and Figs. 7[f-h], data is shown for a single network, trained to an accuracy threshold of 98%. In Figs. 7d,h, we define the output difference as follows. To determine the potential output at any given time-step, we perform the go signal projection of the states **M**_{t}, see Eq. (12), and then further pass this through the readouts,

For an entire test set, we then group each sequence of outputs into the *N* × 2 × *N* = 2*N* ^{2} = 8 distinct possible combinations of label, subtask, and context. We then compute the average output sequence for each of these eight groupings, to arrive at eight sequences that represent the average output in each situation. At risk of abusing the number of subscripts a variable can take, let be this average output sequence for label *ℓ* = 1, 2, subtask *τ* = 1, 2, and context *c* = 1, 2 at a given time *t*. For *N* = 2, each is a two-dimensional vector. The *readout difference* at a given time *t*, Δ*y _{t,τ,c}*, is then defined to be

Intuitively, this is the average over the output vector-component difference of the two possible labels. For each of these quantities, the network should have learned that this difference should be positive if the sequence is to be labeled correctly (e.g. when the task has label *ℓ* = 2, the second component of **y** should be larger than the first so *y*_{2} − *y*_{1} *>* 0). The more positive a given Δ*y _{t,τ,c}*, the better the network is distinguishing sequences that belong to the individual labels at the given time step. The four lines in Figs. 7d,h are generated by plotting all four combinations of

*τ*= 1, 2 and

*c*= 1, 2, where the solid lines correspond to

*τ*=

*c*and the colors correspond to the

*τ*label.

Fig. 7l was generated by randomly choosing an example sequence and a single input and hidden neuron values. Each sequence was then normalized to have zero mean and a maximum magnitude of 1 for better comparison.

**Figure S1** details.

All subplots in this figure were generated in an analogous manner to those in Fig. 3. Figs. S1a,b are the same as Figs. 3b,c, except the network is a GRU instead of a Vanilla RNN. Figs. S1c,d are the go signal projection, see Eq. (12), of the **M**_{t} states shown in Figs. 3g,h. Fig. S1c has the states colored by accumulated evidence, Fig. S1d has the states colored by present input. Figs. S1[e-h] are the same as those in Figs. 3[e-h], except the MPN has *ϕ* (·) = max (0, ·), i.e. the ReLU activation function. Finally, Figs. S1[i-l] are also the same as those in Figs. 3[e-h], but for the additive MPN of Eq. (13).

**Figure S2** details.

In Figs. S2[a-c], we plot several measures of the MPN and GRU trained on the same integration-delay explored in Fig. 5. For the MPN, we vary the magnitude of the delay input, which throughout the main text was set to 0. The mean hidden activity of Fig. S2a is given by . In Fig. S2b, we again plot the decoding accuracy trained on the hidden activity, see description of Fig. 5b above. Notably, decoding accuracy during the delay period increases slightly with larger delay input. In Fig. S2c, we plot the time-variation over a rolling time window, normalized relative to the size of activity. The normalized time variability is computed over a *τ* -time step rolling window. Specifically, for each neuron it is given by

The plot is then created by averaging this quantity over both batches and neurons for all time steps *t* ≥ *τ*, with *τ* = 5. We see the MPN not only has significantly higher time-variation than the GRU during the delay period but also the stimulus period that preceded it as well.

Fig. S2d shows the accuracy of analytical approximations, see description of Fig. 5d above, this time as a function of sequence time. The accuracy of the **h**_{t} approximation gets worse as a function of time due to the compounding effects of the approximation over many time steps.

Figs. S2[e-g] show the decoding accuracy as a function of decoder train and test time (all use the same color scale shown on the far right). The only difference from the calculation of decoding accuracy from Fig. 5b is that the train and test times can be different in this setup. Fig. S2e shows the data for a GRU, Fig. S2f for the MPN with zero delay input magnitude, and Fig. S2g for the MPN with a delay magnitude of 0.05.

**Figure S3** details.

All subplots in this figure were generated in an analogous manner to those in Fig. 3, but for the case of *N* = 3 instead of *N* = 2. Hidden activity and SM matrix activity that were previously colored by accumulated evidence are now colored by example label instead.

**Figure S4** details.

For plots S4[a-e], the additive MPN was trained in an identical setting to the MPN and RNNs of Fig. 6[a-e]. For the noise plot, the additive MPN was not able to achieve the same accuracy thresholds of the other networks, so training was capped at 16, 000 batches. A similar iteration threshold was used for catastrophic forgetting as the networks also occasionally struggled to learn the delay task for certain initializations.

**Figure S5** details.

Details of the true-anti context task can be found in Sec. 5.1 above. In Figs. S5b,c, we color the SM matrix/hidden activity by both label and context, with the darker colors corresponding to the ‘anti-’ context. Fig. S5b is the PC projection of the SM matrix over all sequence time steps. Fig. S5c is the PC projection of only the final hidden activity.

Figs. S5d,e quantifies how the accumulated evidence direction aligns with the readout vectors for both the retrospective and prospective contextual integration tasks. The accumulated evidence direction is computed for each subtask and context combination, so four lines are shown in each subplot. At each time step, for each possible subtask/context combination, the SM matrix states are sorted by the amount of accumulated evidence they encode (which takes on a finite number of integer values). All states that have the same accumulated evidence are averaged together, yielding a set of average SM matrix states for all possible accumulated evidence scores at that time step and subtask/context combination. For a given subtask/context combination, the average SM states are mapped to hidden activity using the go signal projection, Eq. (12), then said hidden activities are fit using PCA, with the top PC direction taken to be the direction of accumulated evidence variation in the hidden activity space. The readout alignment is then the average cosine angle of the top PC direction with the two readout vectors (taking the maximum of the two possible directions PC1 could point to eliminate ambiguity). Thus, higher readout aligns mean the direction of accumulated evidence variation for the subtask is more aligned with the readout vectors, indicating that the readout vectors are better at distinguishing evidence for the particular subtask. Notably, once context is passed, the irrelevant subtask’s accumulated evidence variation becomes close to perpendicular to the readouts, since its accumulated evidence is no longer relevant for the task. Finally, Fig. S5e shows the readout alignment only for time steps after stimulus inputs are passed to the network, since the accumulated evidence is always zero prior to said inputs being passed to the network and thus the accumulated evidence variational direction is ill-defined.

Fig. S5f simply shows a closeup of the continuous integration hidden activity shown in Fig. 7j.

## A Analytical Approximations

### Hadamard identity

To start, note the following Hadamard identity for matrices composed of outer products. Below, we often have multiplications of the form ((**hx*** ^{T}*) ⊙

**W**)

**x**

*. Writing out the elements of this product explicitly, we have and thus, in matrix notation we have*

^{′}In words: we can effectively ‘move the **x** through the **W**’.

### Approximation for M_{t}

We begin by deriving an expression for **M**_{t} in terms of all previous inputs, **x**_{s} for *s* = 1, *…, t*, and the initial state **M**_{0}. Explicitly expanding the expression from our update equations, we have
where, in reaching the last line, we have used the above Hadamard identity, Eq. (33). So far, this expression is exact, but is clearly becoming a mess quite quickly. We can see additional steps backward in time will make this expression even more complicated, so we look for some approximation.

After training the MPN, we find *η* = *𝒪* (1) and *λ* = *λ*^{max} = *𝒪* (1). Additionally, ∥**W**_{inp}**x** ∥_{2} ∥**x** ∥_{2}, so passing the input vectors through the input layer does not significantly change their magnitude. In the last line of Eq. (34), note that we have a term of the form (**x** ⊙ **x*** ^{′}*). As noted in the main text, for

*d*≫ 1, these terms are small relative to terms with just

**x**, see Eq. (8). Additionally, this term also has a Hadamard product in the hidden activity space. If we drop the (

**x**

_{t−1}⊙

**x**

_{t}) term since we know it will be small, and continue on this way, we arrive at where in the each successive line we have dropped any

**x**⊙

**x**

*terms. Finally, taking*

^{′}**M**

_{0}=

**0**, we arrive at the expression in the main text.

### Approximation for h_{t}

Again our goal is to write **h**_{t} in terms of all previous inputs
where in the second line we have used the expansion for **M**_{t}, Eq. (16), and in the final line we have used Eq. (33). Again, note that we have many terms that contain a **x** ⊙ **x*** ^{′}* factor and thus are small relative to any terms with just

**x**. All

**h**

_{t−s}dependence comes with a

**x**⊙

**x**

*factor, and thus is already a sub-leading dependence. The leading-order approximation for a generic*

^{′}**h**

_{t−s}is then where in the second line we have assumed

**M**

_{0}=

**0**. Inserting this approximation into Eq. (36) and taking

**M**

_{0}=

**0**, we have .

Note the leading-order contribution to a given **h**_{t} is only dependent upon the current input, **x**_{t}. Any dependence on previous inputs only comes in at sub-leading order.

Above, our arguments relied on the fact that the weight modulations are small as a result of the Hadamard product between vectors with components *<* 1. One could ask if the same dynamics are observed if the vector components were no longer *<* 1. To test this, we trained the network with inputs such that , i.e. input vectors with components of size *𝒪* (1). In this setting, we again found that the network found a solution such that terms from weight modulations were small relative to unmodulated contributions. This again resulted in hidden activity dynamics that were primarily driven by the most recent input to the network. We leave a further analysis of MPNs trained in settings that might increase the relative size of weight modulations for future work.

## B Additional Figures

## Acknowledgements

We thank Kayvon Daie, Niru Maheswaranathan, Danil Tyulmankov for feedback on this paper. We also wish to thank the Allen Institute for Brain Science founder, Paul G. Allen, for his vision, encouragement, and support.

## Footnotes

↵

^{1}Through most of this work, the number of neurons in the output layer of our networks will always be equal to the number of classes in the task, so we use*N*to denote both unless otherwise stated.↵

^{2}We omit sequences with two or more classes tied for the most evidence. See Methods 5.1 for additional details.↵

^{3}See Methods 5.4 for a more in-depth review of these results including how approximate dimensionality is determined.↵

^{4}This idea leads to another intuitive visualization of the MPN behavior by asking what**h**_{t}would look like at any given time step if the most recent input is the go signal (Figs. S1e,f).↵

^{5}Vanilla RNNs trained on this task perform poorly due to vanishing gradients, so we omit them for this analysis. It is possible to train them by bootstrapping their training by gradually increasing the delay period of the task.↵

^{6}Additional measures of neuron variability and activity silence are shown in the supplement (Fig. S2).↵

^{7}For*n*= 3 and fixed*d*, the number of trainable parameters in the MPN is smaller than that of the RNNs, see Table 1. With*d*= 40, the MPN has a number of trainable parameters more comparable to that of the*d*= 10 GRU (125 and 126, respectively).↵

^{8}We do not see a significant difference in accuracy for*η*= −1.0, i.e. an anti-Hebbian update rule.↵

^{9}Significant work has been done to preserve networks from such pitfalls by, for example, modifying the training order [51] or weight updates [52]. Here we do not implement any such methods, we are simply interesting in how the different operating mechanisms cause the networks to behave “out of the box”.↵

^{10}An alternative way to quantify this difference is to compare each subtask’s direction of evidence variation (Fig. 7b) in hidden activity to that of the readout directions. We find that, after context is passed to the network, the now-irrelevant subtask’s evidence variation becomes close to perpendicular to the readouts, meaning their difference is almost irrelevant to the output neurons (Figs. S5d,e).↵

^{11}Note that simply drawing uniformly over the bank of possible inputs significantly biases the inputs away from sequence with more extreme evidence differences. This method is still biased toward evidence vectors with small relative differences, but much less so than the aforementioned method.↵

^{12}The range of numbers the network needs to keep track of also scales with the length of the sequence,*T*. For instance, in a 2-class integration task, the relative difference of evidence for the two classes, i.e. evidence for class one minus evidence for class two, are all integers in the range [−*T, T*].↵

^{13}We note that the additive MPN setup used here is slightly different than the ‘HebbFF’ used in Ref. [27]. Like the MPN used throughout this work we add no bias to the hidden or output layers, use a tanh activation function, initialize parameters (including*η*) using Xavier initialization, limit*λ*to*λ*^{max}= 0.95, and train with L1 regularization. We also note the SM matrix is denoted by**A**instead of**M**in Ref. [27].↵

^{14}In brief, writing out the exact expression for**M**_{t}, several terms of the form (**x**⊙**x**) appear. By definition of our input vectors, the non-zero components of the vector (^{′}**x**⊙**x**) are , see Eq. (8). These components are small compared to the size of terms of a stand-alone input^{′}**x**(i.e. without the Hadamard product), whose nonzero-components are of size . Thus, any term with (**x**⊙**x**) is small compared to a term with just^{′}**x**.

## References

- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].
- [10].
- [11].
- [12].
- [13].
- [14].
- [15].
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].
- [21].
- [22].
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].
- [43].↵
- [44].↵
- [45].
- [46].
- [47].
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵