Context-modular memory networks support high-capacity, ﬂexible, and robust associative memories

Context, such as behavioral state, is known to modulate memory formation and retrieval, but is usually ignored in associative memory models. Here, we propose several types of contextual modulation for associative memory networks that greatly increase their performance. In these networks, context inactivates speciﬁc neurons and connections, which modulates the e ﬀ ective connectivity of the network. Memories are stored only by the active components, thereby reducing interference from memories acquired in other contexts. Such networks exhibit several beneﬁcial characteristics, including enhanced memory capacity, high robustness to noise, increased robustness to memory overloading, and better memory retention during continual learning. Furthermore, memories can be biased to have di ﬀ erent relative strengths, or even gated on or o ﬀ , according to contextual cues, providing a candidate model for cognitive control of memory and e ﬃ cient memory search. An external context-encoding network can dynamically switch the memory network to a desired state, which we liken to experimentally observed contextual signals in prefrontal cortex and hippocampus. Overall, our work illustrates the beneﬁts of organizing memory around context, and provides an important link between behavioral studies of memory and mechanistic details of neural circuits.

current networks 27 , in which each memory is represented by the co-activation of a set of neurons, forming a cell assembly.In these models, memory patterns are stored as attractors of the network dynamics via, e.g., a Hebbian-like learning rule 28,29 , in which the connections between neurons with correlated activity are strengthened.Though the substrate and learning mechanisms behind memory formation have yet to be fully uncovered, substantial experimental evidence exists supporting the emergence of cell assemblies during learning and for correlation-based Hebbian plasticity 30 .
For many associative network models, the number of stable memories that can be stored scales with the network size 27,31 .A standard Hopfield model holds approximately 0.138N memories 32 (where N is the number of neurons in the network), and several extensions and variants have been proposed to account for higher memory capacity or biological realism 27 .Among these variants, introduction of more general learning rules can lead to an increase in the number of stable memories up to a limit of 2N for memory patterns that activate half of the neurons 33 .The introduction of sparsity through low-activity patterns can further increase this number to more than 10N for a sparsity of 1% or less 34 .However, even these improved models come with caveats, such as unrealistic assumptions (e.g., non-local learning rules) or other undesired properties (e.g., blackout catastrophic interference 31,35 -all memories lose stability if the maximum capacity is surpassed), suggesting that our understanding of associative memory is still incomplete.Considering sequential memory storage (continual learning), blackout interference can be made more gradual by imposing weight bounds, causing memories to be slowly forgotten 27,31 (so-called palimsest memories).However, memory capacity is limited in this case, as old memories are quickly overwritten with new ones 36 .Various remedies have been proposed to alleviate forgetting in artificial neural networks during continual learning, including context-dependence and architectural modularity 37,38 , but it is unclear how these methods might operate in a more biologically-realistic setting.
Finally, how accessible each memory is (during recall) may limit the theoretically achievable storage capacity 39,40 , which is often ignored in mechanistic memory models.For example, it has been posited that some cases of memory forgetting, such as amnesia, may be partially due to deficits in memory accessibility rather than decay of the memories themselves [41][42][43] .Here too, context dependence may control memory expression and search 8,10 , for example, by directing retrieval towards particular memories, through gating or biasing of memory strength 44 .
In this work, we propose a new class of context-dependent associative memory models, which we call context-modular memory networks, inspired by previous theoretical studies 37,45 and experimental findings 10,12,13,21,39,46 .In our model, memories are assigned to different contexts, defined by a set of active neurons and connections.We show that this modular architecture results in enhanced memory capacity, as well as robustness to overloading and continual learning, thereby providing a direct benefit for organizing memories in contextual categories.Furthermore, we propose that a separate "context-encoding" network interacts with the associative memory network, leading to a model which dynamically gates memory expression.Our model provides strong evidence for the benefits of contextdependent memory and draws links between mechanistic circuit details and memory function which can be tested experimentally.

Results
Inspired by classic models of associative memory 27,29,34 we introduce a context-dependent, i.e., context-modular, memory network (Fig. 1).It consists of a recurrent network of N neurons which exhibit elevated or suppressed activity levels (taking values of 1 or 0, respectively; see Methods for details).We define a set of s contextual states that control the network in the following two ways: first, each context may define a corresponding subset of available neurons (N cxt of the total N neurons, with activity level a = N cxt /N).All other neurons are kept inactive.We will refer to this type of contextual control as neuron-specific gating.Second, each context can also define a subset of available synaptic inputs per neuron (K of the total N cxt inputs on average, with connectivity sparseness b = K/N cxt ).All other inputs are transiently gated off.We call this type of contextual control synapse-specific gating.A contextual subnetwork may be defined by neuron-specific gating, synapse-specific gating, or both.In a given network realization, each context can host a set of p distinct memories, i.e., patterns in which half of the neurons in the corresponding subnetwork exhibit elevated activity, chosen randomly.The total number of patterns in the network is thus P = sp memories.The synaptic connectivity between neurons is defined using a Hebbian learning rule 29 (Methods).For the majority of this study, we consider networks in which only one context is active at any given time.The contextual state is imposed on the memory network by a separate contextencoding network that dynamically interacts with the memory network (Fig. 1B).
The gating schemes defined above can be interpreted as temporarily modifying the network such that there is a different effective connectivity matrix (and energy landscape) for each

Context encoder
Associative memory network FIG. 1. Schematic of the context-modular memory network.A, Associative memory is defined hierarchically through a set of contexts (c 1 to c 5 ) and memory patterns (m 1 to m 15 ) assigned to each one.B, Network implementation: neurons are arranged into contextual configurations (subnetworks) in two ways: neuron-specific gating, where context is defined as a proportion of available neurons (colored rings; defined randomly, spatial localization is illustrative), and synapse-specific gating, where context is defined as a proportion of gated synapses (red cross, bottom right inset).Context is controlled by an external contextencoding network, such that one context is active at a time (black ring), and memories outside of the active context remain dormant.C, Contextual configurations change the effective connectivity matrix of the associative memory network: neuron-specific gating removes particular columns and rows (left), synapse-specific gating removes individual elements (center), and together, they will implement both effects (right).contextual state.Neuron-specific gating effectively removes specific rows and columns from the connectivity (Fig. 1C, left), whereas synapse-specific gating removes individual entries in the connectivity matrix, thereby making it more diluted, or sparse (Fig. 1C, middle).The combination of the two produces a smaller and sparser connectivity matrix (Fig. 1C, right), with a potentially large number of synaptic connections remaining hidden, to be used in other contexts.In the following, we study the properties of these context-modular memory networks, and make comparisons with classic associative memory models 29,34 .

Neuron-specific gating vastly improves memory capacity.
Starting with the model of neuron-specific gating (Fig. 1C, left; a ≤ 1, b = 1), we studied memory capacity using established signal-to-noise and heuristic mean-field methods 27,32 .Memory capacity denotes the maximum number of stable patterns stored in the network, divided by the network size (Methods).Here, we have two notions of capacity.The subnetwork capacity is defined as Podlaski, Agnes, Vogels (2020) Thus, α cxt is the number of patterns stored in each subnetwork (context), p, divided by the number of active neurons in the subnetwork, N cxt .For simplicity, we impose that all subnetworks have the same size and store the same number of patterns.The total network capacity is thus which is the total number of patterns, P = sp, divided by the full network size, N, or equivalently, the subnetwork capacity, α cxt , times the number of contexts, s, times the subnet ratio, a = N cxt /N.The stability of a memory pattern ν (of context k) can be evaluated by estimating the total input that a particular neuron i receives when this memory is being recalled, denoted as h kν i .For large networks storing a random set of patterns, the input to a neuron in an elevated state is well approximated by where N(µ, σ 2 ) is a normally-distributed random variable with mean µ and variance σ 2 .The two normally-distributed terms of Eq. 3 represent two different sources of noise (crosstalk) in the network connectivity that interfere with the stability of pattern ν of subnetwork k.The first term is the noise from the other p−1 patterns of context k, which scales with the number of patterns per context, p.The second term represents the noise coming from all of the (s − 1)p other patterns in the other contexts.This term scales with the total number of other patterns, but also with the relative amount of shared active neurons, i.e., the subnet ratio, a.If the memory is stable, then the probability that this neuron will inactivate should be low (i.e., h kν i > 0 with high probability).From this assumption, we arrive at an estimate of the maximum subnet capacity of with α H ≈ 0.138 being the standard Hopfield network capacity 32 (Hopfield limit).
The intuition gained from the signal-to-noise analysis was confirmed by a more accurate mean field approach 27,32 and simulations (Methods), shown in Fig. 2A-C.The denominator in Eq. 4 is always greater than or equal to 1, and so α cxt ≤ α H . Thus, the subnetwork capacity is upper bounded by the Hopfield limit (Fig. 2A-C, green line & arrow), and decreases as the number of subnetworks or the subnet ratio increases.This is because each subnetwork intuitively acts as a standalone Hopfield network of size N cxt with p patterns, but with more noise in the weights due to the influence of the other (s − 1)p memories.The second term in the denominator of Eq. 4, (s − 1)a 2 , functions as a measure of the amount of overlap between subnetworks (explicitly, it is the expected number of additional contexts that each synapse will participate in).
From Eq. 2 and Eq. 4, we arrive at an expression for the maximum full network capacity, ( Taking the limit of large s (such that s − 1 ≈ s), the full network capacity approaches α = α H √ s ≈ α H /a. It follows that the high capacity emerges due to the sparsity in the subnetwork representation, growing sublinearly as a function of the number of contexts.Furthermore, the optimal subnetwork size for a fixed number of contexts is a * = 1/ √ s − 1. Mean field results support this analysis: although subnetwork capacity remains below the Hopfield limit, the total network capacity grows well above α H for many parameter values (Fig. 2D-F).For example, the capacity of a network with 200 distinct contexts is α ≈ 1.2, almost an order of magnitude higher than the Hopfield limit.We thus see that the network as a whole has substantially increased memory capacity, due to reduced interference between memories found in different contexts.Importantly, this reduced interference depends upon the fact that the majority of memories are not retrievable in each context.
Neuron-specific gating takes advantage of reduced interference by having low-activity memory patterns, known to increase memory capacity dramatically 34 .The capacity of the low-activity Hopfield network is comparable to our network (Fig. 2E, dashed green line), but only for very low activity levels with low information content per pattern 47 .In contrast, neuron-specific gating enables both high memory capacity and high information content (Fig. S1).
Synapse-specific gating can further improve capacity.We next studied networks with synapse-specific gating (Fig. 1C, middle; a = 1, b ≤ 1), in which each subnetwork contains the full set of N neurons, but shares only a proportion of weights with the other contexts.Initially, we chose a random subset of synaptic weights to be removed for each context.Thus, each subnetwork can be seen as a diluted Hopfield network 27 .Repeating the same analysis as before, the total input coming into a particular neuron i (of pattern ν in context k) is which again contains two crosstalk (noise) terms.The fact that the parameter b appears in the first noise term and not in the second reflects the fact that each individual pattern becomes less stable with increasing sparsity 48 , with no benefit across contexts either.From Eq. 6, we arrive at a capacity per context of The factor b in the numerator reflects the fact that, even without multiple contexts, memory capacity degrades roughly linearly with network dilution 27,48 .The total network capacity is In this case, the optimal ratio of inputs per neuron is b * = 1, which is independent of the number of contexts s, and thus the overall network capacity is bounded from above by α H .In other words, while synapse-specific gating does enable contextual grouping, the number of stable memories is restricted by the Hopfield limit.Mean-field results confirmed this intuition (Fig. S2).
While random synapse-specific gating proves ineffective, we instead devise a more selective way of imposing synaptic control.We note here that the Hebbian learning rule (Methods) sets the synaptic weight between each pair of neurons according to the correlation in their activity across all patterns, which Podlaski, Agnes, Vogels (2020) acts to stabilize the majority of patterns 27 .However, the set of memories in the active context may produce substantially different correlations, rendering some synaptic weights ineffective at stabilizing the majority of memories in this active set.Such synaptic weights have a net harmful effect on memory recall, and performance would improve for some contexts if these weights were set to zero.We thus propose and test the following gating scheme: if there is a mismatch between the sign of the synaptic weight serving to stabilize all memories versus serving to stabilize memories belonging to a specific context, then this synapse is turned off for this context (Fig. 3A, middle).We refer to this scheme as targeted synapse-specific (TaSS) gating.Interestingly, TaSS gating bears a resemblance to networks with binary synaptic weights 49 , in which the standard Hebbian learning rule is passed through a sign-function, and weights are set to +1 or −1.Previous work has shown that such binary synapse networks maintain a memory capacity of α B ≈ 0.1 49 , close to the standard Hopfield model (α H ≈ 0.138).We devised a rough estimate of the memory capacity of networks with TaSS gating (combined with neuron-specific gating) using this binary synapse capacity combined with an estimate of the proportion of gated connections (Fig. 3A, right; Methods).
For parameter ranges with high overlap between contexts (large a and s), we observe that the network connectivity decreases to 50% (Fig. 3B-D) -this is because, when enough "noise" has been added to the weight matrix, each element will have the desired sign approximately half of the time.In these parameter ranges, the subnetwork capacity approaches α cxt ≈ 1 2 α B ≈ 0.05 (Fig. 3E-G), with the factor 1/2 reflecting the fact that sparse connectivity degrades capacity roughly linearly 48 .The full network memory capacity therefore scales linearly with the number of contexts (α ≈ 1 2 α B sa), growing to a very large value (α ≈ 8 -up to 60 times that of the standard Hopfield network for s = 200 contexts, and much higher than the low-activity variant over similar activity levels, Fig. 3H-J).Furthermore, we also see that the optimal subnetwork size is a * = 1, and so the addition of neuron-specific gating to TaSS gating does not further increase memory capacity.This does not, however, render neuron-specific gating useless -for example, the two schemes have substantial differences in implementational complexity, as discussed next.
Implementational complexity of neuron-specific vs. synapsespecific gating.Up to this point, we have considered the capacity of context-modular memory networks assuming that the desired active context has already been imposed externally.However, setting a particular contextual state requires additional control neurons which synapse onto the memory network in order to gate neurons and synapses.We now investigate how many additional neurons and connections are needed to implement each contextual gating scheme to determine if they are feasible.We consider the capacity of a memory network of N = 10,000 neurons with additional "control" neurons, denoted M, each with on the order of N postsynaptic connections (Fig. 4).
Neuron-specific control could in principle be implemented with a strategically-placed inhibitory synapse onto each neuron of the memory network (Fig. 4A, left; see Discussion for biological implications).Typically, to realistically code for s random configurations, the number of control neurons scales with s 27 -e.g., a winner-take-all architecture with a group of m control neurons per context would require a total of M = ms control neurons (Fig. 4, light blue), and in the perceptron case M = s/2 (not shown).However, given the complex and nonlinear nature of real, biological neurons (e.g., single neurons may behave more like multi-layer neural networks 50 ), there may be more efficient algorithms for single neuron control.If we assume the most compressed code for context (requiring that each context has a unique representation, and 2 M activity Podlaski, Agnes, Vogels (2020) patterns can be generated with M neurons), a total of s different contextual subnetworks would require input from a network of M = log 2 (s) neurons at minimum (lower bound, Fig. 4, dark blue; see Methods).For finite-sized networks, neuron-specific gating requires a non-negligible amount of extra control neurons -a network with N = 10,000 neurons and s = 200 contexts requires 8, 100, or 4000 control neurons, considering the lower bound, perceptron-case and winner-take-all (m = 20) representations, respectively (Fig. 4B-G).Therefore, even assuming the worst case of our assumptions (4000 extra control neurons), the overall memory capacity (now at α ≈ 0.86 instead of α ≈ 1.2) is still over six times larger than the Hopfield limit, and can feasibly be implemented given the known structure of cortical circuits.Synapse-specific gating operates on the level of individual synapses (Fig. 4A, middle), and requires that each of the approximately N 2 synapses in the memory network has a corresponding contextual gating synapse for each of the s contexts, adding N 2 s synapses to the network.Assuming that each neuron can synapse onto at most N other neurons, this would require M = Ns additional control neurons if implemented naively (Fig. 4, dark orange).As a slightly more efficient solution modelling each synaptic gate as a perceptron, we arrive at a limit of Ns/2 (Fig. 4, light orange), meaning that 2,000,000 control neurons are needed for a network of size N = 10,000 with 200 contexts.These two scenarios thus require more context-encoding neurons than memory neurons (Fig. 4B-G), and diminish the capacity to levels below the Hopfield limit.Therefore, implementing synapse-specific gating with full and independent control of each synapse is likely both ineffective

FIG. 4. Complexity of contextual gating schemes.
A, Schematic of contextual control schemes for neuron-specific (NS) gating (left) and for targeted synapse-specific (TaSS) gating, either with local control (middle) or subunit control (right).B-G, Histograms of the number of required context neurons (top), the corrected subnet capacity (middle), and the corrected full net capacity (bottom) for six example parameter sets (assuming N = 10,000, black dashed line).Legend is given in panel A. Note that panels F,G are for a = 1 and do not include NS gating.The numbers of context neurons are calculated as follows: NS gating baseline (0 context neurons; dark grey); lower bound for NS gating (log 2 (s); dark blue), winner-take-all for NS gating (20s; light blue); TaSS gating baseline (0; light grey); lower bound for TaSS gating (log 2 (Nsa 2 ); red); winner-take-all for TaSS gating (Nsa 2 ; orange); perceptron limit for TaSS gating ( 1 2 Nsa 2 ; light orange); dendritic subunit control (30s; purple).Comparisons made with standard Hopfield network (green solid lines) and low-activity network with same activity level (a LA = a/2; turquoise dashed lines).and infeasible.For completeness, we also posit that each synapse is capable of implementing an arbitrary nonlinear gating procedure, thereby reducing the number of control neurons to a theoretical limit of log 2 (Ns) (Fig. 4, lower bound, red).Such a scheme would retain high memory capacity with relatively few control neurons -e.g., a network of 10,000 memory neurons and 200 contexts requires approximately 20 control neuronsbut requires millions of post-synaptic connections per control neuron.
The complexity of synapse-specific gating can be simplified considering the extended morphology of the typical pyramidal cell, enabling control on a dendritic subunit level (Fig. 4A, right).Recent theoretical work has estimated that cortical pyramidal cells may have up to 30 dendritic subunits which can function as quasi-independent electrical compartments 51 .Therefore, if synapses are distributed across these different subunits, it is plausible that synapse-specific gating can be controlled with on the order of 30 contextual synapses, with one on each branch.Considering s contextual states, the network would then require approximately 30s control neurons.Such a gating implementation keeps the required neurons consistently below the network size, and so the memory capacity, though reduced, still peaks far above the Hopfield limit (now at α ≈ 6.25 instead of α ≈ 10.0; Fig. 4B-G, purple).However, in order for branch-specific gating to scale to many contexts, synapses may need to be clustered (see Discussion).Nevertheless, even if the complexity of such context-modular architectures reduce the capacity gains, they have additional benefits for memory organization in stability, retrieval and continual learning, as discussed below.
Memory stability is modulated by context.For the validity of the results presented thus far, it is crucial that as recall occurs in the active context, memories belonging to inactive contexts remain gated.We used an established measure of stability in order to obtain an estimate of the size of each pattern's basin of attraction, i.e., the area of neural activity space around a memory pattern for which the activity will converge to the memory 52  stability as the activity level decreases (Fig. 5B, turquoise).The average stability of "irrelevant" patterns (memories belonging to inactive contexts) increases with the ratio of subnetwork size, but is always lower than that of active patterns (Fig. 5C).For large subnetwork ratios, TaSS gating improves performance by making irrelevant patterns less stable (Fig. 5C, dashed lines).Interestingly, for small subnetwork ratios (a ≈ 0.4 or less), neuron-specific gating without TaSS gating suppresses irrelevant memories more effectively (Fig. 5C), and even goes to negative values, suggesting that the network is actively repelling these memory states.Therefore, the relative stability of neuron-specific gating alone versus TaSS gating depends upon the subnetwork ratio.Notably, each gating scheme is more stable than the other in high-capacity parameter ranges (cf.Fig. 2E and Fig. 3I).This modulation of stability biases the network dynamics depending on the background context, such that even for the same initial condition the activity state can be pushed towards or away from stored patterns (Fig. 5D).

Controlling memory expression via dynamic context switching.
Though it may be beneficial that some memories be inac-cessible at any given time, we hypothesized that a strong input corresponding to a memory in an inactive context could cause a switch to that context.We thus consider the complete memory architecture by including an additional context-encoding network which dynamically interacts with the memory network (Fig. 5E).Substantial evidence exists for a representation of context in the brain 1,46 , providing experimental support for such an architecture.
For the purposes of simplicity, we model the contextencoding network with winner-take-all dynamics.Importantly, this network has reciprocal connections with the associative memory network -the active context unit provides inhibition to neurons outside of its corresponding contextual subnetwork (Methods).Correspondingly, this unit then receives excitation from neurons inside of the active subnetwork.This loop keeps the current context and memory active, and prevents other areas of the network from being activated (Fig. 5E).
We tested the functionality of this network in an example simulation using neuron-specific gating (Fig. 5F).Here, two Podlaski, Agnes, Vogels (2020) memories from one context were stimulated sequentially, and read out from the network.When a memory from a different context was probed, the context-encoding network suppressed recall.Only when a stronger input is given does the context switch, and the corresponding memory was successfully activated.Thus, this network architecture allows for dynamic memory gating, in which memory expression depends on context.We also explored the effects of context deficits in retrieval by inhibiting the representation of one context, and repeating the same experiment (Fig. 5G).In this case, the memories of the inhibited context became virtually inaccessible, despite the fact that the recurrent weights storing the memories were intact.This provides a potential model of cognitive control of memory access, as well as deficiencies in memory accessibility, such as has been hypothesized in some forms of amnesia 41,43 (see Discussion).
Memory overloading and continual learning.Associative memory networks are normally catastrophically affected by overloading the network with patterns beyond its capacity (Fig. 6A, left) -so-called blackout interference 35 (Fig. 6B, green).To probe this in context-modular networks, we first loaded networks with memories up to the maximum storage capacity.Then we chose a subset of contexts (10%, 20%, 50%, or 100%; Fig. 6B) and added additional memories to these configurations, leaving the others at their maximum capacity.When simulating recall, we observe that the architecture of the context-modular memory network makes it resistant to overloading when applied to a subset of contexts -synaptic weights are protected from interference when inactive.The extent of this effect depends upon the number of contexts and subnetwork ratio.
Despite the improved robustness to memory overloading, recall performance still declines as more memories are stored.During continual learning, when memories are added sequentially over time (Fig. 6A, right), catastrophic forgetting can be attenuated by preferentially remembering the most recently acquired memories through weight bounds (strong connections are clipped after each new memory is learned), which is also more biologically plausible 36 .In this scheme, more recent memories will have a large basin of attraction, with a gentle decay of recall performance for older memories 27,31 .We implemented sequential memory acquisition in context-modular memory networks with neuron-specific gating alone (Fig. 6C, top) and TaSS gating alone (Fig. 6C, bottom), over different bound sizes (Methods).Memory recall over time is substantially improved for both schemes, suggesting that contextual gating can also be used to enhance memory retention in continual learning.Our results thus illustrate the potential benefits of organizing memories around context for storage and retrieval over long timescales.

Arbitrary and shared context allocation.
In reality, it is desirable that memory patterns be accessible in more than one context (Fig. 7A, top), but in our model thus far, each context contained a separate, non-overlapping set of memories (Fig. 1A).While neuron-specific gating creates fundamentally distinct representations for different contexts TaSS gating defines contexts at the level of active synapses rather than active neurons, and therefore can stabilize identical neuronal activity patterns in different contexts.To test this, we first trained a network with a large number of memory patterns (up to 16N) in a single context in which all neurons and connections are utilized.Next, we assigned arbitrary (and overlapping) subsets of these memories to be part of new contexts, and determined the Podlaski, Agnes, Vogels (2020) We observed that arbitrary sets of overlapping memory patterns can be reliably retrieved (Fig. 7B), provided that the number of patterns stored in each context is sufficiently small (0.03N or less) -performance begins to degrade when the number of patterns per subnetwork approaches the subnetwork capacity (see Fig. 3E,F).As before, connectivity decreases as the total memory load is increased (Fig. 7C).Remarkably, even a network with a random weight matrix (see Methods) can successfully be used to represent stable context-dependent memories (Fig. 7B,C, dotted gray line), provided that the correct TaSS gating structure is imposed.This is because a random weight matrix will have, on average, half of its elements with the correct sign according to the correlation structure of the desired subset of patterns.This means that TaSS gating can be used to impose any activity pattern as a stable attractor independent of the weight matrix -i.e., groups of memories are stored in the TaSS gating structure.This also suggests that synaptic weights can be corrupted substantially without affecting performance, as long as the synapse-specific gating structure remains intact.Thus, TaSS gating not only allows the network to impose arbitrary and overlapping contextual states, but also produces high robustness to noisy synaptic weights.
Distributions of strengths over memories.Lastly, we extended the context-modular memory network to allow for memories to have different strengths in each context, defined by the memory stability, i.e., basin of attraction size.In other words, we changed the relative ease of recalling each memory without gating it completely on or off.Imposing distributions of memory strengths enables more flexible control.An interpretation of this scheme is to consider that the stability of a memory in a given context corresponds to how often this memory is recalled in that context.From this perspective, memory strength reflects the statistics of the external world, thereby enabling the brain to optimize memory access in order to affect behavior, e.g., making decisions more efficiently.
We use TaSS gating in a modified form, in which each particular pattern's stability (i.e., its basin of attraction, Fig. 5A) is manipulated to have a distinct size for each context (Fig. 8A; Methods).As before, we start with a standard Hopfield network whose weights are defined with the standard Hebbian learning rule (Methods), which generates memories with similar stability (Fig. 8B, black lines).Next, in contrast to normal TaSS gating in which a binary choice of (on and off) patterns is used to determine the targets of synaptic gating, patterns are multiplied by an analogue value representing relative strengths.Following this gating, we can impose arbitrary distributions of strengths over the set of memory patterns (Fig. 8B; colored lines, for four examples in top and middle).Note that no explicit learning of synaptic weights is needed to do so.
Such distributions of memory strengths can be used to create new contexts as combinations of previously-defined contextual states (Fig. 8A,B, bottom).This not only increases the flexibility of defining contextual states, but also opens the door towards multi-level hierarchical contextual control with contexts at different levels of specificity.For example, let's say you are deciding which food to eat on a trip to New York -you can bias your memory recall by first selecting memories related to New York, and then selecting memories related to food, thus leading to a new context with food-related memories in New York.As a simple test of network functionality with distributions of strengths, we probed the pattern completion ability of the network in response to noisy patterns (Fig. 8C, bottom left).Noise tolerance closely matches the level of stability of the memories (Fig. 8C, cf.Fig. 8B, bottom), provided the strength is above a certain baseline.For weak memory strengths, noise tolerance decreases to zero, as the memories themselves are in fact no longer stable (Fig. 8C, bottom right).Thus contextmodular associative memory may serve as a powerful architecture to combine and distinguish groups of memories from one another.

Discussion
Memory and context are deeply intertwined 1,5,6 .To understand the properties and potential benefits of having context modularity in the brain, we have proposed a novel model of how context and associative memory interact, called the contextmodular memory network (Fig. 1).The model provides a mechanistic hypothesis for the basis of context-dependence in recurrent neural circuits through neuron-specific and synapsespecific gating.
Relation to other models of associative memory and context dependence.Context-modular memory networks exhibit enhanced memory capacity (Fig. 2, Fig. 3 and Fig. 4) through the optimized use of sparsity, modularity and hierarchical organization -principles which have been used in previous associative memory models 27,34 .Modularity has been utilized to mimic the architecture of the cortex -dense local connectivity with sparse long-ranged connectivity [53][54][55]   ies, in which patterns with common features are stored in a common representation 27,45 .Our model bears a resemblance to these previous models, but is unique in considering context as the main determinant of memory groupings (and therefore sparsity), as well as modelling context as an external signal imposed on the memory network (in contrast to previous hierarchical models).In addition to the evidence for independent context representations in the brain 1 , we chose to separate context because it facilitated the analytical estimation of memory capacity, and enabled memories to be shared by different contexts (as in Fig. 7).Furthermore, while we only considered a single hierarchical level between context and memory patterns, we speculate that additional levels could be added in a feedforward or looped architecture.Other models have included context-dependence into associative memory (e.g., Doboli et al. 56 ), but with different architectures and motivations.While context-modular networks display high memory capacity, more recent models also achieve similar capacity 27,31 , albeit with differences in learning rules, activity, or architecture.Thus, the context-modular architecture should not be interpreted solely as a capacity booster -several other properties of these networks makes them more interesting, e.g., robustness to noise, robustness to memory overloading, and flexible memory access.Additionally, extensions to associative memory networks that improve memory capacity, such as low activity patterns 34 , alternative learning rules 33 , or complex synapses 57 may be incorporated into the context-modular memory network architecture (e.g., each subnetwork stores low-activity patterns), thus potentially combining their benefits.
We applied our model of context dependence to associative memory, but this architecture may be extended or adapted to other tasks such as context-dependent sensory processing (or multisensory integration 26 ), decision making, or motor control, with non-attractor-like dynamics.For example, by applying a similar neuron-specific gating architecture, recent work has shown that problems with continual learning in deep artificial neural networks can be alleviated 37 .Our model also bears a resemblance to several behavioral models of temporal context 7,9 , supported by experimental evidence for context as a slowly-drifting process over time 58,59 .Intriguingly, such models include a consistent means of introducing new contexts over time, which could be implemented at the mechanistic level in our model.

Circuit motifs and cell types involved in gating.
There is substantial evidence for an input gating motif in the brain 60 , implemented through detailed inhibitory control of network state 21,61 , which has been linked to contextual processing and learning 18,60 .The diversity of inhibitory cell types and their post-synaptic targets provides a rich basis for such gating, with e.g., parvalbumin-positive interneurons preferentially targeting perisomatic regions and somatostatin-positive interneurons targeting dendritic regions 62 .Alternatively, neuronspecific gating could be implemented through excitatory control -e.g., neurons may be in a suppressed state by default, and only participate in recall if they receive extra contextual excitation -supported by recent work showing that baseline shifts modulate free recall 17 .Finally, recent evidence suggests that gain or excitability changes in individual neurons may play a role in memory allocation [63][64][65][66] , and computational work has applied this idea to motor learning 22 and sequence learning 67 .Experimental evidence suggests that around 10 − 30% of neurons are allocated for a given engram in the amygdala and hippocampus 65 , which would correspond to an area of high capacity in our model of neuron-specific gating.
Neuron-specific versus synapse-specific gating.Neuronspecific and synapse-specific gating have their advantages and limitations in terms of memory capacity, complexity, and flexibility.Synapse-specific gating is inherently more flexible, with many more degrees of freedom (neuronal gating can be seen as a special case of synaptic control, in which all synapses of a given neuron are gated), which is reflected in the larger gains in memory capacity (Fig. 3).However, such a scheme suffers from an expansion in complexity of control, as reflected in Podlaski, Agnes, Vogels (2020) 10 Context-dependent associative memory the fact that "harmful" synapses need to be precisely targeted.Targeted synapse-specific gating thus requires many more (external) neurons to implement contextual control (Fig. 4), and it is unfeasible if each synapse is required to be fully and independently controlled (though some experimental evidence exists for individual inhibitory control of dendritic spines 68 ).
Subunit (dendritic branch) specific control represents a more realistic option, lying somewhere on the continuum between neuron-specific and synapse-specific control.Recent theoretical work suggests that cortical pyramidal neurons may have around 30 independent subunits 51 , which could be targeted according to contextual states.However, we note that having only 30 subunits compared to single-synapse control may limit the maximum number of achievable contexts.We speculate that the most efficient scheme may require taking advantage of multiple connections per pair of neurons 69 as well as clustering of synaptic inputs 70 based upon context.An additional benefit of synapse-specific gating is that it enables arbitrary context allocation (Fig. 7) and control of memory strength (Fig. 8).This flexibility may enable learning of a particular statistical distribution in the memory patterns, thus reflecting relationships between memory items in the world, or their relative values 44,71 .While we did not explore more flexible versions of subnetwork gating here, a more clever choice of subnetwork assignments could result in larger capacity or additional flexibility (e.g., shared memories by including more overlap between particular subnetworks).
The learning problem.Due to the complexity differences in subnetwork gating and targeted synapse-specific gating, these schemes will also likely require very different learning procedures.Learning the memory patterns themselves should be feasible considering the local, correlation-based Hebbian learning rule.Furthermore, random subnetwork gating could be imposed before learning the memory patterns, or indeed after memory patterns have been learned by removing neurons from the representation.Targeted synapse-specific gating relies upon a comparison between the overall synaptic weight of a pair of neurons and the hypothetical weight considering only a single context (Fig. 3), thus making it more complex.However, context may be a slow-changing process in time 72 with discrete shifts 73 .Therefore, the pair of neurons may be able to sample enough patterns to estimate their correlation within the currently active context and compare it with the anatomical weight value.
We found that synapse-specific gating endows a network with a tremendous amount of noise tolerance (Fig. 6), to the point where a random connectivity matrix can be used to retrieve memories provided the contextual configurations are maintained (Fig. 7).These results predict that local recurrent connections may vary quite substantially over time as learning occurs with little detriment to memory performance, consistent with recent data on the volitility of synaptic spines 74 .Given that contextual modulation is likely transmitted through inhibitory neurons, this idea resembles recent theoretical work proposing that inhibitory connectivity is responsible for maintaining information over time in the cortex 75 .
In addition to short timescale learning dynamics, contextdependence may also vary over development.A previous experimental study suggested that infantile amnesia arises due to a retrieval failure 43 .In relation to our model, this may imply that a contextualization of memories may be learned and imposed only later on in life, resulting in early memories becoming inaccessible.However, other work suggests that contextual binding of memories decays over time 76 , suggesting that some types of memories may also become more general over time.
Capacity vs. accessibility.Our model displays an inverse relationship between capacity and accessibility, as large increases in capacity are achievable provided that the vast majority of memories are not accessible at the same time.This trade-off may be viewed as a limitation of the model However, restricting memory access may also be seen as a benefit, considering that the main purpose of storing memories can be understood as to influence decision making 8,[77][78][79] .We hypothesize that dynamic control of memory availability may act as an efficient means of tree searching through memories, enabling the brain to select which memories are currently relevant in order to make faster decisions.In this light, the incorporation of additional layers to such a contextual memory hierarchy may add further benefits.Overall, the integration of associative memory models with retrieval processes and decision making is a promising area of future research.
Furthermore, our model provides a direct mechanistic basis for memory failure due to loss of accessibility rather than forgetting 39 .Such a hypothesis has been put forth in the context of different types of amnesia 39,42,43 .Therefore, our model may have specific implications for the understanding of memory access in healthy and disease states.

Hippocampus, prefrontal cortex, and context representations.
While we present the context-modular memory network as a generic architecture without explicit mention to brain areas and circuits, evidence suggests that contextual signals interact with cortical memories through the hippocampus, prefrontal cortex, and amygdala 1,13,[80][81][82][83][84] .For example, a recent study finds evidence of inhibitory control of cortical memories through the hippocampus 13 , providing direct support for a neuron-specific (Fig. 2) or synapse-specific (Fig. 3) gating motif in the cortex.Some work indicates that hippocampus and prefrontal cortex may play similar and complementary roles in memory retrieval -either that the hippocampus control recent memories and the prefrontal cortex more remote memories 85 , or that prefrontal cortex handles active retrieval through executive control, and hippocampus handles more automatic retrieval 86 .Both hypotheses suggest that there may be multiple context-encoder-like networks in the brain.Alternatively, the modularization introduced here is a natural candidate mechanism for pattern separation, which is commonly attributed to the dentate gyrus in the hippocampus 87 , and could act to control contextual memories in CA3 88 similar to previous models 56 .For each of these cases, our work functions as a useful conceptual model for how to begin studying the underlying circuits of each of these systems.

Methods of
Context-modular memory networks support high-capacity, flexible, and robust associative memories William F. Podlaski, Everton J. Agnes, and Tim P. Vogels

Materials and Methods
Model formulation.The context-modular memory network model is a fully connected recurrent network of N binary neurons denoted V i for neuron i, and taking values {0, 1}.The modular architecture defines s contextual states, each with a corresponding set of active neurons N cxt ≤ N, chosen uniformly at random.We use S k to denote the set of active neurons in contextual state k.
We also define a = N cxt /N as the ratio of subnetwork to full network size, which is also the probability that each unit takes part in any given context.Furthermore, each contextual state also determines a particular set of active inputs per neuron.We explore two variants of this type of contextual control -one random and one algorithmically targeted (see section on Targeted synapse-specific gating).In the first case, active inputs are chosen randomly with probability b = K/N cxt , such that each neuron receives K ≤ N cxt inputs on average.We define a symmetric matrix C k with elements c k i j = c k ji = 1 if connections ij and ji are present in contextual state k, i.e., Pr(c k i j = 1) = b and Pr(c k i j = 0) = (1 − b).Given the symmetry of the standard Hopfield model, we only consider input configurations which are also symmetric.Importantly, at any given time, only one contextual state is "active".Considering contextual state k is active, the dynamics of each unit in the network are defined as where H(•) is the heaviside step function, h i is the total input to unit i, w i j is the synaptic weight between neurons i and j (symmetric), and θ k i is the threshold for neuron i when contextual state k is active (defined below).The network stores p memory patterns per contextual state, making P = sp total memories.Memory patterns are denoted η σµ i as the configuration of neuron i for memory µ of state σ, taking values {0, 1} with equal probability provided unit i is in S σ , and 0 otherwise.The connectivity matrix is defined using a variant of a "Hebbian" learning rule 29,34 where Note that the factor 1/bN cxt in Eq. 10 ensures that the average total synaptic input to each neuron is around unity regardless of the choice of b 48 (see Signal-to-noise analysis below).The threshold is defined as This choice of threshold, along with the factor 8 outside of the sum in Eq. 10, ensures that this model is equivalent to the standard Hopfield model with {±1} units (see Supplementary methods for further elaboration).Note that this threshold is different for each neuron and also changes with the contextual state of the network, whereas the connectivity matrix remains constant, with some weights being effectively set to zero through the c k i j terms in Eq. 9.The activity level, or coding level, of the patterns denotes the fraction of active neurons for any given pattern.We define two measures of activity level -per individual context, and for the network as a whole.The activity level per individual context is set to 1/2, since each pattern unit is chosen to be 0 or 1 with uniform probability.The activity level as defined from the perspective of the entire network is a LA = a/2 ("LA" for low-activity, see the following section), where a is the relative subnetwork size.We note that for the sake of analysis, we impose that all subnetworks contain the same number of active neurons and store the same number of memory patterns, though this is not necessary in practice.
Comparison with standard Hopfield network and low-activity network.We compare the results obtained in this study with that of the standard Hopfield model 29,32 , as well as a standard variant with low-activity patterns 34 .For these models, we consider a fully connected network of N neurons which store a set of P memory patterns.Again, units in these networks are binary and take on values {0, 1} (see Supplementary Methods for a comparison of {0, 1} and {±1} formulations).Patterns are denoted η µ i ∈ {0, 1} with Pr(η µ i = 1) = a LA and Pr(η , where a LA is the activity level. Podlaski, Agnes, Vogels (2020) 14 Context-dependent associative memory Dynamics of both models are as follows where w i j is the connectivity matrix defined using the "Hebbian" rule 34 θ i is a neuron-specific threshold, defined as and θ 0 is a constant threshold defined as The standard Hopfield model is obtained by setting a LA = 1/2, which then makes θ 0 = 0.
Several previous works have studied the theoretical memory capacity limits of these networks 32,34 .For the standard Hopfield model with a Hebbian learning rule, this is approximately α H = P/N ≈ 0.138 (in the zero temperature limit).For the low-activity model, capacity scales with the activity level as: for a LA 1 ("TF" stands for Tsodyks-Feigelman 34 ).The comparison with the context-modular memory network is only relevant at intermediate activity levels, where this estimate does not hold.We thus use numerical simulations to obtain a more accurate estimate of memory capacity (Fig. S3).

Analytical capacity estimation.
In the following, we adapt a signal-to-noise analysis and a heuristic mean-field theory of memory capacity for the standard Hopfield network 27,35,89 to the case of the context-modular memory network.
Wald's equation.We will make use of a result in statistics known as Wald's equation , which we summarize here.Consider a Binomial random variable K ∼ B(N K , p K ).Let Y be the sum of a sequence of independent identically distributed random variables X i of length K, i.e., Y = K i=1 X i .Assuming that each X i is independent of K, then the mean and variance of Y can then be written as and where we have dropped the index on X because each random variable X i comes from an identical distribution.
Signal-to-noise analysis.We aim to estimate the stability of an arbitrary neuron i with respect to a particular pattern ν of context k, η kν i , which generalizes to ensuring that H(h kν i ) = η kν i , where h kν i is the total input to neuron i when the network's state is exactly at pattern ν of context k (i.e., this neuron will not change activity, given the input it receives at the stored pattern state).Plugging Eqs. 10 and 12 into Eq.9, we obtain The right two terms in Eq. 22 are two different "crosstalk" terms that may disrupt the stability of memories.The first one is analogous to the standard Hopfield crosstalk term 27 , and accounts for other patterns stored in the same context.The second term accounts for patterns stored in different contexts.We expect patterns stored in the same versus different subnetworks to affect stability differently, which will become apparent in the next section.As shown in Eq. 3 and Eq. 6, if the desired state of neuron i is 1 (2 ηkν i = 1), then we see that this state will be stable providing that the sum of the two crosstalk terms are greater than −1.
The input described in Eq. 22 depends upon the actual value of ηkν i , which complicates the analysis.In order to generalize to the case in which the desired state of the neuron can be either 0 or 1, we follow Hertz et al. 27  which is the two crosstalk terms from Eq. 22 multiplied by −2 ηkν i .We can now define the probability that any given bit will flip as The aim is now to approximate the two terms in Eq. 23 with Gaussian random variables.Let's start with the first term.We first notice that c k i j ∈ {0, 1} will set some terms in the sum to zero.The four pattern variables, i ηkµ i ηkµ j ηkν j , combined with the total factor of 16 outside of the sum, act as a shifted and scaled Bernoulli random variable that takes on values ±1 with equal probability.This sum can be interpreted as a random sum of random variables of the form 12 ).Since this is a sum of Bernoulli random variables, we know that it will take the form of a Binomial distribution.We can then use Wald's equation (see above section) to determine the statistics of this distribution.Given the symmetry of the Bernoulli random variables (±1), it is easy to see that E  12 ).The reason for the form of K in this term is that now we have both c k i j and c σ i j , independent and each with non-zero probability b, as well as ησµ i and ησµ i , each with non-zero probability a (see Eq. 11).The mean of the resulting distribution is again zero, but now the variance is Var[Z] = (s−1)p N a = αa.Since the two Binomial distributions Y and Z are symmetric and feature a large number of trials, they are well approximated by Gaussian distributions.Plugging these two approximations into Eq.23 gives us = N 0, We can now obtain a rough estimate of the memory capacity by ensuring that P err remains low 27 .Alternatively, we can also fomulate the memory capacity in terms of the standard Hopfield model, in which the crosstalk takes the form of a single Gaussian centered around zero with variance α H ≈ 0.138 (see Appendix).We thus set the variance of Eq. 26 equal to that of the standard Hopfield model Rearranging terms, and using the relationship α = α cxt sa, we arrive at the relationship This equation applies generally for cases of arbitrary numbers of contexts s, relative subnetwork size a and relative input size b.To obtain the expression for context modulation by neuron-specific gating only (Eq.4), we set b = 1, and for context modulation by synapse-specific gating only (Eq.7), we set a = 1.Finally, to obtain an expression for the full network capacity α, we need to multiply Eq. 28 by sa (Eq. 5 and Eq. 8).
Information content.The information content of the standard Hopfield model can be calculated as the total entropy (average Shannon information) across all patterns in the following way 33 .Considering that each neuron of each pattern is randomly chosen to be 0 or 1 with probability 1 2 , the entropy of each bit of each pattern is equal to the binary entropy function evaluated at 1 2 , which we will denote H B ( 12 ).Multiplying this by the number of neurons in each pattern, and the number of patterns, we arrive at the total entropy across all patterns This can be extended to the low-activity variant of the Hopfield network simply by replacing α H in the equation above with the corresponding low-activity capacity for a particular activity level a LA : α TF (a LA ) (see section Comparison with standard Hopfield network and low-activity network), and evaluating the binary entropy function at the activity level a LA .Together, this means that the information content of the low-activity Hopfield network for activity level a LA is Podlaski, Agnes, Vogels (2020) 16 Context-dependent associative memory For the context-modular network with neuron-specific gating, we consider the following.For each subnetwork with N cxt neurons, there are 2 Ncxt possible patterns, each equally likely, leading to an information content of for each pattern.Multiplying this by the number of patterns, we obtain We can write this as a ratio with the information content of the standard Hopfield network to obtain Note that this equation information content also holds for targeted synapse-specific (TaSS) gating, with the only difference being that the memory capacity α is different with and without TaSS gating.
Mean-field theory.The memory capacity of the context-modular memory network was calculated using established mean-field methods 27,32,89 .See supplementary methods for details.
Targeted synapse-specific gating.Targeted synapse-specific (TaSS) gating was implemented algorithmically in the following way.Given a context-modular memory network with parameters N, s, a, and p, we define the full weight matrix W from Eq. 10.We also define the hypothetical isolated weight matrix for each individual context k considering only patterns assigned to that subnetwork as Wk .The resulting weight matrix for each context after applying TaSS gating is where D k is a binary matrix of zeros and ones (with elements d k i j ), defined as and is the Hadamard product (element-wise product).This algorithm ensures that for each context k, the sign of each weight w k i j will reflect the correlation between neurons i and j over patterns in context k, but not over all other patterns.This selective input gating substantially complicates the analytical methods for memory capacity estimation.Considering Eq. 23, it is now the case that the two crosstalk terms are no longer independent.We instead propose a means of obtaining a rough estimate of the memory capacity with TaSS gating by comparing the network to results from Hopfield networks with binary weights 27,49 (see main text).Essentially, when overlap is low (a < 0.1 and s < 10), no connections are gated (full connectivity), and each subnetwork has a capacity close to the standard Hopfield network (α H ≈ 0.138).However, when there is a large amount of overlap between contexts (e.g., a > 0.5 and s > 100), connectivity drops to 50%, and each subnetwork is well approximated by a Hopfield network with binary synapses (α B ≈ 0.1 27 ), scaled linearly by the sparsity (Eq.??).Therefore, to a first approximation, we can estimate the capacity in between these extremes by linearly interpolating between α H and α B as a function of network sparsity, f (s, a), which denotes the estimated network connectivity following synapse-specific gating.This capacity estimate can be written as where c H = 2 f (s, a) − 1 indicates how much the network behaves like a standard Hopfield network, and (1 − c H ) = 2(1 − f (s, a)) denotes how much the network behaves as the network with binary weights.This interpolation then be linearly scaled by the sparsity, f (s, a) to obtain The expected amount of sparsity, f (s, a), for each contextual configuration can be estimated assuming that the weight distributions for a single context and across all contexts are Gaussian with zero mean and variances proportional to the amount of crosstalk that they contribute.Given this, the probability that a particular weight will be removed can be approximated as: where σ 2 1 is the variance of the weight distribution for a single context, and σ 2 A is the variance of the weight distribution across all contexts.Based on the mean field results described below (see Eq. 105), we can approximate σ 2 1 and σ 2 A as α cxt r and 1 2 αr n (a + α + a 2 ), respectively.We compare this estimate with numerical simulations in Fig. 3B,C, where f (s, a) = 1 − Pr(w i j ← 0).
Inclusion of complexity factors in capacity estimation.We consider the addition of context-encoding neurons in the estimation of memory capacity in order to make a fairer comparison with other networks.To do so, we simply multiply the original capacity Podlaski, Agnes, Vogels (2020) 17 Context-dependent associative memory equations (Eq. 1 and Eq. 2) by a factor N/(N + M), where M is the number of context neurons.This converts the subnet capacity to and the full network capacity to Importantly, we assume that each neuron should have on the order of N pre-and post-synaptic connections, with an absolute maximum of N for both cases.We refer to the main text as discussed in the results for more details about each individual case M.
Numerical simulations.We briefly describe the details of all numerical simulations here.More information can be found in the supplementary methods.Code was written in C++ with the help of Armadillo, a linear algebra library 91 .
Memory capacity was estimated numerically by building finite-sized networks (N = 10000) initialized with a set of random patterns, connectivity and thresholds as defined above in section Model formulation.Dynamics were run synchronously according to Eq. 9, i.e., all units were updated simultaneously.Synchronous dynamics were chosen for efficiency reasons despite potential convergence issues.To test stability of the patterns, we initialized the network in each memory state and simulated the dynamics until they either reached a steady-state or they reached the maximum number of allowed time steps (100).We then calculated the overlap of the network state with the original pattern state as: where m σµ ∈ [−1, 1] (m σµ = 1 when the state is sitting exactly at the pattern, and m σµ = 0 when it is uncorrelated with the pattern).This was done for all patterns in a particular subnetwork to obtain an average overlap = N −1 cxt µ m σµ .A small amount of noise was allowed in the overlap, such that memory retrieval was deemed successful as long as mσ ≥ 0.97 92 .
Binomial test of proportions.Due to the finite network size, random choices of patterns may have non-zero correlations, which affects recall performance (weight correlations scale as 1/ √ N).This may lead to networks which are able to store a larger or smaller number of patterns (relative to network size) stably compared to a network storing patterns with absolutely zero correlations.We thus ran several trials for each configuration and treated the average overlap from each trial as an estimate of the parameter p m from a Binomial distribution.The average overlap was corrected to fall between 0 and 1, such that the estimate of p m was formulated as: pm = N −1 tr n 1 2 ( mn + 1), where N tr is the total number of trials.To determine the maximum number of patterns, we used a Binomial test of proportions with test statistic where p 0 = 0.97 is the cutoff value.The p-value for this test was set to 10%, such that the overlap was considered too low if T > 1.281.This allowed us to obtain a more accurate estimate of the average overlap with fewer trials.For all simulations shown here, N tr = 10.
Estimating capacity from a single subnetwork.To further speed up numerical simulations, we also took advantage of the fact that all contextual configurations in the context-modular Hopfield network should be identical in memory storage properties.We therefore constructed a single contextual configuration and then mimicked the effect of the other contexts by adding noise to the weight matrix in the following way where B(n B , p B ) is a binomial distribution with parameters n B and p B .
We confirmed the equivalence of this method with the simulation of all contextual configurations, and the two methods correspond well for most parameter ranges (not shown).
Robustness to noise.Robustness to noise was measured using so-called stability parameters 52 .We adapt this notion here, and write the stability parameter of a particular pattern ν of a context k, for neuron i as Podlaski, Agnes, Vogels (2020) 18 Context-dependent associative memory where We average this across all neurons and patterns to obtain the average stability parameter κ.

Dynamic gating model.
The dynamic memory gating model (Fig. 5E,F) featured an associative memory network with N = 1000 units (again denoted V i ).The network was composed of s = 10 contexts, each defined by a subnetwork of N cxt = 200 neurons, storing p = 10 random patterns each.Note that we do not consider sparse connectivity (and so b = 1), and targeted synapse-specific gating was not applied to the network.The memory network was connected to a second network of s context-encoding units, denoted c k for unit k, corresponding to context k.Units in both networks are binary, taking on values of 0 and 1.Finally, we found it necessary to add a third component, which was a global inhibitory neuron, denoted by y, which provides inhibition proportional to the population activity in the memory network (similar to Brunel 93 ).We also defined a linear readout for each memory of each context, denoted z kν for the readout for pattern ν of context k.
The connectivity matrix for the memory network, denoted w MM i j (for units i and j; M for memory), was defined according to the Hebbian rule in Eq. 10.
Connectivity from the memory network to the context-encoding network was defined as: which means that memory units that belong to context k will excite context unit k.The connectivity from the context-encoding network back to the memory network was defined as Context units will inhibit memory neurons outside of their corresponding subnetwork proportional to the maximum activation that these memory units receive across all patterns in that context (the factor of 4 was added for stability purposes).This ensures that these neurons remain in the suppressed state.
The recurrent connectivity of the context-encoding network was implemented as which means that a particular context unit l will inhibit other context units proportional to the maximum input they receive across all patterns in context l.This ensures that the dynamics in the context-encoding network are approximately winner-take-all.The global inhibitory neuron, y, has a synaptic weight of w MG = 0.05 for all memory units.Finally, output connectivity was defined as Dynamics for the memory network were defined as: where H(•) denotes the heaviside step function, I i (t) is the time-dependent input to unit i, and t denotes the time step.For the context-encoding network, dynamics were implemented as: where θ c is a threshold set to 0.5.Finally, the global inhibitory neuron, y, was implemented as a linear neuron, defined as: Memory overloading and continual learning.Memory overloading experiments were carried out numerically for networks of N = 10000.Networks were trained at maximum capacity, as determined numerically.Then, a subset of contexts, denoted as "overloaded" contexts, were allocated additional memories, divided equally among the overloaded contexts, and making Podlaski, Agnes, Vogels (2020) Context-dependent associative memory up 10%, 20%, 50% or 100% of the total number of contexts.Performance was assessed numerically by measuring the average memory recall, as described above.In Fig. 6B, performance is plotted as a function of "overload amount", which is the number of additional overloaded memories divided by the network size, such that an overload amount of 1 indicates that N additional memories past the maximum capacity were stored in the network.Comparisons are made with the standard Hopfield network and low-activity Hopfield network, in which additional memories were stored in the standard way.
Continual learning experiments were also done numerically, for networks of size N = 1000.In this setting, memories were trained sequentially (following Eq. 10, but with the sum over patterns only including a single pattern), with an additional clipping step (threshold parameter A) following the storage of each new pattern.The clipping step is described as follows: Additionally, contexts were defined sequentially, and changed more slowly than the memory patterns themselves.Importantly, it was assumed that contexts were noiseless and did not deteriorate as new memories were added.Memory performance was again measured as above, and averaged over 20 independent trials.

Arbitrary context allocation.
A context-modular memory network with TaSS gating and no neuron-specific gating (a = 1) was trained with an overall number of memory patterns, i.e., as a single context (Fig. 7).Following this, contexts were assigned by choosing an arbitrary group of memory patterns (with replacement), and the appropriated targeted gating configuration was determined as defined above (as in Eqs.34 and 35).This was done for various numbers of memory patterns per context, and memory recall performance was measured numerically.Importantly, this means that a single memory pattern could be found in multiple contexts.

Distributions of memory strengths and noise tolerance.
A modified TaSS gating scheme was devised in order to assign memory strengths for each pattern (Fig. 8).In a standard Hopfield network, all stored patterns have roughly equal stabilities (Fig. 8B, black lines).We then define a set of strong memories, whose stabilities are larger than the standard Hopfield network, and a set of weak memories, who stabilities are less than the standard Hopfield network.In the examples shown, one quarter of the memory patterns is strong, one quarter is weak, and one half remains close to the standard Hopfield network.
As a proof of concept, we generated two classes of distributions of strengths for the strong and weak memory patterns: one class with sequential strengths (contexts "A" and "B" in Fig. 7B,C, left) and another with discrete values (contexts "C" and "D" in Fig. 7B,C, right).The set of strong and weak memories in context k are denoted T S k and T W k , respectively.Memories in the strong and weak groups were then assigned particular stability values, through the auxiliary variables x S µ and x W µ , respectively.These were then combined with the pattern values to obtain weight matrices Sk and Wk with elements and Finally, to obtain the mask for each context (as in Eq. 35), we compared these weight matrices with an overall weight matrix, W, obtained with the standard Hebbian learning rule applied to all memory patterns (Eq.14): In other words, a given synapse is gated (d k i j = 0) if sign(s i j ) sign(w i j ), sign( wij ) = sign(w i j ), and sign(s i j ) sign( wij ).
To obtain the distributions found in Fig. 8B, the following auxiliary values, x S and x W , were used.In Context "A", memories 1 to 20 were assigned to be weak, and memories 21 to 40 were assigned to be strong, with the remaining 40 memories being neutral.The stability values were set to Context-dependent associative memory assigned to be strong, and memories 61 to 80 were assigned to be weak: Next, for context "C", memories 21 to 40 were assigned to be strong (x S µ = 1), and memories 11 to 20 and 41 to 50 were assigned to be weak (x W µ = 1).Finally, for context "D", memories 11 to 15, 21 to 25, 31 to 35 and 41 to 45 were assigned to be strong (x S µ = 1), and memories 15 to 20, 25 to 30, 35 to 40 and 45 to 50 were assigned to be weak (x W µ = 1).Memory strength was measured using a stability parameter (Eq.45).Noise tolerance was measured numerically by assessing memory recall performance for each pattern when the network is initialized in a noisy version the pattern.This noisy version was initialized by flipping each bit (neuron's activity) of the pattern with a probability f .The noise tolerance (as plotted in Fig. 8C) was defined as the maximum value of f for which the recall (averaged over 20 trials with random noisy pattern initializations) for a particular pattern became significantly less than 0.97 (as measured by binomial test of proportions; see Binomial test of proportions).
Podlaski, Agnes, Vogels (2020) 21 Context-dependent associative memory Supplementary methods of Context-modular memory networks support high-capacity, flexible, and robust associative memories William F. Podlaski, Everton J. Agnes, and Tim P. Vogels Equivalence of {-1,+1} and {0,1} formulations.The {0, 1} formulation for the networks used in this study was designed such that it is equivalent to the standard {±1} formulation of the original Hopfield network 29 .To see this, it is sufficient to show that the dynamics in either case are equivalent.We consider the context-modular memory network as defined in section Model formulation, but with units S i ∈ {±1}, and patterns ξ σµ i ∈ {±1} within the active context (S i = 0 and ξ σµ i = 0 for any unit i not contained within the subnetwork).The synaptic weights are defined as and the dynamics of this network are described by the following equation (assuming context k is active) Now, starting from the dynamics of the {0, 1} network, it is simple to show the equivalence to Eq. 63 using the relationships between the two formulations: 2V i − 1 = S i , ηµ = 1 2 ξ µ i , and 2H(x) − 1 = sgn(x) for all x.Assuming that context k is active, the dynamics of the {0, 1} fomulation for a particular unit i within subnetwork k follow Thus, the deterministic dynamics of the two formulations are exactly the same.We note that this holds also for the stochastic version of the dynamics, in which V i = σ(h i ) and S i = tanh(h i ), due to the identity tanh(x) = 2σ(x) − 1.This is used in the following section.Heuristic mean field analysis.In this section, we follow the heuristic mean-field analysis as described previously 27,89 and apply it to the context-modular memory network.We derive this here for the case of ±1 units, as defined above (Equivalence of {−1, +1} and {0, 1} formulations).Due to the equivalence with the {0, 1} formulation, this result holds for both cases.Incorporating context-dependence into the mean-field theory.In order to consider context-dependence through neuron-specific gating and synapse-specific gating, we must appropriately incorporate them into the mean-field theory.For neuron-specific gating, as described in the main text, we can take the perspective of a single subnetwork and assume that the effect of the other contexts is just to add noise to the connectivity matrix relative to the subnetwork size a = N cxt /N and the number of contexts s.Thus, we do not need to do anything in addition to using the full weight matrix formulation as described in equation Eq. 62, which we separate into two terms here since we are taking the perspective of context k: As for dendritic input gating, this is a bit trickier.Essentially, this creates a so-called diluted Hopfield network, with connection Podlaski, Agnes, Vogels (2020) 22 Context-dependent associative memory sparsity controlled by the parameter b = K/N, the average number of inputs per neuron.It turns out that for a standard Hopfield network, symmetric dilution of synapses can be approximated by a fully-connected weight matrix with independent Gaussian noise added to each element with variance N −1 α cxt (1 − c)/c, where N is the network size, α cxt = p/N is the memory capacity, and c is the connection probability 48 .In this case, we can approximate the dilution in the first term in this way, to obtain where δ i j is the independent symmetric Gaussian noise with zero mean and variance N −1 cxt α cxt (1 − b)/b.The dilution in the second term should be treated differently.Since the second term only concerns patterns that we are not interested in recalling, the dilution does not add extra noise to the weights, but instead reduces the crosstalk effects.Thus we can treat this term as undiluted, but scaled by the amount of dilution b, since on average bN connections will be present.Thus we are left with: Setup.We consider a stochastic version of the Hopfield network, with updates to the units S i taking the form: where h i is the input to neuron i, β is the inverse temperature parameter, which effectively controls the amount of noise in the updates, and σ β is a variant of a sigmoid function, parameterized by β.
In mean field theory, we replace the true fluctuating input h i by its average value h i = j w i j S j , which allows us to compute the average activation of each neuron as We can then combine these two equations to get an expression for the average activation of each neuron as a function of the average activity of the rest of the network: This allows us to describe the dynamics of the system using so-called order parameters.First of all, we use the two measures of capacity as defined in the main text, which are repeated here for convenience: where p is the number of patterns stored per context, N is the total network size, N cxt is the number of active neurons per context, s is the number of contexts, and a = N cxt /N is the relative subnetwork size.Throughout the analysis, it is important to determine the nearness of the current network state to the different memory patterns, both in the current context, as well as across contexts.
Consider that the currently active context is k.We use m k ν and n σ µ to denote the overlap between the average network state S i and a particular pattern ν of context k (ξ kν i ), or a pattern µ of context σ (ξ σµ i ), respectively, when k is active: where S k is the set of active neurons in context k (as defined in the section Model formulation).Note that both m k ν and n σ µ are sums of N cxt terms and are normalized by 1/N cxt .However, the ξ σµ i terms of Eq. 72 may be zero depending upon the subnetwork size, so we expect Eq. 71 and Eq.72 to have different variances.Importantly, we suppose that the network is close to one of the patterns, say pattern 1 of context k.Thus m k 1 will be of order unity, and the rest of m k ν 's for ν 1 are small, of order 1/ √ N cxt .To see this, consider the fact that m k ν is a sum of N cxt terms, each of which will be ±1 with equal probability.This can be approximated by a zero-mean Gaussian with variance N cxt .The normalizer 1/N cxt makes the variance 1/N cxt , and thus the standard deviation 1/ √ N cxt .As for n σ µ , it is a sum of N cxt terms, each of which will be non-zero with probability a, due to ξ σµ i .The non-zero terms will Podlaski, Agnes, Vogels (2020) 23 Context-dependent associative memory then be ±1 with equal probability.This can be approximated by a Gaussian with variance N 2 cxt /N, which, divided by N cxt , yields 1/N.Thus the standard deviation is 1/ √ N.
We next introduce r and r n to denote the mean square overlap of the system configuration with the nonretrieved patterns in subnetwork k, and all other subnetworks, respectively: Both of these quantities should be approximately of order unity.For the case of r, we have a sum of p terms, each with variance 1/N cxt , and normalized by 1/α cxt = N cxt /p.For r n , we have a sum of (s − 1)p terms, each with variance of approximately 1/N, and normalized by 1/α = N/sp.Our task is now to get a self-consistent calculation of r, r n , and m 1 .
Derivation Part A: finding an expression for r.We begin by plugging the mean field equations for S i from Eq. 69 into equation Eq. 71, and expanding the weight matrix using the expression in Eq. 66: We then rearrange terms and substitute the other m k µ and n σ µ terms into the equation: Next, we pull out the terms with m k 1 and m k ν , and then use a trick to multiply the equation by ξ k1 i ξ k1 i = 1.We can put one of these terms inside of the tanh due to the fact that tanh(−x) = − tanh(x): We then use another trick, and replace the right-hand side of Eq. 78 with the first two terms of the Taylor expansion with respect to m k ν , i.e., f (m , where f (m k ν ) is the right-hand side of Eq. 78.We assume that m k ν is small, of order 1/ √ N, and so we take a = 0.This expansion yields: where we use c i to denote the crosstalk terms: and Note that we write the crosstalk terms c i to include all patterns except for pattern 1 of context k (k1).Based on the expression in Eq. 79, c i should also exclude the term ξ kν i ξ k1 i m k ν , but this will be a negligible addition assuming large N and p.This term is left in for generality purposes, as the expression c i will reappear later in the derivation.We now approximate as an average of a function f (z) of a Gaussian random variable z.Since N is large (and therefore N cxt too), we can replace the average 1 Ncxt i tanh 2 with the integral over the distribution of z, which we will call q: Podlaski, Agnes, Vogels (2020) 24 Context-dependent associative memory φ ν .We again rearrange terms so as to plug m k µ and n σ µ into the equation: We then follow the same steps as the previous section, by pulling out the terms with m k 1 and n φ ν , multiplying the equation by ξ k1 i ξ k1 i = 1, and finally taking the Taylor expansion with respect to n φ ν around zero, to get: where c i and d are as defined above in Eq. 80 and Eq.81, respectively.The reason for the approximate equality is because c i in this case does not include pattern φν.Again, because we are considering the case of large N and p, this change is negligible.We then Podlaski, Agnes, Vogels (2020) 25 Context-dependent associative memory approximate d using the very same method as before in Eq. 82 and Eq.83.We can thus approximate Eq. 93 with the following: where q is defined as in Eq. 84.Finally, following Eq.86 and Eq.87, we can obtain an expression for r n by squaring and averaging Eq. 95.This yields Again, only the terms with i = j will be non-zero (see description following Eq.87).However, in this case the terms ξ φν i ξ φν j for i = j do not equal 1, but equal a = N cxt /N.This then leaves us with Lastly, we can replace the sum of tanh 2 functions with the same q variable defined in Eq. 86, to get an expression for r n : Derivation Part C: finding an expression for m k 1 .We assume the network configuration is close to pattern 1 of subnetwork k, and so m k 1 be much larger than the other overlaps.We follow the steps from Part A up to Eq. 78, but now using m k 1 instead of an arbitrary m k ν to get: Finally, we use the same trick as in Eq. 82, and treat m k 1 as an average of a function f (z) over the Gaussian random variable z: again remaining agnostic as to the exact form of the mean and variance of the distribution, which will be addressed in the next section.
Derivation Part D: approximating the distributions of q and m k 1 .We now focus on determining an appropriate approximating distribution for q and m k 1 .To do this, we will take a more in-depth look at the expression for c i in Eq. 80, which we repeat here for convenience: The first term of Eq. 102 is a sum over the product of three independent random variables ξ kµ i , ξ k1 i and m k µ .Based on Eq. 73, we know that m k µ should have zero mean and variance α cxt r/p.Due to the symmetry of m k ν around zero, the other two variables ξ kµ i ξ k1 i = ±1 will not have any effect.We thus have the sum of p − 1 random variables, each with variance α cxt r/p, which leads to a single random variable with zero mean and variance α cxt r.Now we move on to the second term of Eq. 102, which is again a sum over the product of three random variables.From Eq. 74, we know that n σ µ should be zero mean with variance αr n /(s − 1)p.The term ξ k1 i again does not have an effect due to the symmetry of n σ µ .However, the term ξ σµ i will be different -it will be nonzero with probability a, and ±1 otherwise.If we assume that ξ σµ i and n σ µ are independent, then we get a random sum of random variables of the form b K i=1 X i , with K ∼ B((s − 1)p, a) and X i ∼ N(0, αr n /(s − 1)p).Using Wald's equation, we arrive at a Gaussian distribution with zero mean and variance αr n ab 2 .
However, it turns out that the assumption that ξ σµ i and n σ µ are independent is not true.They are slightly correlated, which shifts the mean to α (shown in Fig. S4A).Thus, as the total capacity increases, the mean-field approximation gets worse.In fact, we even face this issue with the first term, as m k µ and ξ kµ i are also correlated, which causes the mean to scale with α cxt .However, since α cxt never goes above 0.138, this fact can be safely ignored.For the second term, we can solve this by setting µ * = α.
Adding a term to the mean turns out to complicate the solution quite a bit, so we propose an alternative method as well.This Podlaski, Agnes, Vogels (2020) Context-dependent associative memory is based on two tricks.First of all, we can calculate the variance assuming zero mean by computing Second of all, since each neuron has equal probability of being 1 and −1, we can split up the contribution to the variance from the positive and negative components of the distribution, each having an effect half of the time.It turns out that we get the following expression for the variance: This was confirmed empirically by simulating several networks over different realizations of a, b and s (Fig. S4B).
Finally, for the last term of Eq. 102, it can be shown that this will inject Gaussian noise with ∆ 2 q, where ∆ 2 = α cxt (1−b)/b 48 .To see this, we first note that all three terms in this sum are independent random variables with zero mean, which means that the sum should also have zero mean.The term ξ k1 i , being ±1, will also have no effect due to the symmetry of the sum.Now, we can estimate the variance of this term, which we will call Z: Note that for j l, the expected value of this expression is zero because each item is idependent and has zero mean.We are thus only left with terms for j = l: As a whole we thus have the following: σ 2 * = α cxt r + 1 2 αb 2 r n a + α + a 2 + ∆ 2 q (113) Derivation Part E: solving for α cxt .We can now solve simultaneously for q, r, r n , and m k 1 (from here on, we will refer to m k 1 as m).We list the four equations of interest [Eq.84, Eq. 89, Eq. 98, and Eq.101] here again for convenience: (115) where σ 2 * = α cxt r + 1 2 αb 2 r n (a + α + a 2 ) + ∆ 2 q.These equations resemble the resulting equations of the standard Hopfield network very closely.The only difference is that the variance of z has changed in Eq. 114 and Eq.115 and there is an extra equation for r n in Eq. 116.We will take the same approach as for the standard Hopfield network, following Hertz et al. 27 , and solve the equations Podlaski, Agnes, Vogels (2020) 27 Context-dependent associative memory in the limit of β → ∞.Given this limit, we can make use of the following two integral identities: (121) We see that q → 1 as β → ∞, and the expression C β(1 − q) is undetermined for β → ∞.We can, however, express its limit with respect to m and r.Given that q → 1, we can rewrite equations Eq. 115 and Eq.116 as: (124) which can be rearranged to get C = 1 − 1 √ r , and therefore r n = rb.We can thus write the solution as two equations which can be simultaneously solved for m and r, given a particular capacity α cxt as well as the other parameters of σ Podlaski, Agnes, Vogels (2020) Context-dependent associative memory

FIG. 2 .
FIG. 2. Memory capacity of the context-modular memory network with neuron-specific gating.A-C, Mean-field capacity estimation (solid lines) and numerical simulations (dots) for subnet capacity (α cxt = p/N cxt ), as a function of the number of subnetworks, s (A) for fixed subnet ratio, a = N cxt /N, and vice-versa (B), and over the full range of parameters (C; mean field only).Lines plotted in A,B are slices through C indicated by the colored triangles.Comparison is made with the standard Hopfield network (green).D-F, Same as A-C but for overall network capacity (full net capacity; α = sp/N = α cxt sa).Memory capacity of the low-activity Hopfield network 34 is plotted in E with same activity level a LA = a/2 (turquoise, dashed).N = 10000 for all simulations.

4 ContextFIG. 3 .
FIG. 3. Targeted synapse-specific (TaSS) gating further enhances memory capacity.A, Schematic of TaSS gating (left).The sign of the synaptic weight for a given pair of neurons (i, j) is compared with the sign of the hypothetical weight considering pattern configurations of each individual context (middle).Connections are gated if the sign changes.The proportion of gated weights (network connectivity) is estimated analytically (right, Methods).B-D, Resulting network connectivity following TaSS gating, as a function of number of contexts, s, over fixed subnetwork size, a (B), and vice versa (C), and over all parameters (D).Lines plotted in B,C are slices through D indicated by the colored triangles.E-J, Numerical and theoretical capacity estimation for combined subnetwork and targeted synapse-specific gating for a single contextual configuration (α cxt = p/N cxt ; E-G) and for the full network (α = sp/N = α cxt sa; H-J), plotted as in B-D.Comparison is made with standard Hopfield network (green), and low-activity variant (dashed turquoise, I).

6 ContextFIG. 5 .
FIG.5.Memory stability, dynamic switching, and "loss" of memory.A, Schematic of memory stability: network state (black ball) evolves according to an energy landscape.Memory stability numerically measures relative size of the basin of attraction for each memory (Methods).B, Average memory stability for "active" memories with neuron-specific (NS) gating, for different numbers of subnetworks, s, as a function of relative subnetwork size, a, and compared with standard (green line) and low-activity Hopfield networks (turquoise dashed line).Networks with targeted synapse-specific (TaSS) gating show similar stability levels (not shown).C, Average memory stability of "irrelevant" memories (patterns from non-active contexts) for networks with NS gating only (full lines), and including TaSS gating (dashed lines).D, Dynamic gating in networks with TaSS gating.Dynamics of memory overlap are shown for two memories belonging to different contexts (cxt 1 and cxt 2) when the network is initialized approximately equidistant from the two memory states for context 1 (left), context 2 (middle) and a third context for which the two patterns do not belong (right).Dotted lines (right panel) show memory overlap when the network begins in each of the two memory states.E, Schematic of dynamic gating in networks with NS gating, recurrently connected to a winner-take-all context-encoding network with one unit per context.Linear readouts measure memory activation, one per memory in each context.F, Example simulation of dynamic memory control (as illustrated in panel E).Top: four inputs are given to the network: two memories from context 1 (solid and dashed blue lines), followed by one memory from context 2, at two different amplitudes (magenta).Middle, bottom: inputs from memories of context 1 (blue) cause the corresponding context and memory readout to activate.The input for the memory from context 2 requires stronger amplitude to switch the context and memory readout (magenta).Memory readouts for other contexts are shown in gray.Simulation parameters: N = 1000, N cxt = 200, s = 10, p = 10.G, Example simulation showing "loss" of memory (same network and parameters as in panel F).Memories from context 2 remain inaccessible when the corresponding context unit is blocked.

7 ContextFIG. 6 .
FIG. 6. Memory overloading and continual learning.A, In memory overloading (left), extra memory patterns are stored in a subset of contexts.In continual learning (right), contexts are learned sequentially, and synaptic weights are bounded to induce forgetting (Methods).B, Robustness to memory overloading.Average recall performance, as measured numerically by the average overlap across patterns and contexts (Methods) as a function of overload amount (p ovld /N, where p ovld is additional patterns stored after maximum capacity has been reached).Only a fraction of contexts are overloaded: 10% (top, left), 20% (top, right), 50% (bottom, left), and 100% (bottom, right).Average recall is plotted separately for overloaded contexts (gray), unaffected contexts (black), and averaged over all contexts (dashed gray), along with comparisons to the standard Hopfield network (green) and lowactivity variant (dotted turquoise) with activity level a/2.Parameters: N = 10,000, a = 0.1, s = 100 (left).C,Continual learning with contextmodular memory networks.Contexts are created sequentially after every 40 patterns (p = 40, s = 10 total, N = 1000; α cxt = 0.04; vertical gray dashed lines).Weights are clipped every time a new memory is added to the network (w i j ∈ [−A, A]; A is indicated in the legends under Weight bounds; Methods).Memory recall is shown for each pattern in sequential order for neuron-specific (NS) gating (top) and targeted synapse-specific (TaSS) gating (bottom) versus the standard Hopfield network.

9 Context
-and several other models have explored a hierarchical organization of memor-Podlaski, Agnes,Vogels (2020)

15 Context-dependent associative memory the first term into − 1 .
and multiply Eq. 22 by −2 ηkν i , turning Podlaski, Agnes, Vogels (2020) We now study the effect of the two crosstalk terms by considering the quantity [X] = 0 and thus E[Y] = 0. We use Eq.19 to estimate the variance of Y, first ignoring the factor 1 bNcxt , to obtain Var[Y] = E[K]Var[(2X − 1)] = N cxt (p − 1)b.Dividing by bN cxt , we obtain Var[Y] = p bNcxt = αcxt b .The second term can now be approximated analogously to the first.It again takes the form Z = 1 bNcxt K i=1 (2X i − 1), but now with K ∼ B(N cxt (s − 1)p, b 2 a 2 ) and X i ∼ Bern(

dz √ 2π exp −z 2 /2 1 −−z 2 / 2
tanh 2 β(az + b) tanh β(az + b) ≈ erf b √ 2a(119)This enables us to write the four equations as: FIG. S3.Memory capacity of low-activity Hopfield network.A, Comparison of theoretical (dotted black) and numerical (red) estimations of memory capacity for the low-activity Hopfield network with {0, 1} units as in34 as a function of activity level a LA .Standard Hopfield capacity (0.138) is plotted as solid black line for reference.B, Numerical estimation of memory capacity as a function of network size, for different activity levels a LA .Plot in A used N = 20000.C, Information content (relative to the standard Hopfield network) as determined from numerical simulations (red curve in A).See methods section Analytical capacity estimation for details.
Schematic of hierarchical memory assignment to context with variable size and overlap.B, Network performance as measured numerically by the average overlap across all patterns for the active context, plotted for various total loads, P/N, as a function of the single context capacity, p/N.Simulation with a random weight matrix is shown in the grey dashed line.N = 1000 for all simulations.C, Network connectivity (non-gated connections) of the simulations in B.
Left: each context set half of the memories to have a monotonically increasing strength, and the other half to baseline.Right: half of the memories were assigned a strong or weak stability value, with the other half set to baseline.Comparison is made with a standard Hopfield network in which each memory has roughly the same stability (black lines).C, Noise tolerance of combined contexts in panel B measured numerically (top) by initializing the network in a noisy memory state and running the dynamics (schematic, bottom left).Noise tolerance is also plotted as a function of memory stability (bottom right).
FIG. 8. Imposing distributions of memory stability.A, Stability levels of memories within the active context are set with a modified form of targeted synapse-specific (TaSS) gating (Methods).Contexts can be combined to achieve stability proportional to the sum of the stability levels in individual contexts (right).Diamond shaped points represent memory patterns.B, Distributions of memory strength and context combination for two example networks, with memory stability measured numerically (N = 1000; p = 80; Methods).