## Abstract

A long standing challenge in biological and artificial intelligence is to understand how new knowledge can be constructed from known building blocks in a way that is amenable for computation by neuronal circuits. Here we focus on the task of storage and recall of structured knowledge in longterm memory. Specifically, we ask how recurrent neuronal networks can store and retrieve *multiple* knowledge structures. We model *each* structure as a set of binary relations between events and attributes (attributes may represent e.g., temporal order, spatial location, role in semantic structure), and map each structure to a distributed neuronal activity pattern using a vector symbolic architecture (VSA) scheme.

We then use associative memory plasticity rules to store the binarized patterns as fixed points in a recurrent network. By a combination of signal-to-noise analysis and numerical simulations, we demonstrate that our model allows for efficient storage of these knowledge structures, such that the memorized structures as well as their individual building blocks (e.g., events and attributes) can be subsequently retrieved from partial retrieving cues. We show that long-term memory of structured knowledge relies on a new principle of computation beyond the memory basins. Finally, we show that our model can be extended to store sequences of memories as single attractors.

Human memory is remarkable in its ability to robustly store and retrieve information with complex and hierarchical structure, guiding cognitive processes on many different timescales. In many instances, this “structured knowledge” can be described as sets of associations between discrete events with their contextual attributes. Some concrete examples are, temporal sequences representing events associated to particular times, episodic memories representing events associated with particular contexts [1, 2], cognitive maps representing spatial environments through landmarks associated with locations [3–5], and semantic structures in language in which meaning is conveyed through sets of words associated with their respective roles within a sentence [6–8].

To effectively use structured knowledge that has been stored in long-term memory, it must be represented in a way that allows for its retrieval through partial information, with tolerance for noisy and degraded cues. This is likely facilitated by the distributed nature of the underlying neural representations, which provide an inherent notion of similarity between representations and a mechanism for learning representations by the the tuning of synaptic weights in a neural network [7, 9–11]. However, while the utility of distributed representations is clearly beneficial from this perspective, it is still not well understood how to represent associations and relations in neural networks in an efficient and flexible way that is amenable to the variety of computational demands involved in higher cognition [12–14].

Several recent studies have addressed the contextual modulation of neuronal representations, e.g., by forming “mixed representations” [15] or by gating parts of the network [16]. Other proposals have tried to implement more general relational structures in neural networks. An early attempt used the tensor product to create a distributed representation of pairwise relations between discrete items [6]. Subsequently, several Vector-Symbolic Architectures (VSA) were proposed as compressions of the tensor product to avoid the increase in dimensionality of the representation, allowing for the creation of hierarchies of relations in a compact way [17–22]. More recently, several architectures for deep or recurrent neural networks have been proposed to promote flexible relational reasoning [23–29]. However, all of these works have primarily focused on working memory, i.e., online tasks of processing incoming structured data. By contrast, the challenge of storing and retrieving relational structures in long-term memory has received little attention.

Storing knowledge structures in long-term memory poses several additional challenges. While working memory tasks typically process one structure at a time, longterm memory networks must cope with storing multiple structures, such as multiple complex cognitive maps, multiple temporal sequences, or multiple stories. Two generic measures of the efficiency of information storage in recurrent neural networks are their extensive capacity, i.e., the number of stored items scales with the number of neurons in the network [30], and the ability to recall memories from partial cues which have small but significant overlap with the desired memory [31]. Both of these measures can be adversely affected by correlations across memorized patterns. For relational structures, additional correlations may occur due to the presence of objects or contextual attributes in multiple memories, putting additional constraints on the encoding of relational information. In addition to these considerations, models of distributed representation of knowledge structures typically compress the relational structure into a fixed-length distributed vector representation. To compensate for this loss of information, “clean-up” mechanisms are invoked. Thus, it is crucial that such mechanisms can be adapted for the task of retrieval of such structures from long-term memory to efficiently store large numbers of relational structures each containing multiple associations.

In this work, we propose a model for associative memory of *multiple* relational structures by using a quadratic binding scheme to form vector representations of memories consisting of *multiple* binary relations between items (which we will henceforth denote as pairs of objects and their attributes). While our model is quite general, in most of our work we will use the holographic reduced representation (HRR)[7] VSA scheme for convenience. We show that the binarized versions of these structures can be stored as *fixed point* attractors in a recurrent neural network and each structure can be retrieved from the memory network by using a cue which is a structure encoding a subset of the relations in the memorized structure. We highlight the holistic nature of this model by comparing the storage of temporal sequences in the present model, where the entire sequence is stored as a single fixed point, to previous models, where a sequence is stored as a sequence of transitions between multiple fixed points and cannot be fully recalled at once [32]. Our model posits that in addition to the network that stores the structures, a Dictionary network stores all individual items (e.g., individual words, familiar objects). We show that the identities of the objects contained in the structure can be decoded faithfully from the retrieved memory by querying the retrieved structure with the appropriate cue as long as a “clean-up” operation is performed to map the noisy estimate of the object to the correct item in the Dictionary. Furthermore, this decoding works well even when the retrieved structure is significantly degraded.

## RELATIONAL STRUCTURES

We begin by modeling a binary relational structure *S* as a set of *L* object/attributes pairs

where both objects *a* and attributes *b* have embeddings as real vectors representing distributed patterns of activation in a neuronal population. For simplicity, both populations will be assumed to be of the same size *N*. We represent relations between items in a pair (*a*, *b*) by a transformation through a pairwise quadratic nonlinearity to a binding vector *g*(*a*, *b*) (in ℝ^{N}) representing a distributed pattern of activity in a population of *N* “binding” neurons. Each component of the binding vector takes the form

where each *G ^{k}* is an

*N*×

*N*fixed binding matrix. The binding operation in Eqn. 2 is a generalized version of a VSA scheme [18] and can be interpreted as a lossy compression of the tensor product binding operation first proposed in [6].

We obtain the representation of the full relational structure *S* by the vector summation of the individual object/attribute pairs,

where the vector summation induces a second source of information loss. The representation is permutation invariant with respect to the index *ℓ* so that the relations within the structure have no particular order.

The compressed representations of structures can be used for a variety of computations, such as structure classification. Here we focus on unbinding tasks, in which given and one of its attributes *b _{ℓ}*, we need to estimate its pair

*a*. Similar to binding, we assume that the unbinding operation is performed through a quadratic transformation involving the pair (,

_{ℓ}*b*), so that the

*k*-th component of the estimator of

*a*is given by

_{ℓ}where each *F ^{k}* is an

*N*×

*N*matrix chosen so that the decoding operation is the approximate inverse of the binding operation.

In general, the binding and unbinding matrices can be learned and the optimal choice should depend on the nature of the items contained in the dictionary. Here we use a generic set of matrices, a popular choice known as Holographic Reduced Representations (HRR) described in Methods.

The final estimate of *a*, ã, is computed by comparing the noisy estimate against a Dictionary, i.e., a neural long-term memory system that stores all familiar objects *a _{d}*, using,

A schematic of the encoding and decoding networks is shown in Fig. 1a.

The maximum likelihood (ML) decoding error is given by the probability *P _{ϵ}* that the estimator â has the largest overlap with an incorrect item in the dictionary. It depends on the size of the dictionary

*D*and the signal-to-noise ratio (SNR) overlaps of the estimator â with items dictionary defined in Eqn. 20 in Methods. ML decoding is an idealization of a more biologically realistic retrieval of the stored pattern

*a*from a retrieval cue â in a long-term associative memory network storing all individual Dictionary items. Two possible implementations of this memory system are a winner take all network with lateral inhibition [33] or a sparse Hopfield network [34].

## STORING STRUCTURES IN LONG-TERM ASSOCIATIVE MEMORY

We now consider the long-term memorization of multiple knowledge structures by storing their vector representations in a neural network for long-term Structured Knowledge Associative Memory (SKAM), so that they can be retrieved at a later time from partial cues and subsequently queried to reconstruct individual events. We consider a set of *P* structures {*S*^{1}, …, *S ^{P}*} which for simplicity, all consist of

*L*items. We label the set of L objects and attributes comprising the

*μ*-th structure as .

For each structure, the HRR encoding scheme is used to create a vector representation from *S ^{μ}*.

To store multiple structures as fixed points in a neural network, the neuronal input-output transformation must be highly non-linear, implying that the stored patterns themselves are limited to the dynamic range of the neurons. As in a standard Hopfield network [30, 31, 35], we assume neurons are binary ±1 variables and the memory patterns, candidates of fixed points of the attractor dynamics are . For a network with *N* binary neurons, the memory load is defined as , where *P* is the number of stored structures.

In general, the associative nature of Hopfield memory networks is expressed in the ability to recall a memorized pattern starting in any initial state which has a sufficiently large overlap with the memorized pattern. If the initial state is within the basin of attraction of the pattern, it will converge to the pattern without errors. In our case, we consider partial cues of a structure, , in the form of a recalling structure obtained by the binding and subsequent summation of any subset *S*_{0} of the *L* pairs of binary relations contained in , i.e.,

where *L*_{0} is the number of recalling elements in and is assumed to be much less than *L*. The network is then initialized in the state and evolved to a fixed point which, if successful, corresponds to the stored binarized structure *σ ^{μ}*. A schematic of our model is shown in Fig. 2 and more details on the memory network are given in Methods.

There are several learning rules which can be used to store patterns as discrete fixed points in recurrent neural network models of associative memory. Here, for simplicity, we use the Pseudo-inverse learning rule proposed in [36] to train the network. In a Pseudo-inverse network, all structures are perfect fixed points for *α* < 1, which is assumed throughout. This allows us to focus on the retrieval cues, since failure to perfectly recall a structure occurs only when the retrieval cue is outside of the basin of attraction of the memorized structure. We observe qualitatively similar behavior for the Hebb learning rule and the Storkey rule introduced in [37] for *α* well below the memory capacity described in Section III D of the Supplementary Material.

## RESULTS

We evaluate the performance of the scheme introduced in the previous section by the ability to accurately perform the unbinding operation after retrieval of a structure from the SKAM. For example, after retrieving the structure *μ* = 1, we should be able to extract the item with a query in the form of its pair, i.e., with low error. We quantify performance by the average unbinding error *P _{ϵ}* obtained in simulations where structured memories are created from random patterns, stored in memory, retrieved with partial cues , and subsequently decoded using the ML “clean-up” operation. We assume all items appearing in memorized structures are stored in dictionaries for objects and for attributes which are then used to decode from the retrieved memory. A schematic of this process is shown in Fig. 2 and full details of the simulations are provided in Methods.

The parameters involved in the performance measure are: network size *N*, memory load *α* = *P/N*, size of the relational structures and the retrieval cue, denoted as *L* and *L*_{0}, respectively. We consider the regime where both *N* and *P* are very large and the memory load *α* ~ *O*(1) [30, 38], mainly considering values *α* ~ 0.1 − 0.2, where the network acts as a good associative memory.

### Retrieval of Structured Memories

We begin by showing numerical results which measure the quality of the retrieved structures in terms of the unbinding error *P _{ϵ}* and the SNR of overlaps defined in Eqn. 20. In all reported results, the extracted item (and the associated query) comes from pairs that are not part of the cueing structure

*S*

_{0}. Thus any performance better than chance necessarily involves information extracted by the retrieval from long-term memory. In Fig. 3a we show the dependence of the unbinding error

*P*on

_{ϵ}*L*,

*L*

_{0}, and

*α*. For comparison, we show

*P*for the original structure prior to storage in the memory network, demonstrating that except for small

_{ϵ}*L*, the dominant contribution to the error comes from retrieving the structure from long-term memory. We also observe that for a fixed

*L*,

*L*

_{0}, and

*α*, the error is suppressed as

*N*increases, in contrast to standard large attractor memory networks where performance depends only on

*P/N*. To elucidate this behavior, we replot the results in terms of the SNR

^{−1}, i.e., the inverse of the SNR as defined above in Fig. 3b, showing that for each

*L*

_{0}, there is a critical

*L*above which the SNR of the memorized structures decreases relative to the original SNR, SNR

_{0}before storage in long-term memory. Note that due to binarization, SNR

_{0}is smaller by a factor of relative to the value given in Eqn. 22. we replot the same results in terms of the inverse of the normalized SNR, SNR/SNR

_{0}vs.,

*L/L*

_{0}. Since SNR

_{0}is proportional to

*N/L*, this normalization factors out the “trivial” dependence on

*L/N*from the post retrieval SNR. Figure 3c shows that for a fixed

*α*the normalized inverse SNR depends only on

*L/L*

_{0}. and only weakly on

*N*, suggesting that the main

*N*dependence comes from the linearity of SNR

_{0}in

*N*.

### Length of Cueing Structure and Memory Basins

As seen in Fig. 3, the performance worsens (and SNR decreases) as *L* increases, while the converse holds true for *L*_{0}. We find that there is a critical ratio, , defined as the minimum initial cue (relative to the total length), that leads to very small error which is essentially equivalent to the error for the original structure.

To understand the origin of *l _{c}*, we note that specifies the average initial overlap of the retrieval cue with the corresponding memorized structure, which we denote

*m*

_{0}. For small values of

*L*

_{0}, . The size of

*m*

_{0}determines whether on average the initial state is within the basin of the desired memory, so that the recurrent dynamics will succeed (or fail) in converging to the desired attractor. As the cue length

*L*

_{0}grows, the initial state becomes increasingly likely to be within the basin of attraction of the desired structure, retrieving it with essentially no error. In these conditions, the unbinding operation has the same probability of success as for the original structure. Conversely, for small enough

*L*

_{0}the initial state is likely outside the attraction basin of the memory, leading to errors in the retrieved structure.

To determine the minimum value of *L*_{0} required for perfect retrieval, we use known estimates [36] of the radius of attraction in attractor memory networks, *R*(*α*) = 1 *m _{min}*(

*α*), where

*m*(

_{min}*α*) is the minimal overlap between the initial state and the desired memory required for convergence to the correct fixed point on average.

*m*(

_{min}*α*) determines the minimal length of the cueing structure, i.e., .

We conclude that when *L*_{0}/*L* < *l _{c}*, the main source of the decoding error in our model comes from the limitation on good retrieval of the structure from memory, due to small values of

*L*

_{0}/

*L*, and not from noise in the original encoding (corresponding to

*L*

_{0}/

*L*>

*l*).

_{c}### Retrieval Outside Memory Basins

Naively, one would expect that for *L*_{0} < *l _{c}L*,

*P*will be very large due to the accumulation of errors in the retrieved structure, which is outside the memory basin. However, as shown in Fig. 3, this is not case. Surprisingly, the decoding performance is well below chance level for substantial range of values of

_{ϵ}*L*, even when

*L*≫

*L*

_{0}/

*l*.

_{c}This observation can be explained by two scenarios: (1) the actual basins fluctuate in their shape so that for some structured memories, initial states may converge to the memory fixed point even if they are outside the *mean* basin radius; (2) initial states outside the true memory basin converge to fixed points with significant overlap with the structured memory.

To test these scenarios, we measured the empirical distributions *p*(*m*) where *m* is the overlap between the fixed point and the desired structure, obtained from histograms of overlaps for several values of , shown in Supplementary Fig. 3.

We find that as *N* is increased, *p*(*m*) becomes sharply peaked around a single value *m**. Inside the basin of attraction, i.e., *m*_{0} < *m _{min}*(

*α*),

*m** = 1. However, outside of the basin when

*m*

_{0}<

*m*(

_{min}*α*),

*m** < 1; nevertheless it is substantially larger than 0. The value of

*m** depends on both

*m*

_{0}and the load

*α*roughly as

described in further detail in Methods and the Supplementary Material. III. A schematic of the energy landscape is shown in Fig. 4a. Furthermore, for *α* ≲ 0.3, *f* (*α*) < 1 (Fig. 4b), implying that the final overlap with the retrieved structure is significantly larger than the initial overlap *m*_{0} even far outside the basin of the structure.

### SNR of Retrieved Structures Outside the Basin

We use the preceding results to estimate the SNR for *L*_{0}/*L* ≪ *l _{c}*, i.e., when the initial state is well outside the memory basin. First, we argue that the SNR of unbinding from a noisy state with overlap

*m*< 1 with the true structure, should be roughly,

where, as before, the factor of comes from binarization and *c* ≈ 0.65 accounts for the fact that part of the overlap *m* is contributed by the initial cueing structure *S*_{0} and is more concentrated around the relations contained in the retrieval cue. For very large networks, we can replace *m* in Eqn. 8 with *m** from Eqn. 7. Using Eqn. 30 from Methods, we express *m*_{0} in terms of *L*_{0}/*L* and arrive at

which is verified in Fig. 5 for two values of *α*. These results summarize the rich behavior of associative memory of structured knowledge. In contrast to standard memory functions, here the performance depends not only on the memory *α* but also on the network size *N*, structure length *L*, and cueing length *L*_{0} through the SNR. The key difference is that in structured memories, the criterion for success is not limited to convergence to the target memory; even if the target memory is only partially retrieved, the underlying memorized relations can be still be retrieved faithfully using the semantic memory. The well-known property of pattern completion is realized here by a sub-structure of length *L* < *L*_{0}, in addition to the standard random initial condition.

## STORAGE AND RETRIEVAL OF SEQUENCES

### Storing Sequences as Binary Structures

We now extend the results of the previous sections to representations of temporal sequences. Temporal sequences can be modeled as structures in several ways. One possibility is to bind each event in the sequence with its temporal order in the sequence. This can be implemented via a contextual drift process with a context representation that evolves as items in the sequence are retrieved as in the Temporal Context Model of free recall of lists [2] and similarly in the Context Retrieval and Updating model [39, 40]. Here we use temporal proximity as the contextual cue by interpreting a sequence as a set of binary associations between temporally proximal events. Thus, a temporal sequence of length *L*, (*a*_{1}, *a*_{2}, …, *a _{L}*) can be represented as a structure of the form

and the entire sequence *S* is represented by a vector of size *N* given by

Decoding an episode at a particular time, i.e., *a _{ℓ}*, is performed through an unbinding operation with a query by the preceding event,

*a*

_{ℓ−1}. Starting from a query by

*a*

_{1}, the entire sequence can be unfolded through a sequence of queries. Because each event appears in two binary relations, we need to use an asymmetric binding operation so that

*g*(

*a, b*) ≠

*g*(

*b, a*). Within HRR, this can be accomplished by switching the binding and unbinding operations [41].

As before, we consider the case in which all items being decoded are contained in a Dictionary , so each decoding step involves a clean-up of the decoded item before preceding to decode the next item from the sequence. A schematic of this process is shown in Fig. 6a.

The binarized versions of the structures representing each sequence are stored for long-term memory in a recurrent neural network with synaptic weight matrix determined via the Pseudo-inverse rule. The cueing structure consists of the first relation (*a*_{1}, *a*_{2}), so the overlap of with the stored sequence is closely approximated by Eqn. 30 with *L*_{0} = 2. Alternatively, the first item *a*_{1} can be used as a retrieval cue if it is added to the representation in Eqn. 11. We are primarily interested in the ability to reconstruct the entire sequence after it is retrieved from memory.

### Retrieval of Sequences from Long-term Memory

Due to the sequential nature of decoding sequences in our model, the decoding error accumulates as each subsequent element is retrieved. Thus, the unbinding error for an event depends on its position in the sequence (relative to the cued events). In Fig. 6b, we show the decoding error *P _{ϵ}* at each position along sequences encoded by Eqn. 11 for sequences of different length

*L*. Since the SNR of the overlap with the correct item along each position in the sequence depends on

*L*, the length of the sequence limits the accuracy of decoding at all positions along the sequence. Nevertheless, for low memory loads and moderately long sequences, the accumulated error is small.

### Decoding Error

An interesting outcome of this mode of recall is the accumulation of errors as the recall sequence advances. This would give rise to correlations between probabilities of recall that decay as a function of the temporal lag between the events, consistent with observations [2]. This behavior was previously explained by positing that proximal temporal context vectors are correlated. In our model, these correlations are a natural consequence of the fact that the proximal events serve as temporal context cues.

The present scheme of long-term storage of sequences as single fixed points overcomes a key disadvantage of previous attempts at storing multiple temporal associations in attractor neural networks. In previous models of sequential memory [32], all of the patterns contained in the sequences are stored as separate attractors in one network and the sequences themselves are encoded in a time-delayed synaptic plasticity rule that associates each pattern with its next pattern in the sequence, illustrated in Fig. 6c and reviewed in Section IV of the Supplementary Material. Because of the Markovian nature of the synaptic plasticity, retrieval will fail if multiple sequences share the same event, as shown in Fig. 6d. By contrast, in the present model, the entire sequences are stored and retrieved as separate “holistic” attractors. Thus, as long as the retrieval cue is unique to a single sequence, it will retrieve it unambiguously and the subsequent unfolding of the sequence by the unbinding network will be immune from interference with other sequences. To demonstrate this, we consider *P* sequences *S ^{μ}* with

*μ*= 1, …,

*P*of length

*L*where neighboring sequences

*S*and

^{μ}*S*

^{μ+1}share

*L*events in common.

_{s}In Fig. 6e, we show the decoding error for a sequence stored in memory with another sequence containing an overlapping event, demonstrating the successful retrieval of the entire sequence despite the presence of an overlapping state (compare with Fig. 6d).

Figure 6f shows that sequences can be faithfully retrieved even with multiple common states, up to the point where the basins of attraction of individual sequences shrink due to the large overlap between them.

Finally, it is interesting to compare the the memory capacity of the two models. In the sequence attractor model, the maximal number of stored sequences of length *L* is *P* < *α _{c}N/L* since the network stores

*PL*states. In contrast, in the present model, since only

*P*attractors are stored, the capacity of storage is

*P*<

*α*. Nevertheless, for a successful unfolding of the sequence we need

_{c}N*PL*/

*N*to be bounded (for a fixed

*L*

_{0}/

*L*) due to the noise in the unbinding operation. A potential disadvantage of the current model is the need to devote additional memory resources to store the individual events in a Dictionary. On the other hand, the Dictionary can be used for multiple other cognitive tasks aside from the retrieval of these sequences. Another flexibility in the separation of the retrieval of the neural representation of the sequence from the subsequent reconstruction of individual events, is the fact that, for some tasks, the agent may not need to access the full detail of the sequence, for instance in tasks that requires distinguishing between one episode and another one. Such tasks may not need to rely on the full unbinding of the sequences.

## NEURAL IMPLEMENTATION OF MULTIPLICATIVE BINDING

We now briefly consider possible implementations of the binding computation in Eqn. 2 through multiplicative interactions in biological neurons [42]. Previously, several mechanisms have been proposed to facilitate multiplicative interactions among neurons including dendritic gating [43], quadratic firing rate response [44], and short-term synaptic plasticity. Short-term plasticity comprises a variety of synaptic processes that modulate synaptic efficacy in a transient, activity-dependent manner [45]. These processes occur on timescales ranging from milliseconds to minutes [46] and are thought to mediate higher cognitive functions like attention and working memory. More recently it has been suggested that “fast weights” in artificial neural networks may serve as an analogy to short-term plasticity in the brain [47] with connections to linear transformers [48].

We start by noting that *g _{k}*(

*a, b*) =

*a*′ where

^{T}b*b*′ =

*G*. The last term is a representation of the activity pattern

^{k}b*b*by propagating it through a synaptic matrix

*G*. Finally the dot product between

^{k}*a*and

*b*′ can be implemented can be decomposed into an outer product of two fixed synaptic weight vectors, i.e. so that components of the binding vector take the form

We now use the above form to consider how firing rate nonlinearity and short-term synaptic plasticity can serve as mechanism for generating quadratic binding.

### Nonlinearity of the Firing Rate

Biological neurons can potentially implement the computation of the binding vector *g*(*a, b*) via the nonlinearity of the firing rate response to a synaptic current *r* = *f* (*I*) where the synaptic current is given by the sum . While for many neurons rectificed nonlinearity *f* (*I*) = [*I* − *I*_{0}]_{+} is a good approximations, other neurons are found to be approximated by quadratic nonlinearity where the firing in response to the sum of the separate responses to each input *a* and *b* is subtracted from the response to the combined input from *a* and *b*. Building on a quadratic *f* (*I*) curve, we can write

The subtraction can be implemented by inhibitory neurons or by temporal derivative in a working memory system.

### Short-term Synaptic Plasticity

Another potential mechanism to generate quadratic binding is short-term synaptic plasticity. This can be accomplished by a short-term increase in residual presynaptic calcium levels in working memory enabling *b* to modulate the synapses *G ^{k}* so that subsequent input

*a*will generate the postsynaptic potential

^{T}which contains a multiplicative component of the form in Eqn. 12. Note that Eqn. 14 contains a linear term weighted by *ω*. This term may not need to completely cancel as it provides the trace with some similarity to both *a* and *b*, potentially allowing objects or context to be independently used as a retrieval cue. However, if *ω* relatively small, the trace will remain most similar to the bound conjunction *g*(*a, b*).

## DISCUSSION

In summary, we have proposed and analyzed a model demonstrating how multiple knowledge structures, i.e., sets of relations between pairs of items, can be represented, stored, and retrieved in Hopfield type recurrent neural network models of long-term memory. Our model hypothesizes that the entire set of relations is encoded through binding operations, summation and binarization, in a single pattern of activity in a neuronal population, which is then stored as fixed point in the recurrent network. Retrieval of relational information from longterm memory, consists in our model of two stages: first, retrieval of the desired fixed point, and subsequent unbinding to uncover individual relations with the aid of a separate memory system, the Dictionary.

We also show that this scheme can be used to model long-term memory of temporal sequences by creating structure vectors for sequences of temporally associated items and store in a recurrent network compressed versions of the sequences as fixed points. Sequence recall consists of retrieval of the “sequence” fixed point, and subsequent unfolding of the stored events through a sequence of unbinding operations. In this application we have also demonstrated that our model for storing structure vectors in long-term memory is not very sensitive to the presence of a partial overlap between different structures.

Our analysis suggests that the success of this long-term memory system depends not only on the memory capacity of the attractor network but also very crucially on the properties of the memory basins of attraction and the landscape in the surrounding “terrain”, such as the degree of overlap between “spurious” states outside the basins with the target fixed point (inside it). For this reason, a learning rule that decorrelates memories and yields smoother basins is clearly superior, as shown in Supplementary Fig. 8. Due to the dense distributed nature of the binding scheme employed here (HRR), we have not studied the effect of pattern sparsity on the long-term memory system [34]. It would be interesting to explore the sparsity effect in sparse binding schemes [49–52] and generally how the binding matrices can be learned in a biologically plausible way.

We close by briefly discussing two important aspects of this work which have the most immediate phenomenological implications. A key aspect of our model is the existence of neuronal populations representing entire relational structures in long-term memory as persistent patterns of activity displaying the “holistic context” of each structure. This system interacts with a working memory system which executes the dynamics of retrieving details of the stored relations. We have not addressed the interesting question of the mechanism by which a stream of experiences is segmented into a sequence of discrete events [53], or more generally, the mechanism that segments complex environments into a discrete sets of bound items. In particular, our model of long-term memory of sequences predicts that the retrieval of a temporal sequence is associated with a persistent pattern of activity (representing the context of the entire sequence) in addition to sequential dynamics involving the dynamic interaction between working and long-term memory. This can be tested in recordings of neuronal activity during recall of sequences in the hippocampus and in songbirds.

Finally, as mentioned above, our framework of storage and retrieval of relational knowledge structures in longterm memory relies on the existence of a complementary long-term memory system, the “Dictionary”, which stores the individual building blocks comprising the relational knowledge. It is tempting to identify these two complementary memory systems as representing episodic memory (the relational system) and semantic memory (the Dictionary) [54], although we emphasize that in the present context, semantic memory does not necessarily require language and presumably exists in other species as well. The synergy of these two “complementary” memory systems results in an associative memory system with both the capacity and flexibility to store and faithfully represent complex knowledge structures in long-term memory. Adapting this framework to further explain empirically observed phenomena in memory will require adherence to known biological properties of hippocampal representations as well more explicit models of both the Dictionary and the working memory system in which binding and unbinding occurs.

## METHODS

### Holographic Reduced Representation

HRR [8, 55] is a commonly used VSA scheme with fixed forms for the binding and unbinding matrices in Eqns. 2 and 4. The binding operation *g* is given by the circular convolution operation of the vectors *a* and *b* where

and all subscripts are defined modulo *N*. The circular convolution operation is both associative and commutative. The corresponding decoding operation *φ* is realized through circular correlation of the two vectors and *b* where

We see from comparing Eqns. 2 and 4 with Eqns. 15 and 16 that HRR corresponds to the following choice for the encoding and decoding matrices

The commutativity of the encoding operation implies that HRR encoding is commutative. To represent noncommutative asymmetric relations, we can simply exchange the binding and unbinding operations i.e., binding with circular correlation and unbinding with circular convolution [41]. The full details of the statistics of decoding for HRR is given in Section I B of the Supplementary Material.

### Unbinding Accuracy

We access the typical decoding performance by considering the case in which *a*’s and *b*’s are random vectors with components drawn iid from and the dictionaries for *a*’s contains *D* elements. Then the ML decoding error is well approximated by

where and . Here SNR is a signal-to-noise-ratio defined in terms of the mean overlap of the estimator â_{d} with the correct Dictionary item *a _{d}* and the variance of the overlap with incorrect Dictionary item

*a*, i.e.,

_{d′}where *d*′ ≠ *d*, and the average is over the Gaussian dis-tributions of the components of *a _{d}* and

*a*. For full details see Section II of the Supplementary Material. For SNR ≫ 1, the decoding error can be approximated as

_{d′}To leading order, the SNR for many VSA binding schemes (including HRR) is

Eqn. 21 implies that *P _{ϵ}* ≪ 1 as long as

*N*≳

*O*(

*L*log

*D*). Hence, for

*L*≪

*N*accurate decoding requires the size of the Dictionary

*D*be at most polynomial in

*N*. In this regime, assumed throughout, the size of the Dictionary has little effect on performance, which is dominated by the SNR.

*P*and the inverse SNR are shown as functions of

_{ϵ}*L*in Fig. 1b and c, respectively.

### Memory Network

Throughout this work, we consider Hopfield type re-current neural networks with binary neurons. The state of the network at time *t*, *σ*(*t*) is given by the update rule

where updates are done either in series or in parallel. For simplicity, parallel updates are used for the figures in the main text, but we show in Section III C of the Supplementary Material that the results are qualitatively similar for serial updates.

Given a set of memories *σ ^{μ}*, the synaptic weight matrix

*J*must be chosen so that each of the memories is a fixed point of the dynamics in Eqn. 23. There are several different learning rules which can accomplish this. Mainly, we consider the Pseudo-inverse rule [36] with synaptic weight matrix given by

_{ij}where the pattern overlap matrix *C _{μν}* is defined as

We also consider the Hebb rule given by

and the Storkey rule [37, 56, 57] with *J _{ij}* given by

where,

These learning rules differ in their capacity and the average size of the basins of attraction for memories at a given memory load *α*, further discussed Section III D of the Supplementary Material.

### Simulations

We simulate the memory storage, retrieval, and decoding processes by creating dictionaries of objects and attributes and where *a*’s and *b*’ s are random vectors with components drawn iid from a Gaussian distribution i.e. . The size of these dictionaries is fixed to *D* = *L _{max}N*, where

*L*is the size of the largest structure being considered. In Fig. 1b we show the decoding error for several values of

_{max}*L*and for Figs. 1c and 3 we set

_{max}*L*= 30. We then use a subset of the dictionaries to create

_{max}*P*knowledge structures with vector representations given by HRR encoding. These structures are then point-wise binarized and used to compute the synaptic weight matrix using the Pseudo-inverse rule unless otherwise stated. Since the encoding of the structures induces a similarity with individual relations

*g*(

*a, b*) rather than with

*a*or

*b*individually, we find that the same set of attributes {

*b*

_{1},

*b*

_{2}, …

*b*can be the same across several or all of the different knowledge structures while retaining the ability to decode the corresponding object from retrieved structure . Hence, we consider the case in which the same attributes are used in each structure i.e. .

_{L}We test the performance of the memory network by initializing the network in the state for each memory *μ* = 1; …, *P* where is the subset of *L*_{0} relations used to create a retrieval cue. We then evolve the network for *T* parallel updates, denoting the attractor reached by the network as , i.e., the retrieved state starting from partial cue of the *μ*-th structure. We define *m ^{μ}* as the overlap between

*σ*i.e.

^{μ}For each retrieved structure , we use , corresponding to a relation *not* contained in the initializing structure, to obtain an estimate for , which then identified with the Dictionary element with which it has the highest overlap.

The Pseudo-inverse rule ensures that the basins of attraction for different structures are essentially identical regardless of potential differences in the overlap between different structures. In simulations, this allows us to consider each structure as an independent trial. The fraction of trials in which is incorrectly decoded from provides an empirical estimate of the decoding error *P _{ϵ}*. We also construct an empirical SNR from Eqn. 20. Finally, we measure

*m*for each structure (Eqn. 29) to obtain an empirical distribution

^{μ}*p*(

*m*) where the overlaps

*m*are calculated for each memory in a trial and accumulated over many trials. The distribution

*p*(

*m*) does not appear to change if measured over multiple trials with different patterns or for multiple patterns within the same trial, which further supports the ability to consider each structure as a separate trial. The distribution

*p*(

*m*) is a statistical measure of the retrieval quality for structures of fixed size

*L*, memory load

*α*, and retrieval cue length

*L*

_{0}.

### Determination of *l*_{c}

_{c}

To determine *l _{c}* as a function of the various network parameters, we calculate the relation between

*L*

_{0}/

*L*and the average initial overlap

*m*

_{0}with the desired structure in the limit of large

*N*, yielding

Further details of the derivation are provided in Section I D of the Supplementary Material. Using Eqn. 30, we write *l _{c}*(

*α*) in terms of

*m*(

_{min}*α*) defined in the main text as

To determine *m _{min}*(

*α*) we resort to the Pseudo-inverse model with random binary patterns as memories [36], which is simpler to simulate. Results relating

*m*(

_{min}*α*) and

*l*are shown in Supplementary Fig. 1.

_{c}### Empirical Distribution of Overlaps

We find that the empirical distribution *p*(*m*) is bimodal and takes the general form

where *p*_{1} is the probability that a structure is perfectly retrieved from memory and *p*_{m<1}(*m*) corresponds to the distribution of *m* for imperfectly retrieved memories.

The peak at *m* = 1 corresponds to trajectories converging to the target memories. This can be nonzero even when initial overlap *m*_{0} is outside the mean basin radius, indicating non-spherical basin shape. The second mode, peaked at 0 < *m* < 1 results from trajectories that converged to a fixed point outside the basin with a significant residual overlap with the memory. We characterize the shape of the distribution by the probability of *m* = 1, *p*_{1}, the width of the lower *m* mode, *σ _{m}* and the mean of that mode,

*m**. Results are shown in Supplementary Fig. 4a, for several values of

*N*and two values of

*α*.

As noted in [36, 38], the shape of the distribution *p*(*m*) is sensitive to finite size effects. To analyze these effects, we calculate *p*(*m*) for different sizes in a standard Pseudoinverse model where the initial overlap *m*_{0} can essentially be varied continuously. For *m*_{0} > *m _{min}*(

*α*) almost all trials converge to the memorized pattern. For a range of values

*m*

_{0}<

*m*(

_{min}*α*),

*p*(

*m*) is bimodal. We find that

*p*(

*m*) obtained from networks storing random patterns is very similar to the distribution obtain from networks storing structure memories, when the

*m*

_{0}and

*L*

_{0}/

*L*are related as in Eqn. 30. We find that for large

*N*,

*p*

_{1}approaches a step function changing from zero to one as

*m*

_{0}crosses

*m*(

_{min}*α*) = 1

*R*(

*α*). Near this transition,

*p*

_{1}can be approximated as

indicating that it converges to a step function exponentially fast with . In addition, *σ _{m}* is very small outside the narrow transition regime of

*m*

_{0}and shrinks to zero everywhere as . From this, we conclude that for

*N*→ ∞

*p*(

*m*) becomes a

*δ*function, which is either located at

*m*= 1 for

*m*

_{0}>

*m*or at a smaller value

_{min}*m** which increases smoothly with

*m*

_{0}, starting from zero and reaching 1 as

*m*

_{0}increases from zero to

*m*. Thus in large networks, the basins have a roughly spherical shape, such that virtually all initial conditions with

_{min}*m*

_{0}≥

*m*converge to the memory, and all initial conditions with

_{min}*m*

_{0}<

*m*converge to fixed points with partial overlap,

_{min}*m**.

## Supplementary Material

### I. HOLOGRAPHIC REDUCED REPRESENTATION

#### A. Overview

In holographic reduced representation (HRR) [1], the binding operation *g* is circular convolution

where the individual components of *g* are given by

with all subscripts are defined modulo *N*. For objects and attributes *a, b* ∊ ℝ^{N}, this creates a representation *g*(*a, b*) ∊ ℝ^{N}. The circular convolution operation is both associative and commutative, i.e. cconv(*a, b*) = cconv(*b, a*). The corresponding decoding operation is given by the circular correlation

where the individual components of *f* are given by

In general the circular correlation operation is neither associative nor commutative. However, it can be related to circular convolution by defining the involution of a vector *a* as *a** where . Then ccorr(*a, b*) = cconv(*a**, *b*) and *a** as serves as an approximate inverse to *a* when both *a* and *b* are random vectors.

#### B. Clean-up and SNR

To quantify the effect of the ML clean-up operation described in the main text, we compute various statistics of the estimator â_{ℓ} for the corresponding object *a _{ℓ}* decoded from a relational structure using the HRR unbinding operation. We consider the case where all objects and attributes are random vectors with components drawn iid, i.e., and decoding is done from a dictionary of size

*D*.

Since the encoding operation is permutation invariant, the statistics for all â_{ℓ} decoded from the full structure should be the same. For simplicity we set ℓ = 1 and consider â_{1}. We index dictionary items so that indices ℓ < *L* correspond to the dictionary items contained in the structure and indices *L* < ℓ < *D* corresponds to dictionary items that are not contained in . Note that

Using the expression for the unbinding operation in Eqn. 4, we can express the estimator â_{1} as

Using Eqn. 2, we can express this in terms of *a _{ℓ}*’s and

*a*’s, and â

_{ℓ}_{j,1}as

where we have defined the three noise terms as

Here, ϵ is noise due to deviations in the normalization of *a*, *ξ*_{j, 1} is noise from interference elements of different elements of the cue within the same attribute *b*_{1}, and *ξ*_{j,2} and *ξ*_{j,2} are noise coming from interference between attribute *b*_{1} with all other attributes *b _{ℓ}* contained in the structure. Since components of

*a*

_{1}appear an odd number of times in each term of the sums in

*ξ*

_{j,1}and

*ξ*

_{j,2}, we conclude that 〈

*ξ*

_{j,1}〉 = 〈

*ξ*

_{j,2}〉 = 0. Likewise, 〈ϵ〉 = 0. Using Eqns. 5 and 6, we find that the second moments and correlation of the two noise terms are given by

and from Eqn. 8, we see the noise in each component of the estimator is

We now calculate the first two moments of the overlap of the estimator with the correct pattern *a _{d}* · â

_{1}to obtain the SNR for HRR defined in Eqn. 20 of Methods.

*a*· â

_{d}_{1}can be expressed as

We see from Eqn. 17 that 〈*a*_{1} · â_{1}〉 = 1 and 〈*a _{d}* · â

_{1}〉 = 0, so the estimator is unbiased. The second moment is given by

To summarize

For *L* ≪ *N*, the SNR for overlaps defined in Eqn. 20 of the main text is then approximately

#### C. Binarization and Empirical SNR

When decoding from a binarized structure , the estimator â_{1} is given by

where

To compute the SNR, we start by evaluating the overlap between the estimator â_{1} with items in the dictionary 〈*a _{d}* · â

_{1}〉 which can be expressed as

While individual products of elements *a _{ℓ,i}b_{ℓ,j}* are not Gaussian distributed, the expectation 〈

*a*â

_{d}_{1}〉 is the sum of

*N*

^{2}random variables with correlations only occurring at higher order as the expectation of the product of different elements of each circular convolution is zero. Since all terms contained in the sums over indiced

*j*and

*k*in Eqn. 24 are only correlated at higher order, we can approximate each term in the sum as independent which gives us

We now define the following three random variables

where *x*_{1} is a signal term and *ξ* is a noise term within the sign function. 〈*z*_{1}〉 = 〈*z*_{2}〉 = 〈*ξ*〉 = 0 and the variances are given by

For *d* = 1, *z*_{1} = *z _{ℓ}* =

*z*and

*L*≫ 1 we can treat

*z*and

*ξ*as Gaussian. In this approximation 〈

*a*

_{1}· â

_{1}〉 becomes

Likewise, the second moment is given by

Putting Eqns. 32 and 33 together, the SNR of overlap when decoding *a*_{1} from the binarized structure *σ* is given by

We see from Eqn. 21 that binarizing has decreased the SNR by a constant factor of .

#### D. Memory Initialization

We now determine the average initial overlap *m*_{0} between a binarized retrieval structure *σ*_{0} containing *L*_{0} of the *L* relations in the binarized full structure *σ* in the case of random *a _{ℓ}*’s and

*b*’s are random vectors with components drawn iid as before. An expression for the average overlap

_{ℓ}*m*

_{0}between

*σ*

_{0}and

*σ*can be written as

The overlap between components of the unbinarized structures and with HRR binding is given by

For large *N* we can approximate the sum in Eqn. 35 by treating the individual terms as as independent since there are no lower order correlations between terms. We can then define the random variables and as approximately Gaussian distributed with zero mean and variances and respectively. Then, Eqn. 35 can be approximated as

For *L*_{0} ≪ *L* Eqn. 37 can be further approximated as

This implies that creating retrieval structures with *L*_{0} out of *L* relations will result in retrieval cues with average overlap *m*_{0} with the full memorized structure. The variance of the distribution of initial overlaps will vanish as *N* → ∞.

We use this relation to determine *l _{c}*(

*α*) as a function of

*L*

_{0}/

*L*using the value of

*m*(

_{min}*α*) obtained for a Pseudo-inverse model with random binary patterns as memories [2]. Results for

*m*(

_{min}*α*) are shown in Fig. 1a demonstrating a linear increase in the range

*α*< 0.5. Substituting these results in Eqn. 31 closely predicts the value of

*l*(

_{c}*α*) as shown in Fig. 1b.

### II. DECODING ERROR

We derive a good approximation for the ML decoding error in terms of *D*, the size of the decoding dictionary, and SNR, the signal-to-noise ratio of overlaps defined in Eqn. 20 of Methods. The overlap between a dictionary item *a _{d}* with the estimator â

_{1}for

*a*

_{1}decoded from a structure is given by

where as in the previous section, we index items so that *d* ≤ *L* corresponds to overlap with patterns within and *L* < *d* corresponds to overlaps with other items in the Dictionary not contained in . For ML decoding, an error occurs if . The probability of error *P _{ϵ}* is then given by

For large *N*, each essentially behaves as an independent Gaussian random variable with mean *μ*_{1d} and variance Σ_{1d}. For unbiased decoding schemes, *μ*_{1L} = *μ*_{1D} = 0. In general, we can consider decoding from a structure containing errors by setting *μ*_{11} = *μ*, where *μ* is related to the overlap of the corrupted structure with the correct structure. Then *P _{ϵ}* can be very well approximated as

where

For HRR Σ_{11} ≈ Σ_{1L} ≈ Σ_{1D} to leading order in . Defining Σ to be the leading order term in the variance, we obtain

Identifying with the SNR of overlaps, Eqn. 42 becomes

given in the main text.

#### A. Limitations for Good Decoding

We can understand the dependence of *P _{ϵ}* on the SNR and D in the large SNR regime by obtaining a saddlepoint approximation for the expression for

*P*given in Eqn. 43. We start by rewriting Eqn. 43 as

_{ϵ}where *f* (*z*) is given by

From Eqn. 42, we see *P _{ϵ}* ≪ 1 for as . So we are interested in the regime where . In this regime,

*f*(

*z*) is very large, so the integral in

*P*can be approximated its saddlepoint value

_{ϵ}where *z*_{0} is given by the solution of

In this regime we can make the approximation

which allows us to approximate Eqn. 47 as

where for we used

From Eqn. 49 we see that for and . The saddlepoint approximation for the error is then

This can be further approximated as

We can insert the SNR of overlaps into Eqn. 53 to determine the limits on the size of *D* for good decoding. For HRR the approximation for the error gives us

which implies that for an error of *δ* ≪ 1

From this, we expect the error to be small as long and SNR is large and *D* is polynomial, and not exponential, in *N*.

#### B. Decoding After Memory Retrieval

To analyze the effect of memory retrieval on the decoding error, we start by considering decoding from a degraded binarized structure which is a Hamming distance from the uncorrupted binarized structure. In this case, the effective SNR is found by making the replacement 〈*a*_{1} · â_{1}〉 → *m* in Eqn. 34. From this, we see that SNR(*m*) should take the form

When â_{1} is decoded from an imperfectly retrieved structured from memory, we can instead make the replacement . Here, *c* ~ *O*(1) is a constant factor accounting for differences in overlap of the structure with relations within and outside the retrieval cue. Then SNR(*m*) takes the modified form

The decoding error after retrieval from memory is then given by

where *p*(*m*) takes the same form as in Methods. For *N* → ∞, *p*(*m*) → *δ*(*m* − *m**) and the decoding error approaches

### III. INSIGHTS FROM RANDOM PATTERNS

#### A. Network Order Parameters

For generic Hopfield networks, we can characterize the quality of memory retrieval by formally defining three network order parameters which quantify the overlap of the network state with the stored memories at each time step. The first is

which represents the overlap of the network state *σ*(*t*) with the corresponding pattern *σ ^{μ}* at time

*t*. If the initial state of the network has an

*O*(1) with a small number of patterns, i.e.,

*m*

_{0}=

*m*(0), the memory retrieval process can be sufficiently described by including an additional order parameter

^{μ}which represents the overlap of the state with all patterns except for *μ*. We can also define

as the overlap of the state at time *t* with the initial state of the network after attempting to retrieve *σ ^{μ}* where the overlap between

*σ*(0) and

^{μ}*σ*is

^{μ}*m*

_{0}. While

*m*

^{0}is typically not considered for random patterns, it becomes relevant in the case of structured knowledge, where the retrieval cue can be constructed from a subset of relations rather than simply a corrupted version of a memory.

For large *N*, the distributions of order parameters over fixed retrieval conditions, i.e., *p*(*m*), *p*(*r*) and *p*(*m*^{0}), are sharply peaked at the average values, denoted by 〈*m*〉, 〈*r*〉, and 〈*m*^{0}〉. In the limit *N* → ∞, these quantities depend solely on *m*_{0} and *α*.

#### B. Overlap Distributions

As discussed in Methods, the empirical distribution *p*(*m*) is bimodal and takes the general form

where *p*_{1} is the probability that a structure is perfectly retrieved from memory and *p*_{m<1}(*m*) corresponds to the distribution of *m* for imperfectly retrieved memories. The shape of the distribution is characterized by *p*_{1}, the width of the lower *m* mode, *σ _{m}* and the mean of that mode,

*m**. These quantities all depend on the initial overlap

*m*

_{0}used to retrieve the memory Results for

*p*

_{1},

*m** and

*σ*are shown in Fig. 2a, for several values of

_{m}*N*and two values of

*α*.

We also compare the overlap distributions for structures of length *L* with retrieval cue of length *L*_{0} = 2 with the overlap distribution for a Pseudo-inverse network with random patterns for the corresponding values of *m*_{0} given in Eqn. 35 in Fig. 3.

#### C. Retrieval Dynamics

When retrieving structures from memory (in the absence of noise), the update equation for the state of each neuron at time *t* is given by

For all of the simulations in the main text, we consider parallel updates where all of the neurons 1 ≤ *i* ≤ *N*. We find that serial updates give qualitatively similar results to parallel updates. In general, for large *N* we find

where 〈*m*〉 → *m** as *N* → ∞. We find that serial updates obey the form in Eqn. 65 with a slightly smaller value of the coefficient *f* (*α*). Additionally, for serial dynamics, we can consider the robustness of our results under addition of noise in the updates. To do this, we use the Metropolis algorithm to obtain the final equilibrium state, where the amplitude of the noise is controlled by the inverse temperature *β*. At each update, the acceptance probability is given by

We see that Eqn. 66 becomes equivalent to Eqn. 64 for the noiseless case *β* = ∞. In Fig. 4, we show the average overlap 〈*m*〉 of the network state with the cued memory as a function of time for several different initial overlaps *m*_{0} retrieved through parallel dynamics and serial dynamics for several different values of *β*. In Fig. 5 we show 〈*m*〉, 〈*r*〉, 〈*m*^{0}〉 as function of retrieval cue overlap *m*_{0}. We see that the highest overlap is achieved for parallel updates when the serial updates are done in random order (alternative orders are discussed in [2]). However, the results are qualitatively similar for the different dynamics shown, demonstrating that small amounts of noise have little effect.

In Fig. 5a and c, we see that outside of the basin of attraction of the cued memory, i.e., values of *m*_{0} where 〈*m*〉 < 1, the final network state retains some memory of the initial state reflected by 〈*m*^{0}〉 > *m*_{0}.

#### D. Comparing Learning Rules

In the main text, we considered storing structured memories in recurrent neural networks where the synaptic weights are set via the Pseudo-inverse rule. This learning rule fully de-correlates linearly independent memories so that each memory is a perfect fixed point for *α* < 1. However, the Pseudo-inverse rule is both non-local and non-incremental. This makes it unlikely to be biologically implemented in a straightforward manner.

We now consider two learning rules that are both local and incremental. The first is the Hebb rule and the second is the Storkey rule proposed in [3–5] (see Eqns. (26)–(28) in Methods). The Storkey rule can be viewed as a biologically plausible approximation to the Pseudo-inverse rule for small *α* [4]. As a result, the storage capacity of this rule *α _{c}* ≈ 0.4 is significantly higher than the Hebb rule (

*α*≈ 0.14) and the basins of attraction are larger and more even across different memories [5].

_{c}The storage and retrieval of random memories in Hopfield networks near saturation, i.e., when the number of memories *P* scaling linearly in *N*, is limited by interference between different memories [6]. The overlaps between different memories are characterized by the following matrix

Note that *C ^{μν}* is a

*P*×

*P*matrix that has the form of a sample covariance matrix.

We can compare the Pseudo-inverse, Hebb, and Storkey rules by looking at the forms of the local fields given by

From Eqn. 64, we see that a memory is a fixed point if for all *i* = 1, …, *N*. The elimination of the self-coupling term *J _{ii}* greatly increases the basins of attraction of the pattern. In the limit

*N, P*→ ∞ this can be accomplished by replacing

*J*with

_{ij}*J*[2]. The interference between memories is contained in the noise term in the local field.

_{ij}αδ_{ij}We start by considering the Hebb rule, with synaptic weight matrix given by

The local field for the Hebb rule can be expressed as

Next we consider the the Pseudo-inverse rule with synaptic weight matrix expressed in terms of *C ^{μν}* as

It is useful to decompose the state of the network *σ _{i}*(

*t*) in two parts as follows

where *δσ _{i}*(

*t*) is orthogonal to all of the patterns and

*a*is related to the order parameter

^{μ}*m*(

^{μ}*t*) via

The local field for the Pseudo-inverse rule can then be expressed as

We can see how Eqn. 74 suppresses the effects of overlaps by considering the state . Then *a ^{μ}* =

*δ*

_{μ1}regardless of any correlations of the memories and This implies that each memory is an eigenvector of with eigenvalue (1 −

*α*) so that all memories are perfect fixed points for

*α*≥ 1. By contrast, we see from Eqn. 70 that becomes contains additional noise terms of because of overlaps with the other patterns. This noise reduces the capacity of the network as well as the size of the basins of attraction for each memory which are both further reduced if the patterns are not random and contain correlations.

A simplified form of the Storkey rule, discussed in [4, 5, 7], is given by
where
Following the analysis in [7] We can find a more compact approximation for . We start with *P* = 1. Then is given by
For *P* = 2 we have
For *P* memories, keeping terms up to can be expressed as
We can relate to the Pseudo-inverse rule given in Eqn. 71 by rewriting using the following series expansion of *C*^{−1}
Plugging this expression back into Eqn. 71 gives us
which is identical to the expression for given in the last line of Eqn. 80 up to terms of .

The expansion of *C* in Eqn. 81 converges if the eigenvalues of *C* are all contained in the interval [0, 2]. For random memories with components drawn iid from , the distribution of the eigenvalues of *C ^{μν}* is given by the Marchenko-Pastur distribution given by, which talks the following form for

*P*,

*N*→ ∞ Since

*λ*∈ [

*λ*

_{−}, λ_{+}], the expansion of

*C*is valid for

*α*< 0.17 implying that for sufficiently small

*α*and large

*N*.

In Fig. 6 we compare 〈*m*〉 as a function of *m*_{0} for two values of *α* for the three learning rules and in Fig. 7, we show *m _{min}* (as defined in the main text) as a function of

*α*.

#### E. Connection to Storage of Knowledge Structures

While the various learning rules and dynamics have quantitative effects on the values on the various network order parameters at fixed memory load and network size, their qualitative behavior as a function of the retrieval cue overlap *m*_{0} remains roughly the same for small values of *α* below capacity.

In general, we have found that outside of the basin of attraction, 〈*m*〉 (and *m**) scales linearly with *m*_{0} for all three learning rules and takes the form in Eqn. 7 where the coefficient *f* (*α*) depends on the learning rule.

In all cases, for large *N* where 〈*m*〉 → *m**, the SNR behaves as
As a result, while the choice of learning rule and retrieval dynamics modify the value of *f* (*α*) defined in Eqn. 65, they do not change the scaling behavior in Eqn. 85.

Comparing *P _{ϵ}* for structures stored using the Hebb, Storkey, and Pseudo-inverse rules in Fig. 8a. In general, we find that for low values of

*α*≲ 0.3, the error of decoding from memories stored via the Storkey rule is very similar to the Pseudo-inverse rule. Both are significantly lower the error obtained when decoding from memories stored using the Hebb rule. This suggests that the Storkey learning rule sufficiently decorrelates the different structures to allow for both efficient storage and retrieval of structured knowledge.

### IV. TEMPORAL SEQUENCES AS SEQUENCES OF ATTRACTORS

Previously in [8, 9], it was shown that a temporal sequence of memories could be stored in a Hopfield network by adding a second asymmetric synaptic interaction of the form
to the synaptic weight matrix, where *L* < *P* and *λ* > 1.

We can store *Q* sequences in the network by summing together a contribution for each sequence, i.e.,
as long as each memory is only contained in one sequence.

To retrieve a sequence, one starts with an initial state *σ*_{0} = *σ*(0) within the basin of attraction of the first memory in the sequence. The evolution of the network state is then given by
where is defined as
where the function *w*(*t*) represents a dynamic memory characterized by time constant *τ*. A simple choice for *w*(*t*) uses the heaviside function, i.e. , which we use for the simulation results shown in Fig. 6d of the main text. is given by a learning rule for the conventional Hopfield network such as the Hebb, Storkey, or Pseudo-inverse rules.

For long sequences, transitions between attractors will be equally spaced with period *t*_{0} after initial transients and the system will be in the pattern *μ* in the time interval ((*μ* − 1)*t*_{0}, *μt*_{0}). For a heaviside function *t*_{0} is given by
We can access the performance of this sequence memory by looking at *m ^{μ}*(

*t*) given in Eqn. 60 for each pattern in a sequence. Good performance requires

*m*(

^{μ}*t*) ≈ 1 during the interval

*t*∊ (

*μ*− 1)

*t*

_{0},

*μt*

_{0}) and

*m*(

^{μ}*t*) ≪ 1 for all other

*t*.

Note that if a we store two sequences sharing items in the same memory network, transitions between attractors in the dynamics given in Eqn. 87 are no longer unique.

## ACKNOWLEDGEMENTS

We thank Tankut Can, Naoki Hiratani, Mikhail Katkov, Kamesh Krishnamurthy, Andrew Saxe, Nimrod Shaham, and Misha Tsodyks for fruitful discussions. J.S. acknowledges support from the NSF through the Center for the Physics of Biological Function (PHY-1734030) and computational resources from the Princeton Research Computing at Princeton University, a consortium of groups led by the Princeton Institute for Computational Science and Engineering (PICSciE) and Office of Information Technology’s Research Computing. H.S. acknowledges support from the Swartz Program in Theoretical Neuroscience at Harvard, the Gatsby Charitable Foundation, and NIH grant NINDS (1U19NS104653).

## Footnotes

Fixed abstract placement

## References

- [1].↵
- [2].↵
- [3].↵
- [4].
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].
- [11].↵
- [12].↵
- [13].
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].
- [20].
- [21].
- [22].↵
- [23].↵
- [24].
- [25].
- [26].
- [27].
- [28].
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].
- [51].
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵