Recall tempo of Hebbian sequences depends on the interplay of Hebbian kernel with tutor signal timing

Understanding how neural circuits generate sequential activity is a longstanding challenge. While foundational theoretical models have shown how sequences can be stored as memories with Hebbian plasticity rules, these models considered only a narrow range of Hebbian rules. Here we introduce a model for arbitrary Hebbian plasticity rules, capturing the diversity of spike-timing-dependent synaptic plasticity seen in experiments, and show how the choice of these rules and of neural activity patterns influences sequence memory formation and retrieval. In particular, we derive a general theory that predicts the speed of sequence replay. This theory lays a foundation for explaining how cortical tutor signals might give rise to motor actions that eventually become “automatic”. Our theory also captures the impact of changing the speed of the tutor signal. Beyond shedding light on biological circuits, this theory has relevance in artificial intelligence by laying a foundation for frameworks whereby slow and computationally expensive deliberation can be stored as memories and eventually replaced by inexpensive recall.


Introduction 1
An important class of animal behaviors are those that are 2 consolidated into "automatic", well-practiced sequential rou-This perspective is embodied by models featuring temporally asymmetric Hebbian (TAH) learning rules.Despite the long and illustrious legacy of this class of models (31)(32)(33)(34)(35)(36)(37)(38)(39)(40)(41), previous studies were seriously limited by the very narrow class of learning rules that these preexisting models can describe, and the fact that these learning rules do not correspond with those observed in the brain.In this work we extend the basic TAH models previously considered to a much richer class of models.Not only do we derive a theory that describes the essential elements of sequence dynamics, including tempo and noise robustness, for learning rules currently seen in the brain; our theory describes a wide class of TAH learning rules that gives rise to sequential dynamics, covering the case of new learning rules that may be discovered in biology in the future.
TAH network models are built upon Hebbian learning rules, wherein the network state is set by the tutor to a sequence of activity patterns one after another.These activity patterns correspond to moments in behavioral sequences, and the Hebbian learning rules store the structure of these sequences in the network as a memory.The sequences can then be recalled by setting the network to the first state in a desired sequence, resulting in the execution of the now "automatic", well-practiced sequential behavior.These models can be seen as a generalization of Hopfield networks (42) to storing sequences rather than static patterns; therefore, many of the motivations and rationale behind the Hopfield model is applicable here, such as the ability to "complete" a sequence based on partial inputs (see Discussion for more in-depth interpretations of this model).
While our theory is very general, we demonstrate its applicability and utility by focusing on two fundamental and biologically-relevant cases: (1) TAH learning rules with fast timescales relative to the tutor signal and (2) Double-sided exponential TAH learning rules as are commonly seen in biology (43).In both cases we find useful principles that are interpretable in the context of biology.For example, weight structures (and TAH learning rules) can be adjusted such that the tempo remains the same, but robustness to noise increases.We also show how networks with the same qualitative dynamical properties can arise from very different weight structure.For double-sided exponential TAH kernels, we show that the tempo of sequence recall depends linearly on the speed of the tutor signal, with a slope that depends in a simple way on Left: kernel that would give rise to two terms for each µ, a 0 µ ξ µ (ξ µ ) T and a 1 µ ξ µ+1 (ξ µ ) T .Second from left: kernel that would give rise to three terms for each µ, a 0 µ ξ µ (ξ µ ) T , a 1 µ ξ µ+1 (ξ µ ) T and a −1 µ ξ µ−1 (ξ µ ) T .Third from left: kernel that would give rise to five terms for each µ, corresponding to µ − 2, µ − 1, µ, µ + 1, and µ + 2. Right: double-sided exponential kernel.c) Activity r k of ten randomly selected units in the the network for coefficient values a 0 = 0 and a 1 = 1.d) Overlaps qµ of the network activity with patterns ξ µ (here a 0 = 0 and a 1 = 1).Color corresponds to pattern index µ.The vertical lines indicate the locations tµ of the peaks.The first overlap q 1 is not included in the plot for ease of visualization.
the parameters of the kernel.While we focus here on applications to neuroscience, considering its generality our analysis may also be of interest to theorists in other domains studying dynamical systems with asymmetric weight structures.

Sequence model definition
The sequence is stored and produced by a recurrently connected network of N neurons.The connectivity is given by a recurrent weight matrix W , where Wij is the connectivity strength of neuron j to neuron i.The dynamics of the network are then given by ṙ = −r + ϕ (W r + η(t)) [1] where η(t) is white noise (⟨ηi(t)⟩ = 0 and ⟨ηi(t)ηi(t ′ )⟩ = ρ 2 δ(t−t ′ )) and ϕ is a sigmoidal nonlinearity (see Appendix 1 for more details).Note that it is common to add a timescale factor τ which multiplies the term ṙ and √ τ that multiplies η(t)), which have the effect of rescaling the timescale of the dynamics.
Since this effect is relatively simple to understand, we simply take τ = 1 and focus our attention on other attributes.
The sequence is stored in this network by a tutor signal (Fig. 1a).Given a sequence of length P , this signal sets the state of the network sequentially to an ordered set of P patterns (ξ µ ) P µ=1 .These P patterns represent P neural states that correspond to the states making up the sequential behavior.The recurrent network memorizes these patterns by synaptic plasticity at the recurrent synapses.
To store a memory of this sequence, this synaptic plasticity can take the form of a temporally asymmetric Hebbian (TAH) learning rule activate during application of the tutor signal (31,32,37,44).An example of a TAH learning rule commonly seen in the literature is We refer to this weight structure as sequi-associative.In-122 tuitively, this set of synaptic weights calculates the overlap 123 between the current network state with the stored patterns, 124 and if the overlap with one of the patterns dominates over 125 others (say pattern ν), it steers the network activity to the 126 pattern that is after the dominant pattern (pattern ν + 1).

127
While sequi-associative weights capture some temporal as-128 pects of synaptic plasticity, they fall short in their ability to 129 express the full complexity of plasticity dynamics observed in 130 experiments (43,45,46) which can significantly affect synaptic 131 structure and network dynamics (47)(48)(49).Mathematically, a 132 more general Hebbian learning rule could be written as where w is a general kernel/filter that weights the Hebbian 135 learning rule as a function of time offset.One sees that 136 Eq. ( 2) corresponds to w(s − t) = δ(s − t − ∆t).However, 137 biologically observed kernels take much richer and different 138 In this paper, we consider the learning of sequences under a tutor signal with the general TAH rule (Eq.( 4)).This rule can give rise to a much richer set of weights defined by where a ν µ are coefficients that satisfy Here t ν ξ is the time at which the tutor signal for pattern ν begins and T µ ξ = t µ+1 ξ − t µ ξ is the duration for which pattern µ is shown to the network.Our goal is to understand the sequential dynamics and its tempo arising from such learning rules.
We should note that other works have also considered extensions to Eq. (3).For example, some works (32,34,50,51) include an autoassociative component ξ µ (ξ µ ) ⊤ to the weights; (32,34) use discrete dynamics that selectively lowpass filter the sequi-associative component, and ( 50) multiplies the sequi-associative term with random noise.The autoassociative term modulates the pattern transition speed.The authors of (40) conduct simulation studies of a model with an additional prae-associative term ξ µ−1 (ξ µ ) ⊤ that has the same strength and opposite sign as the sequi-associative component.
(35) derive a discrete dynamics in terms of the parameters of an exponentially decaying kernel, giving rise to terms of the form ξ µ+k ξ µ for k ≥ 0; however, mathematical analysis of this model is limited to deriving the mean-field dynamics and computing some aspects of stability (there are no calculations related to tempo).(36) similarly consider terms of the form ξ µ+k ξ µ , but again focus on capacity and stability properties.
Going back to our model, to quantify the sequential behavior of the network dynamics, we define the overlaps [7] which measure the similarity of the recurrent network activity r at time t to stored patterns ξ µ .For our initial condition, we will always initialize the network state to the first pattern ξ 1 .Note that the dynamics of the network in the basis of the neurons may not reveal the sequential structure (Fig. 1c), while the overlaps qµ clearly display this structure (Fig. 1d).
To measure sequence progression, we track the location in time tµ of the maxima of the overlaps qµ(t); at time tµ we say the sequence is in state µ (see vertical lines in Fig. 1d).* We also define pµ := qµ(tµ) to be the height of the peaks.We are particularly interested in the speed of the sequence progression, which we measure by the peak difference dµ := tµ − tµ−1.Note that the coefficients a µ ν can inhomogeneously depend on µ, resulting in a tempo of sequence recall that can be controlled To gain analytical insight about the behavior of our model, we 221 extend the mean-field analysis of (37) to the general weights 222 described by Eq. (5).We assume that the patterns are drawn 223 identically and independently from a Gaussian distribution, 224 ξ µ i ∼ N (0, 1) , and take the network population size to infinity, 225 N → ∞ .

226
In this mean-field limit the overlaps evolve with dynamics 227 given by (see Appendix 1) for some function G.We give the full functional form of G(x) 232 in Appendix 1 (Eq.( 20)), which depends on the shape of 233 ϕ.For the analysis here, the important aspect of G is the 234 value of G(ρ 2 ), which serves as an upper bound on g(t).In 235 particular, for ρ = 0, g(t) is bounded from above by G(0), and 236 G(0) = 20/ √ 2π ≈ 8 for our choice of parameters for ϕ.

237
The convergence with increasing N of the mean-field solu-238 tions to the full network solutions is illustrated in Fig. 2  holds both in the case of uniform tutor signal intervals (Figs.2a     and 2b) as well as non-uniform tutor intervals (Figs.2c and 2d).
For the following, we assume that tutor signal intervals are uniform in temporal duration, so a µ ν = aµ−ν .We will revisit the case of non-uniform intervals in the final subsection of Results.
We next analyze the mean-field theory defined by Eq. ( 9) to gain insight into the stability properties of sequence generation.
To analyze Eq. ( 9), we use the assumption that g(t) asymptotes to a constant value over time, and define g∞ = limt→∞ g(t).
We validate this assumption in simulations below.
A fundamental quantity in determining sequence dynamics is ḡ := where the sum is over all coefficients.
For the following we consider the case of large N , and reiterate that these analyses treat the case of uniform tutor signal intervals.In Appendix 2 we show that in this case, and under other mild assumptions, the solutions to the network dynamics Eq. ( 1) are stable if and only if g∞ = ḡ.Further, we reason and observe that the network will endeavor to satisfy this stability condition, so that ḡ < G(ρ 2 ) is necessary and sufficient for g∞ = ḡ.Motivated by this, we define āk := ḡa k and consider the linear equations qµ = −qµ + P ν=1 āµ−ν qν .[10] Eq. ( 10) can be expected to approximate Eq. ( 9) when the solutions to Eq. ( 9) are stable (equivalently, when ḡ < G(ρ 2 )).
When the solutions to Eq. ( 9) are unstable (when ḡ > G(ρ 2 )), these solutions will vanish with increasing µ while the solutions to Eq. ( 10) will remain stable.We will test the goodness of the linear approximation Eq. ( 10) to Eq. ( 9) in simulations.
Note that ḡ can be related to the TAH kernel w in a simple way.To see this, we first use Eq. ( 6): Assuming that w(δt) becomes vanishingly small for large |δt| and that T ≫ T ξ , the inner integral can be approximately replaced with one over the real line: Note that this equation means the network can be stabilized by extending the duration of the tutor signal T ξ .Therefore, a biologically-plausible solution to embedding stable sequences is 279 by extending the tutor signal duration.As we will show below, 280 increasing ḡ−1 also increases the robustness of the sequence 281 dynamics to noise.

One forward term 283
We next turn to analyzing sequence progression in the recurrent 284 neural network.We start by considering the case with one 285 forward term of Eq. ( 10), where a k = 0 for k / ∈ {0, 1}.This 286 case is relevant in the regime where the duration of sequence 287 elements, T ξ , is significantly longer than the timescale of w(t) 288 (i.e.w(δt) ≈ 0 for δt > T ξ and w(δt) = 0 for δt < 0). Figure 3a 289 illustrates two possible kernels w corresponding to this case.

290
In this case Eq. ( 10) becomes The solutions to these equations are (recall the initial condi-293 tions qµ(0) = δ(µ − 1)) which have maxima at Note that ā0 + ā1 = 1 by definition, which places some re-298 strictions on the values that ā0 and ā1 can take.For instance, 299 ā0 > 1 implies ā1 < 1.This ensures that 0 ≤ qµ(t) ≤ 1 for 300 all choices of a0 and a1.The (inverse) speed as measured by 301 peak difference is 303 Figure 3 shows the simulation results for various values of 304 a0, a1, and ρ. Figure 3b plots peak times tµ for ρ = 0.The 305 full network simulations with N = 30000 match the linear 306 approximations Eq. ( 11) closely, except for parameter values 307 where the solutions for the full network decay to zero (denoted 308 by C).This happens because ḡ is larger than the critical value 309 of G(0) ≈ 8.Note the significant role of a0 in determining the 310 sequence speed.In particular, for a0 = 0 changing a1 has no 311 impact on the speed ( † symbol in Fig. 3b).Note also that, 312 unlike in standard Hopfield models, the autoassociative term 313 a0 can be negative.The bottom panel of Fig. 3a shows an 314  example of a kernel that gives rise to negative a0 and positive a1.

Figure 3c plots the function g(t)
. This plot supports our original assumption that g(t) asymptotes to a constant value, and shows the close match between g∞ and ḡ, except in the case where ḡ > G(0) ≈ 8.
To look more closely at the relative impacts of a0 and a1 on sequence speed, we plot sequence speed for different combinations of a0 and a1 in Fig. 3d. Figure 3d shows that the dependence of dµ on a0 is roughly linear in the range plotted, with a slope that decreases with a1.Parameter ranges were chosen in an attempt to capture the full behavior of the dynamics within the region of stable sequence generation; in particular, this requires that a1 > 0 and a0 + a1 > 1/G(0).Note that for a0 < 0 sequence speed decreases, counter-intuitively, with increasing a1, while the opposite relationship holds for a0 > 0.
This can be seen from the expression for dµ above.Note that a0 and a1 can be adjusted so that sequence pro-345 gression speed dµ remains constant, but stability is improved 346 by decreasing ḡ.This can most easily be seen by considering 347 lines of constant dµ in Fig. 3d.The most straightforward 348 example is the line dµ = 1 in the right panel of Figure 3d, 349 which clearly corresponds to many values of a0 + a1.This is 350 particularly important in the presence of noise, as shown by 351 Fig. 3e.This degree of freedom of the system makes some 352 kernels w(t) strictly better than others, even if they result in 353 the same speed.

General approximate solution 355
While we were able to explicitly solve Eq. ( 10) in the case 356 of two nonzero terms in the preceding section, in general an 357 explicit solution is not available and we need to resort to ap-358 proximations.To approximately solve Eq. ( 10), we replace 359 the equation with one that is periodic in µ.A general approx-360 imate solution to the equations with these periodic boundary 361 conditions can then be found (see Appendix 2): 364 Sequence speed as measured by peak difference is then When α > 0, these equations to describe forward propagation 367 of the sequence.

Three terms with bidirectional connectivity 371
Next we consider the case with three terms and bidirectional 372 connectivity in Eq. (10), where a k = 0 for k / ∈ {−1, 0, 1}.
Bidirectional connectivity occurs with TAH kernels commonly

384
Sequence speed as measured by peak difference is then 386 The effect of changing a−1 on sequence speed is revealed by 387 taking the derivative of dµ with respect to a−1:

389
This shows that increasing a−1 decreases dµ provided a0 + 390 2a1 > 0. To compare the relative impact of changing a−1 and 391 a1 on sequence speed, we can take the ratio of derivatives: Figure 4 compares the full network simulations, the simulations of the linear system Eq.( 10), and the approximation Eq. ( 14).The peak times tµ are plotted in Fig. 4b, showing that Eq. ( 14) is a good estimate of both the linear and nonlinear systems.Figure 4c shows the relationship between dµ, a0, and a1 for fixed a−1 = −0.2.The left panel reveals that dµ increases approximately linearly with a0, with a slope that decreases with a1.Comparison with Fig. 3d shows the impact of introducing a−1, which can be mixed but is generally to decrease dµ over the parameter regime considered.
Figure 4d shows the relationship between dµ, a0, and a1 for fixed positive a−1 = 0.2.Again, the left panel reveals that dµ increases approximately linearly with a0, with a slope that decreases with a1.Comparing with Fig. 3d shows how a−1 = 0.2 typically increases dµ relative to a−1 = 0.
As in the case with two terms, the speed of the sequence can be held constant while the stability term ḡ is decreased.Indeed, the introduction of a−1 introduces an additional degree of freedom.Examples of keeping dµ fixed while varying ḡ can be seen by following horizontal lines of fixed dµ in Figs.4c  and 4d.

Three terms with forward connectivity
Now we consider the case with two forward terms, where a k = 0 for k / ∈ {0, 1, 2}.This illustrates the effects of having slower timescale TAH kernels, roughly equal to or a bit slower than the timescale of the tutor signal.Figure 5a illustrates a TAH kernel that would give rise to two forward terms.Using Eq. ( 12): [15] Sequence speed as measured by peak difference is then Figure 5 compares the full network simulations, the simulations of the linear system Eq.( 10), and the asymptotic approxi- Plots show the full network simulations (solid lines), linear approximations Eq. (15) (dashed lines), and approximation Eq. ( 12) (dotted lines).b) Peak times tµ for a variety of coefficient combinations (a 0 , a 1 , a 2 ), denoted by color.Dotted lines are the linear approximation given by Eq. ( 15).The † symbol denotes parameter values that give rise to the same peak times.c) Plots of peak differences dµ as a function of a 0 and a 1 .Here a 2 = 0.2.Shaded regions are 95% confidence intervals for µ ∈ {2, ..., 10}, and lines are the means.Left: a 0 is plotted on the horizontal axis and a 1 is denoted by color.Right: a 1 is plotted on the horizontal axis and a 0 is denoted by color.A) Overlaps qµ(t) corresponding to parameter value (0.4, 0.6, −0.1).B) Overlaps qµ(t) corresponding to parameter value (−0.4,0.6, 0.4).
mation of the peak times Eq. ( 15), showing a close match for all three quantities.Similar to the case of the previous

Exponential kernels
We compute in Appendix 2 that the approximate peak times are given by Eq. ( 12) where Here T ξ is the length of the tutor signal interval.Note in particular that the (inverse) sequence speed as measured by scales linearly with T ξ .This dependence provides a biologically-plausible mechanism to control the speed of recalled sequence by the tutor.
The quantity dµ is plotted in Figs.6b and 6c for a variety of different kernels as a function of tutor interval T ξ .These plots compare the full network simulations with the linear approximation Eq. ( 10) and the approximation Eq. ( 12), showing a close match, especially for small T ξ .
In Fig. 6d we plot the dependence of inverse sequence speed dµ as a function of τ1, τ2, and m2.The left panel shows that dµ typically decreases with increasing τ2, but this can be reversed for large enough τ1.The center panel shows that, for the parameter values chosen, dµ decreases with τ1.The right panel shows the influence of changing m2 on dµ; while increasing m2 increases dµ, the effect is small.Parameter values were chosen to demonstrate the full spectrum of behaviors within the regime of stable, forward propagating dynamics.
To look more closely at the interplay between the timescale of the TAH kernel and the timescale of the tutor signal, we look at the case where τ1 is small relative to τ2.Then α ≈ τ2/T ξ , indicating that in this case the ratio of the timescales sets the sequence speed.[3,18], [20,28], and [50,80].d) Plot of overlaps qµ(t).Dashed vertical lines demarcate the three regions.This figure illustrates that after transitioning, the sequence speeds up or slows down to match the new duration.

Sequences with fast and slow parts
However, these transitions have undesirable characteristics.For one, in the first transition the sequence slows down just before entering the faster sequence section.A second issue is that the peak differences become more variable after each transition.A third issue is that the transition from T µ 2 to T µ 3 is very gradual and takes a significant amount of time before recovering the speed of the T µ 1 section.These issues, and particularly the third, are probably related to how the overlaps qµ "spread out" for increasing index µ (see Fig. 1 for an example); that is, the width of the bumps traced out by qµ(t) increases.This suggests a modified sequence model motif (1).In addition, we consider only linear TAH learning rules, although we expect our analysis would extend in spirit to nonlinear learning rules such as those considered in (37).
In the longterm, memory models should be combined with other models, such as models of motor cortex supporting flexible motor control.Tutor signals would then be generated by these other models, and gradually the memory model would take over control of motor actions as they become more practiced (56).In addition, mechanisms for refinement based on reward signals could be added to the memory modules.For instance, by gating TAH learning rules based on reward signals, this learning process can be turned into a reinforcement learning process.
In summary, our analysis shows that sequence memories can be formed by simple biologically plausible learning rules with general TAH kernels and sequence speed is set by the interplay between the TAH kernels and the duration that the network is tutored.Our work indicates areas of potential improvement and future development that we believe are important for advancing the field.
Code for reproducing the plots can be found at https://github.775 com/msf235/Recall-tempo-of-Hebbian-sequences.For details 776 of the parameters used in the simulations for producing the 777 plots, see Appendix 3.

Derivation of mean-field equations
Here we derive the mean-field equations Eq. ( 9).This is a fairly straightforward generalization of the derivation in (37).The full network equations are ṙ = −r + ϕ(W r + η) where η(t) is a white noise vector (⟨ηi(t)⟩ = 0 and ⟨ηi(t)ηi(t ′ )⟩ = ρ 2 δ(t − t ′ )) and ϕ is a sigmoidal nonlinearity: The weights W are Here each pattern ξ µ is standard normal: ∼ N (0, 1).The mean-field equations are written in terms of the overlaps qµ(t) = (ξ µ ) ⊤ r(t)/N .If we let h = W r + η we can then write We now investigate the evolution of the overlaps qµ: Let's look more closely at the second term on the right hand side.As N → ∞ this term approaches an average by the law of large numbers, yielding (ξ µ ) ⊤ ϕ(h)/N → ⟨ξ µ ϕ(h)⟩ where ξ ν i.i.d ∼ N (0, 1) and Here η is a scalar white noise term: ⟨η(t)⟩ = 0 and ⟨η(t)η(t ′ )⟩ = ρ 2 δ(t − t ′ ).Hence, We next note that each term of is an independent normally distributed random variable with mean zero and variance ( P ℓ=1 a ν ℓ q ℓ ) 2 .Hence the sum x is standard normal.We similarly define Sµ = P ℓ=1 a ν ℓ q ℓ and write the average as a Gaussian integral: where The remainder of the calculation follows closely (37).We next define the change of coordinates where v and u are uncorrelated standard normal random variables and where we have defined c = S 2 µ + R 2 µ .To perform this change of variables we first compute the determinant of the Jacobian: Hence we can use the substitution dudv = dξ µ dx.Furthermore, Inverting the change of coordinate equations above, we find that ξ µ = (Sµv + Rµu)/c.Using these substitutions, we can rewrite the integral Dv vϕ (cv) .
We now define Using the definition of ϕ, this integral can be evaluated with integration by parts and is Our equation becomes Finally, we define which results in Eq. (9).Plots showing the convergence of the mean-field approximation to the network simulations for increasing N are shown in Fig. 2. Note that in subsequent analysis, the number of patterns P will also be taken to infinity.In order for these mean-field dynamics to hold exactly, P/N must vanish as N and P are both taken large.
Note that stability is determined by the shape of the function G.For the values of ϕ considered in the main text (rspan = 2, rcenter = 0, θ = 0, and σ = 0.1), G has the shape plotted in Fig. S1.In particular, G is monotonically decreasing from its maximum at x = 0. Since ∥h(t)∥ 2 ≥ 0, the maximal value that g(t) can take during the ongoing dynamics is bounded from above by G(ρ 2 ).Note that the function G is the same as in (37), and the supplementary material of (37) contains plots of G for various values of rspan, θ, and σ.
Noting that is the modified Bessel function of the first kind, this equation can be written Instead of analyzing these equations further, we proceed with approximate methods that will generalize to more cases (i.e. more nonzero coefficients a k ).

Bidirectional connectivity -periodic boundary conditions
To further analyze the behavior, we convert the boundary conditions of Eq. ( 26) to periodic.This yields the following system of equations Note that the periodic equations may not match the solutions to the nonperiodic equations exactly, even as P → ∞.This makes our simulations important for verifying that the approximations are useful.
and using the 2π-periodicity of the integrand qµ+1(t) = e (−1+ā 0 )t 2π π −π e t(ā 1 +ā −1 ) cos s e i(t(ā 1 −ā −1 ) sin s−µs) ds.[31] We next have some choice of method for approximating this integral.A simple approach would involve taking a truncated Taylor expansion of the trigonometric functions in the integrand, with the hope that this resolves into an integral that can be evaluated.A more sophisticated and principled approach would be to use a saddle point approximation, where t is treated as a variable that is becoming large.We will start with saddle point approximations and show the limitations of this approach in this instance.We will then compare the result with the naive Taylor expansion approach.To get the essential information, the reader may wish to skip to this Taylor expansion approach.

Bidirectional connectivity -saddle point approximation
We proceed by performing a saddle point approximation of the integral in Eq. (31).To do so, we write the integral as where ϕ(s) = β cos s + iα sin s − γs and β = ā1 + ā−1 and α = ā1 − ā−1 and γ = µ/t.Note that we are treating µ/t as a finite scalar, so µ and t are growing large together.
For the saddle point approximation, we first extend ϕ to a function over the complex numbers and find the critical points of ϕ (ℜϕ are saddle points at these critical points in the complex plane).When γ is large enough, there are two saddle points that both lie on the imaginary axis.To find these, we can take s = iy and find the roots of ϕ ′ (iy).Letting u = e y , ϕ ′ (iy) can be written ϕ ′ (iy) = −iβ(u − 1/u)/2 + iα(u + 1/u)/2 − iγ.The roots of the resulting quadratic are Here we see that the condition that these roots exist is that γ 2 ≥ β 2 − α 2 since u must be real.We next need to calculate the angle of approach for these saddle points.This is given by arg ϕ ′′ evaluated at the saddle points.This shows that the angles are −π/2 and 0 for the saddle points in the y > 0 positive half-space and y < 0 negative half-space, respectively.The contribution of the saddle point in the positive half-space is pure imaginary, so we can focus on the saddle point in the negative half space, which we denote s * .Given the negative root u− above, it is straightforward to compute that The resulting saddle point approximation yields To find the peak times, we seek the roots of q ′ µ (t).Given the saddle point approximation above, it is straightforward to compute q ′ µ (t).This expression, even in the asymptotic limit t → ∞ (with µ/t constant), does not appear to have roots that can be expressed in an explicit equation for t.
Given that the roots of the saddle point approximation are unhelpfully complex expressions, we next try taking a saddle point approximation of q ′ µ+1 (t); i.e., instead of finding a saddle point approximation and then taking a derivative, we first take a derivative and then find the saddle point approximation.The derivative q ′ µ+1 (t) is where Taking a saddle point approximation of this involves taking a saddle point approximation of I(t) and I ′ (t); we have already done the former, and the latter is simply In these plots, the difference between the true peak times tµ and approximations tµ is plotted across µ.Blue line, orange line, and green line correspond to the peaks of the saddle point approximation of qµ(t), the roots of the saddle point approximation of q ′ µ (t), and the peaks of the Taylor expansion approximation of qµ(t), respectively.The python scipy.signalutility find_peaks is used to find the peak times numerically for the blue line.True peak times (dashed lines) are found by numerical quadrature of the integral expression in Eq. ( 19), followed by using the find_peaks.where s * is the critical point of ϕ as above.Hence We again look for the peak times by finding the roots of this expression, which occur at ((−1 + ā0) + ψ(s * )) = 0. Evaluating ψ(s * ) is similar to evaluating ϕ ′′ (s * ) and yields ψ(s * ) = and γ = µ/t into this expression and solving (−1 + ā0) + β 2 − α 2 + γ 2 = 0 for t yields tµ ≈ (µ − 1)/(ā1 − ā−1).[33] This is the same as Eq. ( 12) up to a constant offset.
While this is a fair approximation, the differences between this estimate and the true peak times typically remain nonvanishing even as t and µ approach infinity; see the blue and orange lines of Figs.S2a and S2b.However, the differences between the true peak differences dµ = tµ − tµ−1 and this estimate do vanish with increasing t and µ; see the blue and orange lines of Figs.S2c and S2d.
Considering that the saddle point method yields either (1) an estimate for tµ that doesn't appear to have a simple closed-form expression, or (2) an estimate that is not asymptotically precise, we seek a simpler approach.This simpler approach derived in the next section has the additional advantage of generalizing to the case of arbitrarily many nonzero coefficients a k .
While we can write this integral in terms of error functions, a simpler expression results if we are able to integrate over the whole real line.We note that across a range of reasonable choices of the a k , the integrand is vanishingly small outside of the The locations tµ of the peaks are given by the positive root of qµ(t) = 0, which when using the approximation above yields tµ ≈ −β + β 2 + 4α 2 (µ − 1) 2 2α 2 Taking the asymptotic limit of large µ and taking the positive root yields Eq. ( 14): Similar to the estimate yielded by taking a saddle point approximation of q ′ µ (t), this estimate is not asymptotically exact; see the green lines in Figs.S2a and S2b.Depending on the choice of a k , the approximation can be bettor or worse than the saddle point approximation Eq. ( 33); compare Fig. S2a with Fig. S2b.However, as before the differences dµ := tµ − tµ−1 are asymptotically correct; see Figs.S2c and S2d.Hence we can consider this approximation to be practically at least as useful as that yielded by taking saddle point approximations.An extra benefit of this approach is that it generalizes; see the next section.

Generalizing the computation to shift-invariant connectivity
The above technique of using Taylor expansions to find an approximate solution for qµ extends to the general case of Eq. ( 10 . As before, the positive root of qµ(t) is asymptotically which is Eq. ( 12).This approximation will be used for the remainder of the Appendices.

Two forward terms
In the case of two forward terms, the system of equations is Using Eq. ( 12) yields the approximate peak times of Eq. ( 15): 18

Fig. 1 .
Fig. 1.Illustration of network model and stereotypical behavior.a) The interaction of a tutor signal (the ξ µ ) with Hebbian temporal kernel w(δt) = w(s − t) gives rise to weight matrix W . b) Illustrations of different possible kernels.Left: kernel that would give rise to two terms for each µ, a 0µ ξ µ (ξ µ ) T and a 1 µ ξ µ+1 (ξ µ ) T .Second from left: kernel that would give rise to three terms for each µ, a 0 µ ξ µ (ξ µ ) T , a 1 µ ξ µ+1 (ξ µ ) T and a −1 µ ξ µ−1 (ξ µ ) T .Third from left: kernel that would give rise to five terms for each µ, corresponding to µ − 2, µ − 1, µ, µ + 1, and µ + 2. Right: double-sided exponential kernel.c) Activity r k of ten randomly selected units in the the network for coefficient values a 0 = 0 and a 1 = 1.d) Overlaps qµ of the network activity with patterns ξ µ (here a 0 = 0 and a 1 = 1).Color corresponds to pattern index µ.The vertical lines indicate the locations tµ of the peaks.The first overlap q 1 is not included in the plot for ease of visualization.
the famous double-sided exponential STDP curve of Bi and Poo (43) (Fig. 1b, right).Note that the neuroscience literature often plots the kernel as a function of −δt = −(s − t) instead of δt = s − t (i.e."pre-post" instead of the "post-pre" convention used here).
Fig.3.Sequence progression for two-term connectivity W µ .Plots show the full network simulations (solid lines) and linear approximations Eq. (11) (dashed lines).Throughout, capital letters A, B, and C correspond to parameter values as indicated in the legend.a): Illustrations of two TAH kernels w that would give rise to two terms.b) Peak times tµ.The † symbol denotes parameter values that give rise to overlapping lines, and C denotes a parameter value for which the network equations are unstable.c) The value of g(t) through time during the simulations.Dashed lines denote ḡ.The function g(t) is bounded from above by Gmax ≈ 8, which is indicated by a dotted line.For the parameters corresponding to C there is a mismatch between ḡ and g∞ (asymptote of solid line).d) Plots of peak differences dµ as a function of a 0 and a 1 .Shaded regions are 95% confidence intervals for µ ∈ {2, ..., 10}, and lines are the means.Left: a 0 is plotted on the horizontal axis and a 1 is denoted by color.Right: a 1 is plotted on the horizontal axis and a 0 is denoted by color.e) The peak height p 70 corresponding to the maximum of q 70 (t) (µ = 70) over t, in the full network simulation, as a function of the sum of coefficients a 0 + a 1 .Color denotes the noise strength ρ and dashed vertical lines show the critical points G(ρ 2 ) −1 where the network is predicted to pass from stable to unstable.Shaded regions are confidence intervals over a 0 ∈ {−0.2, 0, 0.1}.A) Overlaps qµ(t) corresponding to parameter values (0.4, 0.6).B) Overlaps qµ(t) corresponding to parameter values (−0.4,0.6).C) Overlaps qµ(t) corresponding to parameter values (0.0, 0.1).

Figure 3e quantifies the
Figure 3e quantifies the stability properties of sequences by plotting the peak magnitude pµ for a pattern of large index (here µ = 70) as a function of ḡ−1 = a0 + a1.Color denotes the strength of noise injected into the network dynamics (ρ in Eq. (1)).Note that such noise could come from interference, such as that which arises if there are multiple stored sequences(37).The dashed vertical lines mark the critical values of stability where ḡ = G(ρ 2 ).This plot illustrates how ḡ determines the stability of sequences, and that smaller ḡ (equiv.larger a0 + a1) is required for stability in the presence of noise.The plot shows that, as predicted, peak magnitudes pµ have a sharp inflection point at ḡ = G(ρ 2 ), where pµ goes from being approximately zero to nonzero.

374
found in biology, such as the double-sided exponential decay 375 kernels (52) (see Fig.4afor an illustration).Three terms in 376 particular are relevant when the TAH kernel's timescale is 377 faster than the tutor signal's, for instance when the support 378 of the kernel vanishes outside of time windows spanning more 379 than two pattern presentations.Figure 4a illustrates a tem-380 porally asymmetric kernel that would give rise to such three 381 terms.
a0 + 2a−1 a0 + 2a1 .393 This equation reveals that changes in a−1 will impact sequence 394 speed more than changes in a1 when a−1 is smaller in mag-395 nitude than a1.Hence a−1 is a natural parameter to use to 396 control speed.

Fig. 5 .
Fig. 5. Sequence progression for two forward terms.a): illustration of a TAH kernel w that would give rise to two forward terms.Plots show the full network simulations (solid lines), linear approximations Eq. (15) (dashed lines), and approximation Eq. (12) (dotted lines).b) Peak times tµ for a variety of coefficient combinations (a 0 , a 1 , a 2 ), denoted by color.Dotted lines are the linear approximation given by Eq. (15).The † symbol denotes parameter values that give rise to the same peak times.c) Plots of peak differences dµ as a function of a 0 and a 1 .Here a 2 = 0.2.Shaded regions are 95% confidence intervals for µ ∈ {2, ..., 10}, and lines are the means.Left: a 0 is plotted on the horizontal axis and a 1 is denoted by color.Right: a 1 is plotted on the horizontal axis and a 0 is denoted by color.A) Overlaps qµ(t) corresponding to parameter value (0.4, 0.6, −0.1).B) Overlaps qµ(t) corresponding to parameter value (−0.4,0.6, 0.4).

496{T2}µ∈S 2 .
In naturalistic settings, different states of a sequence may have 497 different duration.For instance, in playing a piece of music 498 different notes are held with different durations.Our analysis 499 indicates that changing these durations causes a change in 500 the sequence progression speed, meaning that variable dura-501 tion tutor signals T µ ξ will result in sequences with faster and 502 slower parts (Fig. 7).Consider a sequence with sections that 503 pass from one set of uniform durations {T1}µ∈S 1 to another 504 The overlaps qµ for µ ∈ S1 will be governed by the 505 first interval length, with some disturbance from qν for ν ∈ S2 506 mediated by the backward coefficients a k for k > 0. For µ far 507 from the boundary this disturbance should be small.On the 508 other hand, qν far from the boundary will be governed by the 509 second interval length with small disturbance from qµ, with 510 the caveat that the evolution up to time t has been influenced 511 significantly by the dynamics of qµ.With these caveats in 512 mind, we should expect the speed in each of these regions 513 far from the boundary to progress at speeds determined by the corresponding tutor signal interval length, with more complex behavior occuring near the boundary of the transition.In Fig. 7 we consider a case with two transitions, where T1 = 1 for µ ∈ [1, 20], T2 = .5for µ ∈ [21, 30], and T3 = 1 for µ ∈ [31, 80] (slow to fast back to slow).Here we use a double-sided exponential kernel with τ1 = 0.5, m1 = 5, τ2 = 1, and m2 = 8.

Fig. S2 .
Fig. S2.Comparing approximations of tµ to true peak times.In these plots, the difference between the true peak times tµ and approximations tµ is plotted across µ.Blue line, orange line, and green line correspond to the peaks of the saddle point approximation of qµ(t), the roots of the saddle point approximation of q ′ µ (t), and the peaks of the Taylor expansion approximation of qµ(t), respectively.The python scipy.signalutility find_peaks is used to find the peak times numerically for the blue line.True peak times (dashed lines) are found by numerical quadrature of the integral expression in Eq. (19), followed by using the find_peaks.a) Coefficient values a −1 = −.3, a 0 = .1,and a 1 = 1.8.b) Coefficient values a −1 = .2,a 0 = .2,and a 1 = .8. c)-d) Difference of peak differences dµ = tµ − t µ−1 with approximations dµ.Note that the orange and green lines coincide.Dotted lines denote the precision ceiling due to the step size dt = 0.001.c) Coefficient values a −1 = −.3, a 0 = .1,and a 1 = 1.8.d) Coefficient values a −1 = .2,a 0 = .2,and a 1 = .8.
Fig. S2.Comparing approximations of tµ to true peak times.In these plots, the difference between the true peak times tµ and approximations tµ is plotted across µ.Blue line, orange line, and green line correspond to the peaks of the saddle point approximation of qµ(t), the roots of the saddle point approximation of q ′ µ (t), and the peaks of the Taylor expansion approximation of qµ(t), respectively.The python scipy.signalutility find_peaks is used to find the peak times numerically for the blue line.True peak times (dashed lines) are found by numerical quadrature of the integral expression in Eq. (19), followed by using the find_peaks.a) Coefficient values a −1 = −.3, a 0 = .1,and a 1 = 1.8.b) Coefficient values a −1 = .2,a 0 = .2,and a 1 = .8. c)-d) Difference of peak differences dµ = tµ − t µ−1 with approximations dµ.Note that the orange and green lines coincide.Dotted lines denote the precision ceiling due to the step size dt = 0.001.c) Coefficient values a −1 = −.3, a 0 = .1,and a 1 = 1.8.d) Coefficient values a −1 = .2,a 0 = .2,and a 1 = .8.
interval [−π, π].Hence we can indeed approximate the integral with one over the real line.This results in

-field theory 220
on the pattern state (i.e. the same sequence can have 190 both fast and slow transitions between patterns, resulting in 191 a dµ that depends strongly on µ).If the tutor signal presents 192 patterns at regular intervals, then the coefficients take the 193 form a ν µ = aν−µ.In this case the coefficient for the ξ ν ξ µ term 194 only depends on the difference between ν and µ rather than 195 depending on µ independent of ν: *We can also define the sequence state represented at a time t by arg max µ (qµ(t)), but the tµ are more mathematically convenient objects to consider.based 219 Mean . This 239

3. Sequence progression for two-term connectivity
W µ .Plots show the full network simulations (solid lines) and linear approximations Eq. (11) (dashed lines).Throughout, capital letters A, B, and C correspond to parameter values as indicated in the legend.
a): Illustrations of two TAH kernels w that would give rise to two terms.b) Peak times tµ.The † symbol denotes parameter values that give rise to overlapping lines, and C denotes a parameter value for which the network equations are unstable.c) The value of g(t) through time during the simulations.Dashed lines denote ḡ.The function g(t) is bounded from above by Gmax ≈ 8, which is indicated by a dotted line.For the parameters corresponding to C there is a mismatch between ḡ and g∞ (asymptote of solid line).d) Plots of peak differences dµ as a function of a 0 and a 1 .Shaded regions are 95% confidence intervals for µ ∈ {2, ..., 10}, and lines are the means.Left: a 0 is plotted on the horizontal axis and a 1 is denoted by color.Right: a 1 is plotted on the horizontal axis and a 0 is denoted by color.e) The peak height p 70 corresponding to the maximum of q 70 (t) (µ = 70) over t, in the full network simulation, as a function of the sum of coefficients a 0 + a 1 .Color denotes the noise strength ρ and dashed vertical lines show the critical points G(ρ 2 ) −1 where the network is predicted to pass from stable to unstable.Shaded regions are confidence intervals over a 0 ∈ {−0.2, 0, 0.1}.A) Overlaps qµ(t) corresponding to parameter values (0.4, 0.6).B) Overlaps qµ(t) corresponding to parameter values (−0.4,0.6).C) Overlaps qµ(t) corresponding to parameter values (0.0, 0.1).
Sequence progression for bidirectional terms.Plots show the full network simulations (solid lines), linear approximations Eq. (10) (dashed lines), and approximation Eq. (14).Throughout, capital letters A and B correspond to parameter values as indicated in the legend.a) Illustration of a TAH kernel w that would give rise to bidirectional terms.b) Peak times tµ plotted for a variety of coefficient combinations (a −1 , a 0 , a 1 ), denoted by color.c) Plots of peak differences dµ as a function of a 0 and a 1 .Here a −1 = −0.2.Shaded regions are 95% confidence intervals for µ ∈ {2, ..., 10}, and lines are the means.Left: a 0 is plotted on the horizontal axis and a 1 is denoted by color.Right: a 1 is plotted on the horizontal axis and a 0 is denoted by color.