## Abstract

The intertwined effects of recency and practice on memory and learned behavior have long been studied in psychology and neuroscience. We develop a bottom-up, mechanistic theory addressing the combined effects of recency and practice, beginning with a classical single-neuron model, then applying insights from this model to the network level. Combining error-based and associative learning, we mathematically derive the forgetting curve for a single neuron and its dependence on practiced repetition, showing how highly practiced memories or behaviors can become far more resistant to being overwritten by later learning. At the network level, error-based learning is responsible for initial gains in performance, while associative learning gradually transfers control of the downstream population from one input pathway to the other. We interpret the model neurobiologically by identifying the inputs as cortex and thalamus, and the downstream population as striatum, providing a framework for understanding the neural basis of habit formation and the automatization of behavior through practice.

## Introduction

The acquisition and retention of memories and learned behaviors are complex phenomena dependent upon many factors. Among the most important of these are *recency*, i.e. how recently a memory or behavior has been acquired, and *practice*, i.e. how many times it has been repeated during training. The combined effects of these variables have been studied since the dawn of experimental psychology [1]. They were later codified by Thorndike as the Law of Recency and the Law of Exercise, respectively [2], which have since served as some of the basic organizing principles guiding research on human and animal behavior for more than a century.

Although, historically, studies on humans have tended to focus on memory, while animal studies to focus on learned behaviors, more modern approaches have suggested that these processes can be understood within a common framework [3]. In studies of memory, the effects of recency and practice can be tested and quantified with forgetting curves, which portray the decline in recall performance over time. Studies addressing the interplay of both effects in human memory have shown that the strength of memories, as measured by retention duration [1, 4, 5, 6] or reaction times [7], continues to increase through practice, even after immediate recall performance is perfect. While phenomenological theories have been proposed to describe the process of memory formation and recall [8, 9], including the effects of practice, the mechanistic bases underlying such theories are not well understood. In studies of learned behavior, on the other hand, the framework within which to study and understand recency and practice is somewhat less clear. Different studies have tended to view the effects of practice through different lenses, including the improvement of speed and accuracy [10], formation of habit [11, 12], and decrease in cognitive effort [13]. It is not clear whether these are all distinct phenomena or different facets of a common underlying cause, and, if the latter, how they might be unified by a single mechanistic theory [14, 13, 15].

From a neurobiological perspective, memories and behaviors are understood to be encoded via the modification and consolidation of synaptic strengths, or “weights” [16, 17]. Multiple possibilities exist for how the modification of synaptic weights may occur. One possibility is that such modifications should make use of information about errors or rewards, making changes such that future behaviors will minimize errors or maximize rewards [18, 19]. Such learning rules have had great success in recent years as the basis for neural-network-based approaches to artificial intelligence [20]. Alternatively, synaptic modifications might be blind to errors and rewards, instead simply strengthening associations between neurons whenever they are coactive, as originally proposed by Hebb [21]. It is likely that both of these types of synaptic modification, and possibly others as well, are realized in the brain, and understanding where they are implemented and what computational purposes they serve is an ongoing challenge for neuroscience. The overwriting of these modifications by newer ones as new memories and behaviors are acquired provides a plausible mechanism for the effect of recency on retention [22, 23], while the continued strengthening of such modifications through repetition provides a possible mechanism for the effect of practice [24]. A unified theory describing the interplay of both effects at the neural level, however, has been lacking.

In recent years, artificial neural networks have emerged as powerful tools for modeling the ways in which circuits in the brain might learn to solve challenging tasks and implement behaviors [25]. However, such models typically do not meaningfully address the effects of recency and practice. First, because examples during training are typically presented to the network repeatedly and in random order, even in tasks with inherently episodic structure [26], there is no sense in which examples are learned and remembered sequentially. In addition, such models also typically fail to address the effects of practice. This is because their training is typically based on supervised or reinforcement learning, and, in both of these cases, performance improves during training but stops once the error or reward is optimized. Hence, such models are unable to address a key observation about practice from experimental studies, namely that it continues to have effects through repetition even after a behavior has been learned or a fact memorized. This simple observation suggests that there is an additional, covert process at play during practiced repetition. This process is not immediately obvious from behavioral observation, but has effects that can be probed later by testing recall performance.

In this work, we develop a bottom-up, mechanistic theory of learning and memory that addresses the combined effects of recency and practice. We begin by studying sequential learning in the perceptron, a well-studied classical model of a single neuron implementing supervised (i.e. error-based) learning [27], deriving a mathematical expression for the forgetting curve, which describes how old information is lost as new information is acquired. Because the simple perceptron cannot account for the effects of practice, however, we next extend this model to include a second input pathway consisting of synapses that are modified with association-based learning, deriving a generalized forgetting curve that includes the effects of both recency and practice. We go on to show that networks of such neurons are capable of performing more challenging computational tasks, with the first pathway facilitating initial learning, and the association-based pathway facilitating the retention of well-practiced behaviors. Finally, we connect the model to neurobiology, interpreting the two pathways as cortical and thalamic inputs to the striatum, the major site of reward-based learning in the vertebrate brain. We show that learning in this model corresponds to a process by which the fast adjustment of cortical weights leads to initial gains in performance, followed by the gradual transfer of control from cortical to subcortical inputs as the behavior is repeatedly practiced. This interpretation provides a quantitative theory for describing the automatization of behavior and the formation of habits through practiced repetition.

## Results

Learning and memory depend on both how recently learning took place, as well as on the amount of practice during learning. In the following sections, we address the ways in which learning and memory are affected by recency and practice, beginning with the simplest possible case: a single neuron trained to classify input patterns. We then proceed to build on the insights obtained from this simple model by extending it, first by endowing it with associative learning in order to include the effects of practice, and then by investigating the emergent network-level phenomena in populations of such neurons. Finally, we apply the theory to a sensorimotor circuit, showing that it can provide a mechanism for habit formation and the automatization of learned behaviors.

### The single-neuron forgetting curve describes the effect of recency

In order to develop a quantitative theory of the effects of recency at the most basic level, we began by studying a classical model of a single neuron: the perceptron [27]. In this model, *N*_{x} input signals *x*_{i} are multiplied by synaptic weights *w*_{i} and then summed. If the summed value is greater than 0, the neuron’s output is +1, otherwise the output is −1. This is described by the equation *z*^{µ} = sgn(**w**^{µ} ⋅ **x**^{µ}), where *µ* denotes one of *P* possible input patterns (Figure 1a). The goal is then for the synaptic weights **w** to be adjusted so that, for each pattern *µ*, the output matches a target value . Thus, the task is binary classification, in which random input patterns , where *𝒩* denotes the standard normal distribution) are mapped onto random output values .

This learning problem can be illustrated geometrically in the *N*_{x}-dimensional space of synaptic weights, where the weight vector **w**^{µ} just before learning pattern *µ* is represented as a vector in this space (Figure 1b). In order for the classification to be correct, the weight vector after training should have positive overlap with the vector , which defines a classification boundary (red line in Figure 1b), with all weight vectors on one side of this boundary leading to correct classification. This can be effected by updating the weights to **w**^{µ} + ∆**w**^{µ}, where
where *u*^{µ} = **w**^{µ} ⋅ **x**^{µ} is the summed input to the neuron. The first line of this equation says that, if the classification is initially incorrect or correct but within a margin κ of the classification boundary (dashed line in Figure 1b), then an update is made such that the new weight vector **w**^{µ} + ∆**w**^{µ} lies on the correct side of the classification boundary (Figure 1b) with margin *κ*. The second line, on the other hand, says that, if the classification is initially correct with a margin of at least *κ*, then no update needs to be made. Because *κ* can be absorbed into an overall rescaling of **w**, we henceforth set *κ* = 1.

Clearly a potential problem exists whenever more than one pattern is being learned, however. After applying (1) to ensure correct classification of pattern *µ*, followed by applying the same update rule for pattern *µ* + 1, there is no guarantee that the weight vector will still be on the correct side of the classification boundary for pattern *µ*. That is, the learning of new information in this model has the potential to interfere with and overwrite previously learned information. In the standard paradigm for learning in the perceptron, this problem is solved by cycling many times through all of the *P* patterns to be learned. In one of the most celebrated results in computational neuroscience, it has been shown that, in the large-N limit, a weight vector producing correct classification of all patterns can be obtained whenever *P* < 2*N*_{x} [28, 29].

In this work, we instead study the case in which the patterns are learned in sequence, so that no further training with pattern *µ* occurs after training with pattern *µ* + 1 has begun. Concretely, we first train the perceptron by applying the update rule (1) for each of the *P* patterns in the sequence. We then test the perceptron by testing the classification of each pattern using the final weight vector **w**^{P}. Intuitively, we expect that the most recently learned classifications will remain correct with high probability during testing since the weight vector will have been updated only a small number of times after learning these patterns. On the other hand, for classifications learned much earlier and hence overwritten by a large number of subsequent patterns, the probability of incorrect classification should approach chance level.

In order to describe this process of training, overwriting, and testing quantitatively, we calculated the probability of incorrect classification (or “error rate”) for each pattern during testing. The full calculation is presented in Appendix 1. To summarize, it begins by assuming the weight vector is **w**^{ν} just before learning some generic pattern *ν*, then applies the update rule (1) using randomly drawn **x**^{ν} and . For the following *P* − *ν* steps (*µ* = *ν* + 1,…, *P*), the updates ∆**w**^{µ} will be in directions that are random with respect to **w**^{ν}. Hence, the evolution of the weight vector from **w**^{ν} to **w**^{P} can be described as a drift-diffusion process characterized by the probability distribution *p*(**w**^{P} *|***w**^{ν} + ∆**w**^{ν}). Once the probability distribution for **w**^{P} is known, the error rate during testing is calculated by averaging *z*^{ν} = sgn(**w**^{P} ⋅ **x**^{ν}) over **w**^{P}, as well as over the initial weight vector **w**^{ν} and the random input pattern **x**^{ν}.

The result of this calculation is that the error rate during testing of pattern *ν* is given by
where the form of the function *F* is given in Appendix 1, and *γ* = *e*^{−q(P − ν)/Nx}, where *q ≃* 0.798 is the probability that a nonzero update is made according to (1) during any particular step of training. This result defines the forgetting curve of the perceptron, which is shown in Figure 1c. (*N.b.* While the pattern index *ν* is the independent variable in the forgetting curve, we plot the curve as a function of (*P* − *ν*)/*N*_{x} so that earlier times appear on the left and later times on the right, and so that the curve is independent of *P*, which may be arbitrarily large.) As expected, the forgetting curve in Figure 1c shows that recently learned classifications are recalled with very low error rates, while those learned far in the past are recalled with error rates approaching chance level. We can further observe that, because the number of synapses *N*_{x} appears only through the parameter *γ*, the number of patterns that can be learned with error rate at or below any particular threshold is proportional to *N*_{x}. This means that the memory is extensive in the number of synapses, as in the classical result described above for the perceptron with cycled training. Finally, because the perceptron implements learning in a single step and may therefore be questionable as a neurobiological model of learning, we show using simulations that a similar forgetting curve is obtained using a gradient-descent learning rule, in which small updates are accumulated over many repetitions to minimize the readout error (Supplemental Figure 1).

In the case of the classical perceptron, in which patterns are cycled through repeatedly during training, the classification error rate would be zero for up to *P = 2 N_{x}* patterns. However, no further classifications could be learned without drastically impairing the classification performance of the already-learned patterns. This phenomenon, known as catastrophic forgetting [30], is obviated in the sequentially trained model studied here. As previously pointed out in theoretical studies of related models [31, 23], although sequentially trained neurons and neural networks have a smaller overall memory capacity than models in which training is repeated, they have the advantage that learned patterns decay smoothly over time as new patterns are learned, even as

*P*/

*N*

_{x}→ ∞.

### A two-pathway model describes the effects of recency and practice

In the preceding section, we derived the forgetting curve for the perceptron in order to quantitatively address the effect of recency in learning. Because learning of any given pattern occurs in a single step and stops once the error is zero, however, the simple perceptron is unable to describe the effects of practice. In order to study the effects of practice, we made the simplest possible addition to the model by including a second input pathway in which synaptic weights are modified with associative Hebbian learning (Figure 2a). Specifically, as originally proposed by Hebb [21], these synapses are strengthened by coactivation of their inputs and outputs, so that the same inputs will in the future tend to produce the same outputs. This strengthening, which is independent of errors or rewards, continues to accumulate with each repetition if an input pattern is presented many times. Importantly, the Hebbian learning should be slow compared to the supervised learning, so that it will not strengthen an incorrect input-output association before the supervised learning has taken place. Intuitively, then, this picture suggests that repeated training of a particular classification should strengthen the association and make it more resistant to being overwritten by later learning. We shall show below that this is indeed the case.

Concretely, we extended the perceptron model by adding a second set of inputs **y**^{µ}, where the *N*_{y} inputs are random and independent of the first set of *N*_{x} inputs . The output is then given by *z*^{µ} = sgn(**w** ⋅ **x**^{µ} + **v** ⋅ **y**^{µ}), where the synaptic weights **v** are updated according to a Hebbian learning rule. Specifically, if **v**^{µ} is the weight vector just before training on pattern *µ*, then the weight vector just after training on pattern *µ* is given by **v**^{µ} + ∆**v**^{µ}, where

The second term in this equation is a modification to the synaptic weight that is proportional to the product of the pre- and post-synaptic activity, making this a Hebbian learning rule. Because the learning of **w** is assumed to be fast relative to this Hebbian learning, the downstream activity will have attained its target value by the time the Hebbian learning has any significant effect, allowing us to use rather than *z*^{µ} in (3). The *n*_{µ} appearing in (3) is interpreted as the number of times that the behavior *µ* is repeated when it is trained, and is the average of *n*_{µ} over all patterns. The first term in (3) causes the weights to decrease slightly in magnitude with each repetition of each pattern, so that they do not become arbitrarily large after a large number of classifications have been learned. The constants *α* and *β* control how quickly the weights decay and how strong the initial updates are, respectively. The weights **w**, meanwhile, are updated as before, except that now the total summed input appearing in (1) is given by *u*^{µ} = **w** ⋅ **x**^{µ} + **v** ⋅ **y**^{µ}.

For this two-pathway model, we derived a generalized forgetting curve using the same approach as that used for the perceptron, describing the evolution of both weight vectors **w** and **v** as drift-diffusion processes and computing statistical averages over weights and patterns. The generalized forgetting curve for the two-pathway model has the form
where , and the function *G* is derived in Appendix 2. From (4), we see that the forgetting curve now additionally depends on the Hebbian learning rate *β*, the ratio *N*_{y}/*N*_{x}, the Hebbian decay rate *α*, and the number of repetitions *n*_{ν} for each pattern.

To begin, we kept constant for all patterns and investigated the dependence of the forgetting curve (4) on its other parameters. Starting with the Hebbian learning rate, we found that nonzero values of *β* shifted the forgetting curve slightly downward, modestly reducing the error rate for all patterns (Figure 2b). Whether this qualifies as a true improvement, however, depends somewhat on bookkeeping. For a fixed total number of synapses *N*_{x} + *N*_{y}, the error rate can be reduced by allowing for Hebbian learning. However, the error rate is reduced even more by eliminating the Hebbian synapses entirely (dashed curve in Figure 2b), which decreases the denominator *N*_{x} + *N*_{y} by setting *N*_{y} = 0. Stated differently, if the goal is to minimize the error rate for a fixed total number of synapses, this is accomplished most effectively by letting all of the synapses be updated with supervised learning rather than with Hebbian learning (Figure 2c). If, on the other hand, the goal is to minimize the error rate for a fixed number *N*_{x} of supervised synapses, then a benefit is obtained from including additional synapses with Hebbian learning. The exception to these conclusions occurs for small values of the Hebbian decay rate *α*, in which case very old memories can persist for longer with error rates below chance level (Figure 2d). In this case, there is a benefit to adding Hebbian synapses even if they are counted against the total *N*_{x} + *N*_{y}.

We further investigated the effects of the parameters *α* and *β*, while still holding *n*^{ν} constant for all patterns, by setting a threshold *p*_{θ} for the acceptable error rate, then defining the memory capacity as the integrated area between the forgetting curve and this threshold value, normalized by *p*_{θ} (Figure 3a). Because the forgetting curve was found to have relatively weak dependence on *β* (Figure 2b), we fixed *β* = 1 and considered the effects of *α* and of *p*_{θ} on the memory capacity. For a given choice of *p*_{θ}, we found the value of *α* that maximized the memory capacity (Figure 3b). This led to the conclusion that, the larger the error tolerance *p*_{θ}, the smaller the forgetting rate *α* should be in order to maximize the memory capacity (Figure 3c). Finally, we found that the optimized memory capacity increases as the error tolerance becomes greater (Figure 3d). Consistent with our observations from Figure 2, we found that the memory capacity, which is normalized by the total number of synapses *N*_{x} + *N*_{y}, is improved by learning in the second pathway if *N*_{x} + *N*_{y} is held constant (solid vs. dashed curve in Figure 3d), but is even larger if the second pathway is left out entirely (dotted curve in Figure 3d), again indicating that adding supervised synapses is a better strategy than adding unsupervised synapses if the goal is to optimize the forgetting curve for a fixed total number of synapses.

This conclusion changes considerably, however, in the case where some patterns are repeated more than others during training. In this case, repeating particular patterns multiple times consecutively during training causes those patterns to be retained much better than nonrepeated patterns. If trained with enough repetitions, particular patterns can be recalled essentially perfectly even after *∼* (*N*_{x} + *N*_{y}) additional patterns have been learned (Figure 4a). Further, as long as the number of highly practiced patterns is much less than *N*_{x} + *N*_{y}, the error rate for the remaining patterns is not significantly increased (solid line versus dotted line in Figure 4a). This shows that it is possible for a neuron to become an “expert” at a small number of classifications without performing significantly worse on other classifications that have recently been learned.

We next investigated the dependence of the error rate for a particular pattern *ν*^{∗} on the interval (the “testing interval”) between training and testing, as well as on the number of repetitions *n*_{ν∗}, for that pattern, again assuming that all other patterns are presented only once during training. We found that the effect of practice is to shift the forgetting curves in a roughly parallel manner (Figure 4b). This parallel shifting of forgetting curves via practiced repetition is a well known result from experimental psychology studies in human subjects performing memory tasks [4, 6, 5, 32, 3]. This shows that the error rate during testing, given a fixed amount of practice, decreases a function of the recency of the practiced pattern.

Plotting the data instead as a function of the number of repetitions *n*_{ν∗} for a fixed testing interval shows the error rate smoothly decreasing with practice (Figure 4c), again bearing similarity to results from memory studies in humans [33]. This shows that the error rate during testing, for a fixed recency of the practiced pattern, decreases as a function of how many times that pattern was practiced.

We next asked, if just a single pattern *ν*^{∗} is repeated multiple times while all other patterns are trained just once, how many times must pattern *ν*^{∗} be repeated in order to obtain an error rate below a threshold *p*_{θ} during later testing of that pattern? The number of necessary repetitions was found to be a supralinear function of the interval between the training and testing of pattern *ν*^{∗} (Figure 4d). This dependence could potentially be (and, to our knowledge, has not already been) measured experimentally, for example in human subjects learning paired associations (as in, e.g., Refs. [32, 3]), with certain associations presented multiple times in succession during training, and with varying intervals between training on these repeated patterns and testing.

Not all of the results obtained from the two-pathway model were in agreement with published results from experiments on human memory, however. The fact that memory recall performance decays roughly as a power law has been relatively well established experimentally [3]. However, the forgetting curves in the two-pathway model are better described by exponential decay than by power laws (Supplemental Figure 2). In addition, experimental study of spaced-repetition effects in human memory has shown that there exists an optimal interval between presentations during training [34], such that the interval between the two training presentations should match that between the second presentation and the time of testing. In the two-pathway model, however, we found that shorter training intervals always lead to better testing performance, regardless of the testing interval (Supplemental Figure 3). Of course, it is not surprising that a simple single-neuron model fails to capture the full range of experimental results on human memory. Characterizing the additional mechanisms that could lead to such effects as power law memory decay and optimal intervals for spaced repetition will be a promising direction for future study.

Thus far we have considered just a single-neuron model. We next asked whether the basic mechanism illustrated in Figure 4—namely, the enhanced retention of learned classifications that are practiced multiple times—could also be implemented in a neural network performing a more challenging task. We thus extended the two-pathway architecture to a neural network with one hidden layer and trained the network to perform a center-out reaching task (Figure 5a,b). In this task, the velocity of a cursor moving in two dimensions is controlled by a neural network, which receives as input the current location of the cursor together with a context input indicating which target to move toward. In each trial, the network was trained with supervised learning to follow a smooth, minimal-jerk trajectory to the specified target, with each trial having a duration of 10 timesteps. As in the single-neuron two-pathway model, the inputs were divided into two pathways, with supervised learning in one pathway and Hebbian learning in the other.

The network was first trained in a block of trials to reach to target 1, then in another block to reach to target 2, and so on, with more trials in the first block than in the later blocks (Figure 5c). When the network was tested after training was complete, we found that, when Hebbian learning was activated, the performance for target 1 was better than that for target 2 (*p* = 0.011), despite the fact that the training for target 2 had occurred more recently (Figure 5d). This did not occur in the network without Hebbian learning, where performance for target 1 was worse than that for other targets despite the larger amount of practice. This result shows that, in this trained network model as in the single-neuron case, associative Hebbian learning enables practiced behaviors to be better retained, protecting them from being overwritten as new behaviors are learned.

### Input alignment and control transfer in the two-pathway model

In the previous section, we showed that, due to the reinforcement of the input-to-output mapping that comes about by slow Hebbian learning in the single-neuron model, input from the second pathway can contribute to driving the readout unit to its target state. We next asked how the learning process in the two-pathway model could be described at the population level. Specifically, we defined **m** and **h** as the *N*_{z}-dimensional inputs to the readout population from the first and second pathway, respectively (Figure 6a), where and , and the weight vectors **w** and **v** have been promoted to weight matrices **W** and **V**. The activity in the readout population is then described by the *N*_{z}-dimensional vector **z** = sgn(**m** + **h**). In this case, applying the updates (1) and (3) for each synapse when training on pattern *µ* leads to updated input currents and in the two pathways. Thus, following the updates, both of the input currents obtain a component of magnitude ∼ *O*(1) along the target activity , so that they become aligned with the target activity and with each other. Further, for the Hebbian update, this component continues to grow as the pattern is repeated *n*_{µ} times in succession (Figure 6b).

As a single input pattern is repeated more times in succession, the normalized alignment between the inputs, defined as **m** ⋅ **h**/(*|***m***||***h***|*) grows, meaning that the inputs from the two pathways are driving the downstream population in similar ways (Figure 6c, top), a process that we refer to as *input alignment*. In addition, as the number of repetitions grows, the input from the second pathway along constitutes an increasingly dominant proportion of the total input along (Figure 6c, bottom), a process that we refer to as *control transfer*. Both of these processes—alignment of the two inputs and the transfer of control from one pathway to the other—are *covert*, in the sense that they occur gradually during repetition and do not have any obvious immediate behavioral correlate, since the learning in the first pathway causes the activity in the downstream population to attain its target value already by the first repetition.

In order to further illustrate the implications of input alignment and control transfer, we sequentially trained a two-pathway network with a multi-neuron readout population to produce *P* classifications, generalizing the task from previous sections to the case *N*_{z} > 1. During training, one particular pattern was repeated many times in succession, while all others were trained only once (Figure 6d, top). We found that input alignment became large and positive during training and, particularly for the repeated pattern, remained large and positive during testing after training was complete (Figure 6d, second panel). Similarly, the ratio of the input **h** along from the second pathway to the total input along became large for the repeated pattern during training and remained large during testing, illustrating the occurrence of control transfer (Figure 6d, third panel). Finally, when testing the error rate for each pattern after training was complete, we performed “lesions” by shutting off each of the inputs during testing (Figure 6d, bottom). In the case **m** = 0, we observed that the error rate increased for all patterns *except* for the pattern that was repeated multiple times during training, which retained an error rate of zero. This shows that, via the mechanisms of input alignment and control transfer for the highly practiced behavior, the second pathway was able to compensate for the loss of the first pathway, driving the downstream population to produce the correct activity even in the absence of the first input. In the case **h** = 0, on the other hand, the repeated pattern was not recalled accurately, illustrating that the input from the second pathway is necessary to protect practiced patterns from being overwritten.

In summary, when an input pathway with fast supervised learning and another with slow associative learning drive a downstream population in tandem, the mechanisms of input alignment and control transfer enable the second input to gradually overtake the role of the first in driving the downstream activity. For highly practiced patterns, this enables the second input to produce the correct downstream activity even if the first input is removed completely. As we show in Supplemental Figure 4, these mechanisms also occur in the case where reinforcement learning is used rather than supervised learning in the first pathway, but not in the case where both pathways feature reward-driven learning, illustrating that the combination of Hebbian learning together with error- or reward-based learning is necessary in order for input alignment and control transfer to take place.

### Implementation in the sensorimotor brain circuit: Automatization of behavior through practice

In the previous sections we showed that, for a single-layer neural network in which two input pathways operate with different types of learning, fast error-driven learning in the first pathway causes the downstream neurons to produce the correct output, while slow associative learning in the second pathway reinforces the association between the input to the network and the correct output. After many repetitions, this scheme enables the second input pathway to assume control for driving the downstream layer.

Could this framework for learning be realized in the brain? As a neurobiological interpretation, we propose identifying the readout population in the two-pathway model as sensorimotor striatum, a primary site of reward-based learning in the vertebrate brain. The descending pathway through sensorimotor striatum plays a major role in both the learning and execution of learned motor behaviors [35, 36, 37] as well as habit formation and the automatization of highly practiced behaviors [38, 39]. These processes are facilitated by dopamine-dependent plasticity at input synapses to striatum, where this plasticity plays a role in reinforcing behaviors that have previously led to reward [40].

The dominant cortical input to sensorimotor striatum is sensorimotor cortex [41, 42], and a great deal of research has implicated this pathway in motor learning [43, 44]. We propose to identify this as the first input in the two-pathway model, where this input is characterized by fast, error-driven learning. As the second input, which is characterized by slow, associative learning, we identify the parafasicular nucleus of the thalamus, which is the dominant subcortical input to sensorimotor striatum [45]. Although this pathway has received less attention than the corticostriatal pathway, it has also been shown to be important for the production of learned behaviors [46].

Taken together, these identifications suggest a picture in which information about state, including sensory feedback, cues, and context, is represented in cortex and thalamus, and in which plasticity at the synapses descending from these structures to striatum allows for state information to be mapped onto actions that will tend to minimize errors and maximize rewards (Figure 7a). The mechanisms and roles of learning in the two pathways are distinct, with fast changes to corticostriatal synapses serving to learn and produce the correct behavior, and slower changes to thalamostriatal synapses serving to associate a given state with a behavioral output, with this association becoming stronger and more persistent the more that it is practiced. As described in the previous section, the transfer of control from corticostriatal to thalamostriatal inputs would suggest that motor cortex should be necessary for the initial learning of a behavior, but not for the production of the behavior after a sufficient amount of practice. Evidence for this has recently been found in studies using lesions [47] or inactivations [48] to test the involvement of motor cortex in rodents performing skilled forelimb tasks. In addition, studies of learned motor behaviors in humans have suggested that motor cortex becomes less involved in driving behaviors with practice [49, 50], consistent with the proposed sensorimotor implementation of the two-pathway model.

With the neurobiological implementation proposed above, the two-pathway model also leads to a number of testable experimental predictions. First and most obviously, the theory predicts that cortico-vs. thala-mostriatal synapses should be modified by different types of plasticity (fast and reward-driven vs. slow and associative, respectively). Because both synapses are glutamatergic, they have typically not been studied separately or even distinguished in past experiments [51]. Future experiments could test for differences between the types of modifications at these synapses, either *in vivo* or in slice preparations.

Second, because motor cortex plays less of a role in driving striatum as behaviors become highly practiced, lesioning, inactivating, or randomly perturbing activity in motor cortex during behavior should have a decreasingly small effect on the behavior the more that it has been trained (Figure 7b). This effect has in fact already been observed in very recent experiments using optogenetic inactivation throughout learning of a targeted forelimb task [48].

Third, online optogenetic inactivation could be used so that only motor cortex or only thalamus drives striatum in a given trial. The theory would predict that the striatal activities observed in these two cases would be correlated due to input alignment, with the degree of correlation increasing with the amount of training (Figure 7c, blue line). Similarly, striatal population activity should be similar before and after cortical lesion or inactivation in a trained animal, with the degree of similarity increasing with the amount of training (Figure 7c, red line).

Fourth, lesions of motor cortex and of thalamus are predicted to produce different results, especially if a substantial amount of time elapses between training and testing of a behavior, allowing for potential overwriting. As shown in the previous section and as observed in experiments [47], highly practiced behaviors are robust to motor cortical lesion due to input alignment and control transfer. They are not expected to be robust to thalamic lesions, however (Figure 6d, bottom panel). Comparing the effects of motor cortical and thalamic lesions in trained animals could thus provide a test of input alignment and control transfer in the two-pathway model.

Finally, if thalamostriatal plasticity is blocked during training of a behavior, then the model predicts that the behavior will still be learned due to the fact that motor cortex is responsible for driving initial learning. Since input alignment cannot occur in the absence of thalamostriatal plasticity, however, the learned behavior will no longer be robust to cortical lesion or inactivation. This would be a challenging experiment to perform due to the fact that both thalamo- and corticostriatal synapses are glutamatergic, making selective manipulation difficult. If it could be done, however, it would provide strong evidence for our proposed neurobiological implementation of the two-pathway model.

## Discussion

In this work, we have developed a bottom-up, mechanistic theory addressing the combined effects of recency and practice in memory and learned behavior. In particular, this theory proposes two simultaneously occurring learning processes: a fast process that minimizes errors, paired with a slow process that strengthens associations. From a computational perspective, this work provides a quantitative theoretical framework in which the effects of recency and practice can be analyzed for the simplest possible case, namely a single neuron. From a cognitive perspective, including both of these processes provides an explanation for why repeated practice can continue to have effects even after a memory or behavior has been learned to be performed with zero error. From a neurobiological perspective, this work provides a framework for understanding the automatization of learned behavior and the formation of habits, and proposes how these processes might be implemented in neural circuits.

In the Introduction, we described the multiple effects of practice: improved performance, habit formation, and decrease in cognitive effort [15]. The two-pathway model, and its proposed implementation in the brain’s sensorimotor circuitry in particular, provides a unified account of these effects. First, fast learning in the first pathway provides a mechanism for improved performance, with the greatest gains coming in the early stages of practice. Second, the two-pathway model provides a mechanistic description of habit formation, in which behaviors are produced consistently in response to particular stimuli, even when the behavior is no longer rewarded. Going further, it may even provide a description of addiction and compulsion, in which behaviors persist even if they lead to negative outcomes. Finally, the two-pathway model provides a mechanistic description of behavioral automatization, with the amount of conscious effort devoted to a behavior decreasing with increasing amounts of practice, as control of the behavior is transferred from a cortical to a subcortical circuit.

In Supplemental Figures 2 and 3 we showed that our simple model does not reproduce all of the relevant results from human memory studies. One successful approach for obtaining power law memory decay has been to allow individual synapses to have multiple states, with a different timescale associated with each state [52, 23]. Incorporating associative learning and the effects of practice into such multi-state models could be a promising direction for future work on the mechanistic modeling of learning and memory. More broadly, future work in this area will need to identify the network-level mechanisms that emerge in circuits with large numbers of neurons and multiple layers and might underlie phenomena such as power-law forgetting curves and the effects of spaced repetition. In addition, memory performance can also be affected by higher-level cognitive strategies such as chunking or creating narratives to link information meaningfully [3]. Clearly, such complex strategies cannot be addressed by a single-neuron model, though it might be hoped that such a model could be a building block for addressing such phenomena in the future.

Other promising directions for future theoretical work include modeling neurons with continuous output values or sign-constrained synapses, as well as scaling the two-pathway idea up to multilayer architectures. Future work should also extend the model from supervised learning to reinforcement learning, which is less theoretically tractable but more plausible as a model of reward-based learning at corticostriatal synapses [40]. Reinforcement learning is also more plausible from a behavioral perspective, since it allows for performance to improve gradually by trial and error, leading to a richer description of learned motor behaviors than the single-step perceptron learning rule can provide. While a first step in this direction was taken in Supplemental Figure 4, which shows that the mechanisms of input alignment and control transfer are realized in a circuit trained with reinforcement learning, extending the theoretical framework underlying the two-pathway model to the case of reinforcement learning is an important direction for future work. Finally, in our implementation interpretation, we have focused on sensorimotor striatum and its inputs. The thalamocortico-basal ganglia circuits that are involved in higher-level cognitive functions and that exist in parallel with the sensorimotor circuit could provide alternative interpretations of the model that we have presented [38, 53]. More broadly, this two-pathway motif might be found in other brain areas and could in part explain the large degree of apparent redundancy found throughout the neural circuitry of the brain.

## Materials and Methods

Parameters and equations for the simulations in Figures 1, 2, 3, 4, 6, and 7 are provided in the Main Text. For the simulations in Figure 5, *n* = 21 networks were trained for each of the conditions shown (SL only, SL+HL, and SL shuffled), where each network had *N*_{x} = *N*_{y} = 50, *N*_{z} = 10, and a 2-dimensional linear readout. The readout controlled the velocity of a cursor, which began at position **r** = **0** and moved to one of the four target positions: (1,0), (0,1), (−1,0), (0,−1). The target velocity in each trial was given by
where *t* is the timestep, *T* = 10 is the number of timesteps in each trial, and **r**^{i} denotes one of the four target positions. This bell-shaped velocity profile describes a minimum-jerk trajectory from the initial to the target point. Both input layer populations receive fixed random projections of the network readout from the previous timestep, as well as a tonic random input associated with one of the four targets.

During training, the mean squared difference between the network output and was minimized by gradient descent on the weights **W** (labelled SL in Figure 5b), with learning rate *η*_{SL} = 0.001 (found by grid search). The weights **V** (labelled HL in Figure 5b) were either not trained (in the cases SL only and SL shuffled) or trained with the Hebbian update rule ∆*V*_{ij} = *η*_{HL}[−*V*_{ij} + *z*_{i}(*t*)*y*_{j}(*t*)], where *η*_{HL} = 10^{−6} was found by grid search. The networks were sequentially trained on each target, as shown in Figure 5c. The bars in Figure 5d show the median test loss over the trained networks for each target, with error bars denoting the 25th and 75th percentiles.

## Supplemental Information

### Learning with gradient descent

The perceptron learning rule, according to which synaptic weights are adjusted to produce the correct output in a single step, is mathematically convenient but biologically questionable. In order to address this, we simulated the forgetting curve using gradient descent learning, a widely used supervised learning algorithm in which small updates are accumulated over many repetitions to minimize the readout error.

In this case the output of the neuron is , where is drawn randomly for step *a*. The number of steps for each pattern was chosen to be *N*_{steps} = 100, and the learning rate *η* = 0.01 and the noise amplitude *σ* = 0.2 were chosen by grid search to maximize the area between the forgetting curve and chance performance. The noise is not strictly necessary in gradient descent learning, but is included here so that a finite classification margin will be obtained, as in the case of the perceptron learning rule (Figure 1b). In this case, the synaptic weights were updated according to the gradient-descent update rule , where

As shown in Supplemental Figure 1, the forgetting curve obtained for the perceptron with this alternative learning rule is similar to that obtained using the perceptron learning rule.

### Forgetting curves decay exponentially

Many previous studies on memory have shown that forgetting curves are well described by curves decaying as a power law [3], in which the probability of correct recall has the form ∼ *t*^{−a}, where *t* is time and *a* > 0. Supplemental Figure 2a shows that such a fitting function, which appears as a straight line in a log-log plot, does a relatively poor job of fitting the forgetting curves from the two-pathway model. In contrast, exponentially decaying functions, in which the probability of correct recall ∼ *e*^{−at} and which appear as straight lines on semilogarithmic plots, provide a good fit to the forgetting curves (Supplemental Figure 2b).

### Spaced repetition

In simulations of the two-pathway model, the neuron was trained in sequence to perform *P* classifications. All of the input patterns were distinct, except for a single pattern that was presented twice during training, with an interval ∆_{train} between presentations. The interval between the second presentation and the testing phase was ∆_{test}. For any ∆_{test}, smaller values of ∆_{train} always led to a lower error rate for the repeated pattern during testing (Figure 3b).

In addition, we note that repetition of patterns has no significant effect at all in the single-pathway model without Hebbian weights. In this case, upon the second presentation of the repeated pattern during training, the weight vector will either not be updated at all (if classification is already correct with sufficient margin) or will be updated to lie on the correct side of the classification boundary plus a margin (if the classification is initially incorrect or correct with insufficient margin). In neither of these two cases is there a benefit to having seen the pattern before, since one of these same things would have happened if the first presentation had not occurred.

Thus, while the information accumulated in the Hebbian weights of the two-pathway model is an important ingredient for describing nontrivial effects due to spaced repetition, the single-neuron model appears to be unable to account for the experimentally observed existence of optimal repetition intervals [34], and thus leaves room for future work on this topic.

### Input alignment with reinforcement learning

We checked that the mechanisms of input alignment and control transfer could also be obtained using reinforcement learning (RL), which is a more plausible form of synaptic plasticity than the perceptron learning rule for describing changes at corticostriatal synapses. In this case, a noise current *ξ* was added to the readout layer, with , and the updates to input weights in the first pathway were given by [55, 56]

Over the course of many trials, where each trial consists of a complete presentation of all *P* = 10 patterns, the input currents from the two pathways become aligned when one pathway was trained with RL and the other with HL, but not if both pathways were trained with RL (Supplemental Figure 4b). Further, due to input alignment and control transfer, the first input **m** can be removed once the network is trained, without significantly impairing performance (Supplemental Figure 4c).

## Acknowledgments

The authors are grateful to B. Ölveczky, members of the Ölveczky lab, and L.F. Abbott for helpful discussions and feedback. Support for this work was provided by the National Science Foundation NeuroNex program (DBI-1707398), the National Institutes of Health (R01 NS105349, DP5 OD019897, and U19 NS104649), and the Gatsby Charitable Foundation.

## Appendix 1: The perceptron forgetting curve

As described in the main text, we wish to train a perceptron having *N*_{x} = *N* inputs and subject to the update rule (1) to map *P* random input patterns onto randomly chosen binary outputs. The question that we seek to answer is, after *P* patterns have been trained using the update rule in (1), what will be the probability of misclassification if we then test the output produced by a particular pattern *ν* without any further learning? Clearly, the most recently learned patterns are likely to produce correct outputs, while those learned long in the past are more likely to produce errors due to accumulated changes in the weights **w** during subsequent learning. In general, the probability of an error when testing on pattern *ν* is, using the Heaviside step function Θ(⋅), given by

In this equation, we have, without loss of generality, redefined , so that the target output becomes *z*^{µ} = 1 for every pattern. We have also set the classification margin *κ* = 1, which amounts to a choice for scaling the overall magnitude *|***w***|*. In this equation, the weight vector just before training pattern *ν* is assumed to come from a distribution *p*(**w**^{ν}), which we shall derive below. The first line in (5) counts the cases in which the classification using this weight vector is initially incorrect or correct with margin less than *κ* = 1 (so 1 − **w**^{ν} ⋅ **x** > 0) and in which, after making the initial update **w**^{ν} *→* **w**^{ν} + ∆**w**^{ν}, the weight vector evolves through *P* − *ν* successive updates into the final weight vector **w**^{P}, which leads to incorrect classification when pattern *ν* is again tested (−**w**^{P} ⋅ **x**^{ν} > 0). Similarly, the second line of (5) counts the cases in which the classification is initially *correct* with a sufficiently large margin and in which, after making successive weight updates, the final weight vector **w**^{P} again leads to incorrect classification.

At this stage, the probability distributions *p*(**w**^{ν}) and *p*(**w**^{P} *|***w**^{ν}) in (5) are unknown. For the latter distribution, however, we can track its evolution step by step using the update rule (1). Because **x** is a random variable, we first seek to find the distribution *p*(∆**w***|***w**) by averaging over **x**. In fact, it will be sufficient just to calculate the first two moments of this distribution. Let **x** = **x**^{||} + **x**^{⊥}, where **x**^{||} is the component along **w**, and the index *ν* has been dropped for simplicity. In this case the weight update (1), in the case where an update occurs, can be written as

Using (6), in the case where a weight update occurs, the first moment is given by

Similarly, the second moment of the distribution is where

Together, (7)-(9) describe a drift-diffusion process. If we neglect higher-order moments, then the single-step probability distribution is given by
where *q* (to be calculated below) is the probability that a weight update occurs in a given step, and *𝒩*(⋅, ⋅) is the multinormal distribution. The time evolution of the probability distribution of the weights is then given by the Fokker-Planck equation [54]:
where the initial condition from (5) is either *p*(**w**) = *δ*(**w**−**w**^{ν}) or *p*(**w**) = *δ*(**w**−**w**^{ν} −∆**w**^{ν}), and *τ* ≡ *q*(*P* − *ν*) is the effective time variable.

Because the coefficients in (11) depend on |**w**^{τ}|, the full solution is not known in general. However, it is straightforward to calculate the time evolution of the moments of *p*(**w**^{τ}) by multiplying both sides by powers of **w**^{τ} and using integration by parts [54]. Denoting the first two moments (not to be confused with the moments of ∆**w** defined above) as
the time evolution is given by

In the *N* → ∞ limit, we can assume (to be checked below) that |**w**^{τ}| is constant. In this case, (13) has the solutions
where terms *∼ O*(1/*N*^{2}) in the second equation have been dropped in the large-*N* limit (while the first term is kept because *τ* may be *∼ O*(*N*)).

With (14), the solution to the Fokker-Planck equation (11) when just the first two moments are kept is
where “⊗” denotes the outer product (i.e. [**w** ⊗ **w**]_{ij} = *w*_{i}*w*_{j}), and we have defined

In the *N* → ∞ limit, the anisotropic term can be ignored, giving
which is the probability density evolution corresponding to an Ornstein-Uhlenbeck stochastic process [54]. After a long time, *γ →* 0 and p(**w**^{P} |**w**^{ν}) from (15) approaches the steady-state distribution

Thus, the deterministic update rule (1) leads to a bounded steady-state weight distribution *p*(**w**) after a large number of classifications have been learned. This differs somewhat from previous models of sequential learning with random synaptic weight updates, such a bounded distribution as *P*/*N* → ∞ was achieved either by requiring that the synaptic weights should be bounded [31] or that that they should decay slightly at each step [23].

Given the steady-state distribution (18), we can assume that is constant in the large-*N* limit and let . Then, with , (18) can be used to calculate the variance of **w**, leading to the self-consistent equation , which, using the definition of *g*(|**w**|) from (9), has the solution . This is close to but differs somewhat from the steady-state norm found in numerical simulations, from which . The reason for this is presumably because of the decision to approximate *p*(∆**w**|**w**) using only the first two moments of the distribution in (11). In general, the higher-order moments do not vanish, and these will contribute higher-order derivative terms in the Fokker-Planck equation (12). In turn, such terms will lead to nonvanishing higher-order moments in the distribution *p*(**w**), beyond the two that were calculated in (12)-(14). Presumably, it is these higher-order terms which cause the discrepancy between the simulated result and the self-consistent calculation. In the theoretical curves shown in the Results section, we use the value of obtained from simulations. Using (18), we can also calculate the probability of making a weight update in a given step, which is given by
and evaluates to *q ≃* 0.798.

With the preceding points in mind, and making use of the probability distributions (17) and (18), we can proceed to evaluate the integrals in (5) to obtain the probability of incorrect classification when testing pattern *ν*. In order to factorize the arguments of the Heaviside step functions, we make use of the following identity:
with this trick, (5) becomes
where
and

Beginning with *I*_{1}, the integral over **w**^{P} can be performed, which, after simplification and using **x**^{2} = *N* in the *N* → ∞ limit, leads to

Performing the integrals over **w**^{ν} and **x** then leads to

Finally, the integrals over *v* and *v*′ can be performed exactly, then those over *u* and *u*′ can be evaluated using the complementary error function, yielding the final result

In a similar manner, the integrals in (23) can be evaluated to get

After performing the integrals over *v*, *v*′, and *u*′, then changing variables for the *u* integral, the final result is
where the final integral in this case must be performed numerically.

As a check, we can evaluate these results in their extreme limits. In the case *γ* = 1, which corresponds to testing the most recently learned pattern, we have *I*_{1} ∼ *I*_{2} ∼ erfc(∞) → 0, so that there is perfect classification for very recently learned patterns. In the opposite limit of *γ* = 0, which corresponds to testing patterns learned in the distant past, we obtain , which means that very old patterns are completely overwritten and so are classified at chance level.

## Appendix 2: The two-pathway forgetting curve

Let us introduce a second source of input to the downstream units, so that , where . Though it is not necessary, we will assume for notational simplicity that the numbers of units in the two input layers are the same, so that *N*_{x} = *N*_{y} = *N*. The weights **w** are again trained using supervised learning:

The second set of weights, meanwhile, is updated using the following rule:

This Hebbian update rule defines an *N*-dimensional Ornstein-Uhlenbeck stochastic process with time-dependent coefficients. The evolution of the probability distribution *p*(**v**, ∆*t*) is given by the master equation:
where

We can further note that, since the components of **v** are not coupled to one another in (30), we can, without loss of generality, consider the evolution of just a single component *v*_{i}. In this case the two sides of (31) can be expanded to obtain

Then, using (32) to obtain 〈∆*v*_{i}〉 and 〈(∆*v*_{i})^{2}〉, (33) leads to the Fokker-Planck equation, which describes the evolution of the probability distribution p(**v**) as new patterns are learned:

In this equation, we have taken the continuous-time limit by letting ∆*t* = 1/*N* and *t* ≡ (*µ* − *ν*)/*N*, where *µ* > *ν*, the number of repetitions to be *n*(*t*) ≡ *n*_{µ}, and the initial condition to be given by the distribution (to be calculated below) *p*(**v**^{ν}).

The Fokker-Planck equation (34) can be solved using the Fourier transform [54]

With this, (34) becomes

This equation can be solved by making the following ansatz:
which, by substituting into (36) and requiring that the terms at each order in **k** vanish, leads to the following equations for the time-dependent coefficients:

These equations then have the solutions

With this, and letting *t* = (*P − µ*)/*N*, we can take the inverse Fourier transform of (37) to obtain the distribution of the final weight vector given the weights at pattern *ν*:
where we have identified , with
and we have also identified , with

In what follows below, in order to keep expressions compact, we shall write and *ρ* = *ρ*(*P*, *ν*). Further, though it is not strictly necessary, these expressions can be considerably simplified if we make the simplifying assumption that the average of *n*_{µ} in (42) over the last *P − ν* patterns is equal to its average over the full set of *P* patterns (or, in the case of (41), that can be replaced by ). In this case, (42) becomes , while (41), after performing the summation, becomes .

Returning to the distribution (40), we can see that it begins as a *δ* function at **v**^{ν} and, as more patterns are introduced and *ρ* → 0, evolves to the following steady-state distribution:

This is the distribution from which **v**^{ν} will be drawn in order to calculate the error rate for pattern *ν* below. In addition to driving a drift-diffusion process for the weights **v**, the updates to **v** also affect the evolution of the distribution of weights **w** due to the appearance of **v** in the update rule (29). In order to account for this change, the first two moments of the update ∆**w** must be reevaluated as in (7)-(8), but now averaging over the random variable **y** in addition to averaging over **x**. Because 〈*y*_{i}〉 = 0, the first moment in (7) remains unchanged. As for the second moment, (8) generalizes to

As before, we will assume that |**w**|^{2} and |**v**|^{2} can be replaced by their average values in the *N* → ∞ limit. Noting from (43) that 〈**v**^{2}〉 = *β*^{2}/*α*, we have the diffusion tensor , where , where, as before, is taken from numerical simulations. From this result, we see that the equations (17)-(18) determining the evolution of **w** can also be applied in the two-pathway case by making the substitution .

As in the previous section, our goal is to calculate the probability of incorrect classification when testing pattern *ν* after training *P* patterns in sequence. Including the second input pathway, this quantity is given by

As before, we have absorbed all into the definition of the input activity vectors **x**^{µ} and **y**^{µ} by letting and , effectively setting all . The first term in the integrand corresponds to cases in which the classification of pattern *ν* is initially incorrect, so that the weights **w** and **v** are both updated. The second term corresponds to cases in which the initial classification is correct, so that only the weights **v** are updated. In both cases, the weight distributions for **w** and **v** evolve according to drift-diffusion processes, as new patterns are learned up until pattern *P*, at which time, the classification of pattern *ν* is tested using the final weights **w**^{P} and **v**^{P} (the first Θ function appearing in the integrand).

Again using the trick (20) to represent the Heaviside step functions, the two terms in (45) can be written as where and

Beginning with *J*_{1}, we can shift the integration variable and use **x**^{2} = *N* in the *N* → ∞ limit to obtain
where
Where the Hebbian decay term ∼ *α***v**^{ν}/*N* was dropped in the first line because it vanishes as *N* → ∞, and the integration variable **v**^{P} was shifted in the third line. Using this result, and noting that the integrals over **x**, **w**^{ν}, and **w**^{P} are the same as those appearing in (22) (with ), (49) becomes
where we have defined
and

As in the case without the Hebbian pathway, the last remaining integral in (51) must be performed numerically.

Equation (48) for *J*_{2} can be evaluated in a similar manner. To begin, we express it as
where

Using this result, and noting that the integrals over **x**, **w**^{ν}, and **w**^{P} are the same as those appearing in (23) (with ), (54) becomes
where we have defined
and

It is straightforward to check that, in the limit *β* → 0, which corresponds to shutting off the Hebbian input pathway, we recover *J*_{1,2} → *I*_{1,2} from the simple perceptron result.

Finally, we can calculate the probability of making a weight update in a given step, which in the two-pathway case is given by

In the case where *β*^{2}/*α* = 0, this evaluates to the earlier result *q* ≃ 0.798. For nonzero *β*, on the other hand, *q* decreases and approaches 0.5 as *β*^{2}/*α* → ∞.

## References

- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵