## Abstract

How are actions linked with subsequent outcomes to guide choices? The nucleus accumbens, which is implicated in this process, receives glutamatergic inputs from the prelimbic cortex and midline regions of the thalamus. However, little is known about whether and how representations differ across these input pathways. By comparing these inputs during a reinforcement learning task in mice, we discovered that prelimbic cortical inputs preferentially represent actions and choices, whereas midline thalamic inputs preferentially represent cues. Choice-selective activity in the prelimbic cortical inputs is organized in sequences that persist beyond the outcome. Through computational modeling, we demonstrate that these sequences can support the neural implementation of reinforcement learning algorithms, both in a circuit model based on synaptic plasticity, and one based on neural dynamics. Finally, we test and confirm predictions of our circuit models by direct manipulation of nucleus accumbens input neurons. Thus, we integrate experiment and modeling to suggest neural solutions for credit assignment.

## Introduction

Multiple lines of experimental evidence indicate that the nucleus accumbens (NAc, part of the ventral striatum) is critical to reward-based learning and decision-making (Apicella et al., 1991; Cador et al., 1989; Carelli et al., 1993; Cox and Witten, 2019; Di Ciano et al., 2001; Everitt et al., 1991; Parkinson et al., 1999; Phillips et al., 1993, 1994; Robbins et al., 1989; Roitman et al., 2005; Setlow et al., 2003; Stuber et al., 2011; Taylor and Robbins, 1986). The NAc is a site of convergence of glutamatergic inputs from a variety of regions, including the prefrontal cortex and the midline thalamus, along with dense dopaminergic inputs from the midbrain (Brog et al., 1993; Do-Monte et al., 2017; Groenewegen et al., 1980; Hunnicutt et al., 2016; Otis et al., 2017; Phillipson and Griffiths, 1985; Poulin et al., 2018; Reed et al., 2018; Swanson, 1982; Wright and Groenewegen, 1995; Zhu et al., 2016).

An important mechanism underlying reward-based learning and decision-making is thought to be dopamine-dependent synaptic plasticity of glutamatergic inputs to the NAc that are co-active with a reward prediction error (RPE) in dopamine neurons (Fisher et al., 2017; Gerfen and Surmeier, 2011; Reynolds and Wickens, 2002; Russo et al., 2010). Such strengthening of glutamatergic inputs is thought to be central to learning, allowing actions that are followed by a rewarding outcome to be more likely to be repeated in the future (Britt et al., 2012; MacAskill et al., 2014; Steinberg et al., 2013; Tsai et al., 2009; Witten et al., 2011).

A central question in reinforcement learning is how actions and outcomes become associated with each other, even when they are separated in time (Asaad et al., 2017; Gersch et al., 2014; Sutton, 1988; Wörgötter and Porr, 2005). A possible mechanism that could contribute to solving this problem of temporal credit assignment in the brain is that neural activity in the glutamatergic inputs to the NAc provide a neural memory trace of previous actions. This could allow action representations from glutamatergic inputs and outcome information from dopaminergic inputs to overlap in time.

Whether glutamatergic inputs to the NAc indeed represent memories of previous actions is unclear. More broadly, what information is carried by glutamatergic inputs to the NAc during reinforcement learning, and whether different inputs provide overlapping or distinct streams of information, has not been examined systematically. To date, there have been relatively few recordings of cellular-resolution activity of glutamatergic inputs to the NAc during reinforcement learning, nor comparison of multiple inputs within the same task, nor examination of the timescale with which information is represented within and across trials. Furthermore, if glutamatergic inputs do indeed provide memories of previous actions, construction of a neurally plausible instantiation of an algorithm for credit assignment based on the measured signals remains to be demonstrated (for review of biological instantiation of reinforcement learning algorithms, see Joel et al., 2002).

To address these gaps, we recorded from glutamatergic inputs to the NAc during a probabilistic reversal learning task that we previously demonstrated was dopamine-dependent. In this task, dopamine neurons that project to the NAc encode RPE, and inhibition of dopamine neurons substitutes for a negative RPE (Parker et al., 2016). To compare activity in major cortical and thalamic input to the NAc core, here we combined a retrograde viral targeting strategy with cellular-resolution imaging to examine the input from prelimbic cortex (“PL-NAc”, part of medial prefrontal cortex) and that from the midline regions of the thalamus (“mTH-NAc”). We found that PL-NAc neurons preferentially encode actions and choices relative to mTH-NAc neurons, with choice-selective sequential activity that persists until the start of the subsequent trial. The long timescale through which a prior action is encoded in cortical inputs to the NAc provides the information to bridge actions, outcomes, and the subsequent choice. In addition, we demonstrated with computational modeling that these choice-selective sequences can support neural instantiations of reinforcement learning algorithms, either through dopamine-dependent changes in downstream synaptic weights, or dopamine-dependent changes in neural dynamics. Finally, we test and confirm a prediction of our models through direct optogenetic manipulation of PL-NAc neurons. Thus, by recording and manipulating glutamatergic inputs to the NAc and integrating these data with computational modeling, we provide specific proposals for how reinforcement learning could be implemented by neural circuitry.

## Results

### Cellular resolution imaging of glutamatergic inputs to the NAc during a probabilistic reversal learning task

Mice performed a probabilistic reversal learning task while inputs from thalamus or cortex were imaged (**Figure 1a**). A trial was initiated when the mouse entered a central nose poke, which prompted the presentation of a lever on either side. Each lever had either a high (70%) or low (10%) reward probability, with the identity of the high and low probability levers reversing in an unsignaled manner after a variable number of trials (see Methods for block reversal probabilities). After a variable delay (0-1s), either a sound (CS+) was presented at the same time as a reward was delivered to a central reward port, or another sound (CS−) was presented that signaled the absence of reward.

As expected, mice switched the lever they were more likely to press following block reversals (**Figure 1b,c**). Similarly, mice were significantly more likely to return to the the previously chosen lever (i.e. stay) following rewarded, as opposed to unrewarded, trials (**Figure 1d**; p<0.0001: paired, two-tailed t-test between stay probabilities following rewarded and unrewarded trials across mice, n=16 mice), meaning that, as expected, mice were using previous choices and outcomes to guide behavior. A logistic regression to predict choice based on previous choices and outcomes indicated that mice relied on ~3 previous trials to guide their choices (**Figure 1e**; see Methods for choice regression details).

To image activity of glutamatergic input neurons to the NAc during this behavior, we injected a retroAAV or CAV2 to express Cre-recombinase in the NAc as well as an AAV2/5 to Cre-dependently express GCaMP6f in either the PL or mTH (**Figure 1f**). A gradient refractive index (GRIN) lens was implanted above either the PL or mTH (see **Supplementary Figure 1** for implant locations), and a head-mounted miniature microscope was used to image activity in these populations during behavior (**Figure 1f**, n=278 neurons in PL-NAc from n=7 mice, n=256 neurons in mTH-NAc from n=9 mice). An example field of view from a single recording session is shown for both PL-NAc neurons as well as mTH-NAc neurons (**Figure 1g**). Behavior between mice in the PL-NAc versus mTH-NAc cohorts was similar (**Supplementary Figure 2**).

### Actions are preferentially represented by PL-NAc neurons, while reward-predicting stimuli are preferentially represented by mTH-NAc neurons

Individual PL-NAc and mTH-NAc neurons displayed elevated activity when time-locked to specific behavioral events in the task (**Figure 2a**). Given the correlation between the timing of task events, as well as the temporal proximity of events relative to the time-course of GCaMP6f, we built a linear encoding model to properly relate neural activity to each event (Engelhard et al., 2019; Krumin et al., 2018; Lovett-Barron et al., 2019; Musall et al., 2019; Park et al., 2014; Parker et al., 2016; Pinto and Dan, 2015; Sabatini, 2019; Steinmetz et al., 2019). Briefly, time-lagged versions of each behavioral event (nosepoke, lever press, etc) were used to predict the GCaMP6f fluorescence in each neuron using a linear regression. This allowed us to obtain “response kernels”, which related each event to the GCaMP6f fluorescence in each neuron, while removing the potentially confounding (linear) contributions of correlated task events (**Figure 2b**; see Methods for details).

To visualize the response kernels, we plotted them as a heatmap, where each row was the response kernel for a particular neuron associated with each behavioral event. This heatmap was then ordered by the time of peak kernel value across all behavioral events. Visual observation revealed a clear difference between the PL-NAc and mTH-NAc populations – PL-NAc neurons were robustly modulated by the action-events in our task (**Figure 2c**; kernel values associated with ‘nose poke, ‘ipsilateral lever press’, ‘contralateral lever press’ and ‘reward consumption’) while mTH-NAc neurons appeared to be most strongly modulated by the stimulus-events, specifically the positive reward auditory cue (**Figure 2d**, kernel values associated with ‘CS+’).

Examination of the GCaMP6f fluorescence time-locked to each behavioral event (rather than the encoding model-derived response kernels) revealed similar observations of action encoding in PL-NAc and CS+ encoding in mTH-NAc (**Figure 2e,f**). While this time-locked GCaMP6f heatmap displays neurons which appear to respond to multiple events (**Figure 2e**, see neurons approximately 50-100 that show elevated activity to ‘ipsilateral lever press’, ‘levers out’ and ‘nose poke’), this impression is likely a result of the temporal correlation between neighboring behavioral events, which our encoding model accounts for. To illustrate this, we applied our encoding model on a population of simulated neurons that responded only to the lever press events. In this simulated data, we observed a similar multi-peak heatmap when simply time-locking the simulated GCaMP6f fluorescence, but this multi-peak effect is eliminated by the use of our encoding model, which recovers the true relationship between GCaMP6f fluorescence and behavior in the simulated data (**Supplementary Figure 3**).

This encoding model was used to identify neurons in the PL-NAc and mTH-NAc populations that were significantly modulated by each event in our task (significance was assessed by comparing the encoding model with and without each task event, see Methods). We found that a similar fraction of PL-NAc and mTH-NAc neurons were modulated by at least one task event (PL-NAc: n=121/278 neurons from 7 mice; mTH-NAc: n=98/256 neurons from 9 mice). Of these neurons that were selective to at least one task event, the selectivity for actions versus sensory stimuli differed between the two populations (**Figure 2g,h**). In particular, more PL-NAc neurons were modulated by at least one *action* event (nose poke, ipsilateral lever press, contralateral lever press and reward consumption; 102/121 PL-NAc neurons; 60/98 mTH-NAc neurons; P=0.0001: two-proportion Z-test comparing fraction of action-modulated neurons between PL-NAc and mTH-NAc). By contrast, a significantly larger fraction of mTH-NAc neurons were modulated by at least one stimulus cue (levers out, CS+ and CS-; 51/98 mTH-NAc neurons; 40/121 PL-NAc neurons; P=0.005, two-proportion Z-test comparing fraction of stimulus-modulated neurons between PL-NAc and mTH-NAc).

### PL-NAc neurons preferentially encode choice relative to mTH-NAc neurons

This preferential representation of actions in PL-NAc relative to mTH-NAc suggests that lever choice (contralateral versus ipsilateral to the recording site) could also be preferentially encoded in PL-NAc. Indeed, a significantly larger fraction of neurons were choice-selective in PL-NAc compared with mTH-NAc (**Figure 3a**; PL-NAc: 92/278 (33%); mTH-NAc: 42/256 (16%); P=9.9×10^{-6}: two-proportion Z-test; significant choice-selectivity was determined with a nested comparison of the encoding model with and without choice information, see Methods). A logistic regression population decoder supported this observation of preferential choice-selectivity in PL-NAc relative to mTH-NAc. Choice decoding using neural activity of simultaneously recorded PL-NAc neurons was significantly more accurate compared with decoding using mTH-NAc activity (**Figure 3b**; PL-NAc: 72±3%, mTH-NAc: 60±2%, choice decoding accuracy from a logistic regression with activity from multiple, random selections of 10 simultaneously imaged neurons, mean±s.e.m. across mice; P=0.0065: unpaired, two-tailed t-test comparing peak decoding accuracy across mice between PL-NAc, n=6 mice, and mTH-NAc, n=9 mice).

In contrast to the preferential representation of choice in PL-NAc compared to mTH-NAc, there was a larger fraction of neurons in mTH-NAc that encoded outcome (CS identity or reward consumption) compared to PL-NAc (**Figure 3c**; mTH-NAc: 85/256 (33%), PL-NAc: 70/278 (25%); P=0.038: two-proportion Z-test; significant outcome-selectivity was determined using a nested comparison of the encoding model with and without CS identity and reward consumption information, see Methods). However, while outcome decoding accuracy in mTH-NAc was slightly better relative to PL-NAc (**Figure 3d**; mTH-NAc: 73±2%, PL-NAc: 69±1%), this difference was not statistically significant (P=0.11: unpaired, two-tailed t-test comparing peak decoding accuracy across mice between PL-NAc (n=6 mice) and mTH-NAc (n=9 mice)). These results suggest that, unlike the preferential choice representation observed in PL-NAc over mTH-NAc, outcome was more similarly represented between these two populations. This is presumably due to the fact that both CS+ and reward consumption responses contribute to outcome representation, and although more neurons encoded CS+ in mTH-NAc, the opposite was true for reward consumption (**Figure 2c,d**).

To determine if choice or outcome decoding in either population depended on recording location, we aligned GRIN lens tracks from the histology to the Allen atlas (see Methods). We found no obvious relationship between the strength of either choice or outcome decoding and recording location in either PL-NAc or mTH-NAc (**Supplementary Figure 4**).

### PL-NAc neurons display choice-selective sequences that persist into the next trial

We next examined the temporal organization of choice-selective activity in PL-NAc neurons. Across the population, choice-selective PL-NAc neurons displayed sequential activity with respect to the lever press that persisted for >4s after the press (**Figure 4a-c**; see **Supplementary Figure 5** for sequences without peak-normalization). These sequences were visualized by time-locking the GCaMP6f fluorescence of choice-selective neurons with respect to the lever press, rather than with the encoding model from the earlier figures. The robustness of these sequences was confirmed using a cross-validation procedure, in which the order of peak activity across the PL-NAc choice-selective population was first established using half the trials (**Figure 4b**, ‘train’), and then the population heatmap was plotted using the same established ordering and activity from the other half of trials (**Figure 4c**, ‘test’). To quantify the consistency of these sequences, we correlated the neurons’ time of peak activity in the ‘training’ and ‘test’ data, and observed a strong correlation (**Figure 4d**; R^{2}=0.81, P=6.4×10^{-23}, n = 92 neurons from 7 mice). Additionally, the ridge-to-background ratio, a metric used to confirm the presence of sequences (Akhlaghpour et al., 2016; Harvey et al., 2012; Kondo et al., 2017) was significantly higher when calculated using the PL-NAc choice-selective sequences compared with sequences generated using shuffled data (**Supplementary Figure 6a-c**; P<0.001, t-test between ratio calculated from unshuffled data and 500 iterations of shuffled data).

In contrast, choice-selective sequential activity in the mTH-NAc population was significantly less consistent than in PL-NAc (**Supplementary Figure 7a-d**; Z = 2.34, P=0.0096: Fisher’s Z, comparison of correlation coefficients derived from comparing peak activity between ‘test’ and ‘training’ data from PL-NAc versus mTH-NAc). Additionally, while the ridge-to-background ratio of the sequences generated using mTH-NAc activity was significantly higher than that using shuffled data, (**Supplementary Figure 6d-f**; P<0.001) this ratio was also significantly lower than that obtained from PL-NAc sequences (2.36+/-0.12 mean+/-sem ratio for choice-selective mTH-NAc neurons, n=42, vs 3.01+/-0.12 mean+/-sem ratio for choice-selective PL-NAc, n=92; P=0.001: unpaired, two-tailed t-test comparing ratio between PL-NAc and mTH-NAc neurons). The ridge-to-background ratio of both the PL-NAc and mTH-NAc sequences did not significantly change across either a block or recording session (**Supplementary Figure 8a-d**).

A striking feature of these choice-selective sequences in PL-NAc was that they persisted for seconds after the choice, potentially providing a neural ‘bridge’ between action and outcome. To further quantify the timescale of choice encoding, both within and across trials, we used activity from simultaneously imaged neurons at each timepoint in the trial to predict the mouse’s choice (with a decoder based on a logistic regression using random combinations of 10 simultaneously imaged neurons to predict choice). Choice on the current trial could be decoded above chance for ~7s after the lever press, spanning the entire trial (including the time of reward delivery and consumption), as well as the beginning of the next trial (**Figure 4e**; P<0.01: unpaired, two-tailed t-test of decoding accuracy across mice at the time of the next trial nose poke compared with chance decoding of 0.5, n=6 mice). Choice on the previous or subsequent trial was not represented as strongly as current trial choice (**Figure 4e**; in all cases we corrected for cross-trial choice correlations with a weighted decoder, see Methods) and choice from two trials back could not be decoded above chance at any time point in the trial (**Supplementary Figure 8e**). We also examined the temporal extent of choice encoding in the mTH-NAc population (**Supplementary Figure 7e**). Similar to PL-NAc, we observed that decoding of the mice’s choice persisted up to the start of the next trial. However, the peak decoding accuracy across all time points in the trial was lower in mTH-NAc (59%±0.2%) compared with PL-NAc (73±0.2%).

### Choice-selective sequences in PL-NAc neurons, in combination with known anatomy, can provide a substrate for temporal difference (TD) learning

Thus far, we observed that choice-selective sequences in PL-NAc neurons encoded the identity of the chosen lever for multiple seconds after the lever press. This sequential activity bridged the gap in time between a mouse’s action and reward feedback, and therefore contained information that could be used to solve the task (**Figure 4**). But how could a biologically realistic network use sequences to implement this task? In particular, how could the synaptic outputs of PL-NAc neurons that were active at the *initiation* of the choice-selective sequence be strengthened by an outcome that occurred toward the *end* of the sequence? This represents a temporal credit-assignment problem and may be critical to the ability of the system to learn to generate the appropriate choice-selective sequence for a given trial.

To address this question, we developed a circuit-based computational model that could perform this task based on the observed choice-selective sequences in PL-NAc neurons. This was achieved by a synaptic plasticity model that implemented a temporal difference (TD) reinforcement learning algorithm by combining the recorded choice-selective sequential activity of PL-NAc neurons with the known connectivity of downstream structures (**Figure 5a,b**). The goal of TD learning is for weights to be adjusted in order to predict the sum of future rewards, or “value” (Dayan and Niv, 2008; O’Doherty et al., 2003; Sutton and Barto, 1998; Tsitsiklis and Van Roy, 1997). When this sum of future rewards changes, such as when an unexpected reward is received or an unexpected predictor of reward is experienced, a TD reward prediction error (RPE) occurs and adjusts the weights to reduce this error. The error signal in the TD algorithm closely resembles the RPE signal observed in ventral tegmental area (VTA) dopamine neurons (Parker et al., 2016; Schultz, 1998; Schultz et al., 1997), but how this signal is computed remains an open question.

In our synaptic plasticity model, the PL-NAc sequences enabled the calculation of the RPE in dopamine neurons. Over the course of a series of trials, this error allowed a reward that occurred at the end of the PL-NAc sequence to adjust synaptic weights at the beginning of the sequence (**Figure 5a,b**). Our model generated an RPE in dopamine neurons based on minimal assumptions: i) the choice-selective sequences in PL-NAc neurons that we report here (**Figure 4b**), ii) established (but simplified) anatomical connections between PL, NAc, and VTA neurons (see circuit diagram in **Figure 5a**), and iii) dopamine- and activity-dependent modification of the synapses connecting PL and NAc neurons (with a synaptic eligibility decay time constant of 0.6s, consistent with Gerstner et al., 2018 and Yagishita et al., 2014). In addition to the proposed circuit architecture of **Figure 5a**, we present several variant circuits, which also are able to calculate an RPE signal using the recorded choice-selective PL-NAc sequences (**Supplementary Figure 9**).

In more detail, our synaptic plasticity model took as inputs experimental, single-trial recordings of choice-selective, sequentially active PL neurons (**Figure 5c**, see Methods). These inputs represented temporal basis functions (*f _{i}*(

*t*) in

**Figure 5a**) for computing the estimated value of making a left or right choice. These basis functions are weighted in the NAc by the strength

*w*of the PL-NAc synaptic connection and summed together to create a (sign-inverted) representation of the estimated value, at time t, of making a left choice,

_{i}*V*(

_{L}*t*), or right choice,

*V*(

_{R}*t*). To create the RPE observed in DA neurons requires that the DA neuron population receive a fast, positive value signal

*V*(

*t*) and a delayed negative value signal

*V*(

*t*– Δ), as well as a direct reward signal (

**Figure 5b**). In the instantiation of our model shown in

**Figure 5a**, the summation of NAc inputs and sign-inversion occurs in the ventral pallidum (VP), so that the fast value signal is due to direct VP to DA input. The delayed negative signal to the DA population is due to a slower, disynaptic pathway that converges first upon the VTA GABA neurons, so that these neurons encode a value signal as observed experimentally (Eshel et al., 2015). The temporal discounting factor 7 is implemented through different strengths of the two pathways to the VTA DA neurons (

**Figure 5b**).

Learning is achieved through DA-dependent modification of the PL-NAc synaptic strengths. We assume that PL-NAc neuronal activity leads to an exponentially decaying “eligibility trace” (Gerstner et al., 2018; Sutton and Barto, 1998). The correlation of this presynaptically driven eligibility trace with DA input then drives learning (**Figure 5b**). Altogether, this circuit architecture (as well as the mathematically equivalent architectures shown in **Supplementary Figure 9**) realizes a temporal difference learning algorithm for generating value representations in the ventral basal ganglia, providing a substrate for the selection of proper choice based on previous trial outcomes. The synaptic plasticity model was able to correctly perform the task and recapitulate the mice’s behavior, achieving a comparable rate of reward (47.5% for the model versus 47.6% for the mice). Similar to mice, the model alternated choice following block reversals (**Figure 5d,e**; compare to **Figure 1b,c**; choice based upon a probabilistic readout of the difference between right and left values at the start of the sequence, see Methods) and had a higher stay probability following rewarded trials relative to unrewarded trials (**Figure 5f**; compare to **Figure 1d**).

Appropriate task performance was achieved through the calculation of an RPE signal in VTA dopamine neurons. The RPE signal was evident within a trial, based on the positive response to rewarded outcomes and negative response to unrewarded outcomes (**Figure 5g, top**). The RPE signal was also evident across trials, based on the negative modulation of the dopamine outcome signal by previous trial outcomes (**Figure 5g, bottom**, multiple linear regression similar to Bayer & Glimcher, 2005 and Parker et al., 2016).

Several features of the model neuron responses resembled those previously observed experimentally. The model dopamine responses were similar to results obtained from recorded dopamine activity in the same task (**Supplementary Figure 10**; Parker et al., 2016). The VTA GABA interneuron had a sustained value signal, due to the converging input of the transient, sequential value signals from NAc/VP (**Supplementary Figure 11**), replicating the sustained value signal in VTA GABA interneurons recently found in monosynaptic inputs to the VTA dopamine neurons (Cohen et al., 2012). We note that, alternatively, the VP neurons shown in **Figure 5a** could project to a second set of VP neurons that functionally take the place of the VTA GABA interneurons (**Supplementary Figure 9a,c,f**), leading to sustained positive value encoding VP neurons as were recently observed in VTA-projecting VP neurons (Tian et al., 2016). Additionally, the convergence of the right-choice and left-choice neurons could occur in VP rather than VTA, so that there is no explicit encoding of *V _{L}* and

*V*(

_{R}**Supplementary Figure 9a**).

We next asked how the same model using single-trial activity from choice-selective mTH-NAc neurons would perform (**Figure 5h**). In line with the less consistent sequential choice-selective activity in mTH-NAc relative to PL-NAc (**Supplementary Figure 7**; compare with **Figure 4**), we observed a substantial reduction in performance when using mTH-NAc activity as input (**Figure 5i-k**). In fact, the mTH-NAc model performed at chance levels (39.7% reward rate, compared with chance reward rate of 40%). Additionally, relative to the PL-NAc model, using mTH-NAc activity resulted in a reduction in the negative modulation of dopamine signal following an unrewarded outcome (**Figure 5l, top**; compare with **Figure 5g, top**), and less effect of previous trial outcome on dopamine response (**Figure 5l, bottom**; compare with **Figure 5g, bottom**). Compared with PL-NAc, using mTH-NAc as input also resulted in a reduction in the speed at which the correct value is learned within the NAc and VTA GABA neurons (**Supplementary Figure 11c,d**; compare with **Supplementary Figure 11a,b**).

The choice-selective sequences in PL-NAc neurons were critical to model performance, as they allowed the backpropagation of the RPE signal across trials. This was verified by comparing performance to a model in which PL-NAc activity was shifted to be synchronously active at the trial onset (**Figure 5m**). Unlike the sequential data, synchronous choice-selective PL-NAc activity was unable to correctly modulate lever value following block reversals, and therefore did not lead to correct choices, resulting in model performance near chance (**Figure 5n-p**; synchronous model reward rate: 39.8%, compared with chance reward rate of 40%). This was due to the fact that the synchronous model was unable to generate an RPE signal that was modulated properly within or across trials. Within a trial, negative outcomes did not result in the expected decrease in the dopamine signal (**Figure 5q, top**), while across trials, the influence of the previous trial’s outcome on the dopamine signal was disrupted relative to both the sequential model and recorded dopamine activity (**Figure 5q, bottom**, compare to **Figure 5g** and recorded dopamine activity in NAc in **Supplementary Figure 10**). Without a properly calculated RPE signal, the synchronous model was unable to generate value signals that correlated with the identity of the high probability lever in either NAc or the VTA GABA interneuron (**Supplementary Figure 11e,f**).

### Stimulation of PL-NAc (but not mTH-NAc) neurons decreases the effect of previous trial outcomes on subsequent choice in both the model and the mice

We next sought to generate experimentally testable predictions from our synaptic plasticity model by examining the effect of disruption of these sequences on behavioral performance. Towards this end, we simulated optogenetic-like neural stimulation of this projection by replacing the PL-NAc sequential activity in the model with constant, population-wide and choice-independent activity across the population on a subset of trials (on 10% of trials, 65% of neurons were stimulated; **Figure 6a,b**). This generated a decrease in the probability of staying with the previously chosen lever following rewarded trials and an increase following unrewarded trials relative to unstimulated trials (**Figure 6c**). In other words, the effect of previous outcome on choice was reduced when PL-NAc activity was disrupted. This effect persists for multiple trials, as revealed by including terms that account for the interaction between previous rewarded and unrewarded choices and stimulation in the logistic choice regression introduced in **Figure 1e** (**Figure 6d**; see Methods for details). This occurs because stimulation disrupts the choice-selectivity that is observed in the recorded PL-NAc sequences, thus allowing dopamine to indiscriminately adjust the synaptic weights (i.e. value) of *both* the right and left PL-NAc synapses following rewarded or unrewarded outcomes. The effect of stimulation is observed multiple trials back because the incorrect weight changes persist for multiple trials in the synaptic plasticity model. In contrast, stimulation did not result in a difference in stay probability on the trial with stimulation, as choice was determined by spontaneous activity before the sequence initiates in PL-NAc neurons, combined with the synaptic weights between PL and NAc neurons (**Figure 6c**).

We tested these model predictions experimentally by performing an analogous manipulation in mice, which involved stimulating PL-NAc axon terminals with ChR2 on 10% of trials (**Figure 6e**). In close agreement with our model, mice had a significant decrease in their stay probability following a rewarded trial that was paired with stimulation (**Figure 6f**; P=0.001: paired, two-tailed t-test across mice, n=14, comparison between stay probability following rewarded trials with and without stimulation), while they were more likely to stay following an unrewarded trial paired with stimulation (**Figure 6f**; P=0.0005: paired, two-tailed t-test across mice, n=14, comparison between stay probability following unrewarded trials with and without stimulation). Similar to the synaptic plasticity model (**Figure 6d**), the effect of stimulation on the mouse’s choice persisted for multiple trials. Mice had a significant decrease in their stay probability following PL-NAc stimulation on rewarded choices one and two trials back (**Figure 6g**; P=0.001 for one trial back; P=0.02 for two trials back; one-sample, two-tailed t-test across mice of regression coefficients corresponding to the interaction term between rewarded choice and optical stimulation, n=14 mice).

Also similar to the model, and in contrast to these effects on the next trial, stimulation on the current trial had no significant effect on choice following either rewarded or unrewarded trials (**Figure 6f**; P>0.5: paired, two-tailed t-test across mice, n=14, comparison between stay probability on trials with and without stimulation following both rewarded and unrewarded trials).

We also observed an increase in the probability of mice abandoning the trials with stimulation compared with those trials without (P=0.0006: paired, two-tailed t-test comparing percentage of abandoned trials on stimulated versus non-stimulated trials; 12.2+/-2.5% for stimulated trials, 0.9+/-0.2% for non-stimulated trials). Relatedly, we found an increase in the latency to initiate a trial following PL-NAc stimulation (**Supplementary Figure 12a,b**). Together, these results suggest that this manipulation had some influence on the mouse’s motivation to perform the task.

Given the relatively weak choice encoding in mTH-NAc compared to PL-NAc (**Figure 3a,b**), and the fact that the mTH-NAc did not support effective trial-by-trial learning in our model (**Figure 5h-k**), we hypothesized that optogenetic stimulation of the mTH-NAc projection might not impact choice (**Figure 6h**). Indeed, in contrast to PL-NAc stimulation, mTH-NAc stimulation had no significant effect on the mice’s stay probability on the subsequent trial, following either rewarded or unrewarded stimulation trials (**Figure 6i**; P=0.84: paired, two-tailed t-test across mice, n=8, comparison of stay probability following rewarded trials with and without stimulation; P=0.40: paired two-tailed t-test, comparison of stay probability following unrewarded trials with and without stimulation). Similarly, inclusion of mTH-NAc stimulation into our choice regression model from **Figure 1e** revealed no significant effect of stimulation on rewarded or unrewarded choices (**Figure 6j**; P>0.05 for all trials back: one-sample, two-tailed t-test of regression coefficients corresponding to the interaction term between rewarded or unrewarded choice and optical stimulation across mice, n=8 mice). Additionally, there was no effect on the mice’s stay probability for current trial stimulation (**Figure 6i**; P=0.59: paired, two-tailed t-test across mice, n=8, comparison of stay probability following rewarded trials with and without stimulation; P=0.50: paired, two-tailed t-test, comparison of stay probability following unrewarded trials with and without stimulation). Similar to PL-NAc stimulation, mTH-NAc stimulation generated an increase in the probability of abandoning a trial on stimulation trials compared with control trials (P=0.032: paired, two-tailed t-test comparing percentage of abandoned trials on stimulated versus non-stimulated trials; 22.1+/-7.9% for stimulated trials, 6.4+/-3.1% for non-stimulated trials) as well as an increase in the latency to initiate the next trial (**Supplementary Figure 12c**), indicating that laser stimulation of the mTH-NAc projection may affect motivation to perform the task while not affecting the mouse’s choice.

To control for non-specific effects of optogenetic stimulation, we ran a control cohort of mice that received identical stimulation but did not express the opsin (**Supplementary Figure 12e,f**). Stimulation had no effect on the mice’s choice behavior (**Supplementary Figure 12g,h**) nor on the probability of abandoning trials on stimulation versus control trials (P=0.38: paired, two-tailed t-test comparing percentage of abandoned trials on stimulated versus non-stimulated trials; 0.4+/-0.08% for stimulated trials, 0.4+/-0.01% for non-stimulated trials).

### Trial-by-trial modulation in neural dynamics, without changes in synaptic plasticity on the timescale of trials, provides an alternative explanation for the behavioral and neural results

In our synaptic plasticity model (**Figure 5**), plasticity in PL-NAc synapses was used to produce value correlates in the NAc, and the values are compared across the choices for action selection. Thus, this model requires fast, dopamine-mediated synaptic plasticity, on the time scale of a trial, to mediate behavior. Whether plasticity operates in the NAc on this timescale is unclear. An alternative hypothesis, which preserves the same conceptual structure of this model but does not require fast synaptic plasticity, is that the across-trial updating of values and corresponding selection of actions is accomplished through the dynamics of a recurrent neural network rather than the dynamics of synaptic plasticity (Botvinick et al., 2019, 2020; Doshi-Velez and Konidaris, 2016; Song et al., 2017; Wang et al., 2018). The initial learning of the neural network’s synaptic weights is based on a reinforcement learning algorithm, which models slow initial task acquisition, but during task performance, synaptic weights remain fixed and the DA RPE serves only to alter neural dynamics.

We thus developed such a neural dynamics model based on the recorded PL-NAc activity (**Figure 7a**; see Methods for model details). Similar to the synaptic plasticity model, single-trial, experimentally recorded PL-NAc activity was input to a recurrent neural network that modeled NAc and other associated brain regions (the “critic network”) to calculate value. Next, the TD RPE was calculated in the DA neurons from the value signal using the same circuit architecture as the synaptic plasticity model. However, rather than reweighting PL-NAc synapses on the timescale of trials, the TD RPE was input to a second recurrent neural network that modeled dorsomedial striatum (DMS) and other associated brain regions (the “actor network”; Atallah et al., 2007; Lau and Glimcher, 2008; O’Doherty et al., 2004; Richard et al., 2016; Tsutsui et al., 2016). This actor network, which is also a recurrent neural network, used this TD RPE input from the previous timestep, along with the action from the previous timestep and a “temporal context” sequence which may arise from hippocampus or other cortical areas (Howard and Eichenbaum, 2013; Leon and Shadlen, 2003), in order to select the choice (left choice, right choice, or no action). This choice, in turn, determined which PL-NAc choice sequence was initiated.

Similar to the synaptic plasticity model (**Figure 5d,e**), our neural dynamics model was able to rapidly modulate choice following a reversal in the identity of the high probability lever (**Figure 7b,c**). In addition, the neural dynamics model generated RPE signals in VTA dopamine neurons that resemble previous experimental recordings (**Figure 7d**; **Supplementary Figure 10**). By contrast, when we replaced the choice-selective sequences to the NAc by choice-selective persistent activity the model failed to train within the same number of training episodes (**Figure 7e**). This suggests that temporal structure in this input is beneficial for efficient task learning.

Given the symmetry in the block structure, the average RPE signal will be statistically similar as a function of trial number within each block. However, the model must make different choices across blocks to achieve high performance, given that the choice that is most likely to lead to reward changes across blocks. Therefore, across blocks, the actor network needs to respond differently to similar RPE inputs. Consistent with this, the decision variable for a given RPE was different for left versus right blocks (**Figure 7f**).

To further understand how the model is able to appropriately modulate its choices, we analyzed the evolution of the actor network’s activity across trials. Towards this, we performed dimensionality reduction on the actor network’s activity at the first time step in a trial (i.e., the time of decision) using principal components analysis. The actor network’s dynamics were low-dimensional, with the first three principal components explaining ~95% of the variance. The dynamics were different for the two blocks (**Figure 7g**), consistent with the model responding differently to similar RPE inputs across blocks (**Figure 7f**). At a block reversal, for example from a left block to a right block, the network activity rapidly transitioned from the approximately steady-state representation of the left block (cluster of blue-purple points in **Figure 7g**) to the approximately steady-state representation of the right block (cluster of red-yellow points). Furthermore, the model learned to align the first principal component of activity along the direction of the network readout weights that determine the actor’s choice *a*(*t*) (**Figure 7h**). Thus, the actor learned to generate an explicit representation of the decision variable in the first principal component of its activity.

To solve the reversal learning task, the network needs to use its past history of choices and rewards to accumulate evidence for whether the current block is a left block or a right block. Rewarded left-side choices, or unrewarded right-side choices, represent evidence that the current block is a left block, while the converse represents evidence for a right block. In the synaptic plasticity model (**Figure 5b**), this accumulation occurs in the PL-NAc synaptic weights, which accumulate the product of RPE (representing whether a choice was rewarded) and the eligibility trace (which, due to the choice-selectivity of the PL-NAc activity, represents the current choice). To analyze whether the actor network uses a similar accumulation of evidence to solve the task, we linearly regressed the first principal component of actor activity (PC1, which correlated strongly with the decision variable as described above) against the past history of choices and RPEs, which serve as inputs to the network, as well as the product of these (‘Choice x RPE’), which represents evidence. PC1 most strongly depended upon the ‘Choice x RPE’ predictor, with coefficients that decayed on a timescale of approximately 3 trials, suggesting that the actor used a leaky accumulation of evidence over this timescale to solve the task (**Figure 7i**). In addition, like the mice and the synaptic plasticity model, the neural dynamics model tended to stay with its previous choices as evident from the positive coefficients corresponding to the previous choice regressors in **Figure 7i**. Therefore, both the synaptic plasticity model and the neural dynamics model follow the same principle of accumulating evidence across trials to perform fast reversal learning in addition to having a tendency to repeat their previous choices.

To determine if the neural dynamics model, similar to the synaptic plasticity model, is able to recapitulate our optogenetic experiments, we simulated optogenetic stimulation of a fraction of PL-NAc neurons in the same manner as described for the synaptic plasticity model. Like the synaptic plasticity model and optogenetic experiments, relative to unstimulated trials, stimulation decreased the probability of staying with the previously chosen lever following rewarded trials and increased the probability following unrewarded trials (**Figure 7j**). Also like the synaptic plasticity model, this effect persisted for multiple trials (**Figure 7k**). In the case of the neural dynamics model, this disruption was due to a cascade of events: PL-NAc stimulation disrupted the activity of the critic (NAc and associated regions), thus disrupting the RPE signal that is transmitted to the actor. This disrupted RPE then leads to the observed disruption in behavior. Thus, both the synaptic plasticity and neural dynamics models exhibit disrupted trial-by-trial learning when PL-NAc activity is stimulated in a manner mimicking our experimental optogenetic perturbations.

## Discussion

This work provides both experimental and computational insights into how the NAc and associated regions could contribute to reinforcement learning. Experimentally, we found that mTH-NAc neurons were preferentially modulated by a reward-predictive cue, while PL-NAc neurons more strongly encoded actions (e.g. nose poke, lever press). In addition, PL-NAc neurons display choice-selective sequential activity which persists for several seconds after the lever press action, beyond the time the mice receive reward feedback. Computationally, we demonstrate that the choice-selective and sequential nature of PL-NAc activity can contribute to performance of a choice task by implementing a circuit-based version of reinforcement learning (Sutton and Barto, 1998; Tesauro, 1992), based on either synaptic plasticity or neural dynamics. The models are able to i) perform the task, ii) replicate previous recordings in VTA dopamine and GABA neurons, and iii) make new predictions that we have experimentally tested regarding the effect of perturbing PL-NAc or mTH-NAc activity on trial-by-trial learning. Thus, this work suggests a computational role of choice-selective sequences, a form of neural dynamics whose ubiquity is being increasingly appreciated (Kawai et al., 2015; Kim et al., 2017; Long et al., 2010; Ölveczky et al., 2011; Picardo et al., 2016; Sakata et al., 2008).

### Relationship to previous neural recordings in the NAc and associated regions

To our knowledge, a direct comparison, at cellular resolution, of activity across multiple glutamatergic inputs to the NAc has not previously been conducted. This is a significant gap, given that these inputs are thought to contribute critically to reinforcement learning by providing the information to the NAc that dopaminergic inputs can modulate (Centonze et al., 2001; Nestler, 2001; Nicola et al., 2000; Shen et al., 2008; Wilson, 2004; Xiong et al., 2015). The differences in the representations in the two populations that we report were apparent due to the use of a behavioral task with both actions and sensory stimuli.

The preferential representations of actions relative to sensory stimuli in PL-NAc is somewhat surprising, given that previous studies have focused on sensory representations in this projection (Otis et al., 2017), and also given that the NAc is heavily implicated in Pavlovian conditioning (Day and Carelli, 2007; Day et al., 2006; Di Ciano et al., 2001; Parkinson et al., 1999; Roitman et al., 2005; Wan and Peoples, 2006). On the other hand, there is extensive previous evidence of action correlates in PFC (Cameron et al., 2019; Genovesio et al., 2006; Luk and Wallis, 2013; Siniscalchi et al., 2019; Sul et al., 2010), and NAc is implicated in operant conditioning in addition to Pavlovian conditioning (Atallah et al., 2007; Cardinal and Cheung, 2005; Collins et al., 2019; Hernandez et al., 2002; Kelley et al., 1997; Kim et al., 2009; Salamone et al., 1991).

Our finding of sustained choice-encoding in PL-NAc neurons is in agreement with previous work recording from medial prefrontal cortex (mPFC) neurons during a different reinforcement learning task (Maggi and Humphries, 2019; Maggi et al., 2018). Additionally, other papers have reported choice-selective sequences in other regions of cortex, as well as in the hippocampus (Harvey et al., 2012; Pastalkova et al., 2008; Terada et al., 2017). In fact, given previous reports of choice-selective (or outcome-selective) sequences in multiple brain regions and species (Kawai et al., 2015; Kim et al., 2017; Long et al., 2010; Ölveczky et al., 2011; Picardo et al., 2016; Sakata et al., 2008), the relative absence of sequences in mTH-NAc neurons may be more surprising than the presence in PL-NAc.

Our observation of prolonged representation of the CS+ in mTH-NAc (**Figure 2d,f**) is in line with previous observations of pronounced and prolonged encoding of task-related stimuli in the primate thalamus during a Pavlovian conditioning task (Matsumoto et al., 2001). Together with our data, this suggests that the thalamus is contributing information about task-relevant stimuli to the striatum, which could bridge the gap between a CS and US in a Pavlovian trace conditioning task (Campus et al., 2019; Do-Monte et al., 2017; Otis et al., 2019; Zhu et al., 2018). Consistent with the observation that mTH-NAc displayed weaker choice representation relative to PL-NAc, disruption of this mTH projection did not have a causal effect on our choice task **(Figure 3a,b; Figure 6h-j**). We also demonstrate that mTH-NAc stimulation results in both an increase in the number of abandoned trials as well as the latency to initiate the subsequent trial (**Supplementary Figure 12c**), suggesting that this population may play a role in motivation as opposed to action selection. This is in line with recent work demonstrating a role of thalamostriatal projections in the initiation and execution, but not learning of a motor task (Wolff et al., 2019).

### Choice-selective sequences implement TD learning in a neural circuit synaptic plasticity model of task performance

Given the widespread observation of choice-selective sequences across multiple behaviors and brain regions, including the PL-NAc neurons that we record from in this study, a fundamental question is what is the computational function of such sequences. Here, we suggest that these sequences may contribute to the neural implementation of TD learning, by providing a temporal basis set that bridges the gap in time between actions and outcomes. Specifically, neurons active in a sequence that ultimately results in reward enable the backpropagation in time of the dopaminergic RPE signal, due to the fact that the earlier neurons in the sequence predict the activity of the later neurons in the sequence, which themselves overlap with reward. This causes synaptic weights onto NAc of these earlier neurons to be strengthened, which in turn biases the choice towards that represented by those neurons.

While TD learning based on a stimulus representation of sequentially active neurons has previously been proposed for learning in the context of sequential behaviors (Fee and Goldberg, 2011; Jin et al., 2009), and for learning the timing of a CS-US relationship (Aggarwal et al., 2012; Carrillo-Reid et al., 2008; Gershman et al., 2014; Ponzi and Wickens, 2010), here we extend these ideas in several important ways:

First, we link these theoretical ideas directly to data, by demonstrating that choice-selective sequential activity in the NAc is provided primarily by PL-NAc (as opposed to mTH-NAc) input neurons, and that perturbation of the PL-NAc (but not mTH-NAc) projection disrupts action-outcome pairing consistent with model predictions. Specifically, in our synaptic plasticity model, overwriting the choice-selective sequential activity with full trial optogenetic-like stimulation of PL-NAc neurons disrupts the model’s ability to link choice and outcome on one trial so as to guide choice on the subsequent trials (**Figure 6b-d**). In contrast to the effect of stimulation on subsequent trials’ choice, stimulation caused no disruption of choice on the stimulated trial, in either the synaptic plasticity model or the experimental results (**Figure 6c,f**). This is true in the synaptic plasticity model because choice is determined by the PL-NAc weights at the beginning of the trial, which are determined by previous trials’ choices and outcomes. Thus, the synaptic plasticity model provides a mechanistic explanation of a puzzling experimental finding: that optogenetic manipulation of PL-NAc neurons affects subsequent choices but not the choice on the stimulation trial itself and that this stimulation creates oppositely directed effects following rewarded versus unrewarded trials.

Second, we extend these ideas to the performance of a choice task. For appropriate learning in a choice task, an RPE must update the value of the chosen action only. Given that dopamine neurons diffusely project to striatum, and given that the dopamine RPE signal is not action-specific (Lee et al., 2019), it is not immediately obvious how such specificity would be achieved. In our synaptic plasticity model, the specificity of the value updating by dopamine to the chosen action is possible because the PL-NAc neurons that are active for one choice are silent for the other choice. Similarly, even though dopamine neurons receive convergent inputs from both NAc/VP neurons that encode right-side value as well as others that encode left-side value (given, once again, that dopamine RPE signals are not action-specific), the model produces dopamine neuron activity that is dependent on the value of the chosen action (rather than the sum of left and right action values) by virtue of the fact that PL-NAc inputs that are active for one choice are silent for the other.

Third, our synaptic plasticity model replicates numerous experimental findings in the circuitry downstream of PL-NAc. The most obvious is the calculation of an RPE signal in dopamine neurons (Bayer and Glimcher, 2005; Parker et al., 2016), which allows proper value estimation and, thus, task performance. We had previously confirmed an RPE signal in NAc-projecting dopamine neurons in this task (Parker et al., 2016). In addition, GABA interneurons encode value (Cohen et al., 2012; Tian et al., 2016), as predicted by our model, and their activation inhibits dopamine neurons, again consistent with our model (Eshel et al., 2015). This value representation in the GABA interneurons is key to our model producing an RPE signal in dopamine neurons, as it produces the temporally delayed, sign inverted signals required for the calculation of a temporally differenced RPE (**Figure 5a**). Although previous work had suggested other neural architectures for this temporal differencing operation (Aggarwal et al., 2012; Carrillo-Reid et al., 2008; Doya, 2002; Hazy et al., 2010; Ito and Doya, 2015; Joel et al., 2002; Pan et al., 2005; Suri and Schultz, 1998, 1999), these models have not been revisited in light of recent cell-type and projection-specific recordings in the circuit. Consistent with our synaptic plasticity model, electrical stimulation of VP generates both immediate inhibition of dopamine neurons, and delayed excitation, as required by our model (Chen et al., 2019).

Our specific proposal for temporal differencing by the VTA GABA interneuron is attractive in that it could provide a *generalizable* mechanism for calculating RPE: it could extend to any input that projects both to the dopamine and GABA neurons in the VTA, and that also receives a dopaminergic input that can modify synaptic weights.

### Limitations of the synaptic plasticity model and trial-by-trial changes in neural dynamics as an alternative model of the neural implementation of reinforcement learning

While the synaptic plasticity model replicates multiple features of neurons throughout the circuit, it does not replicate some findings on the relative timing and sign of the value correlates in the NAc versus the VP (Chen et al., 2019; Ottenheimer et al., 2018; Tian et al., 2016), likely because of anatomical connections that we omitted from our circuit for simplicity (e.g. VP→ NAc; Ottenheimer et al., 2018; Wei et al., 2016; also see **Supplementary Figure 9** for discussion of feasible model variants that produce a greater diversity of signals in VP). Similarly, our synaptic plasticity model does not produce previously observed responses to the conditioned stimuli in VP (Ottenheimer et al., 2019; Stephenson-Jones et al., 2020; Tian et al., 2016; Tindell et al., 2004), although inclusion of cue-responsive mTH-NAc inputs into our model would remedy this. Relatedly, our model does not take into account other glutamatergic inputs to the NAc, such as those from the basolateral amygdala and ventral hippocampus (Bagot et al., 2015; French and Totterdell, 2003). Further investigations into what information these inputs encode will be elucidating.

In the synaptic plasticity model, appropriate trial-and-error learning is mediated by a dopamine-dependent synaptic plasticity mechanism that operates on the timescale of trials (**Figure 5b**). There is extensive evidence to support a key role for dopamine in this task on this timescale. For example, we had previously demonstrated that dopamine neurons that project to the NAc encode an RPE signal in this task, and that inhibiting these neurons during single trials serves as a negative prediction error signal, consistent with a reinforcement learning mechanism (Parker et al., 2016). Similar evidence for dopamine serving as a reinforcement learning signal has been obtained in other related paradigms in both rodents (Hamid et al., 2016; Kwak et al., 2014; Lak et al., 2020) and human subjects (Rutledge et al., 2009). In addition, other studies have used pharmacology or ablation to implicate dopamine-receptor pathways and NMDA-receptors in NAc in reversal learning (Boulougouris et al., 2009; Ding et al., 2014; Izquierdo et al., 2006; Kruzich and Grandy, 2004; O’Neill and Brown, 2007; Taghzouti et al., 1985).

However, whether dopamine-mediated synaptic plasticity operates on a fast enough timescale to mediate plasticity during trial-and-error learning is not clear (but see Shonesy et al., 2020 and Young and Yang, 2005 for some evidence of fast dopamine-mediated synaptic plasticity in cortex and striatum). Furthermore, model-free TD learning cannot take advantage of additional task-structure information such as the reward probabilities within a block (Collins and Cockburn, 2020; Doll et al., 2012, but see Supplementary Figure 14 for challenges in identifying this ability within tasks like ours). For these reasons, we presented an alternative model based on trial-by-trial modulation by dopamine of neural dynamics instead of synaptic plasticity (Figure 7; see related work by Botvinick et al., 2019, 2020; Doshi-Velez and Konidaris, 2016; Nagabandi et al., 2018; Song et al., 2017; Wang et al., 2018; Sæmundsson et al. 2018; Finn et al. 2017; Duan et al. 2016; Rakelly et al. 2019). In this model, dopamine affected choices by modifying dynamics in a recurrent neural network, while synaptic weights were fixed after initial training of the weights with an RPE. Thus, in the neural dynamics model, the DA RPE played a similar role during initial learning of the task as it contributed during the task in our synaptic plasticity model. Similar to our synaptic plasticity model, the choice-selective sequences were input to the value-calculating network. Eliminating the temporal structure of this input while preserving the choice-selectivity made training of the network less efficient (Figure 7e), suggesting that temporal structure is beneficial in the calculation of value. Similar to our synaptic plasticity model, the neural dynamics model also replicates our key experimental observations. Both models also generate a value signal and an RPE signal using single-trial recorded PL-NAc activity and a reward signal as input. Additionally, in both models, the calculation of the reward prediction error signal is accomplished using similar VTA circuit architecture downstream of the value calculation (compare Figures 5a and 7a). Finally, we showed that, analogous to the accumulation of past history of rewarded choices by the synaptic weights in the synaptic plasticity model (Figure 5b), the neural dynamics model appears to accumulate a similar variable in the output dynamics of its actor network (Figure 7i). Given the similarity in outputs of these two models, future experiments will need to be designed and executed to test predictions that distinguish the synaptic plasticity and neural dynamics models.

## Methods

### Mice

46 male C57BL/6J mice from The Jackson Laboratory (strain 000664) were used for these experiments. Prior to surgery, mice were group-housed with 3-5 mice/cage. All mice were >6 weeks of age prior to surgery and/or behavioral training. To prevent mice from damaging the implant of cagemates, all mice used in imaging experiments were single housed post-surgery. All mice were kept on a 12-h on/ 12-h off light schedule. All experiments and surgeries were performed during the light off time. All experimental procedures and animal care was performed in accordance with the guidelines set forth by the National Institutes of Health and were approved by the Princeton University Institutional Animal Care and Use Committee.

### Probabilistic reversal learning task

Beginning three days prior to the first day of training, mice were placed on water restriction and given per diem water to maintain >80% original body weight throughout training. Mice performed the task in a 21 x 18 cm operant behavior box (MED associates, ENV-307W). A shaping protocol of three stages was used to enable training and discourage a bias from forming to the right or left lever. In all stages of training, the start of a trial was indicated by illumination of a central nose poke port. After completing a nose poke, the mouse was presented with both the right and left lever after a temporal delay drawn from a random distribution from 0 to 1s in 100ms intervals. The probability of reward of these two levers varied based on the stage of training (see below for details). After the mouse successfully pressed one of the two levers, both retracted and, after a temporal delay drawn from the same uniform distribution, the mice were presented with one of two auditory cues for 500ms indicating whether the mouse was rewarded (CS+, 5 kHz pure tone) or not rewarded (CS−, white noise). Concurrent with the CS+ presentation, the mouse was presented with 6μl of 10% sucrose reward in a dish located equidistantly between the two levers, just interior to the central nose poke. The start time of reward consumption was defined as the moment the mouse first made contact with the central reward port spout following the delivery of the reward. The end of the reward consumption period (i.e. reward exit) was defined as the first moment at which the mouse was disengaged from the reward port for >100ms. In all stages of training, trials were separated by a 2s intertrial interval, which began either at the end of CS on unrewarded trials or at the end of reward consumption on rewarded trials.

In the first stage of training (“100-100 debias”), during a two hour session, mice could make a central nose poke and be presented with both the right and left levers, each with a 100% probability of reward. However, to ensure that mice did not form a bias during this stage, after five successive presses of either lever the mouse was required to press the opposite lever to receive a reward. In this case, a single successful switch to the opposite lever returned both levers to a rewarded state. Once a mouse received >100 rewards in a single session they were moved to the second stage (“100-0”) where only one of the two levers would result in a reward. The identity of the rewarded lever reversed after 10 rewarded trials plus a random number of trials drawn from the geometric distribution:
where *P*(*k*) is the probability of a block reversal k trials into a block and p is the success probability of a reversal for each trial, which in our case was 0.4. After 3 successive days of receiving >100 total rewards, the mice were moved to the final stage of training (“70-10”), during which on any given trial pressing one lever had a 70% probability of leading to reward (high-probability lever) while pressing the opposite lever had only a 10% reward probability (low-probability lever). The identity of the higher probability lever reversed using the same geometric distribution as the 100-0 training stage. On average, there were 23.23 +/- 7.93 trials per block and 9.67 +/- 3.66 blocks per session (mean +/- std. dev.). In this final stage, the mice were required to press either lever within 10s of their presentation; otherwise, the trial was considered an ‘abandoned trial’ and the levers retracted. All experimental data shown was collected while mice performed this final “70-10” stage.

### Logistic choice regression

For the logistic choice regressions shown in **Figure 1e** and **Supplementary Figure 2a**, we modeled the choice of the mouse on trial *i* based on lever choice and reward outcome information from the previous n trials using the following logistic regression model:

Where *C*(*i*) is the probability of choosing the right lever on trial *i*, and *R*(*i – J*) and *U*(*i – j*) are the choice of the mouse *j* trials back from the *i*-th trial for either rewarded or unrewarded trials, respectively. *R*(*i – j*) was defined as +1 when the *j*-th trial back was both rewarded and a right press, −1 when the *j*-th trial back was rewarded and a left press and 0 when it was unrewarded. Similarly, *U*(*i–j*) was defined as +1 when the *j*-th trial back was both unrewarded and a right press, −1 when the *j*-th trial back was unrewarded and a left press and 0 when it was rewarded. The calculated regression coefficients, and , reflect the strength of the relationship between the identity of the chosen lever on a previously rewarded or unrewarded trial, respectively, and the lever chosen on the current trial.

To examine the effect of optogenetic stimulation from multiple trials back on the mouse’s choice (**Figure 6d,g,j; Figure 7k; Supplementary Figure 12h & Supplementary Figure 13c-d**), we expanded our behavioral logistic regression model to include the identity of those trials with optical stimulation, as well as the interaction between rewarded and unrewarded choice predictors and stimulation:
where *L*(*i*) represents optical stimulation on the *i*^{th} trial (1 for optical stimulation, 0 for control trials), represents the coefficient corresponding to the effect of stimulation on choice j trials back and and represent the coefficients corresponding to the interaction between rewarded choice x optical stimulation and unrewarded choice x stimulation, respectively.

To visualize the relative influence of stimulation on the mice’s choices compared with unstimulated trials, in **Figure 6d,g,j; Figure 7k; Supplementary Figure 12h & Supplementary Figure 13c-d**, the solid blue trace represents the sum of the rewarded choice coefficients (represented by the black trace) and rewarded choice x stimulation coefficients . Similarly, the dashed blue trace represents the sum of the unrewarded choice coefficients (grey trace) and unrewarded choice x stimulation coefficients . For all choice regressions, the coefficients for each mouse were fit using the *glmfit* function in MATLAB and error bars reflect means and s.e.m. across mice.

### Cellular-resolution calcium imaging

To selectively image from neurons which project to the NAc, we utilized a combinatorial virus strategy to image cortical and thalamic neurons which send projections to the NAc. 16 mice (7 PL-NAc, 9 mTH-NAc) previously trained on the probabilistic reversal learning task were unilaterally injected with 500nl of a retrogradely transporting virus to express Cre-recombinase (CAV2-cre, IGMM vector core, France, injected at ~2.5 × 10^{12} parts/ml or retroAAV-EF1a-Cre-WPRE-hGHpA, PNI vector core, injected at ~6.0 x 10^{13}) in either the right or left NAc core (1.2 mm A/P, +/-1.0 mm M/L, −4.7 D/V) along with 600nl of a virus to express GCaMP6f in a Cre-dependent manner (AAV2/5-CAG-Flex-GCaMP6f-WPRE-SV40, UPenn vector core, injected at ~1.27 x 10^{13} parts/ml) in either the mTH (−0.3 & −0.8 A/P, +/- 0.4 M/L, −3.7 D/V) or PL (1.5 & 2.0 A/P, +/- 0.4 M/L, −2.5 D/V) of the same hemisphere. 154 of 278 (55%, n=5 mice) PL-NAc neurons and 95 out of 256 (37%, n=5 mice) mTH-NAc neurons were labeled using the CAV2-Cre virus, the remainder were labeled using the retroAAV-Cre virus. In this same surgery, mice were implanted with a 500 μ*m* diameter gradient refractive index (GRIN) lens (GLP-0561, Inscopix) in the same region as the GCaMP6f injection - either the PL (1.7 A/P, +/- 0.4 M/L, −2.35 D/V) or mTH (−0.5 A/P, +/- 0.3 M/L, −3.6 D/V). 2-3 weeks after this initial surgery, mice were implanted with a base plate attached to a miniature, head-mountable, one-photon microscope (nVISTA HD v2, Inscopix) above the top of the implanted lens at a distance which focused the field of view. All coordinates are relative to bregma using *Paxinos and Franklin’s the Mouse Brain in Stereotaxic Coordinates, 2nd edition* (Paxinos and Franklin, 2004). GRIN lens location was imaged using the Nanozoomer S60 Digital Slide Scanner (Hamamatsu) (location of implants shown in **Supplementary Figure 1**). The subsequent image of the coronal section determined to be the center of the lens implant was then aligned to the Allen Brain Atlas (Allen Institute, brain-map.org) using the *Wholebrain* software package (wholebrainsoftware.org; Fürth et al., 2018).

Post-surgery, mice with visible calcium transients were then retrained on the task while habituating to carrying a dummy microscope attached to the implanted baseplate. After the mice acclimated to the dummy microscope, they performed the task while images of the recording field of view were acquired at 10 Hz using the Mosaic acquisition software (Inscopix). To synchronize imaging data with behavioral events, pulses from the microscope and behavioral acquisition software were recorded using either a data acquisition card (USB-201, Measurement computing) or, when LED tracking (see below for details) was performed, an RZ5D BioAmp processor from Tucker-Davis Technologies. Acquired videos were then pre-processed using the Mosaic software and spatially downsampled by a factor of 4. Subsequent down-sampled videos then went through two rounds of motion-correction. First, rigid motion in the video was corrected using the translational motion correction algorithm based on Thévenaz et al. (1998) included in the Mosaic software (Inscopix, motion correction parameters: translation only, reference image: the mean image, speed/accuracy balance: 0.1, subtract spatial mean [r = 20 pixels], invert, and apply spatial mean [r = 5 pixels]). The video then went through multiple rounds of non-rigid motion correction using the NormCore motion correction algorithm (Pnevmatikakis and Giovannucci, 2017) NormCore parameters: gSig=7, gSiz=17, grid size and grid overlap ranged from 12-36 and 8-16 pixels, respectively, based on the individual motion of each video. Videos underwent multiple (no greater than 3) iterations of NormCore until non-rigid motion was no longer visible). Following motion correction, the CNMFe algorithm (Zhou et al., 2018) was used to extract the fluorescence traces (referred to as ‘GCaMP6f’ throughout the text) as well as an estimated firing rate of each neuron (CNMFe parameters: spatial downsample factor=1, temporal downsample=1, gaussian kernel width=4, maximum neuron diameter=20, tau decay=1, tau rise=0.1). Only those neurons with an estimated firing rate of four transients/ minute or higher were considered ‘task-active’ and included in this paper – 278/330 (84%; each mouse contributed 49,57,67,12,6,27,60 neurons, respectively) of neurons recorded from PL-NAc passed this threshold while 256/328 (78%; each mouse contributed 17,28,20,46,47,40,13,13,32 neurons, respectively) passed in mTH-NAc. Across all figures, to normalize the neural activity across different neurons and between mice, we Z-scored each GCaMP6f recording trace using the mean and standard deviation calculated using the entire recording session.

### Encoding model to generate response kernels for behavioral events

To determine the response of each neuron attributable to each of the events in our task, we used a multiple linear encoding model with lasso regularization to generate a response kernel for each behavioral event (example kernels shown in **Figure 2b**). In this model, the dependent variable was the GCaMP6f trace of each neuron recorded during a behavioral session and the independent variables were the times of each behavioral event (‘nose poke’, ‘levers out’, ‘ipsilateral lever press’, ‘contralateral lever press’, ‘CS+’, ‘CS−’ and ‘reward consumption) convolved with a 25 degrees-of-freedom spline basis set that spanned −2 to 6s before and after the time of action events (‘nose poke’, ‘ipsilateral press’, ‘contralateral press’ and ‘reward consumption’) and 0 to 8s from stimulus events (‘levers out’, ‘CS+’ and ‘CS−’). To generate this kernel, we used the following linear regression with lasso regularization using the *lasso* function in MATLAB:
where *F*(*t*) is the Z-scored GCaMP6f fluorescence of a given neuron at time *t, T* is the total time of recording, *K* is the total number of behavioral events used in the model, *N _{sp}* is the degrees-of-freedom for the spline basis set (25 in all cases),

*β*is the regression coefficient for the

_{jk}*j*spline basis function and

^{th}*k*behavioral event,

^{th}*β*

_{0}is the intercept term and λ is the lasso penalty coefficient. The value of lambda was chosen to minimize the mean squared error of the model, as determined by 5-fold cross validation. The predictors in our model,

*X*, were generated by convolving the behavioral events with a spline basis set, to enable temporally delayed versions of the events to predict neural activity: where

_{jk}*S*(

_{j}*i*) is the

*j*spline basis function at time point

^{th}*i*with a length of 81 time bins (time window of −2 to 6s for action events or 0 to 8s for stimulus events sampled at 10 Hz) and

*e*is a binary vector of length

_{k}*T*representing the time of each behavioral event

*k*(1 at each time point where a behavioral event was recorded using the MED associates and TDT software, 0 at all other timepoints).

Using the regression coefficients, *β _{jk}*, generated from the above model, we then calculated a ‘response kernel’ for each behavioral event:

This kernel represents the (linear) response of a neuron to each behavioral event, while accounting for the linear component of the response of this neuron to the other events in the task.

### Quantification of neural modulation to behavioral events

To identify neurons that were significantly modulated by each of the behavioral events in our task (fractions shown in **Figure 2g-h**), we used the encoding model described above, but without the lasso regularization:

As above, *F*(*t*) is the Z-scored GCaMP6f fluorescence of a given neuron at time *t, K* is the total number of behavioral events used in the model, *N _{sp}* is the degrees-of-freedom for the spline basis set (25 in all cases),

*β*is the regression coefficient for the

_{jk}*j*spline basis function and

^{th}*k*behavioral event and

^{th}*β*

_{0}is the intercept term. To determine the relative contribution of each behavioral event when predicting the activity of a neuron, we compared the full version of this model to a reduced model with the

*X*and

*β*terms associated with the behavioral event in question excluded. For each behavioral event, we first generated an F-statistic by comparing the fit of a full model containing all event predictors with that of a reduced model that lacks the predictors associated with the event in question. We then calculated this same statistic on 500 instances of shuffled data, where shuffling was performed by circularly shifting the GCaMP6f fluorescence by a random integer. We then compared the F-statistic from the real data to the shuffled distribution to determine whether the removal of an event as a predictor compromised the model significantly more than expected by chance. If the resulting P-value was less than the significance threshold of P=0.01, after accounting for multiple comparison testing of each of the behavioral events by Bonferroni correction, then the event was considered significantly encoded by that neuron.

To determine whether a neuron was significantly selective to the choice or outcome of a trial (‘choice-selective’ and ‘outcome-selective’, fractions of neurons from each population shown in **Figure 3a,c**), we utilized a nested model comparison test similar to that used to determine significant modulation of behavioral events above, where the full model used the following behavioral events as predictors: ‘nose poke’, ‘levers out’, ‘all lever press’, ‘ipsilateral lever press’, ‘all CS’, ‘CS+’ and ‘reward consumption’. For choice-selectivity, an F-statistic was computed for a reduced model lacking the ‘ipsilateral lever press’ predictors and significance was determined by comparing this value with a null distribution generated using shuffled data as described above. For outcome-selectivity, the reduced model used to test for significance lacked the predictors associated with both the ‘CS+’ and ‘reward consumption’ events.

By separating the lever press and outcome-related events into predictors that were either blind to the choice or outcome of the trial (‘all lever press’ and ‘all CS’, respectively) and those which included choice or outcome information (‘ipsilateral lever press’ or ‘CS+’ and ‘reward consumption’, respectively) we were able to determine whether the model was significantly impacted by the removal of either choice or outcome information. Therefore, neurons with significant encoding of the ‘ipsilateral lever press’ event (using the same P-value threshold determined by the shuffled distribution of F-statistics) were considered choice-selective, while those with significant encoding of the ‘CS+/reward consumption’ events were considered outcome-selective.

### Neural decoders

#### Choice decoder

In **Figure 3b**, we quantified how well simultaneously imaged populations of 1 to 10 PL-NAc or mTH-NAc neurons could be used to decode choice using a logistic regression:
where C(i) is the probability the mouse made an ipsilateral choice on trial i, *β*_{0} is the offset term, n is the number of neurons (between 1 and 10), *β _{j}* is the regression weight for each neuron,

*X*(

_{j}*i*) is the mean z-scored GCaMP6f fluorescence from −2s to 6s around the lever press on trial i and ‘ is the error term.

Given that the mice’s choices were correlated across neighboring trials, we weighted the logistic regression based on the frequency of each trial type combination. This was to ensure that choice decoding of a given trial was a reflection of the identity of the lever press on the current trial as opposed to that of the previous or future trial. Thus, we classified each trial as one of eight ‘press sequence types’ based on the following ‘previous-current-future’ press sequences: ipsi-ipsi-ipsi, ipsi-ipsi-contra, ipsi-contra-contra, ipsi-contra-ipsi, contra-contra-contra, contra-contra-ipsi, contra-ipsi-ipsi, contra-ipsi-contra. We then used this classification to equalize the effects of press-sequence type on our decoder by generating weights corresponding to the inverse of the frequency of the press sequence type of that trial. These weights were then used as an input to the *fitglm* function in MATLAB, which was used to fit a weighted version of the logistic regression model above (**Equation 8**).

Decoder performance was evaluated with 5-fold cross-validation by calculating the proportion of correctly classified held-out trials. Predicted ipsilateral press probabilities greater than or equal to 0.5 were decoded as an ipsilateral choice and values less than 0.5 were decoded as a contralateral choice. This was repeated with 100 combinations of randomly selected, simultaneously imaged neurons from each mouse. Reported decoding accuracy is the average accuracy across the 100 runs and 5 combinations of train-test data for each mouse. Note that only 6/7 mice in the PL-NAc cohort were used in the decoder analyses as one mouse had fewer than 10 simultaneously imaged neurons.

#### Outcome decoder

For the outcome decoder in **Figure 3d**, we used the same weighted logistic regression used for choice decoding, except the dependent variable was the outcome of the trial (+1 for a reward, 0 for no reward) and the predictors were the average GCaMP6f fluorescence during the intertrial interval (ITI) of each trial. The ITI was defined as the time between CS presentation and either 1s before the next trial’s nose poke or 8s after the CS, whichever occurred first. This was used in order to avoid including any neural activity attributable to the next trial’s nose poke in our analysis.

To correct for outcome correlations between neighboring trials, we performed a similar weighting of predictors as performed in the choice decoder above using the following eight outcome sequence types: ‘reward-reward-reward’, ‘reward-reward-unreward’, ‘reward-unreward-unreward’, ‘reward-unreward-reward’, ‘unreward-unreward-unreward’, ‘unreward-unreward-reward’, ‘unreward-reward-reward’, ‘unreward-reward-unreward.’

#### Time-course choice decoder

To determine how well activity from PL-NAc and mTH-NAc neurons was able to predict the mouse’s choice as a function of time throughout the trial (**Figure 4e** & **Supplementary Figure 7e**), we trained separate logistic regressions on 500ms bins throughout the trial, using the GCaMP6f fluorescence of 10 simultaneously imaged neurons.

Because of the variability in task timing imposed by the jitter and variability of the mice’s actions, we linearly interpolated the GCaMP6f fluorescence trace of each trial to a uniform length, *t _{adjusted}*, relative to behavioral events in our task. Specifically, for each trial,

*T*, we divided time into the following four epochs: (i) 2s before nose poke, (ii) time from the nose poke to the lever press, (iii) time from the lever press to the nose poke of the subsequent trial,

*T*+ 1 and (iv) the 3s following the next trial nosepoke. For epochs ii and iii,

*t*was determined by interpolating the GCaMP6f fluorescence trace from each trial to a uniform length defined as the median time between the flanking events across all trials. Thus,

_{adjusted}*t*within each epoch for each trial,

_{adjusted}*T*, was defined as: where , and are the times of the nose poke and lever press on the current trial, is the time of the nose poke of the subsequent trial and and are the median times across trials of epoch ii and iii.

The resulting time-adjusted GCaMP6f traces were divided into 500ms bins. For each bin, we fit the weighted logistic regression described above to predict choice on the current, previous or future trial from the activity of 10 simultaneously imaged neurons. Predictors were weighted based on press sequence type as described above. Decoding accuracy was assessed as described above using 100 combinations of 10 randomly selected neurons and 5-fold cross-validation. To determine if decoding was significantly above chance, which is 0.5, for each timepoint we performed a two-tailed, one-sample t-test.

### Statistics

All t-tests reported in the results were performed using either the *ttest* or *ttest2* function in MATLAB. In all cases, t-tests were two-tailed. In cases where multiple comparisons were performed, we applied a Bonferroni correction to determine the significance threshold. Two-proportion Z-tests (used to compare fractions of significantly modulated/selective neurons, **Figures 2g,h & 3a,c**) and Fisher’s Z (used to compare correlation coefficients, **Figure 4d & Supplementary Figure 7d**) were performed using Vassarstats.net.

For all t-tests in this paper, data distributions were assumed to be normal, but this was not formally tested. No statistical methods were used to predetermine sample sizes, but our sample sizes were similar to those generally employed in the field.

### Synaptic plasticity model

To computationally model how the brain could solve the reversal learning task *using fast dopamine-mediated synaptic plasticity*, we generated a biological instantiation of the TD algorithm for reinforcement learning (Sutton and Barto, 1998) by combining the recorded PL-NAc activity with known circuit connectivity in the NAc and associated regions (Hunnicutt et al., 2016; Kalivas et al., 1993; Otis et al., 2017; Watabe-Uchida et al., 2012). The goal of the model is to solve the “temporal credit assignment problem” by learning the value of each choice at the onset of the choice-selective PL-NAc sequence, when we assume the mouse makes its decision and which is well before the time of reward.

### Synaptic plasticity model description

#### The value function

Our implementation of the TD algorithm seeks to learn an estimate, at any given time, of the total discounted sum of expected future rewards, known as the value function W). To do this, we assume that the value function over time is decomposed into a weighted sum of temporal basis functions (Sutton and Barto, 1998) corresponding to the right-choice and left-choice preferring neurons: with the total value being given by the sum over both the left and right neurons as

Here, *V _{R}*(

*t*) and

*V*(

_{L}*t*) are the components of the value functions encoded by the right- and left-preferring neurons respectively,

*n*and

_{R}*n*are the number of right- and left-preferring choice-selective neurons respectively, and are the weights between the

_{L}*i*PL neuron and the NAc, which multiply the corresponding basis functions. Thus, each term in

^{th}*V*(

_{R}*t*) or

*V*(

_{L}*t*) above corresponds to the activity of one of the striatal neurons in the model (

**Figure 5a**). Note that, in our model, the total value

*V*(

*t*) sums the values associated with the left and right actions and is thus not associated with a particular action. At any given time on a given trial, the choice-selective activity inherent to the recorded PL-NAc neurons results predominantly in activation of the sequence corresponding to the chosen lever compared to the unchosen lever (see

**Figure 5c**), so that a single sequence, corresponding to the chosen action, gets reinforced.

#### The reward prediction error (RPE)

TD learning updates the value function iteratively by computing errors in the predicted value function and using these to update the weights *w _{i}*. The RPE at each moment of time is calculated from the change in the estimated value function over a time step of size

*dt*as follows where

*δ*(

*t*) is the reward prediction error per unit time. Here, the first two terms represent the estimated value at time

*t*, which equals the sum of the total reward received at time

*t*and the (discounted) expectation of rewards, i.e. value, at all times into the future. This is compared to the previous time step’s estimated value

*V*(

*t*–

*dt*). The coefficient represents the temporal discounting of rewards incurred over the time step d

*t*. Here

*τ*denotes the timescale of temporal discounting and was chosen to be 0.8

*s*.

To translate this continuous time representation of RPE signals to our biological circuit model, we assume that the RPE *δ*(*t*), is carried by dopamine neurons (Montague et al., 1996; Schultz et al., 1997). These dopamine neurons receive three inputs corresponding to the three terms on the right side of the above equation: a reward signal originating from outside the VTA, a discounted estimate of the value function *V*(*t*) that, in **Figure 5a**, represents input from the striatum via the ventral pallidum (Chen et al., 2019; Tian et al., 2016) and an oppositely signed, delayed copy of the value function *V*(*t* – Δ) that converges upon the VTA interneurons (Cohen et al., 2012).

Because the analytical formulation of TD learning in continuous time is defined in terms of the infinitesimal time step *dt*, but a realistic circuit implementation needs to be characterized by a finite delay time for the disynaptic pathway through the VTA interneurons, we rewrite the above equation approximately for small, but finite delay Δ as:
where we have defined as the discount factor corresponding to one interneuron time delay and, in all simulations, we chose a delay time Δ = 0.01*s*. Note that the discount factor is biologically implemented in different strengths of the weights of the VP inputs to the GABA interneuron and dopaminergic neuron in the VTA.

The proposed circuit architecture of **Figure 5a** can be rearranged into several other, mathematically equivalent architectures (**Supplementary Figure 9**). These architectures are not mutually exclusive, so other more complicated architectures could be generated by superpositions of these architectures.

#### The eligibility trace

The RPE at each time step *δ*(*t*)d*t* was used to update the weights of the recently activated synapses, where the “eligibility” *E _{i}*(

*t*) of a synapse for updating depends upon an exponentially weighted average of its recent past activity (Gerstner et al., 2018; Sutton and Barto, 1998): which can be rewritten as or, in the limit

*dt*≫ 1, where

*τ*defines the time constant of the decay of the eligibility trace, which was chosen to be 0.6s consistent with (Gerstner et al., 2018; Yagishita et al., 2014).

_{e}#### Weight Updates

The weight of each PL-NAc synapse, *w _{i}*, is updated according to the product of its eligibility

*E*(

_{i}*t*) and the RPE rate

*δ*(

*t*) at that time using the following update rule (Gerstner et al., 2018; Sutton and Barto, 1998): where

*α*= 0.009(

*spikes/s*)

^{−1}was the learning rate. Note that the units of

*α*derive from the units of weight being value · (spikes/s)

^{−1}The PL-NAc weights used in the model are thresholded to be non-negative so that the weights obey Dale’s principle.

#### Action Selection

In the model, the decision to go left or right is determined by “probing” the relative values of the left versus right actions just prior to the start of the choice-selective sequence. To implement this, we assumed that the choice was read out in a noisy, probabilistic manner from the activity of the first 60 neurons in each (left or right) PL population prior to the start of the sequential activity. This was accomplished by providing a 50 ms long, noisy probe input to each of these PL neurons and reading out the summed activity of the left and the summed activity of the right striatal populations. The difference between these summed activities was then put through a softmax function (given below) to produce the probabilistic decision.

To describe this decision process quantitatively, we define the probability of making a leftward or rightward choice in terms of underlying decision variables *d _{left}* and

*d*corresponding to the summed activity of the first 60 striatal neurons in each population: where denotes time-averaging over the 50 ms probe period and and denote the non-negative stochastic probe input, which was chosen independently for each neuron and each time step from a normal distribution (truncated at zero to enforce non-negativity) with mean prior to truncation equal to 0.05 s

_{right}^{−1}(5% of peak activity) and a standard deviation of . Note that the weights used here correspond to the weights from the end of the previous trial, which we assume are the same as the weights at the beginning of the next trial. The probability of choosing the left or the right lever for a given trial

*n*is modeled as a softmax function of these decision variables plus a “stay with the previous choice” term that models the tendency of mice in our study to return to the previously chosen lever irrespective of reward (

**Figure 1d**), given by the softmax distribution where

*I*is 1 if that action (i.e. left or right) was chosen on the previous trial and 0 otherwise, and

_{left/right}*β*ν 2500 and

_{value}*β*= 0.2 are free parameters that define the width of the softmax distribution and the relative weighting of the value-driven versus stay contributions to the choice.

_{stay}### Synaptic plasticity model implementation

#### Block structure for the model

Block reversals were determined using the same criteria as in the probabilistic reversal learning task performed by the mice – the identity of the rewarded lever reversed after 10 rewarded trials plus a random number of trials drawn from the geometric distribution given by **Equation 1**. The model used p = 0.4 as in the reversal learning experiments. Given the variation in performance across the models that use PL-NAc, mTH-NAc or PL-NAc synchronous activity as input (see **Figure 5**), the average block length for each model varied as well (because block reversals depended upon the number of rewarded trials). The average block length for the single-trial PL-NAc model, single-trial mTH-NAc model and PL-NAc synchronous control were 22.6+/-6.3, 25.7+/-8.5 and 26.5+/-6.6 trials (mean +/- std. dev.), respectively. The PL-NAc model produced a similar block length as that of behaving mice (23.23 +/- 7.93 trials, mean +/- std. dev.). Because a block reversal in our task is dependent on the mice receiving a set number of rewards, the choices just prior to a block reversal are more likely to align with the identity of the block and result in reward (see **Figure 5e,j,o**). Thus, the increase in choice probability observed on trials close to the block reversal (most notably in the synchronous case in **Figure 5o**, but also notable in the mTH-NAc in **Figure 5j** and the PL-NAc in **Figure 5e**) is an artifact of this reversal rule and not reflective of the model learning choice values.

#### PL-NAc inputs to the neural circuit model

To generate the temporal basis functions, *f _{i}*(

*t*) (example activity shown in

**Figure 5c**), we used the choice-selective sequential activity recorded from the PL-NAc neurons shown in

**Figure 4b-c**. Spiking activity was inferred from calcium fluorescence using the CNMFe algorithm (Zhou et al., 2018) and choice-selectivity was determined using the nested comparison model used to generate

**Figure 3a**(see

**above for details). Model firing rates were generated by Z-scoring the inferred spiking activity of each choice-selective PL-NAc neuron. The resulting model firing rates were interpolated using the**

*Quantification of neural modulation to behavioral events**interp*function from Python’s numpy package to match the time step,

*dt*= 0.01

*s*, and smoothed using a Gaussian kernel with zero mean and a standard deviation of 0.2s using the

*gaussian_filter1d*function from the ndimage module in Python’s scikit-learn package.

To generate a large population of model input neurons on each trial, we created a population of 368 choice-selective “pseudoneurons” on each trial. This was done as follows: for each simulated trial, we created 4 copies (pseudoneurons) of each of the 92 recorded choice-selective PL-NAc neurons using that neuron’s inferred spiking activity from 4 different randomly selected trials. The pool of experimentally recorded trials from which pseudoneuron activities were chosen was balanced to have equal numbers of stay and switch trials. This was done because the choices of the mice were strongly positively correlated from trial to trial (i.e., had more stay than switch trials), which (if left uncorrected) potentially could lead to biases in model performance if activity late in a trial was reflective of choice on the next, rather than the present trial. To avoid choice bias in the model, we combined the activity of left- and right-choice-preferring recorded neurons when creating the pool of pseudoneurons. We then randomly selected 184 left-choice-preferring and 184 right-choice-preferring model neurons from this pool of pseudoneurons. An identical procedure, using the 92 most choice-selective mTH-NAc neurons, was followed to create the model mTH-NAc neurons. The identity of these 92 neurons was determined by ranking each neuron’s choice-selectivity using the p-value calculated to determine choice-selectivity (see ** Quantification of neural modulation to behavioral events** above for details).

To generate the synchronous PL-NAc activity used in **Figure 5m-q**, the temporal basis function of each PL neuron was time-shifted so the peak probability of firing was 2s before the time of the lever press.

To mimic the PL-NAc activity during the optogenetic stimulation of PL-NAc neurons (**Figure 6b**), we set equal to 0.3 for a randomly selected 65% of PL neurons, at all times *t*, from the time of the simulated nosepoke to 2s after the reward presentation. These ‘stimulation trials’ occurred on a random 10% of trials. 65% of PL neurons were activated to mimic the incomplete penetrance of ChR2 viral expression.

#### Reward input to the neural circuit model

The reward input *r*(*t*) to the dopamine neurons was modeled by a truncated Gaussian temporal profile centered at the time of the peak reward:
where *R*(*i*) is 1 if trial *i* was rewarded and 0 otherwise, *μ _{r}* is the time of peak reward and

*σ*defines the width of the Gaussian (0.2s in all cases, width chosen to approximate distribution of dopamine activity in response to reward stimuli observed in previous studies (Matsumoto and Hikosaka, 2009; Schultz et al., 1997). For each trial, a value of

_{r}*μ*was randomly drawn from a uniform distribution spanning 0.2-1.2s from the time of the lever press. This distribution was chosen to reflect the 1s jitter between lever press and reward used in our behavioral task (see Methods above) as well as the observed delay between reward presentation and peak dopamine release in a variety of studies (Cohen et al., 2012; Matsumoto and Hikosaka, 2009; Parker et al., 2016; Saunders et al., 2018). To ensure that no residual reward response occurred before the time of the lever press,

_{r}*r*(

_{a}*t*) was set to 0 for any time

*t*that was 0.2s before the time of the peak reward,

*μ*.

_{r}#### Initial weights

The performance of the model does not depend on the choice of the initial weights as the model learns the correct weights by the end of the first block irrespective of the chosen initial weights. We chose the initial weights to be zero.

#### Weight and eligibility update implementation

We assumed that the weight and eligibility trace updates start at the time of the simulated nose poke. The nose poke time, relative to the time of the lever press, varies due to a variable delay between the nose poke and the lever presentation as well as variation in time between lever presentation and lever press. To account for this, the weight and eligibility trace updates are initiated at time *t* = *t _{start}*, where

*t*was drawn from a Gaussian distribution with a mean at −2.5s, and a variance of 0.2s, which was approximately both the time of the nose poke and the time at which choice-selective sequences initiated in the experimental recordings. The eligibility trace is reset to zero at the beginning of each trial. We stopped updating the weights at the end of the trial, defined as 3s after the time of lever press. The eligibility traces were updated according to

_{start}**Equation 16**. The weights were updated by integrating

**Equation 17**with a first-order forward Euler routine. In all simulations, we used a simulation time step d

*t*= 0.01

*s*.

### Neural dynamics model

To computationally model how the brain could solve the reversal learning task *without fast dopamine-mediated synaptic plasticity*, we used an actor-critic network based on the meta-RL framework introduced by Wang et al. (2018). The model actor and critic networks are recurrent neural networks of Long Short Term Memory (LSTM) units whose weights are learned slowly during the training phase of the task. The weights are then frozen during the testing phase so that fast reversal learning occurs only through the activation dynamics of the recurrent actor-critic network. Like the synaptic plasticity model, we input recorded PL-NAc activity to a value-generating “critic” network (conceived of as NAc, VP, and associated cortical regions) to generate appropriate reward prediction error signals in dopamine neurons. Unlike the synaptic plasticity model, the reward prediction error signals in this model are sent to an explicit actor network (conceived of as DMS and associated cortical regions), where they act as an input to help generate appropriate action signals based on reward history.

### Neural dynamics model description

#### LSTM

The model comprises two separate fully connected, gated recurrent neural networks of LSTM units, one each for the actor and critic network. An LSTM unit works by keeping track of a ‘long-term memory’ state (‘memory state’ ** c**(

*t*), also known as cell state) and a ‘short-term memory’ state (‘output state’

**(**

*h**t*), also known as hidden state) at all times. To regulate the information to be kept or discarded in the memory and output states, LSTMs use three types of gates: the input gate

**(**

*i**t*) regulates what information that is input to the network, the forget gate

**(**

*ϕ**t*) regulates what information to forget from the previous memory state, and the output gate

**(**

*o**t*) (not to be confused with the output state

**(**

*h**t*)) regulates the output of the network. More precisely, the dynamics of an LSTM is defined by the following equations: where

**(**

*x**t*) is the vector of external inputs to the LSTM network at time step

*t*,

*W*_{q}and

*U*_{q}are the weight matrices of the input and recurrent connections, respectively, where the subscript

*q*denotes the state or gate being updated,

*b*_{q}are the bias vectors, ⊙ denotes element-wise multiplication and

*σ*denotes the sigmoid function defined as follows:

#### Critic network

As in the synaptic plasticity model, the goal of the critic is to learn the value (discounted sum of future rewards) of a given choice at any time in a trial. The learned value signal can then be used to generate the RPE signals that are sent to the actor. The critic is modelled as a network of LSTM units that linearly project through trainable weights to a value readout neuron that represents the estimated value *V*(*t*) at time step *t*. The critic takes as input the reward received at the previous time step *r*(*t* − 1) and the experimentally recorded PL-NAc choice-selective sequential input *C*(*t*). The PL-NAc input provides the critic with a representation of the chosen side on the current trial as well as the time during the trial. This allows the critic to output an appropriately timed value signal (and consequently an appropriately timed RPE signal) corresponding to the chosen side. The reward input acts as a feedback signal to the critic that evaluates the correctness of the generated value signal.

To map the critic to a biological neural circuit, we hypothesize that NAc, together with VP and associated cortical areas, form the critic recurrent neural network (**Figure 7a**; Atallah et al., 2007; Lau and Glimcher, 2008; O’Doherty et al., 2004; Richard et al., 2016; Tsutsui et al., 2016). The choice-selective sequential input *C*(*t*) to the critic is provided by the recorded choice-selective sequential activity in PL-NAc neurons (**Figure 7a**).

#### The reward prediction error (RPE)

As in the synaptic plasticity model (**Figure 5a**), the RPE *δ*(*t*) is computed in the VTA DA neurons based on the value signal from the critic network (**Figure 7a**).

Unlike the synaptic plasticity model, the RPE signal is conveyed by the VTA dopamine neurons to the actor network.

#### Actor network

In contrast to the synaptic plasticity model, in which actions were directly readout from the activity of the value neurons early in the trial, we consider an explicit actor network that generates actions. The actor is modelled as a network of LSTM units that compute the policy, i.e., the probability of choosing an action *a*(*t*) at time step *t* given the current state of the network. The policy is represented by three policy readout neurons, corresponding to choosing left, right or ‘do nothing’, whose activities are given by a (trainable) linear readout of the activities of the actor LSTM units. The actor receives three inputs: (i) an efference copy of the action taken at the previous time step *a*(*t* – 1), (ii) a ‘temporal context’ input *ξ*(*t*), encoded as a vector of all 0s except for a value of 1 in the entry corresponding to the current time point *t*, that provides the actor with a representation of the time within the trial, and (iii) the reward prediction error at the previous time step *δ*(*t* – 1).

To map the actor to a biological neural circuit, we hypothesize that the DMS and associated cortical areas form the actor recurrent neural network (**Figure 7a**; Atallah et al., 2007; O’Doherty et al., 2004; Seo et al., 2012; Tai et al., 2012). The temporal sequence input *ξ*(*t*) to the actor is assumed to originate in the hippocampus or other cortical areas (**Figure 7a**; Hahnloser et al., 2002; Howard and Eichenbaum, 2013; Kozhevnikov and Fee, 2007; Zhou et al., 2020).

#### Training algorithm

To train the recurrent weights of the network, which are then held fixed during task performance, we implement the Advantage Actor-Critic algorithm (Mnih et al., 2016) on a slightly modified version of the reversal learning task (see “Block structure for training” section below). In brief, the weights of the neural network are updated via gradient descent and backpropagation through time. The loss function for the critic network, , defines the error in the estimated value function V(t). The synaptic weight parameters *θ _{υ}* of the critic network are updated through gradient descent on the critic loss function :
where

*α*is the learning rate,

*s*is the state at time step

_{t}*t, V*denotes the value function and

*β*is the scaling factor of the critic loss term.

_{υ}*δ*(

_{t}*s*;

_{t}*θ*) is the k-step return temporal difference error (not to be confused with the RPE input to the actor defined in

_{υ}**Equation 22**) defined as follows: where

*R*is the discounted k-step bootstrapped return at time

_{t}*t*where

*r*is the reward received at time step

_{t}*t, γ*is the discount factor at time step

_{t}*t*(defined below), and

*k*is the number of time steps until the end of an episode.

The loss function for the actor network, , is given by a weighted sum of two terms: a policy gradient loss term, which enables the actor network to learn a policy *π*(*a _{t}|s_{t}*) that approximately maximizes the estimated sum of future rewards

*V*(

*s*), and an entropy regularization term that maximize the entropy of the policy

_{t}*π*to encourage the actor net.work to explore by avoiding premature convergence to suboptimal policies. The gradient of the actor loss function with respect to the synaptic weight parameters of the actor network,

*θ*, is given by where

*a*is the action at time step

_{t}*t, π*is the policy,

*β*is the scaling factor of the entropy regularization term and

_{e}*H*(

*s*) is the entropy of the policy

_{t};θ*π*where

*A*denotes the space of all possible actions.

### Neural dynamics model implementation

#### LSTM

Both the actor and critic LSTM networks consisted of 128 units each and were implemented using TensorFlow’s Keras API. The weight matrices *U*_{q} were initialized using Keras’s ‘glorot_uniform’ initializer, the weight matrices *W*_{q} were initialized using Keras’s ‘orthogonal’ initializer and the biases ** b** were initialized to 0. The output and memory states for both LSTM networks were initialized to zero at the beginning of each training or testing episode.

#### PL-NAc inputs to the critic

Input to the critic was identical to the smoothed, single-trial input used for the synaptic plasticity model described above, except i) activity was not interpolated because each time step in this model was equivalent to the sampling rate of the collected data (10 Hz), and ii) we chose to input only the activity from 2s before to 2s after the lever press (as compared to 3s after the lever press for the synaptic plasticity model) in order to reduce the computational complexity of the training process. To reduce episode length, and therefore training time, we also excluded those neurons whose peak activity occurred more than 2s after the lever press, reducing the final number of ‘pseudoneurons’ used as input to 306 (compared with 368 for the synaptic plasticity model).

Optogenetic-like stimulation of the PL-NAc population was performed in a similar manner to the synaptic plasticity model, with activity set to 0.15 for a randomly selected 60% of neurons for the duration of the trial.

#### Trial structure

Each trial was 4s long starting at 2s before lever press and ending at 2s after lever press. At any given time, the model has three different choices - choose left, choose right or do nothing. Similar to the synaptic plasticity model, the model makes its decision to choose left or right at the start of a trial, which then leads to the start of the corresponding choice-selective sequential activity. However, unlike the synaptic plasticity model, the model can also choose ‘do nothing’ at the first time step, in which case an activity pattern of all zeros is input to the critic for the rest of the trial. For all other time steps, the correct response for the model is to ‘do nothing’. Choosing ‘do nothing’ on the first time step or choosing something other than ‘do nothing’ on the subsequent time steps results in a reward *r*(*t*) of −1 at that time. If a left or right choice is made on the first time step, then the current trial is rewarded based on the reward probabilities of the current block (**Figure 1a**) and the reward input *r*(*t*) to the critic is modeled by a truncated Gaussian temporal profile centered at the time of the peak reward (**Equation 20**) with the same parameters as in the synaptic plasticity model.

#### Block structure for training

We used a slightly modified version of the reversal learning task performed by the mice in which the block reversal probabilities were altered in order to make the block reversals unpredictable. This was done to discourage the model from learning the expected times of block reversals and to instead mimic the results of our behavioral regressions (**Figure 1e**) suggesting that the mice use only the previous ~4 trials to make a choice. To make the block reversals unpredictable, the identity of the high-probability lever reversed after a random number of trials drawn from a geometric distribution (**Equation 1**) with *p* = 0.9.

#### Training

Each training episode was chosen to be 15 trials long and the model was trained for 62000 episodes. The values of the training hyperparameters were as follows: the scaling factor of the critic loss term *β _{υ}* = 0.05, the scaling factor of the entropy regularization term

*β*= 0.05, the learning rate

_{e}*α*= 0.01

*s*

^{−1}(

*α*= 0.001 per time step), and the timescale of temporal discounting within a trial

*τ*= 2.45

*s*, leading to a discount factor

*γ*=

*e*= 0.96 for all times except for the last time step of a trial when the discount factor was 0 to denote the end of a trial. The network’s weights and biases were trained using the RMSprop gradient descent optimization algorithm (Hinton et al., 2012) and backpropagation through time, which involved unrolling the LSTM network over an episode (630 time steps).

^{−dt/τ}#### Block structure for testing

Block reversal probabilities for the testing phase were the same as in the probabilistic reversal learning task performed by the mice. The average block length for the PL-NAc neural dynamics model was 19.9+/-5.3 trials (mean+/-std. dev.).

#### Testing

The model’s performance (**Figures 7b-k**) was evaluated in a testing phase during which all network weights were held fixed so that reversal learning was accomplished solely through the neural dynamics of the LSTM networks. The network weights used in the testing phase were the weights learned at the end of the training phase. A testing episode was chosen to be 1500 trials long and the model was run for 120 episodes.

#### Actor network analysis

For **Figures 7f-i**, we tested the model’s performance on a slightly modified version of the reversal learning task in which block lengths were fixed at 30 trials. This facilitated the calculation and interpretation of the block-averaged activity on a given trial of a block. Dimensionality reduction of the actor network activity (**Figure 7g**) was performed using the *PCA* function from the decomposition module in Python’s scikit-learn package.

#### Replacing sequential input to the critic with persistent input

In **Figure 7e**, we analyzed how model performance changed when the temporal structure provided by the choice-selective sequential inputs to the critic were replaced during training by persistent choice-selective input. The persistent choice-selective input was generated by setting the activity of all the left choice-selective neurons to 1 and all the right-choice selective neurons to 0 for all time points on left-choice trials and vice versa on right-choice trials.

### Cross-trial analysis of RPE in dopamine neurons

To generate the regression coefficients in **Figure 5g,l,q, Figure 7d** and **Supplementary Figure 10c,d**, we performed a linear regression analysis adapted from (Bayer and Glimcher, 2005), which uses the mouse’s reward outcome history from the current and previous 5 trials to predict the average dopamine response to reward feedback on a given trial, *i*:

Where *D*(*i*) is the average dopamine activity from 0.2 to 1.2s following reward feedback on trial *i*, is the reward outcome *j* trials back from trial *i* (1 if *j* trials back is rewarded and 0 if unrewarded) and *β _{j}* are the calculated regression coefficients that represent the effect of reward outcome

*j*trials back on the strength of the average dopamine activity,

*D*(

*i*). For the regression coefficients generated from recorded dopamine activity (

**Supplementary Figure 10c,d**) we used the Z-scored GCaMP6f fluorescence from VTA-NAc terminal recordings of 11 mice performing the same probabilistic reversal learning task described in this paper (for details see Parker et al., 2016). The regression coefficients for the experimental data as well as the synaptic plasticity and neural dynamics model simulations were fit using the

*LinearRegression*function from the linear_model module in Python’s scikit-learn package.

### Optogenetic stimulation of PL-NAc neurons

22 male C57BL/6J mice were bilaterally injected in either the PL (n=14 mice, M–L ± 0.4, A–P 2.0 and D–V −2.5 mm) or mTH (n=8 mice, M–L ± 0.3, A–P −0.7 and D–V −3.6 mm) with 600nl AAV2/5-CaMKIIa-hChR2-EYFP (UPenn vector core, injected 0.6 μl per hemisphere of titer of 9.6 × 10^{13} pp per ml). Optical fibers (300 μm core diameter, 0.37 NA) delivering 1–2 mW of 447 nm laser light (measured at the fiber tip) were implanted bilaterally above the NAc Core at a 10 degree angle (M–L ±1.1, A–P 1.4 and D–V −4.2 mm). An additional cohort of control mice (n=8) were implanted with optical fibers in the NAc without injection of ChR2 and underwent the same stimulation protocol outlined below (**Supplementary Figure 12e-h**). Mice were anesthetized for implant surgeries with isoflurane (3–4% induction and 1–2% maintenance). Mice were given 5 days of recovery after the surgical procedure before behavioral testing.

During behavioral sessions, 5 ms pulses of 1-3 mW, 447 nm blue light was delivered at 20 Hz on a randomly selected 10% of trials beginning when the mouse entered the central nose poke. Light stimulation on unrewarded trials ended 1s after the end of the CS− presentation. On rewarded trials, light administration ended either 1s after CS+ presentation (‘cohort 1’) or at the end of reward consumption, as measured by the mouse not engaging the reward port for 100ms (‘cohort 2’). See **Supplementary Figure 13** for a schematic of stimulation times as well as the behavior of the two cohorts. Mice alternated between sessions with and without stimulation – sessions without stimulation were excluded from analysis. Anatomical targeting was confirmed as successful in all mice through histology after the experiment, and therefore no mice were excluded from this data set.

To quantify the effect of laser stimulation on latency times shown in **Supplementary Figure 12a-d**, we ran a mixed effects linear model using the *fitglme* package in MATLAB. In this model, the median latency to initiate a trial of a mouse, defined as the time between illumination of the central nose poke (i.e., trial start) and the mouse initiating a trial via nose poke, was predicted using i) opsin identity (PL-NAc CaMKII-ChR2, mTH-NAc CaMKII-ChR2 or no-opsin controls), ii) laser stimulation on the current trial, iii) laser stimulation on the previous trial, iv) the interaction between opsin identity and laser stimulation on the current trial and v) the interaction between opsin and laser stimulation on the previous trial. To account for individual variation between mice, a random effect of mouse ID was included.

### Simulation of model-free versus model-based task performance

#### Overview

In order to identify possible RPE signatures that distinguish ideal observer (“model-based”) versus Q-learning (“model-free”) behavior in this task (**Supplementary Figure 14**), we simulated choices using the two models. Based on the dopaminergic signature of block reversal inference reported in Bromberg-Martin et al. (2010), we first confirmed that our ideal observer and Q-learning models gave rise to distinct dopamine signatures when performing the task used in Bromberg-Martin et al. (2010). In this task, reward probabilities were 100% and 0% for the “high probability” and “low probability” choices, respectively, and the reward probabilities reversed with a 5% probability on each trial. Next, we applied the same framework to our task, to determine if we could observe similar distinctions between the models. In this case, the reward probabilities were 70% and 10%, as in the task studied in this paper, and blocks reversed with a 5% probability on each trial, which resulted in block lengths comparable to those observed in our experiments.

#### Ideal observer model

The ideal observer model was provided with knowledge of the reward probabilities associated with each block and the probability of block reversal on each trial. The 5% block reversal probability on each trial can be written in terms of the block state transition probabilities as
where *T _{ij}* is defined as the transition probability between block state

*s*on trial

_{i}*t*and block state

*s*on trial

_{j}*t*+ 1. Here, ‘block state’ refers to whether the current block has a higher probability of left or right choices being rewarded. The reward probabilities for each block were as follows where

*R*is defined as the probability of reward for block state

_{ik}*s*and choice

_{i}*c*.

_{k}On each trial, the ideal observer model selects the choice with the highest expectation of reward based on its belief about the current block state given the choice and reward history. The expectation of reward *ρ _{l}*(

*t*+ 1) for choice

*l*on trial

*t*+ 1, given the entire reward history

*r*(1:

*t*) and choice history

*c*(1:

*t*) up until trial

*t*is given by where

*l*can be either 1 (left choice) or 2 (right choice) and

*P*(

*s*(

*t*) =

*s*(1:

_{j}|r*t*),

*c*(1:

*t*)) is the probability of being in block state

*s*on trial

_{j}*t*given the entire reward and choice history up to and including trial

*t*.

**Equation 28**tells us that estimating the block state probability

*P*(

*s*(

*t*) =

*s*(1:

_{j}|r*t*),

*c*(1:

*t*) will provide us with an estimate of the expected reward for a given choice on trial

*t*+ 1 as

*R*and

_{il}*T*are already known. Using Bayes’ theorem, we can estimate the block state probability as

_{ji}The first term in the numerator of the right hand side of **Equation 29**, *P*(*r*(*t*)|*r*(1: *t* – 1), *c*(1: *t*), *s*(*t*) = *s _{j}*), is the probability of receiving reward

*r*(

*t*) (1 if rewarded and 0 if unrewarded) on trial

*t*given the current choice

*c*(

*t*) =

*c*, the block state

_{k}*s*, and the reward history

_{j}*r*(1:

*t*– 1) and the choice history

*c*(1:

*t*– 1) up to trial

*t*– 1. Because the past history of rewards and choices does not affect the reward probability once the block state is known, this can be rewritten as

The second term in the numerator of the right hand side of **Equation 29**, *P*(*s*(*t*) = *s _{j}*|

*r*(1:

*t*– 1),

*c*(1:

*t*)), is the probability that the current block state is

*s*given the reward and choice history. This can be rewritten as

_{j}In the second line above, the dependence on c(t) has been removed because the choice on the current trial, in the absence of reward information on the current trial, does not provide any additional information about the current state beyond that provided by the past reward and choice history. Combining **Equations 29–31**, the block state probability on the current trial *t* can be written in terms of the known reward probabilities, known state transition probabilities and the previous block state probability as

The above equation allows us to estimate the current trial block state probability *P*(*s*(*t*) = *s _{j}|r*(1:

*t*),

*c*(1:

*t*)) recursively, since it can be expressed in terms of the previous trial block state probability

*P*(

*s*(

*t*– 1) =

*s*(1:

_{m}|r*t*– 1),

*c*(1:

*t*– 1)) and other known constant terms. This combined with the known reward and block transition probabilities allows the model to select the optimal choice according to

**Equation 28**.

#### Q-Learning model

To simulate trial-by-trial, model-free performance of the tasks, we used a Q-learning model in which the value of the chosen action is updated on each trial as follows:
where *Q _{right}* is the value for the right choice and

*Q*is the value for the left choice

_{left}*t*is the current trial and

*α*is the learning rate, which was set to 0.612 per trial.

*r*(

*t*) is the outcome of trial

*t*(1 for reward, 0 for no reward). Q-values for each choice were initialized to 0. The outcome

*r*(

*t*) was determined based on the reward probability for choice

*c*(

*t*) given the block. Choice was simulated using a softmax equation such that the probability of choosing right or left is given by, where

*β*is the inverse temperature parameter, which was set to 0.99.

_{value}*β*is a parameter accounting for how likely mice were to repeat their previous choice, which was set to 0.95.

_{stay}*I*is 1 if that action (i.e. left or right) was chosen on the previous trial and 0 otherwise. Parameters for the Q-learning model were fit in Lee et al. (2019) to the behavior of mice in which dopamine neuron activity was recorded in Parker et al. (2016).

_{left/right}#### Comparison of RPE at block reversals

RPE for both the ideal-observer model and the Q-learning model (**Supplementary Figure 14**) was defined as the difference between the experienced reward *r*(*t*) and the expected reward for the chosen action (*ρ _{chosen}*(

*t*) for the ideal-observer model or

*Q*(

_{chosen}*t*) for the Q-learning model) as follows:

To identify RPE signatures of model free versus model based performance of the two tasks, we compared the RPE from the ideal-observer model and the Q-learning model on trials around block reversals. Specifically, we compared the RPE from the two models on the first trial of a block with the RPE on the second trial of a block when the choice on trial 1 was different from the choice on trial 2. This means that any changes in RPE from trial 1 to trial 2 were inferred because the new action-outcome relationship for the choice made on trial 2 had not been explicitly experienced in the new block.