Acetylcholine-dependent phasic dopamine activity signals exploratory locomotion and choices

Dopamine neurons from the Ventral Tegmental Area (VTA) switch from tonic to phasic burst firing in response to reward-predictive cues and actions. Bursting is influenced by nicotinic acetylcholine receptors (nAChRs), which are not implicated in reinforcement learning, but rather in exploration and uncertainty-seeking. The leading model assigns these functions to tonic dopamine firing. To investigate this paradox, we recorded the activity of VTA dopamine neurons during a spatial decision-making task. When reward was certain, mice adopted a stereotyped behavior, and dopamine neurons signaled reward. When confronted with uncertain rewards or a novel environment, mice exhibited exploration. Modulation of phasic, but not tonic, dopamine activity predicted uncertainty-seeking and locomotor exploration. Deletion of nAChRs disrupted the influence of uncertainty and novelty on dopamine firing and behavior, sparing reward signaling and learning. Hence, nAChR modulation of dopamine neurons can influence cognitive functions on a short timescale, through the modulation of phasic, synchronous bursting.


INTRODUCTION
VTA dopamine neurons play a major role in motivating goal-directed behaviors and reinforcing behaviors leading to reward (Glimcher, 2011;Schultz, 2007;Sutton and Barto, 1998). These cells encode a "reward prediction error", the difference between predicted and actual reward, which can be used to update the expected (mean) value of stimuli and actions (Glimcher, 2011;Schultz, 2007).
Dopamine neurons signal a reward prediction error through a switch from tonic (regular singlespike firing) to phasic (synchronous bursting) firing (Faure et al., 2014;Grace et al., 2007;Schultz, 2007). The bursting pattern in dopamine cells depends on the balance between glutamatergic and GABAergic inputs (Lobb et al., 2010;Paladini and Roeper, 2014;Zweifel et al., 2009), but also critically on cholinergic modulation (Dani and Bertrand, 2007;Faure et al., 2014;Grace et al., 2007;Mameli-Engvall et al., 2006). Understanding how mesopontine acetylcholine impacts decision-making through the modulation of firing in dopamine cells is of utmost importance, as dysregulations of these neuromodulatory systems are implicated in major psychological diseases such as schizophrenia, tobacco addiction or Parkinson's disease (Dani and Bertrand, 2007).
There is thus an apparent discrepancy between the proposed behavioral roles of VTA nAChRs (exploration, but not reward), the kind of dopamine firing that they are thought to control (switch from tonic to phasic) and the respective roles assigned to phasic (reward) and tonic (exploration) dopamine. In other words, direct evidence of how nAChRs affect dopamine firing and function during exploration or uncertainty-processing tasks is still lacking. Here we assessed dopamine encoding of uncertainty, novelty and exploration in a self-paced task, with no stimulus signaling the beginning of a trial, in order to give the mice full control over whether to exploit reward or explore the open-field. We found that dopamine neurons encoded expected reward and motivation for both reward-directed decisions and locomotion, i.e., exploitation, in purely predictable contexts. Importantly, we also found that the phasic activity of dopamine neurons summed uncertainty with reward, to signal motivation to explore in uncertain or novel environments. We then disrupted β2*nAChRs to relate alterations in uncertainty processing and exploration with dopaminergic signaling, and show that dopamine neurons in β2 -/mice did not encode uncertainty nor exploration bonuses, concomitantly with the disappearance of uncertaintyseeking and exploratory locomotion.

Synchronous bursting emerges with the learning of a self-paced task
Dopamine cells signal reward by a phasic activation, i.e. a time-locked increase in synchronous bursting activity. We thus assessed how the involvement of β2*nAChRs in synchronous bursting can affect phasic dopamine activity related to reward processing. Phasic activity from dopamine cells is usually conceptualized as a reward prediction error (Glimcher, 2011;Sutton and Barto, 1998): ( ) = ( ) + ( +1 ) − ( ), where ( ), the reward prediction error, is assumed to correspond to phasic dopamine activity, ( ) to the actual reward eventually obtained, ( +1 ) to the prediction of the value of the next choice (temporallydiscounted by a factor ) and ( ) to the expected value of the current state of the animal. We determined whether the activity from VTA pDAn (126 from 12 WT and 103 from 10 β2 -/mice) is compatible with a reward prediction error, while mice performed a spatial version of the multiarmed bandit task (Naudé et al., 2016) (Methods). In an open-field, mice learned to associate three explicit locations with intracranial self-stimulation (ICSS(Carlezon and Chartoff, 2007)) ( Fig. 2A).
After learning, pDAn emitted bursts of action potentials early in the trials (Fig. 2b), henceforth called "early bursting". Early bursting emerged in both WT and β2 -/animals, suggesting that the phasic increase in dopamine activity at actions onset (Jin and Costa, 2010;Wassum et al., 2012) is β2*nAChRs-independent. Early bursting was characterized by a series of features, in both WT and in β2 -/mice, compatible with the encoding a reward prediction error occurring during the self-initiation of an action that constitutes the first predictor of the next reward(Sutton and Barto, 1998;Wassum et al., 2012). First, early bursting was not triggered by the electrical stimulation, since it occurred when ICSS reward was unexpectedly omitted (Fig. 2c), in the first trials of the probabilistic setting used later on (see below). Hence, early bursting appeared to predict the next reward, rather than signaling the previous one. Second, during reward omissions ( Fig. 2c), pDAn activity decreased at the time of the expected reward (t(18)=-3.07, p=0.007). This result is also consistent with dopamine cell signaling a reward prediction error (Glimcher, 2011;Schultz, 2007;Sutton and Barto, 1998), as the actual reward is lower than expected: in the first omission trials, animals expect a reward but do not obtain it ( ( ) = 0 so that ( ) − ( ) < 0). Finally, the percentage of pDAn displaying a bursting activity significantly larger than baseline firing at any time between two ICSS increased over learning sessions in WT mice (χ2=7.81, p=0.01, Fig. 2d). The increase in phasic activity was also apparent in pDAn from β2 -/animals (proportion of pDAn displaying phasic activity not different from WT: χ2=0.05, p=0.82). This gradual increase in the total number of VTA dopamine cells displaying phasic bursting activity is consistent with the emergence of a positive reward prediction error during learning.

Reorganization of firing with learning
We next analyzed how this change in dopamine firing at the single-trial scale impacted the average activity of dopamine neurons. Completion of learning induces the emission of bursts by 8 dopamine cells at each trial, but also corresponds to an increase in the number of trials, the combination of these two features thus suggests an overall increase in pDAn firing frequency.

Dopamine neurons signal exploitation of rewards in predictable environment
Reinforcement models also state that after learning, dopaminergic reward prediction does not play any further role in ongoing behavior (Glimcher, 2011;Sutton and Barto, 1998). Recent studies have shown that phasic dopamine can also encode for kinetic variables, but this mainly concerns dopamine neurons from the Substantia Nigra pars compacta (SNc) (Barter et al., 2015;Howe and Dombeck, 2016). We thus asked whether pDAn from the VTA encode for motor variables. In our setup, early bursting occurred during the animal's dwelling time, before movement onset towards the next location (Fig. 3a). When assessing the correlations between early bursting and kinetic variables, we did not find any relation with immediate behavior (e.g. instantaneous speed or acceleration just following early bursting, Supplementary Fig. 4), as found in studies interested in the role of the SNc in motor control (Barter et al., 2015;Howe and Dombeck, 2016). However, we observed that a higher frequency during early bursting corresponded to a shorter trial duration, i.e. the time needed by the animal to reach the next rewarded location (Fig.   3b). Sorting the trials by increasing "early bursting" frequency (i.e. frequency during the first second of the trial, see Methods) revealed that it correlated with the time-to-peak speed ( Fig. 3c) and, more precisely, with the ratio of maximum speed over the time-to-maximum speed (Fig. 3d, WT: 62.5% modulated cells, median R 2 : 0.54; β2 -/-: 76.0% modulated cells, median R 2 : 0.58). This ratio relates to the directness of the animal towards the goal: a high peak speed, a strong acceleration, a direct path or a short dwell-time all increase this measure. Since ICSS reward was equivalent at all locations (and thus, the reward prediction would be the same for all trials), our findings suggest that dopamine phasic activity encodes a locomotor signal related to the whole action sequence (from start to goal)(Wassum et al., 2012), on top of a reward prediction error.
Moreover, the proportion of cells showing directness-related firing was similar in WT and β2 -/mice (p=0.21, KS-test), indicating that VTA dopamine signaling of locomotor activity is β2*nAChRs-independent in the context of exploitation.

Encoding of uncertainty by phasic bursting in dopamine neurons
We next investigated pDAn signaling in decision-making under uncertainty, using a probabilistic setup, in which the three rewarded locations are respectively associated with different probabilities to receive ICSS (Fig. 4A, p=100%, p=50% and p=25%). The three binary choices (here, 25% vs. 50%, 25% vs. 100% and 50% vs. 100% reward probabilities) allow characterizing the influence of two co-varying parameters (reward mean, which corresponds to p and uncertainty, defined as variance p(1-p)) on decisions . Mice visited overall the locations associated with higher ICSS probability more often (Fig 4b left), indicating a choice behavior.
However, when starting from the 25%-reward probability location, they chose the 50% reward as much as the 100% reward (U=35, p =0.49, Wilcoxon test, Fig 4b middle), suggesting that mice are irrationally attracted by the 50% reward. We investigated this preference by fitting the binary choices with a computational model that takes into account the relative influences of reward mean and reward uncertainty on decisions (see Methods). This model-based analysis indicates that WT mice chose as if they assigned a positive value to the reward uncertainty (a "bonus" value added to expected reward) (Kakade and Dayan, 2002;Naudé et al., 2016) of the goal (Fig 4B right), which is maximal for 50% reward probability (Fig. 4a). In other words, mice choices appear suboptimal because they explore actions with uncertain consequences.
We next recorded pDAn firing during this task (Fig 4c) to assessed whether the bonus value of uncertainty was encoded by pDAn neuron. pDAn firing activity was sorted according to the reward probability of the goal during direct trials (duration shorter than 5s, average median duration of trials =4.84±0.39s, mean±s.e.m. in WT mice) in WT and β2 -/mice. Averaging trials with different durations by linearizing the time between two consecutive locations ( Fig. 4D, Supplementary Fig. 5 displays all neurons) revealed that (1) early bursting, but also late, "ramping" activity, seemed modulated by reward probabilities (boxes in Fig. 4D) and (2) these dynamics seemed altered by the genetic deletion of β2*nAChRs ( Fig. 4D left versus right). We thus centered the analysis of early bursting on the first 500ms of trial onset (Methods). Early bursting was sensitive to the expected reward probability at the goal (F(2,187)=5.53, p=0.005, Fig. 5A), consistent with a reward prediction (Glimcher, 2011;Schultz, 2007). In addition, encoding of uncertain rewards (50%) relative to certain rewards (100%) by pDAn early bursting activity predicted the bonus (i.e. the positive value) mice assigned to uncertainty (R 2 =0.47, p=0.03, Fig. 5B). This suggests that the phasic activity of VTA dopamine neurons integrates an uncertainty bonus together with expected value (Kakade and Dayan, 2002;Lak et al., 2014).
Moreover, we found a significant correlation between the residuals of pDAn activity and of the next choices of the animal (R 2 =0.36, p<0.001, Fig if a mouse chose a rewarded location more than the group average, pDAn from this mouse displayed an activity corresponding to this location's reward probability higher than the population average. In other words, dopamine activity encodes the subjective value of the next choice rather than the objective reward attributes (e.g. reward probability). Besides, early bursting activity correlated with the peak speed/time-to-peak speed ratio, as in the certain setting, but to a lesser extent ( Supplementary Fig. 6), as we discuss below when analyzing exploratory locomotion.

Uncertainty-seeking and encoding of uncertainty by dopamine neurons both depends on nicotinic acetylcholine receptors
Given the implication of β2*nAChRs in uncertainty-seeking and exploration Naudé et al., 2016), we next assessed their roles in the probabilistic task. Overall, β2 -/mice chose the 50% location less than their WT counterparts (F(1,2)=4.19, p=0.02, repartition*genotype interaction, t(15)=2.31, p=0.035, Fig. 4b left). This was particularly clear in the binary choice between 50% and 100% reward, in which β2 -/animals chose less the 50% alternative than WT animals (t(15)=2.56, p=0.02, Fig. 4b middle). This alteration in decision-making can be modeled as an absence of valuation of uncertainty (uncertainty-bonus in β2 -/versus WT mice: U(10,7)=114, p=0.02, and not different from 0: W (7) indicating that the relation between pDAn activity and ongoing choices was not altered in β2 -/mice. Hence, uncertainty-seeking and encoding of uncertainty by VTA pDAn activity are both affected by β2*nAChR deletion, but not the relation between DA activity and choices.

Tonic dopamine activity encodes uncertainty but does not signal uncertainty-seeking
When no action is required from the animals, dopamine cells have been shown to encode reward uncertainty through a ramping increase in activity (Fiorillo et al., 2003), but this remains unclear in operant tasks, where uncertainty motivates the animal to explore uncertain options (Funamizu et al., 2012;Heilbronner, 2013;Naudé et al., 2016;Oudeyer, 2007). We thus sorted pDAn activity upon arrival at a location (i.e. last second before trial end, Methods) by the reward probability associated with this location (Fig. 5D Fig. 5) indicates that pDAn ramping activity scaled with reward uncertainty.
In β2 -/mice, ramping was not modulated by reward probability (F(2,137)=1.62, p=0.20) and was different from WT mice (F(1,2)=4.42, p=0.01, probability*genotype interaction), indicating an implication of β2*nAChRs in uncertainty-modulation of both phasic and tonic dopamine activity ( Fig. 5D right). Yet, the modulation of tonic ramping was not correlated with uncertainty-seeking behavior ( Supplementary Fig. 8). This result is consistent with ramping activity appearing long after the choice, and thus not being implicated in the decision. Moreover, we observed that while early bursting activity could be considered as phasic (i.e. both bursting and synchronous), late ramping activity was mostly composed of tonic spiking ( Supplementary Fig. 9). This contradicts that ramping activity might constitute a backpropagation of reward prediction error (Fiorillo et al., 2005;Niv et al., 2005). Our results thus suggest that phasic, rather than tonic, DA activity, underlies uncertainty-seeking.

Dopamine neurons signal exploratory locomotion in uncertain or novel environments, under nicotinic regulation
Our results suggest a role of β2*nAChRs in exploration, in the framework of decisionmaking between clearly defined alternatives (here, the rewarded locations). Exploration also refers to locomotion in the rest of the open-field  when animals wander inbetween rewarded locations in the probabilistic task, or when animals travel in a novel open-field.
We thus assessed whether pDAn activity (phasic and or tonic) encodes motor variables of locomotor exploration in long trials (duration longer than 5s, Fig. 6A), during which mice display pauses and tortuous trajectories in-between rewarded locations. At the beginning of these long trials, pDAn displayed bursting activity (Fig. 6B) that did not scale with reward probability ( Fig.   6C; WT F(2,173)=0.22, p=0.81; β2 -/-F(2,117)=0.42, p=0.66). This early bursting positively correlated with the travelled distance in WT (32.9% modulated pDAn, median R 2 =0.16) but not in β2 -/mice (4.0% modulated pDAn, median R 2 =0.03, Fig. 6D). Hence, pDAn could signal distance (Fig. 6) in long trials and speed in short trials (Fig. 3). Since locomotion in long, exploratory trials in the probabilistic setting consists in a sequence of movements interspersed with pauses , we refined the analysis of pDAn activity by distinguishing, in long trials, exploration (at low speed) from navigation (at higher speed). Periods of low speed in the open-field corresponds to rearing, scanning, sniffing and grooming , with all these behaviors related to collecting information, except grooming that only accounts for 10% of low-speed epochs in the open-field . During long trials, the difference between bursting activity during slow and fast epochs correlated with the total travelled distance in WT (60.3% modulated cells, example in Fig. 6E left, median R 2 =0.40, distribution in Fig. 6E right). This correlation with travelled distance was abolished in β2 -/mice (10.0% modulated cells, median R 2 =0.04, Fig. 6E).
Moreover, bursting activity was higher during epochs of slow exploration compared to epochs of

Nicotinic acetylcholine receptors control dopamine firing on different timescales
nAChRs are known to affect the transition of dopamine firing from regular spiking to bursting (Dautan et al., 2016;Grace et al., 2007). Here we delved further on how this cholinergic modulation relates to behavior. Bursting is observed during phasic dopamine activity (Fiorillo et al., 2003;Jin and Costa, 2010;Lak et al., 2014;Redgrave and Gurney, 2006;Schultz, 2007), usually described as a reward prediction error (Glimcher, 2011;Schultz, 2007;Sutton and Barto, 1998), while tonic fluctuations have been related to most of the other functions ascribed to dopamine, e.g. exploration (Frank et al., 2009;Humphries et al., 2012), locomotion (Niv, 2007), and encoding of reward uncertainty (Fiorillo et al., 2003). However, the mapping between behaviors and dopamine dynamics remains unclear (Barter et al., 2015;Berridge, 2012;Howe and Dombeck, 2016;Lak et al., 2014), and it is thus crucial to distinguish the firing patterns (regular or bursting) from the timescale (average activity or event-related) considered. When considering average activity in anesthetized animals or at rest, bursting occurs spontaneously in dopamine cells (Faure et al., 2014;Mameli-Engvall et al., 2006;Schiemann et al., 2012;Schultz, 2007), but this bursting propensity is distinct from phasic, stimulus-or action-locked activity. Conversely, averaging stimulus-related activity from repeated trials might blur the analysis of firing patterns (Fiorillo et al., 2005;Niv et al., 2005).
Hence, we based our analysis on a definition of phasic activity that translates into large and thus may impact immediate decision-making.

What roles for the cholinergic control of dopamine in action, choices and exploration?
While there is a general agreement that dopamine is necessary for actions aimed at obtaining potential rewards (Berridge, 2012;McClure et al., 2003;Schultz, 2007), the contributions of dopamine to decision-making remain under investigation (Barter et al., 2015;Berridge, 2012;Howe and Dombeck, 2016;Nicola, 2010;Salamone and Correa, 2012;Schultz, 2007). In the reinforcement-learning framework, phasic dopamine signals a reward prediction error, which indirectly affects future choices by updating the cached values of options. However, there is an ever-growing debate about the direct role of dopamine at the time of choice (Glimcher, 2011;Schultz, 2007). This motivational and/or motor signal has been linked to phasic dopamine (Barter et al., 2015;Berridge, 2012;Howe and Dombeck, 2016) or to slower variations (Niv, 2007;Schultz, 2007). Even though we did not use "online" causal tools such as optogenetics (which can lead to erroneous inference about neural-behavior relations (Jazayeri and Afraz , 2017) lacking (Humphries et al., 2012;Naudé et al., 2016). We found that phasic activity correlated with uncertainty-seeking (discussed in-depth below) but that the modulation of tonic activity by uncertainty did not predict choices, and occurred at the end of the trial, presumably long after the decision process. Hence, it is doubtful that tonic encoding of uncertainty is directly implicated in choices under uncertainty, but may instead affect attentional processes (Fiorillo et al., 2003). We also observed that dopamine phasic activity at the beginning of trials could signal the distance 20 travelled during long, exploratory trials. On a longer timescale, the extent of locomotor exploration was related to the modulation of bursting in dopamine cells when the animal was at low speed, rather than avergae bursting propensity. Hence, we overall show that locomotor exploration and uncertainty-seeking are both signaled by the phasic activity of dopamine cells. Our data add to the growing catalogue of functions (motor control (Barter et al., 2015;Howe and Dombeck, 2016), here exploration and uncertainty) long thought to be related to tonic dopamine (Niv, 2007;Schultz, 2007), that have proven signaled instead by fast, precisely timed, transitions towards phasic dopamine activity. In this framework, the disruption in the encoding of exploration by dopamine neurons from β2 -/mice shows that cholinergic signaling can affect precisely-timed transitions of dopamine cells towards bursting, henceforth affecting decision-making on a short timescale.

Cholinergic control of dopamine implements an uncertainty bonus
We also define how this fast effects of nicotinic receptors modulate dopamine encoding of decision variables. Together with previous studies, we show that phasic dopamine firing not only encode expected reward but also other reward attributes, such as risk (reward uncertainty (Fiorillo, 2011;Fiorillo et al., 2003;Lak et al., 2014;Stauffer et al., 2014)) or advanced information about upcoming rewards(Bromberg-Martin and Hikosaka, 2009), and non-reward attributes, such as novelty or sensory surprise (Redgrave and Gurney, 2006). Rather than a "reward-only" prediction error, which would encode the objective properties of a reward (e.g. probability, magnitude), dopamine neurons would instead signal the subjective value of a choice (Lak et al., 2014;Stauffer et al., 2014). The "subjective utility" theory remains however agnostic about the determinants of one's preferences, i.e. it relates preferences for e.g. a type of food or for risky rather than safe option to dopamine firing, but does not explain what renders dopamine neurons activated by risk or uncertainty. Here we suggest how uncertainty can affect the subjective value of a choice, in terms of normative explanation and neural mechanisms.
We sought to explain the irrational preferences displayed by the animals in our setup: they display equal preference for the 50% and 100% options, even though the number and intensity of ICSS are the same for both outcomes. Using computational modeling (Daw et al., 2006;Funamizu et al., 2012;Naudé et al., 2016), we were able to extract the subjective value of uncertainty from animals' choices, and to relate it with increases in phasic dopaminergic activity, on top of the encoding of expected reward. In the framework of phasic dopamine signaling a reward (or utility) prediction error, this means that the value of uncertainty is added to that of the expected reward, in the form of "bonus" value (Kakade and Dayan, 2002). Formalizing uncertainty-modulation of dopaminergic activity as a bonus explains suboptimal behavior as a motivation to obtain information about the current status of an unpredictable outcome (Anselme et al., 2013;Dayan, 2012;Vasconcelos et al., 2015), even if such information cannot be used to maximize reward in the future. Our task parameters, i.e., small stakes (there is no actual risk to lose, but just a risk not to win) and short delays between series of repeated gambles, are known to induce uncertaintyseeking (Heilbronner, 2013). In our set-up, the relative weight of uncertainty compared to expected reward was even close to one, meaning that mice priced (useless) information as much as rewards, reminiscent of playful or gambling behaviors (Oudeyer, 2007).
We also propose a neural mechanism for uncertainty-seeking. As VTA dopamine neurons from β2 -/mice encode reward expectation and directness but not uncertainty or exploration, these different types of motivation arise from distinct dynamical processes at the level of the VTA

Animals
Experiments were performed on 14 wild-type (WT; C57BL/6J) and 11 knockout SOPF HO ACNΒ2 (β2 -/-) male mice obtained from Charles Rivers Laboratories Frances. β2 -/mice were generated as described previously (Picciotto et al., 1995). Even though WT and β2 -/mice are not littermates, the mutant line was generated more than 20 years ago, and has been backcrossed more than 20 generations with the wild-type C57BL6/J line, and is more than 99.99% C57BL/6J. All experiments were performed on mice between 2 and 4 months of age. Mice were housed in cages containing a maximum of 4 animals, in a 12 h light/dark cycle and temperature-controlled room (22 +/-1 °C) with food and water available ad libitum. Sample sizes in this study are similar to those generally employed in the field and were not pre-determined by a sample size calculation. 4 wild-type and 4 β2 -/mice were excluded from the analysis due to improper electrode implantation (no dopamine neurons were found in these mice). All experiments were undertaken in compliance with French laws on animal experimentation, the directives of the European Community 219/1990 and 220/1990. The protocol was approved by the Committee on the Ethics of Animal Experiments of the University of Pierre et Marie Curie (Permit Number: 01438.01).

Drive and electrodes
Hand-made poly-electrodes (bundle of 8 electrodes, "octrodes") were obtained by twisting eight polyimide-insulated 17 µm Nickel-Chrome wires (A-M SYSTEMS, USA). The use of eight channels relatively close together allows for a better discrimination of the different neurons. Before implantation and recording, the octrodes were cut at suitable length and gold-plated (gold plating least 30% decrease in their firing rate, while eticlopride restored firing above the baseline. Nevertheless, as continuous D2 pharmacology could have affected both baseline DA neurons firing and decision-making(Fobbs and Mizumori, 2014), we allowed the mice to recover two days after this experiment. We thus performed pharmacological confirmation (1) when first encountering a putative DA neuron in a given mouse or (2) at the end of the week if at least one putative neuron was present during the behavioral experiment. Neurons were considered DA only if they responded to the pharmacology, or if they presented electrophysiological characteristics defined above and were recorded between two positive pharmacological experiments.

Immunohistochemistry
Immunolabeling allowed to assess the TH-positive phenotype of the neurons surrounding the electrode trace and was performed as follows. All neurons from mice in which the electrode trace was detected outside the VTA were excluded from the analysis. Following the death of the mice, brains were rapidly removed and fixed in 4% paraformaldehyde. After at least 3 days of fixation at 4 °C, serial 60-μm sections were cut from the midbrain with a vibratome. Free-floating VTA brain sections were incubated 1h at 4°C in a blocking solution of phosphate-buffered saline (PBS) containing 3% Bovine Serum Albumin (BSA, Sigma; A4503) and 0.2% Triton X-100 and then incubated overnight at 4°C with a mouse anti-tyrosine hydroxylase antibody (TH, Sigma, T1299) at 1:200 dilution in PBS containing 1.5% BSA and 0.2% Triton X-100. The following day, sections were rinsed with PBS and then incubated 3 h at 22-25°C with Cy3-conjugated anti-mouse secondary antibodies (Jackson ImmunoResearch, 715-165-150) at 1:200 dilution in a solution of 1.5% BSA in PBS. After three rinses in PBS, slices were wet-mounted using Prolong Gold dopamine neurons were still recorded in these animals. From 14 WT and 10 β2 -/mice at the beginning of the study (i.e. training), 10 WT and 7 β2 -/mice had dopamine neurons in the probabilistic sessions.

Analysis of decision-making
Animal trajectories were smoothed using a triangular filter. Time-to-goal measures the duration of one trial between one ICSS location and the next one. The speed profile corresponds to the instantaneous speed as a function of time.
Analysis of decision-making was based on a previous study . In short, we expressed behavioral data as a series of choices between rewarding locations (labeled A, B, C).
We only considered choices made in an interval of 10s after visiting the previous rewarding location. This experiment implements a Markovian Decision Process (MDP(Sutton and Barto, 1998)) consisting of three states (A, B, C), corresponding to each rewarding locations, and a transition function, corresponding to the proportions of choices in the three gambles. The repartition is defined as the proportion of states visited by the animal during a session. The transition matrix describes the proportion of transitions from one state to another. Because animals receive stimulations only when they alternate between rewarding locations, there is no repetition of states in the sequence and the 3 × 3 transition matrix has null diagonal elements.
We modeled mice decision strategies with a previously-published computational model, an uncertainty-sensitive softmax rule, henceforth called the "uncertainty model". Because mice could not return to the same rewarding location, they had to choose between the two remaining states (rewarded locations). The uncertainty model determined the probability Pi of choosing the next state i, as a function of a decision variable. The softmax choice rule is: (1) where β is an inverse temperature parameter ("decision noise") reflecting the sensitivity of choices to the difference between decision variables. In our model embedding an exploration bonus, the decision variable (the value of an option) depends on both expected reward and uncertainty (Daw et al., 2006;Frank et al., 2009;Funamizu et al., 2012).
(2) This compound value is then nested in the softmax choice rule. We fitted the free parameters (β, φ) maximizing the respective likelihood of the observed choices c at all trials t for each mouse separately, using the population fit (fit of the average choice probabilities) as initial conditions.

Electrophysiology analysis
Both spontaneous activity and behavior-related neuronal activity were analyzed offline with Matlab custom-written codes. Spontaneous DA cell firing was analyzed with respect to the average firing rate and the percentage of spikes within bursts (%SWB, number of spikes within bursts, divided by total number of spikes). Bursts were identified as discrete events consisting of a sequence of spikes such that: their onset is defined by two consecutive spikes within an interval <80 ms and they terminated with an interval >160 ms. Phasic activity is defined as spikes falling into bursts, while tonic activity comprises spikes outside bursts. Synchrony between pairs of neurons was computed according to Kreuz et al.(Kreuz et al., 2013)