Abstract
Rapid learning confers significant advantages to animals in ecological environments. Despite the need for speed, animals appear to only slowly learn to associate rewarded actions with predictive cues1–4. This slow learning is thought to be supported by a gradual expansion of predictive cue representation in the sensory cortex2,5. However, evidence is growing that animals learn more rapidly than classical performance measures suggest6–8, challenging the prevailing model of sensory cortical plasticity. Here, we investigated the relationship between learning and sensory cortical representations. We trained mice on an auditory go/no-go task that dissociated the rapid acquisition of task contingencies (learning) from its slower expression (performance) 7. Optogenetic silencing demon-strated that the auditory cortex (AC) drives both rapid learning and slower performance gains but becomes dispensable at expert. Rather than enhancement or expansion of cue representations9, two-photon calcium imaging of AC excitatory neurons throughout learning revealed two higher-order signals that were causal to learning and performance. First, a reward prediction (RP) signal emerged rapidly within tens of trials, was present after action-related errors only early in training, and faded at expert levels. Strikingly, silencing at the time of the RP signal impaired rapid learning, suggesting it serves an associative and teaching role. Second, a distinct cell ensemble encoded and controlled licking suppression that drove the slower performance improvements. These two ensembles were spatially clustered but uncoupled from underlying sensory representations, indicating a higher-order functional segregation within AC. Our results reveal that the sensory cortex manifests higher-order computations that separably drive rapid learning and slower performance improvements, reshaping our understanding of the fundamental role of the sensory cortex.
Despite the value of rapid learning in ecological environments, most laboratory models of rodent learning show that linking sensory cues with reinforced actions is a slow, gradual process1–4,10. An alternative view suggests that animals, including humans, rapidly infer relationships between cues, actions, and reinforcement (i.e. learning)6 even if they continue to make ongoing performance errors 7,8,11. Recent behavioral studies in rodents have begun to reconcile these views, arguing that latent task knowledge (i.e. discriminative contingencies) can emerge rapidly even though behavioral performance appears to improve only gradually7. How are these two dissociable behavioral processes—rapid acquisition of contingencies versus slower performance improvements—implemented in the brain?
An attractive brain region to consider is the sensory cortex as it is thought to subserve instrumental learning by enhancing or attenuating the representation of sensory cues that drive behavior. Plasticity of cue-related responses in the sensory cortex is thought to subserve learning as it mirrors the slow and gradual improvements in behavioral performance 1,2,5,10. This raises a fundamental challenge: if animals learn discriminative contingencies rapidly but cue representations in the sensory cortex change slowly1,2,9, the causal model linking cue-related plasticity to learning becomes problematic. One possible solution is that the sensory cortex plays a role beyond cue-related representational plasticity and directly represents high-order signals that associate reinforced actions with predictive cues. Here we focus on the auditory cortex (AC) and asked whether and how it plays a higher-order role in cue-guided learning.
We trained head-fixed, water-restricted mice to lick to a target tone (S+) for water reward and to withhold licking to a foil tone (S−) to avoid a timeout (auditory go/no-go task, Fig. 1a). We used simple pure tones to prevent the AC from being recruited for complex sensory processing. To confirm this, two-photon imaging of AC excitatory neurons showed that stimulus identity could accurately be decoded from AC activity from the first training day with no subsequent improvement throughout training (Supplementary Figure 1), suggesting that the AC was indeed not needed for perceptual sharpening in the task and thereby allowing us to identify possible associative functions. Performance was evaluated in each session in reinforced and non-reinforced (‘probe’) trials (Fig. 1b). Performance in probe trials revealed a rapid acquisition of task contingency knowledge which was only expressed much later in reinforced trials (Fig. 1c)7. Reinforcement feedback, although critical for learning, paradoxically masked the underlying task knowledge. By combining this behavioral procedure with optogenetics and longitudinal two-photon imaging, we aimed to determine how quickly animals learn stimulus-action contingencies and to define the fundamental role of the auditory cortex in sound-guided learning.
a, Head-fixed mice were trained on an auditory go/no-go task with 3 -spaced pure tones. H: hit, M: miss, FA: false alarm, CR: correct reject. b, Every day during training, task knowledge is probed by omitting reinforcement for 20 trials. c, Two distinct learning trajectories are revealed: a fast acquisition of task contingencies (measured in probe trials; green) and a slower knowledge expression (measured in reinforced trials; black). d, Probabilistic optogenetic silencing of the auditory cortex over learning. e, Testing conditions. f, Accuracy in reinforced light-on trials (two-way ANOVA, p < 10−8). g, Action rate in reinforced light-on trials (HIT, p = 0.07; FA, p < 10−33). See also Supplementary Figure 4. h, Accuracy in probe light-off trials (two-way ANOVA, p < 10−4). i, Tone response index in S+ trials (see Methods; two-way ANOVA, p < 10−101). Black and gray lines are individual mice and dots indicate change points (see Methods). j, Maximal difference between hit and FA rates in probe light-off trials over the first 6 days (t-test, p < 10−3). k, Hit lick latency in probe light-off trials (median ± s.e.median; Wilcoxon test, p = 0.007). l, Accuracy in reinforced light-off trials (two-way ANOVA, p < 10−8). m, Action rate in reinforced light-off trials (two-way ANOVA, HIT: p = 0.57, FA: p < 10−8). n, Accuracy in reinforced light-off trials with inter-subject alignment to the day where probe accuracy ≥ 0.65 (green triangle) (two-way ANOVA, p < 10−5). Supplementary Figure 3a-c. o, Comparison of light-off versus light-on trials to measure auditory cortex silencing effect on on-line performance. p, Session density plot of accuracy in reinforced light-on against light-off. Top, control; bottom, PV-ChR2. See also Supplementary Figure 3d-g. q, Within subject accuracy difference in reinforced light-on and light-off trials, aligned to the day where FA rate < 0.3 in reinforced light-off (two-way ANOVA, p < 10−15). r, Within subject accuracy difference in reinforced light-on and light-off when silencing started at expert level (n = 4; t-test, p = 0.58). See also Supplementary Figure 6. mean ± s.e.m.; *p < 0.05; **p < 0.01; ***p < 0.001, n.s.: not significant.
The auditory cortex is the default pathway for sound-guided learning Lesion studies have suggested that the AC may not be essential to learn or execute cue-guided tasks with simple sensory stimuli12–15. However, permanent lesions cannot determine whether the AC is normally used for, or causally produces16, learning in an intact brain. To address this, we exploited a transient silencing approach to prevent the recruitment of alternative pathways15,17–20 while also using a probabilistic design to allow assessment of learning as distinct from performance by measuring behavior on non-silenced trials, thereby avoiding direct effects of silencing on performance.
We examined the impact of bilateral cortical silencing of the AC throughout learning (Fig. 1a). We probabilistically silenced the AC on 90% of reinforced trials throughout learning (‘light-on reinforced’, Fig. 1d), leaving 10% of reinforced (‘light-off reinforced’) and 100% of probe trials (‘light-off probe’) with intact AC activity. Silenced trials were pseudo-randomly sequenced and equally split between S+ and S−. Silencing was achieved by shining blue light bilaterally through cranial windows implanted above the AC of double transgenic mice (n=8) expressing channel rhodopsin (ChR2) in parvalbumin (PV) interneurons14,21 (Fig. 1d). We confirmed that the excitatory network was effectively silenced using this approach by combining two-photon calcium imaging of excitatory neurons and full-field optogenetic stimulation in PV-ChR2 mice (Supplementary Figure 2). Control mice (n=8) received the same light stimulation but did not express ChR2. This experimental design allowed us to assay the impact of cortical silencing on performance (control vs PV-ChR2 performance on light-on reinforced trials) versus acquisition learning (control vs PV-ChR2 performance on light-off probe trials) and expression learning (control vs PV-ChR2 performance on light-off reinforced trials) (Fig. 1e).
We first compared performance in light-on reinforced trials between PV-ChR2 and control mice (Fig. 1e) and observed a large performance impairment in PV-ChR2 mice (Fig. 1f,g). To address whether this performance reduction was accompanied by an impairment in rapid learning, we compared performance in PV-ChR2 and control animals in light-off probe trials (Fig. 1e,h-k) when the AC was not silenced and knowledge acquisition can be accurately measured7. Accuracy was lower during probe trials in PV-ChR2 mice (Fig. 1h), with delayed S+-response learning (Fig. 1i), lower discrimination (Fig. 1j), and longer lick latency on hit trials (Fig. 1k). Rapid acquisition of task knowledge was therefore impaired in PV-ChR2 mice.
Accuracy was also lower in reinforced light-off trials in PV-ChR2 mice (Fig. 1l,m). This remained true even after controlling for their slower task acquisition (Figs.1n, Supplementary Figure 3a-c). These impairments were also apparent in response latency and response vigor (Supplementary Figure 4). Together, these results suggest that the AC is the default pathway for sound-guided reward learning, even when not needed for perceptual sharpening.
The auditory cortex is used during learning but becomes dispensable at expert levels We next sought to understand the contribution of AC activity for the expression of the learned behavior as animals transitioned to expert performance. Transient inactivation of auditory cortex in expert animals has led to conflicting results, with some reports showing degradation of sound-guided behavior14,17,22,23 and others not14,24,25. We exploited our probabilistic silencing strategy and compared performance in light-on (AC silenced) versus light-off (AC functional) reinforced trials within subjects (Fig. 1o). Performance on these two trial types was similar at early periods of training, as performance was poor overall (Fig. 1p). As training progressed, performance remained poor on light-on trials but improved on light-off trials (Fig. 1p), demonstrating that the AC is used for task performance at early and intermediate time-point during learning. Surprisingly, this deficit in performance on light-on trials gradually waned (Fig. 1p,q), suggesting that while the AC was used during learning, it became dispensable once the mice had mastered the task.
These results could be explained by three alternative explanations. First, the optogenetic manipulation per se may not be interfering with a task-relevant process but instead could be ‘distracting’ the animal, necessitating more time to increase performance in light-on trials. We reasoned that bilateral silencing of another cortical region that is nominally unrelated to the task would serve as an important control. We bilaterally silenced the visual cortex throughout learning in PV-ChR2 mice and found no evidence of performance impairment in light-on trials (Supplementary Figure 5), demonstrating that the performance impairment was specific to AC silencing. Second, it is possible that AC silencing altered tone perception, increasing task difficulty at the perceptual level in light-on trials. Third, the reduction of impairment during light-on trials could be driven by a reduction of the silencing effect with time due, for example, to brain damage induced by repeated silencing. To address the second and third possibilities, we trained a separate cohort of PV-ChR2 mice without daily inactivation and, instead, inactivated the AC only after they reached expert performance (see Methods). We observed no impact from AC silencing (Figs.1r, Supplementary Figure 6)14.
Altogether, these results show that the AC is engaged during learning but is dispensable at expert levels, potentially tutoring subcortical structures that take over once the associations are learned.
Unsupervised discovery of learning-related dynamics by low-rank tensor decomposition We next sought to understand the nature and dynamics of auditory cortical activity underlying learning and performance. To do so, we performed longitudinal, two-photon calcium imaging of thousands of excitatory neurons in mice learning the auditory go/no-go task (n = 5). A separate group of water-restricted mice was passively exposed to two pure tones over the same duration but with no association with reinforcement (n = 3, see Methods; Supplementary Figure 7). This design allowed us to use the passive network as a base-case model to isolate learning-related neural dynamics.
We expressed the genetically encoded calcium indicator GCaMP6f under the CaMKII pro-moter, targeting AC layer 2/3 pyramidal neurons. We imaged two planes ∼50µm apart (Fig. 2a), allowing us to record simultaneously hundreds of neurons per animal (n=7,137 neurons in 8 mice). All mice were passively presented with a series of pure tones (4 to 64kHz, quarter-octave spaced) to characterize auditory tuning properties within the local area of expression. We computed single-neuron tuning curves and then constructed a ‘best frequency’ map confirming the location in the AC (Fig. 2b). For each mouse, we chose two stimuli that were similarly represented in the recorded population and were 3/4 octaves apart (Fig. 2c). We used a custom head-fixation system that allowed for kinematic registration and tracked the activity of the same neurons across weeks, including pre- and post-learning tuning curve sessions (n = 4, 643 neurons in 8 mice, see Methods; Fig. 2d-g).
a, Multi-plane, longitudinal two-photon calcium imaging of layer 2/3 excitatory network in the auditory cortex during learning (n = 5 mice) or passive exposure (n = 3 mice; see Methods). b, Tonotopic organization of the field of view of one example mouse before learning (left). Cells are colored according to their best frequency and tone-evoked responses of example cells circled in black to 17 pure tones ranging from 4 to 64 kHz are displayed on the right. c, Tone-evoked activity (top) and proportion of responsive cells (bottom) to pure tones. S+ and S− (filled and unfilled triangles, respectively) are chosen for training in the task based on their equal representation in the field of view in b. d, Six example cells tracked everyday across weeks. e, Two planes recorded in one example mouse. Cells are colored according to the number of days tracked among the 19 recording sessions in this mouse. f, Distribution of number of tracked days per cells in e. g, Cumulative distribution of tracked cells according to the percentage of recording sessions. Data for mouse in e is the light blue line. h, Calcium data is arranged by neurons × time within trial (−1 to +4s relative to tone onset, vertical line) × trials over time × trial outcomes. i, Activity from all Learning and Passive cells are concatenated together to create a fourth-order tensor (megamouse; left). In the 3rd, ‘across trials’ dimension, data is aligned across mice according to learning phases: Acquisition (performance increases in probe trials), Expression (performance increases in reinforced trials), and Expert (high, stable performance in reinforced trials; see Methods and Supplementary Figure 8). j, Megamouse tensor decomposition identifies six neuronal dynamics (numbered; see Methods) that are characterized by a set of four factors: Neuron, Within trial, Across trial, and Outcome (see also Supplementary Figure 10). k, Projection of the tensor decomposition output onto principal subspace. WNr, WW r and WAr indicate neuronal, within trial and across trial weights for a component r, respectively. l, t-distributed stochastic neighbor embedding (t-SNE) projections of neuronal weights. Each dot represents a cell, colored according to the neuronal dynamic it contributed in the most. Bars (right) display the proportion of learning and passive cells among the highest contributors for each dynamic. Dynamics 1 and 2 are driven by the passive network (burgundy), while Dynamics 3 to 6 are driven by the learning network (blue). m, In the passive network, the highest contributing cells in Dynamic 1 define cell ensemble 1, and highest contributing cells in Dynamic 2 define cell ensemble 2. Similarly, in the learning network, cell ensembles 3 to 6 are constituted of the highest contributing cells to Dynamics 3 to 6, respectively. n, Absolute weights of cell ensembles across the six identified dynamics. Neurons can participate in more than one dynamic.
From this high-dimensional dataset, we sought to identify single neurons and neuronal ensembles carrying learning-related information, resolve stimulus and non-stimulus related activity within a given trial, identify changes in representation across trials, and determine outcome-specific dynamics. To do so, we organized our data into a 4-dimensional array containing neurons × time in trial × trials across learning × trial outcome (Fig. 2h). To identify shared and distinct variability in neuronal populations recorded in passive mice (n = 2, 339, ‘passive network’) and in learning mice (n = 2, 304, ‘learning network’), we created a ‘megamouse’ by combining data from all mice and aligning neural activity to learning phase (n=4,643 neurons, see Methods; Fig. 2i; Supplementary Figure 8). We then used low-rank tensor decomposition to allow unsupervised identification of demixed, low-dimensional neural dynamics across multiple (> 2) dimensions26,27 (Supplementary Figure 9 and Supplementary Figure 10a,b; see Methods). The tensor decomposition revealed six neuronal dynamics, each characterized by the four factors of the original tensor (see Methods; Figs.2j, Supplementary Figure 10c,d, Supplementary Figure 11d). These six dynamics represented independent computations performed by the auditory cortical networks.
Projecting the product of the decomposition into principal component subspace showed that learning and passive networks exhibit almost orthogonal dynamics (Fig. 2k; Supplementary Figure 10f,g) and that the neural dynamics of different trial types evolved further apart in the learning network than in the passive network (Supplementary Figure 10h,i). Importantly, we ensured that the identified dynamics were not driven by isolated mice (Supplementary Figure 10e). Therefore, decomposition of the megamouse tensor discovered distinct dynamics exhibited by passive versus learning networks.
For further analyses, we attributed each dynamic to individual neurons based on the neuron’s maximum weight (‘unique participation’; Fig. 2l; see Methods and Supplementary Figure 11). This allowed us to map the six dynamics onto six distinct cell ensembles, i.e. groups of neurons maximally encoding a particular network-specific dynamic (Fig. 2m and Supplementary Figure 11d). It is important to note that individual neurons (and corresponding ensembles) could exhibit mixed selectivity for the six dynamics, which allows an individual neurons to contribute to multiple, independent computations (Fig. 2n).
Learning counteracts tone-evoked habituation by maintaining stimulus selectivity in distinct cell populations A prevailing view in sensory systems holds that sensory cortices subserve associative learning through plasticity of the cue representation5,28–36. This model posits that individual neurons (via changes in sensory tuning) and neural populations (via cortical map expansion) enhance the representation of behaviorally relevant cues for use by downstream regions37–39. These studies, however, measure neural tuning and map expansion outside of the task context in a ‘pre’ and ‘post’ learning design and infer that plasticity of cue representations reflects the mechanistic role of the sensory cortex. To assess this model, we initially focused on the cell ensembles that exhibited classical stimulus-evoked activity (Fig. 2j), namely cell ensembles 1-4.
We observed a prominent signature of stimulus-evoked habituation over hundreds to thousands of trials. This habituation dominated activity in passive networks, as seen in cell ensembles 1 and 2 which represented ∼77% (1, 803/2, 339) of all passive cells (Fig. 3a,d). These neurons exhibited stimulus-evoked activation (cell ensemble 1) or suppression (cell ensemble 2), both of which decreased in amplitude over time (Fig. 3b-c,e-f). These cell ensembles were not stimulus selective and displayed the same dynamic in both stimulus 1 (S1) and stimulus 2 (S2) trials (Fig. 3b,e). These ensembles thus reflected the broad-based suppression of non-selective neurons after long-term repeated presentation of the same sounds.
a, Representation of cell ensemble 1 in the Passive network. b, Average activity of cell ensemble 1 in S1 (black) and S2 (gray) trials across time in 80-trial blocks. Black triangles indicate tone onset, gray lines delimit averaged trial blocks. Black dashed lines separate time phases indicated by light to dark gray rectangles at the top: early, middle and late (see Methods). c, Cell ensemble 1 tone-evoked calcium responses across time phases for S1 and S2 trials combined (Friedman test, p = 1.26.10−291). d, Representation of cell ensemble 2 in the Passive network. e, Average activity of cell ensemble 2 in S1 and S2 trials across time. f, Cell ensemble 2 tone-evoked calcium responses across time phases for S1 and S2 trials combined (Friedman test, p = 7.32.10−121). g, Representation of cell ensemble 3 in the Learning network. h, Average activity of cell ensemble 3 in hit (green) and CR (yellow) trials across learning in 80-trial blocks. Black triangles indicate tone onset, gray lines delimit averaged trial blocks. Black dashed lines separate learning phases indicated by colored rectangles at the top: Acquisition, Expression and Expert (see Methods). i, Representation of cell ensemble 4 in the Learning network. j, Average activity of cell ensemble 4 in hit and CR trials across learning. k, Response index (response probability over learning; see Methods) of cell ensembles 1 and 2 (red) vs cell ensembles 3 and 4 (blue) (Wilcoxon test, p = 1.23.10−30). l, Selectivity index (see Methods) of cell ensembles 1 and 2 (red) vs cell ensembles 3 and 4 (blue) (Wilcoxon test, p = 1.37.10−94). m, Pre (top raw) and post (bottom raw) learning tonotopic maps (left), after spatial binning (middle) and restricted to surface with S+ (filled triangle) and S− (open triangle) best frequency (right) of one example mouse. n, Change in surface representation of S+ and S− pre- vs post-task learning (Learning) or pre- vs post-passive exposure (Passive) (binomial proportion tests). o, Pre vs post-learning change in percentage of neurons responsive to S+ and S−(binomial proportion tests). p, Pre vs post-learning change in tone-evoked responses of pre-task S+ and S− responsive neurons (KW test, p = 2.77.10−5). q, Pre- vs post-learning comparison of local best frequency differences in tonotopic maps. r, Distribution of local differences (from difference maps in q) in Learning versus Passive. median ± s.e.median; *p < 0.05; **p < 0.01; ***p < 0.001, n.s.: not significant.
Stimulus-evoked responses in learning networks were observed in cell ensembles 3 and 4 (Fig. 3g-j). This includes a high selectivity for the S− (cell ensemble 3) or S+ (cell ensemble 4) cues (Fig. 3g-j). Cell ensemble 3 consisted of 19% of the Learning cell population (Fig. 3g), and displayed a slight habituation but mainly a strong preference for the S− throughout learning (Fig. 3h), while cell ensemble 4 (12% of total learning cells; Fig. 3j) exhibited S+ selectivity throughout learning (Fig. 3j). Cell ensembles 3 and 4 were more tone responsive and tone selective than cell ensembles 1 and 2 (Fig. 3k,l). Stimulus-evoked activity analyses across days of all recorded neurons (n = 7, 137) also support these results (Supplementary Figure 12, Supplementary Figure 13). Therefore, learning counteracted tone-evoked habituation by maintaining distinct ensembles that encoded either the S+ or S− selectively.
Learning was not associated with cortical map expansion To directly test representational expansion and tuning shifts, we conducted a series of analyses focusing on stimulus-evoked responses before (pre-task) and after (post-task) learning, akin to classical measures of tuning and tonotopy. We computed the change in surface area occupied by S+ and S− preferring cells in tuning curve sessions, outside the task (Fig. 3m). Surprisingly, we observed no increase in the map-level representation of the S+ or S− after learning, and instead, observed a modest decrease (Fig. 3m-n). In addition to the best frequency representation, the fraction of neurons responding to the S+ and S− decreased (Fig. 3o) and the response amplitude of neurons that were initially tuned to the S+ and S− was lower after learning (Fig. 3p). Interestingly, while we observed no increase in representation to the S+ and S−, learning networks favored the representation of frequencies in between S+ and S−, but not higher or lower as seen in passive networks (Fig. 3n). Finally, using our passive networks as a base-case comparison, we calculated the local changes in the tonotopic map structure (Fig. 3q). Learning networks were surprisingly stable and exhibited less local changes than passive networks (Fig. 3r). These pre- vs post-learning changes in responsiveness and tonotopy thus mirrored the responsiveness observed online during learning (in dynamics 1 and 2) in a stable, tracked network (n=4,643 neurons, Fig. 3a-l), as well as when we include all neurons from each session (n=7,137 neurons) (Supplementary Figure 13). Altogether, our results suggest that cortical map expansion and changes in single-neuron tuning are unlikely to be the substrate for associative learning40,41.
Tone-restricted silencing only partially impairs learning and performance We next sought to understand the extent to which the maintenance of stimulus-selectivity by learning networks was important to learning and performing the task. We performed daily bilateral silencing of AC during stimulus presentation throughout learning (Supplementary Figure 14a). Tone-restricted AC silencing impaired task performance throughout learning (Supplementary Figure 14b-e), task acquisition (Supplementary Figure 14f-i), and online performance during learning, with gradual fading of the effect at expert performance (Supplementary Figure 14n-q). Accuracy and action rate were not affected in reinforced light-off trials (Supplementary Figure 14j-k), but PV-ChR2 mice lick more and faster to the S− (Supplementary Figure 14l-m), suggesting that tone-restricted AC silencing also impaired expression, but to a lesser extent than full-trial silencing. Altogether, these results showed that information carried by the AC network in the tone-evoked window is used during learning. Interestingly, tone-restricted silencing impacted learning less than full trial silencing across nearly all measures (Fig. 1, Supplementary Figure 14), suggesting that activity after the tone-evoked window was critical for rapid contingency acquisition and performance during learning.
Rapid emergence of reward prediction activity in the auditory cortex The sensory cortex is widely considered to be specialized for perception by interpreting complex sensory objects42,43 or adjusting representations of behaviorally-relevant stimuli2,33,37,44,45. Recent evidence, however, suggests that sensory cortical neurons directly encode non-sensory variables such as movement46–49, reward timing50–53, expectation54,55, and context23,45,56–63. Conjoint representations of sensory and non-sensory variables in the same network could further hone perception or, alternatively, subserve more integrative associative processes.
Inspection of the within-trial dynamics of learning-driven cell ensembles 5 and 6 suggested that these neurons exhibited non-canonical activity in the form of a signal that occurred late in the trial, delayed from the tone-evoked response (Fig. 2j). This late-in-trial signal increased over learning and was trial type selective (Fig. 2j). We next sought to further explore the encoding properties of these two cell ensembles. Cell ensemble 5 (n = 155 cells from the learning network), exhibited late-in-trial activity on hit trials (licking to the S+) that increased with learning (Fig. 4a). This delayed activity was not apparent on correct S−trials (correct reject, CR), where neurons exhibited classical stimulus-evoked response that habituated over learning (Fig. 4b).
a, Heat map of cell ensemble 5 activity (n = 155 cells) across learning phases (delimited by horizontal white dashed lines) in hit trials (20-trial blocks). White trace represents the average trial trace. Inserts (right) show average activity at time indicated by black triangles. Colored rectangles indicate learning phases: Acquisition (green), Expression (black) and Expert (blue). b, Heat map of cell ensemble 5 activity across learning phases (delimited by horizontal white dashed lines) in CR trials (20-trial blocks). c, Heat map of the activity of a fraction of cells from cell ensemble 5 (n = 20 cells) from one example mouse across consecutive S+ trials. Black dots indicate licks. Trial outcome is represented on the right (green circle: hit; blue stars: miss). d, Cell ensemble 5 activity in hit vs miss trials (time and number matched, see Methods and Supplementary Figure 15a). e, Area under the curve (AUC) quantification of data in gray rectangle in d (Wilcoxon signed rank test, p = 6.78.10−21). f, Procedure of reinforced and probe hit trial (H) matching. g, Average cell ensemble 5 activity in reinforced hit trials immediately before (black) or after (gray) probe hit trials (green). h, AUC quantification of data in h (Friedman test, p = 0.3071). i, Lick PSTHs in reinforced hit trials immediately before (black) or after (gray) probe hit trials (green). j, Quantification of number of licks in 1-s window post-tone (KW test, p = 3.18.10−56). k, Average activity of cell ensemble 5 over the first five blocks of 40-reinforced hit trials in learning. l, Late peak activity in HIT trials across learning phases of cell ensemble 5 (green) and low weighted cells (null, black). m, Procedure of reinforced and probe FA trial (fa) matching (top) and corresponding local accuracy quantification (bottom; see Methods; repeated measures ANOVA, p = 3.16.10−4). n, Average cell ensemble 5 activity in FA trials in the probe, non-reinforced context (orange). AUC late-in-trial (gray rectangle) compared to zero (Wilcoxon signed rank test, p = 1.46.10−8). o, Average activity of cell ensemble 5 (n = 51 cells) from one example mouse in FA trials in the reinforced context (n = 423) after classification based on the detection of a reward prediction signal. Bottom, average activity of FA trials with (RP+, n = 101) or without (RP-, n = 322) reward prediction signal, and activity during FA trials in the probe context (n = 19 trials, orange) reflecting ‘knowledge’ errors (see also Supplementary Figure 16). p, Heat map of the activity of a fraction of cells from cell ensemble 5 (n = 51 cells) from the same example mouse in o across consecutive FA trials in the reinforced context. Identification of a RP signal is represented by a black dot (right). q, Distribution of RP+ and RP− FA trials over learning in learning mice (binomial proportion tests, Acquisition, p = 1.65.10−7, Expression, p = 3.32.10−10, Expert, p = 0.22). r, Trial-specific closed-loop optogenetic AC inactivation over learning. s, Performance index (left, see Methods; two-way ANOVA, p = 2.11.10−21) and hit lick latency (right; two-way ANOVA, p = 0.013) in probe context in post-hit silencing experiments. t, Performance index (left, see Methods; two-way ANOVA, p = 6.36.10−5) and hit lick latency (right; two-way ANOVA, p = 0.008) in probe context in post-FA silencing experiments.
To understand the nature of the late-in-trial activity, we exploited our multiple trial types to disambiguate the contribution of sensory, motor, and reward signals. To assess whether the late-in-trial signal was a delayed form of sensory activity, we compared activity in hit trials to activity in trials where the same stimulus was presented but the mice did not lick and did not get rewarded (miss trials, Figs.1a and 4c-e). To ensure an appropriate comparison between hit and miss trials, we generated a balanced set of trials that were matched in number (given that miss trials were less frequent) and occurred within the same time period (given that the signal amplitude evolved with learning) (Supplementary Figure 15a). Cell ensemble 5 did not exhibit late-in-trial activity on miss trials (Fig. 4c-e), discarding the possibility that it reflected a delayed sensory response. We then asked whether this activity reflected reward consumption. We compared cell ensemble activity during hit trials in the reinforced context to the activity during hit trials in the probe context (Fig. 4f), where the mice expected reward and thus correctly licked to the S+ but the reward was omitted (Fig. 1b). We matched the number of trials between reinforced and probe contexts and controlled for within-session and across-session changes by comparing probe hit trials to reinforced hit trials immediately before and after the probe block (Fig. 4f). Strikingly, late-in-trial activity was preserved in probe trials (Fig. 4g,h), indicating that it did not reflect reward consumption. Finally, although movement has been reported to decrease auditory cortical activity46,64–66, we sought to understand the degree to which this late-in-trial signal could be driven by licking itself. To do this, we first exploited probe hit trials where the lick rate was strongly reduced compared to reinforced hit trials (Fig. 4i,j). We observed no difference in the late-in-trial neural signal and could thus conclude that the signal was not due to ongoing licking (Fig. 4i,j). Second, we tested the possibility that this late-in-trial signal was driven by the initiation of a lick bout as compared to the ongoing licking activity. We isolated spontaneous lick bouts in between training blocks and observed that the cell ensemble was not lick-responsive (Supplementary Figure 15b,c). In addition, if lick initiation drove this activity, we would also expect to see it on false alarm trials (incorrect licking to the S−). For this analysis, we focused on false alarms that occurred after task acquisition, as these errors are unlikely to be errors due to imperfect task knowledge. We observed no systematic late-in-trial activity on these trials (Supplementary Figure 15d) even though the licking pattern in false alarm trials was similar to that during probe hit trials (Supplementary Figure 15e). Taken together, the late-in-trial activity did not reflect stimulus, reward consumption, licking, nor lick initiation. Instead, these results showed that cell ensemble 5 encoded the higher-order process of reward prediction (RP).
We next sought to identify the precise moment when a contingency is formed by identifying the trials when this reward prediction signal emerged. Initially, these neurons exhibited classical tone-evoked responses but then abruptly and within only 40 hit trials, developed a robust reward prediction activity (Fig. 4k, Supplementary Figure 15f). This reward prediction signal continued to develop over Acquisition, strengthened during Expression, and then surprisingly receded at Expert level when learning is nominally complete (Fig. 4a,l, Supplementary Figure 15g). This longitudinal temporal dynamic mirrored our optogenetic results which demonstrates that the AC is the default pathway for learning but then becomes dispensable at expert levels. Altogether, these results show that a reward prediction signal rapidly emerges at the timescale of Acquisition in auditory cortical networks.
Revealing the underlying cognitive drivers of errors Identifying the cognitive drivers of errors is particularly challenging during learning 4. Errors during learning are typically considered ‘mistakes’ while discriminative contingencies (task knowledge) are still forming. However, errors arise not only from knowledge-related mistakes (for which animals incorrectly expect reward), but also from factors such as impulsivity, disengagement, and exploration (for which animals do not expect reward). While detailed behavioral inspection has been a promising route to uncover the nature of errors11, an alternative approach is to use neural activity itself. Given our findings of reward prediction encoding on correct trials, we hypothesized that the same signal would be present when animals make ‘knowledge-related’ errors, when animals incorrectly ‘expected’ rewards on S− trials. To address this, we first focused on the occasional false alarms (FA) that occurred during probe trials, as they reflected errors of task knowledge (Fig. 4m)7. Strikingly, we observed a robust reward prediction activity in these trials (Fig. 4n), strongly suggesting that animals were indeed expecting reward. We next reasoned that such knowledge errors should be present not only on probe trials, but also in a subset of reinforced trials, interspersed with non-knowledge errors. We classified individual FA trials in the reinforced context based on the presence of a reward prediction signal (see Methods; Supplementary Figure 16a). We identified a significant proportion of trials that exhibited robust reward prediction activity, but also many that did not (Fig. 4o, Supplementary Figure 16b). The reward prediction signal was identical to that observed in probe trials (Fig. 4o, Supplementary Figure 16d), providing further confidence that these were indeed knowledge errors. These data suggest that we could isolate knowledge errors using neural data, which was not possible from behavioral inspection alone (Supplementary Figure 16c). Interestingly, we found that knowledge errors were interspersed with errors that did not elicit reward prediction activity (Fig. 4p). Finally, we hypothesized that knowledge errors should predominantly occur during the Acquisition phase of behavior, when animals are still learning the discriminative contingencies. We computed the fraction of RP+ (knowledge-related errors) and RP-(non-knowledge errors) over time and found that RP+ errors peaked during the Acquisition phase of learning, and rarely occurred during Expression or Expert phases of behavior (Fig. 4q, Supplementary Figure 15d). These results demonstrate that the internal cognitive drivers of errors may be accessible from neural data, which is particularly striking when behavior alone is insufficient.
Reward prediction activity provides the core teaching signal Learning theory proposes that animals learn from correct actions that are rewarded but also from incorrect actions that are not rewarded67. This allows animals to select the appropriate action after reward-predictive (S+) versus non-predictive (S−) cues. Given the presence of the reward prediction activity on correct S+ trials (throughout learning) and incorrect S− trials (early in learning), we reasoned that silencing auditory cortical activity during the post-response period could impact learning and/or performance. To test this, we performed closed-loop probabilistic optogenetic silencing of the AC whereby light was delivered upon lick detection in 90% of either S+ reinforced trials (n = 5 control, n = 8 PV-ChR2 mice) or, in a separate cohort, S− reinforced trials (n = 7 control, n = 8 PV-ChR2 mice; see Methods; Fig. 4r, Supplementary Figure 17a, Supplementary Figure 18a). No light was delivered in 10% of S+ reinforced trials and 100% of probe trials. Given that the light was delivered after the instrumental lick response, the effect of the manipulation could not affect the instrumental behavior on the current trial, only on subsequent ones. To confirm this, we calculated the difference in performance between light-on and light-off trials and observed no difference (Supplementary Figure 17b-d and Supplementary Figure 18b-d). In the S+ cohort, post-hit silencing weakened the stimulus-action association (Fig. 4s), delayed cue-response discrimination (Figs.4s), but did not impact probe accuracy over the first 6 days (Supplementary Figure 17e-g). Importantly, the same silencing protocol above the visual cortex (n = 6 PV-ChR2 mice) had no effect on behavior, confirming that these effects were specific to the auditory cortex (Supplementary Figure 17k,l). In the S− cohort, post-FA silencing weakened the stimulus-action association as measured on hit trials (Fig. 4t), robustly delayed cue-response discrimination (Fig. 4t, Supplementary Figure 17g), and impaired probe accuracy over the first 6 days (Supplementary Figure 17e,f). Accuracy of PV-ChR2 mice was lower than control in the reinforced context in both experiments (Supplementary Figure 17h and Supplementary Figure 18h), with lower hit rate and higher FA rate (Supplementary Figure 17i and Supplementary Figure 18i), and longer response latencies on hit trials (Supplementary Figure 17j and Supplementary Figure 18j), suggesting an impairment of expression. Overall, these closed-loop manipulations showed that AC activity at the time of the reward prediction signal in both hit and FA trials was used by the animal for the task acquisition and expression. These data also demonstrate that learning is sensitive to cortical silencing on mistakes (FA trials) suggesting that in a go/no-go paradigm, reward feedback on error trials is crucial to the learning process. Altogether, these results suggest that reward prediction activity in auditory cortical networks is used as a teaching signal during learning.
Encoding of action suppression enables task performance A critical requirement in a go/no-go task is the ability to suppress responding to the non-rewarded, S− cue. In our task, we demonstrate that mice have the capacity to withhold licking to the S− very early in learning (as shown in probe trials during the acquisition phase) but continue to lick for hundreds to thousands of trials when being reinforced and throughout Expression. Here, we ask the extent to which the AC mediates this form of action suppression. Neurons in cell ensemble 6 (n = 704, 31% of learning networks; Fig. 5a), but not non-member cells, exhibited late-in-trial activity when animals correctly withheld from licking on S− trials (correct rejects, CR; Fig. 5b, Supplementary Figure 19a-b). This signal was stable throughout training despite the strong increase of CR rate over learning (Fig. 5c, Supplementary Figure 19c-d). This all-or-none attribute suggested that this late-in-trial activation was tied to performance rather than being a signal used for learning. Once mice acquired the task contingencies, they essentially learned to inhibit a licking response to the S− tone. We therefore thought to test the hypothesis that late-in-trial activation in CR trials reflected action suppression. First, we reasoned that activity in FA and CR trials should be similar until the moment of suppression failure (i.e. first lick). We compared the activity of cell ensemble 6 in CR vs FA trials, i.e. when mice fail to withhold licking (see Methods) exploiting the different first lick latencies in FA trials (Fig. 5d). We observed that calcium activity dropped abruptly in FA trials at the time of the first lick compared to CR trials (Fig. 5d,e, Supplementary Figure 19e). Second, if lick suppression is an active contingency-specific process, the late-in-trial activation should be specific for correct rejections for the S− tone, and not observed when the animal did not lick in response to the S+ tone (miss trials). Given that miss trials were rare and sporadic, we controlled for the effect of time over learning and difference in the number of trials for each outcome type (see Methods) and did not observe late-in-trial activation on miss trials despite similar peak activity after tone onset in miss and CR trials (Fig. 5f). Third, we reasoned that if this activity reflects the active process of action suppression, the signal should decrease when the animal disengaged from the task. We therefore compared late-in-trial activity in CR trials immediately before, during and after short blocks of disengagement (see Methods) and observed that the activity dropped significantly when mice transiently disengaged from the task (Fig. 5g). These data suggest that the auditory cortex integrates a higher-order action suppression signal.
a, Representation of cell ensemble 6 (n = 704 cells) in the Learning network. b, Average activity of cell ensemble 6 (yellow) versus cells that do not contribute to this dynamic (null, black) in CR trials in Expert phase (Wilcoxon test, p = 7.44.10−17). c, Average activity of cell ensemble 6 in CR trials (top) and CR rate (bottom) during Acquisition (green), Expression (black) and Expert (blue) phases (KW test, p = 0.09). d, Heat map of cell ensemble 6 activity in hit, FA and CR trials. FA trials are binned according to lick latencies (white dots, latency range extrema; white cross, mean latency). e, Heat map of cell ensemble 6 activity in hit and FA trials significantly different from CR trials (Wilcoxon tests, red, higher; blue, lower; white, n.s.). f, Average cell ensemble 6 activity in miss and CR trials (time and number matched, see Methods; middle). Quantifications of tone-evoked activity (bottom left; Wilcoxon signed rank test, p = 0.84) and late-in-trial AUC (bottom right; Wilcoxon signed rank test, p = 5.24.10−26). g, Procedure of reinforced and probe CR trial matching (top) and corresponding calcium activity (middle; Friedman test, p = 1.36.10−11) and local hit rate (bottom; Friedman test, p = 3.45.10−11). h, FA rate difference between light-on and light-off trials in PV-ChR2 mice (two-way ANOVA, p = 7.20.10−16; t-tests compared to 0, p = 4.96.10−4, p = 0.96, p = 0.002). Auditory or visual cortex were inhibited during the full trial (AC trial, n = 8; VC trial, n = 8) or AC was silenced during tone presentation only (AC tone, n = 4). i, Average lick probability in FA light-on versus FA light-off trials (two-way ANOVA, p = 1.18.10−5; t-tests compared to 0, p = 1.94.10−6, p = 0.10, p = 0.68).
Finally, we wondered whether the action suppression activity in the AC was causal to performance during learning. To test this, we reasoned that silencing the AC network throughout S− trials should increase the FA rate but also the lick probability (since the action suppression neurons are silenced during this period). In contrast, silencing the AC network only during the stimulus period should increase the FA rate but not impact the lick probability when the light is off (Supplementary Figure 19f). We first compared the FA rate between light-on and light-off trials in PV-ChR2 mice during full trial silencing (Fig. 5h) and observed a marked increase in FA rate and lick probability (Fig. 5h,i). Importantly, this effect was not the result of the perception of optogenetic manipulation per se as suppression of the visual
Higher-order contingency ensembles are spatially clustered and uncoupled from sensory representations We next asked the extent to which reward prediction and action suppression ensembles mapped onto the underlying stimulus properties of the AC. We exploited the spatial resolution of two-photon imaging to characterize the spatial distribution of reward prediction and action suppression neurons in the AC network. Strikingly, we observed that the two cell ensembles were spatially clustered (Fig. 6a-c). To determine whether this organization was driven by the neuron’s pre-learning stimulus selectivity, we calculated the selectivity index (SI) of each neuron before training to test whether neurons selective for the S+ preferentially became reward prediction neurons and S− selective neurons preferentially became action suppression neurons. We observed no difference in SI distribution between reward predictive and action suppression neurons (Fig. 6d,e), suggesting that pre-task stimulus selectivity was not predictive of either reward prediction or action suppression. We then asked whether the spatial location of reward prediction and action suppression neurons aligned with the underlying tonotopic map. In other words, did action suppression neurons have S− tone for best frequency, and were reward prediction neurons preferentially responsive to S+ tone (Fig. 6f; see Methods)? We found that this was not the case (Fig. 6g,h), with similar proportion of S+- and S−-preferring neurons in reward prediction and action suppression cell ensembles (Fig. 6i). Therefore, contingency-related ensembles clustered into spatial domains that were uncoupled from underlying stimulus selectivity and tonotopy, indicating a higher-order functional segregation within the AC.
a, Spatial distribution of reward prediction (purple circles) and behavioral inhibition (orange circles) cell ensembles in an example mouse. Color scale indicates neuronal weights in Dynamics 5 (purple) and 6 (orange). b, Median of cell distance between cell ensembles compared to shuffle distribution (n = 500) for example mouse in a. The null hypothesis is that the distance between the two ensembles is no different than chance (i.e. no spatial organization). c, Z-scored distances between clusters per mouse (blue: significant; gray: non-significant). Red arrow points to example mouse in a. d, Neuronal weights in Dynamics 5 and 6 of cells from learning mice (n = 1, 216, left) and their pre-task stimulus selectivity index (right). e, Distributions of pre-task stimulus selectivity of cell ensembles 5 and 6 (KS test, p = 0.25, Wilcoxon test, p = 0.18). f, Pre-task tonotopic map of the example mice in a. Cells are colored according to their best frequency (BF). Frequencies used as S+ and S− for training are indicated by full and empty triangles, respectively. g, Distribution of BF distance from S+ for reward prediction cell ensemble (purple). Null hypothesis is that reward prediction cells have a BF as close to S+ as possible (black; see Methods; KS test, p = 3.81.10−9). h, Distribution of BF distance from S− for action suppression cell ensemble (orange). Null hypothesis is that action suppression cells had a BF as close to S− as possible (black; see Methods; KS test, p = 9.21.10−16). i, Proportions of S+ and S−-preferring cells in reward prediction and action suppression cell ensembles (binomial proportion tests, S+, p = 0.17, S−, p = 0.53) cortex in PV-ChR2 mice did not have this effect (Fig. 5h,i). In contrast, restricting silencing to the stimulus period increased FA rate while not affecting lick probability (Fig. 5h,i), suggesting that the late-in-trial activity in CR trials was critical for the maintenance of action suppression. Altogether, these results showed that action suppression is encoded in the auditory cortex and is instrumental for performance during learning.
Competing Interest Statement
The authors have declared no competing interest.
Data availability
The data that support the findings of this study are available from the corresponding authors upon request.