Abstract
Can decisions be made solely by chance? To investigate this question, we designed a deterministic setting in which mice are rewarded for non-repetitive choice sequences, and modeled the experiment using reinforcement learning. We found that mice progressively increased their choice variability using a memory-free, pseudo-random selection, rather than by learning complex sequences. Our results demonstrate that a decision-making process can self-generate variability and randomness even when the rules governing reward delivery are neither stochastic nor volatile.
Main
Principles governing random behaviors are still poorly understood, despite well-known ecological examples ranging from vocal and motor babbling in trial-and-error learning 1,2 to unpredictable behavior in competitive setups (e.g preys-versus-predators 3 or humans competitive games 4). Dominant theories of behavior and notably reinforcement learning (RL) rely on exploitation, namely the act of repeating previously-rewarded actions 5,6. In this context, choice variability is associated with exploration of environmental contingencies. “Directed” exploration aims at gathering information about environmental contingencies 7,8, whereas random exploration introduces variability regardless of the contingencies 9,10. Studies have shown that animals are able to produce variable, unpredictable choices 11,12, especially when the reward delivery rule changes 13,14, is stochastic 9,15,16 or is based on predictions about their decisions 17,18. However, even approaches based on the prediction of the animal behavior 17,18 keep the possibility to distribute reward stochastically - for example if no systematic bias in the animal’s choice behavior has been found 18,19. Thus, because of the systematic use of volatile or probabilistic contingencies, it has remained difficult to experimentally isolate variability generation from environmental conditions. To test the hypothesis that animals can adaptively adjust the randomness of their behavior, we implemented a task where the reward delivery rule is deterministic, predetermined and identical for all animals, but where a purely random choices strategy is successful.
Mice were trained to perform a sequence of binary choices in an open-field where three target locations were explicitly associated with rewards delivered through intra-cranial self-stimulation (ICSS) in the medial forebrain bundle. Importantly, mice could not receive two consecutive ICSS at the same location. Thus, they had to perform a sequence of choices 15 and at each location to choose the next target amongst the two remaining alternatives (Fig 1A). In the training phase, all targets had a 100% probability of reward. We observed that after learning, mice alternated between rewarding locations following a stereotypical circular scheme interspersed with occasional changes in direction, referred to as U-turn (Fig 1B). Once learning was stabilized, we switched to the complexity condition, in which reward delivery was non-stochastic and depended on sequence variability. More precisely, we calculated the Lempel-Ziv (LZ) complexity 20 of choice subsequences of size 10 (9 past choices + next choice) at each trial. Animals were rewarded when they chose the one target (out of the two options) associated with the highest complexity (given the previous nine choices). Despite its difficulty, this task is fully deterministic. Indeed, mice were asked to move along a tree of binary choices (see Fig 1A) where some paths ensured 100% rewards. Whether each node was rewarded or not was pre-determined in advance. Thus, choice variability could not be imputed to the inherent stochasticity of the outcomes. For each trial, if choosing randomly, the animal had either 100% or 50% chance of being rewarded depending on whether the two subsequences of size 10 (= 9 past choices + 1 choice out of 2 options) had equal or unequal complexities. Another way to describe the task is thus to consider all possible situations, not as sequential decisions made by the animal during the task but as the set of all possible subsequences of size 10 of which the algorithm may evaluate the complexity. From this perspective, there is an overall 75% probability of being rewarded if subsequences are sampled uniformly (Fig 1A). To summarize, theoretically, while a correct estimation of the complexity of the sequence leads to a success rate of 100%, a pure random selection leads to 75% of success, and a repetitive sequence (e.g. A,B,C,A,B,C,…) grants no reward.
We found that mice progressively increased the variability of their choice sequences (Fig 1B) and thus their success rate along sessions (Fig 1C). This increased variability in the generated sequences was demonstrated by an increase in the normalized LZ-complexity measure (hereafter NLZcomp) of the session sequences, a decrease in an entropy measure based on recurrence plot quantification and an increase in the percentage of U-turns (Fig 1D). Furthermore, in the last session, 65.5% of the sequences were not significantly different from surrogate sequences generated randomly (Supp Fig 1A). The success rate was correlated with the NLZcomp of the entire session of choice sequences (Fig 1E), suggesting that mice increased their reward through an increased variability in their choice. The increase in success rate was associated with an increase of the percentage of U-turns (Fig 1D left), yet mice performed a suboptimal U-turn rate of 30%, below the 50% U-turn rate ensuring 100% of rewards (Supp Fig 1B).
From a behavioral point of view, mice thus managed to increase their success rate in a highly demanding task. They did not achieve 100% success but reached performances that indicate a significant level of variability. Given that the task is fully deterministic, the most efficient strategy would be to learn and repeat one (or some) of the 10-choice long sequences that are always rewarded. This strategy ensures the highest success rate but incurs a tremendous memory cost. On the other hand, a purely random selection is another appealing strategy since it is less costly and leads to about 75% of reward. To differentiate between the two strategies and better understand the computational principles underlying variability generation in mice, we examined the ability of a classical RL algorithm to account for the mouse decision-making process under these conditions.
State-action values were learned using the Rescorla-Wagner rule 21 and action selection was based on a softmax policy 5 (Fig 2A; see ‘Methods’). By defining states as vectors including the history of previous locations instead of the current location alone, we were able to vary the memory size of simulated mice and to obtain different solutions from the model accordingly. We found that, with no memory (i.e. state = current location), the model learned equal values for both targets in almost all states (Fig 2B). In contrast, and in agreement with classical RL, with the history of the nine last choices stored in memory, the model favored the rewarded target in half of the situations by learning higher values (approximately 90 vs 10%) associated with rewarded sequences of choices (Fig 2B). This indicates that classical RL can find the optimal solution of the task if using a large memory. Furthermore, choosing randomly was dependent not only on the values associated with current choices, but also on the softmax temperature and the U-turn cost. The ratio between these two hyperparameter controls the level of randomness in action selection (see ‘Methods’). Intuitively, a high level of randomness leads to high choice variability and sequence complexity. But interestingly, the randomness hyperparameter had opposite effects on the model behavior with small and large memory sizes. While increasing the temperature always increased the complexity of choice sequences, it increased the success rate for small memory sizes but decreased it for larger memories (Fig 2C). A boundary between the two regimes was found between memory sizes of 3 and 4.
Upon optimization of the model to fit mouse behavior, we found that their performance improvement over sessions was best accounted for by an increase of choice randomness using a small memory (Fig 2D). This model captured mouse learning better than when using fixed parameters throughout sessions (Bayes factor = 3.46; see ‘Methods’, and Supp Fig 2D and E). The model with a memory of size 3 best reproduced mouse behavior (Fig 2D), but only slightly better than versions with smaller memories (Supp Fig 2C). From a computational perspective, one possible explanation for the fact that although theoretically sufficient, a memory of size 1 fits less than size 3, is that state representation is overly simplified in the model. Accordingly, altering the model’s state representation to make it more realistic should reduce the size of the memory needed to reproduce mice performances. To test this hypothesis, we used a variant of the model in which we manipulated state representation ambiguity: each of the locations {A, B, C} could be represented by n ≥1 states, with n = 1 corresponding to unambiguous states (see ‘Methods’, and Fig 2E). As expected, the model fitted better with a smaller memory as representation ambiguity was increased (Fig 2E). We also found that the best fitting learning rate was higher with ambiguous representations while the randomness factor remained unchanged regardless of ambiguity level (Fig 2E). This corroborates that the use of additional memory capacity by the model is due to the model’s own limitations rather than an actual need to memorize previous choices. Hence, this computational analysis overall suggests that mice adapted the randomness parameter of their decision-making system to achieve more variability over sessions rather than remembered rewarded choice sequences. This conclusion was further reinforced by a series of behavioral arguments detailed below supporting the lack of memorization of choice history in their strategy.
We first looked for evidence of repeated choice patterns in mouse sequences using a Markov chain analysis (see ‘Methods’). We found that the behavior at the end of the complexity condition was Markovian (Fig 3A). In other words, the information about the immediately preceding transition (i.e. to the left or to the right) was necessary to determine the following one (e.g. p(L) ≠ P(L|L)) but looking two steps back was not informative on future decisions (e.g. p(L|LL) ≈ P(L|L)). The analysis of the distribution of subsequence of length 10 (see Methods) provides an additional evidence of the lack of structure in the animals’ choice sequence. Indeed, while at the end of the training, mice exhibit a peaky distribution with a strong preference for the highly repetitive circular patterns, the distribution was dramatically flattened under the complexity condition (Fig 3B). Furthermore, we tested whether mice use of a win-stay-lose-switch strategy 17. Indeed, mice could have used this heuristic strategy when first confronted with the complexity condition after a training phase in which all targets were systematically rewarded. Changing directions in the absence of reward could have introduced enough variability in the animals’ sequence to improve their success rate. Yet, we found that being rewarded (or not) had no effect on the next transition, neither at the beginning nor the end of the complexity condition (Fig 3C); thus eliminating another potential form of structure in mice behavior under the complexity rule.
To further support the notion that mice did not actually memorize rewarded sequences to solve the task, we finally performed a series of experiments to compare the animals’ behavior under the complexity rule and under a probabilistic rule in which all targets were rewarded with a 75% probability (the same frequency reached at the end of the complexity condition). We first analyzed mice behavior when the complexity condition was followed by the probabilistic condition (Group 1 Fig 4A). We hypothesized that, if animals choose randomly at each node in the complexity setting (and thus do not memorize and repeat specific sequences), they would not detect the change of the reward distribution rule when switching to the probabilistic setting. In agreement with our assumption, we observed that as we switched to the probabilistic condition, animals did not modify their behavior although the optimal strategy would have been to avoid U-turns, as observed in the 100% reward setup used for training (Fig 4B and Supp Fig 3A). Hence, after the complexity setting, mice were likely stuck in a “random” mode given that the global statistics of the reward delivery were conserved. In contrast, when mice were exposed to the probabilistic distribution of reward right after the training session (Group 2 Fig 4A), they slightly changed their behavior but mostly stayed in a circular pattern with few U-turns and low sequence complexity (Fig 4B and Supp Fig 3A). Thus, animals from Group 2 exhibited lower sequence complexity and U-turn rate in the probabilistic condition than animals from Group 1, whether in the complexity or the probabilistic condition (Fig 4C). The distribution of patterns of length 10 in the sequences performed by animals from Group 2 during the last probabilistic session shows a preference for repetitive circular patterns that is very similar to that observed at the end of the training; contrasting with the sequences performed by animals from Group 1 (Fig 4D). A larger portion of sequences performed by animals from Group 1 were not significantly different from surrogate sequences generated randomly in comparison with animals from Group 2 (Supp Fig 3B). Last, if the sequences performed by mice from Group 2 were executed under the complexity rule, these animals would have obtained significantly lower success rate that animals from Group 1 in the complexity condition (Supp Fig 3C). In summary, mice behavior under the probabilistic condition changed significantly depending on the preceding condition and the strategy that the animal was adopting. This further supports our initial claim that stochastic experimental setups made it difficult to unravel the mechanisms underlying random behavior generation. On the other hand, the complexity rule used in our experiments make it possible to categorize animals’ behavior into one of three possible strategies (i.e. repetitive, random or optimal). Overall, our results indicate no evidence of sequence memorization nor any behavioral pattern that might have been used by mice as a heuristic to solve the complex task.
Whether and how the brain can generate random patterns has always been puzzling 22. In this study, we addressed two fundamental aspects in this matter: the implication of memory processes and the dependence upon external (environmental) factors. Regarding memory, one hypothesis holds that in human, the process of generating random patterns leverages memory 23, to ensure the equality of response usage for example 24, whereas a second hypothesis suggests that the lack of memory may help eliminate counterproductive biases 25,26. In our task, mice did not use their memory, thus suggesting that the brain is able to effectively achieve high variability by suppressing biases and structure, at least in some contexts. The second aspect is the degree of dependence upon external, environmental factors. Exploration and choice variability are generally studied by introducing stochasticity and/or volatility in environmental outcomes 15-18. However, such conditions make it difficult to interpret the animal’s strategy and to know whether the observed variability in the mouse choice is inherited or not from the statistics of the behavioral task. In this work, we took a step further toward understanding the processes underlying the generation of variability per se, independently from environmental conditions. Confronted with a deterministic task which yet favors complex choice sequences, mice avoided repetitions by engaging in a behavioral mode where decisions were random and independent from their reward history. Animals adaptively tuned their decision-making parameters to increase choice randomness, which suggests an internal process of randomness generation.
Methods
Animals
Male C57BL/6J (WT) mice obtained from Charles Rivers Laboratories France (L’Arbresle Cedex, France) were used. Mice arrived to the animal facility at 8 weeks of age, and were housed individually for at least 2 weeks before the electrode implantation. Behavioral tasks started one week after implantation to ensure full recovery. Since intracranial self-stimulation (ICSS) does not require food deprivation, all mice had ad libitum access to food and water except during behavioral sessions. The temperature (20-22 °C) and humidity was automatically controlled and a circadian light cycle of 12/12h light-dark cycle (lights on at 8:30 am) was maintained in the animal facility. All experiments were performed during the light cycle, between 09:00 a.m. and 5:00 p.m. Experiments were conducted at Sorbonne University, Paris, France, in accordance with the local regulations for animal experiments as well as the recommendations for animal experiments issued by the European Council (directives 219/1990 and 220/1990).
ICSS
Mice were introduced into a stereotaxic frame and implanted unilaterally with bipolar stimulating electrodes for ICSS in the medial forebrain bundle (MFB, anteroposterior = 1.4 mm, mediolateral = ±1.2 mm, from the bregma, and dorsoventral = 4.8 mm from the dura). After recovery from surgery (1 week), the efficacy of electrical stimulation was verified in an open field with an explicit square target (side = 1 cm) at its center. Each time a mouse was detected in the area (D = 3 cm) of the target, a 200-ms train of twenty 0.5-ms biphasic square waves pulsed at 100 Hz was generated by a stimulator. Mice self-stimulating at least 50 times in a 5 minutes session were kept for the behavioral sessions. In the training condition, ICSS intensity was adjusted so that mice self-stimulated between 50 and 150 times per session at the end of the training (ninth and tenth session), then the current intensity was kept the same throughout the different settings.
Complexity task
In the complexity condition, reward delivery was determined by an algorithm that estimated the grammatical complexity of animals’ choice sequences. More specifically, at a trial in which the animal was at the target location A and had to choose between B and C, we compared the LZ-complexity 20 of the subsequences comprised of the 9 past choices and B or C. Both choices were rewarded if those subsequences were of equal complexity. Otherwise, only the option making the subsequence of highest complexity was rewarded.
Measures of choice variability
Two measures of complexity were used to analyze mouse behavior. First, the normalized LZ-complexity (referred to as NLZcomp or simply complexity throughout the paper) which corresponds to the LZ-complexity divided by the average LZ-complexity of 1000 sequences of the same length generated randomly (a surrogate) with the constraint that two consecutive characters could not be equal, as in the experimental setup. NLZcomp is small for highly repetitive sequence and is close to 1 for uncorrelated, random signals. Second, the entropy of the frequency distribution of the diagonal length (noted ENT), taken from recurrence quantification analysis (RQA). RQA is a series of methods in which the dynamics of complex systems are studied using recurrence plots (RP) 27,28 where diagonal lines illustrate recurrent patterns. Thus, the entropy of diagonal lines reflects the deterministic structure of the system and is smaller for uncorrelated, random signals. RQA was measured using the Recurrence-Plot Python module of the “pyunicorn.timeseries” package.
Computational models
The task was represented as a Markov Decision Process (MDP) with three states s ∈ {A, B, C} and three actions a ∈ {GoToA, GoToB, GoToC}, respectively corresponding to the rewarded locations and the transitions between them. State-action values Q(s,a) were learned using the Rescorla-Wagner rule 21: where s t =[st, st-1, …, st-m] is the current state which may include the memory of up to the mth past location, at the current action, α the learning rate and U the utility function defined as follows: where r is the reward function and κ the U-turn cost parameter modeling the motor cost or any bias against the action leading the animal back to its previous location. The U-turn cost was necessary to reproduce mouse stereotypical trajectories at the end of the training phase (see Supp Fig 2).
Action selection was performed using a softmax policy, meaning that in state st the action at is selected with probability: where τ is the temperature parameter. This parameter reduces the sensitivity to the difference in actions values thus increasing the amount of noise or randomness in decision-making. The U-turn cost κ has the opposite effect since it represents a behavioral bias and constrains choice randomness. We refer to the hyperparameter defined as ρ= τ / κ as the randomness parameter.
In the version referred to as BasicRL (see Supp Fig 2), we did not include any memory of previous locations nor any U-turn cost. In other words, m=0 (i.e. s t = [st]) and κ = 0.
To manipulate state representation ambiguity (see Fig 2), each of the locations {A, B, C} could be represented by n≥1 states. For simplicity, we used n=1, 2 and 3 for all locations for what we referred to as ‘null’, ‘low, and ‘med’ levels of ambiguity. This allowed us to present a proof of concept regarding the potential impact of using a perfect state representation in our model.
Model Fitting
The main model-fitting results presented in this paper were obtained by fitting the behavior of the mice under training and complexity conditions session by session independently. This process aimed to determine which values of the two hyperparameters m and ρ= τ / κ make the model behave as mice in terms of success rate (i.e. percentage of rewarded actions) and complexity (i.e. variability of decisions). Our main goal was to decide between the two listed strategies that can solve the task: repeating rewarded sequences or choosing randomly. Therefore, we momentarily put aside the question of learning speed and only considered the model behavior after convergence. α was set to 0.1 in these simulations.
Hyperparameters were selected through random search 29 (see ranges listed in Supplementary Table 1). The model was run for 2.106 iterations for each parameter set. The fitness score with respect to mice average data at each session was calculated as follows: where and are the average success rate and complexity in mice respectively and Ŝ and Ĉ the model success rate and complexity -- all the four ∈ [0,1]. Simulations were long enough for the learning to converge. Thus, instead of multiple runs for each parameter set, which would have been computationally costly, Ŝ and Ĉ were averaged over the last 10 simulated sessions. We considered that 1 simulated session = 200 iterations, which is an upper bound for the number of trials performed by mice in one actual session.
Since mice were systematically rewarded during training, their success rate under this condition was not meaningful. Thus, to assess the ability of the model to reproduce stereotypically circular trajectories in the last training session, we replaced Ŝ and in Equation (5) by Û and Ū representing the average U-turn rates for mice and for the model respectively.
Additional simulations were conducted with two goals: 1) test whether one single parameter set could fit mice behavior without the need to change parameter values over sessions, 2) test the influence of state representation ambiguity on memory use in the computational model. Therefore, each simulation attempted to reproduce mice behavior from training to the complexity condition. Hence, the learning rate α was optimized in addition to the previously mentioned m and ρ = τ / κ hyperparameters (see ranges listed in Supplementary Table 1). Each parameter set was tested over 20 different runs. Each run is a simulation of 4000 iterations, which amounts to 10 training sessions and 10 complexity sessions since simulated sessions consist of 200 iterations. The fitness score was computed as the average score over the last training session and the 10 complexity sessions using Equations (4) and (5). Using a grid search ensured comparable values for different levels of ambiguity (‘null’, ‘low, and ‘med’; see previous section). Given the additional computational cost induced by higher ambiguity levels, we gradually decreased the upper bound of the memory size range in order to avoid long and useless computations in uninteresting regions of the search space.
Markov chain analysis
Markov chain analysis allows to mathematically describe the dynamic behavior of the system, i.e. transitions from one state to another, in probabilistic terms. A process is a first-order Markov chain (or more simply Markovian) if the transition probability from state A to a state B depends only on the current state A and not on the previous ones. Put differently, the current state contains all the information that could influence the realization of the next state. A classical way to demonstrate that a process is Markovian is to show that the sequence cannot be described by a zeroth-order process, i.e. that P(B|A) ≠ P(B), and that the second-order probability is not required to describe the state transitions, i.e. that P(B|A) = P(B|AC).
In this paper, we analyzed the 0th, 1st and 2nd order probabilities in sequences performed by each mouse in the last session of the complex condition (c10). Using the targets A, B, and C as the Markov chain states would have provided a limited amount of data. Instead, we described states as movements to the left (L) or to the right (R) thereby obtaining larger pools of data (e.g. R = A → B, B →C, C → A) and a more compact description (e.g. two 0th order groups instead of three). To assess the influence of rewards on mouse decisions when switching to the complexity condition (i.e. win-stay-lose-switch strategy), we also compared the probability of going forward P(F) or backward P(U) with the conditional probabilities given the presence or absence of reward (e.g. P(Frw) or P(Uunrw)). In this case, F = R →R, L →L and U = R →L, R →L.
Analysis of subsequences distribution
All patterns starting by A were extracted and pooled from the choice sequences of mice in the last sessions of the three conditions (training, complexity, probabilistic). The histograms represent the distribution of these patterns following the decision tree structure. In other words, two neighbor branches shared the same prefix.
Bayesian model comparison
Bayesian model comparison aims to quantify the support for a model over another based on their respective likelihoods P(D|M), i.e. the probability that data D are produced under the assumption of model M. In our case, it is useful to compare the fitness of the Mind model fitted session by session independently from that of the model Mcon fitted to all sessions in a continuous way. Since these models do not produce explicit likelihood measures, we used approximate Bayesian computation: considering the 15 best fits (i.e. the 15 parameter sets that granted the highest fitness score), we estimated the models’ likelihood as the fraction of (Ŝ, Ĉ pairs that were within the confidence intervals of mouse data. Then, the Bayes factor was calculated as the ratio between the two competing likelihoods:
B > 3 was considered to be a substantial evidence in favor of Mind over Mcon 30.
Statistical analysis
No statistical methods were used to predetermine sample sizes. Our sample sizes are comparable to many studies using similar techniques and animal models. The total number of observations (N) in each group as well as details about the statistical tests were reported in figure captions. Error bars indicate 95% confidence intervals. Parametric statistical tests were used when data followed a normal distribution (Shapiro test with p > 0.05) and non-parametric tests when they did not. As parametric tests, we used t-test when comparing two groups or ANOVA when more. Homogeneity of variances was checked preliminarily (Bartlett’s test with p > 0.05) and the unpaired t-tests were Welch-corrected if needed. As non-parametric tests, we used Mann-Whitney test when comparing two independent groups, Wilcoxon test when comparing two paired groups and Kruskal-Wallis test when comparing more than two groups. All statistical tests were applied using the scipy.stats Python module. They were all two-sided except Mann-Whitney. p > 0.05 was considered to be statistically non-significant.
Author Contributions
PF designed the behavioral experiment. PF, EB, RDC, MD, ED, TAY, BH and MC performed the behavioral experiments, PF and MB analyzed the behavioral data. MB developed the computational model, SD developed some acquisition tools. JN and OS contributed to modeling studies and data analysis. MB, PF, JN and OS wrote the manuscript with inputs from AM.
Competing interests
The authors declare no competing interests.
Acknowledgements
This work was supported by the Centre National de la Recherche Scientifique CNRS UMR 8246 et UMR 7222, the Labex SMART (ANR-11-LABX-65) supported by French state funds managed by the ANR within the Investissements d’Avenir programme under reference ANR-11-IDEX-0004-02, the Foundation for Medical Research (FRM, Equipe FRM DEQ2013326488 to P.F) and the French National Cancer Institute Grant TABAC- 16-022 (to P.F.). P.F. team is part of the École des Neurosciences de Paris Ile-de-France RTRA network and member of LabEx Bio-Psy.
Footnotes
Introduction, task description and discussion updated to clarify; Figure 3 updated and Figure 4 added to describe additional results.