Abstract
How do children and adults differ in their search for rewards? We consider three different hypotheses that attribute developmental differences to either children’s increased random sampling, more directed exploration towards uncertain options, or narrower generalization. Using a search task in which noisy rewards are spatially correlated on a grid, we compare 55 younger children (age 7-8), 55 older children (age 9-11), and 50 adults (age 19-55) in their ability to successfully generalize about unobserved outcomes and balance the exploration-exploitation dilemma. Our results show that children explore more eagerly than adults, but obtain lower rewards. Building a predictive model of search to disentangle the unique contributions of the three hypotheses of developmental differences, we find robust and recoverable parameter estimates indicating that children generalize less and rely on directed exploration more than adults. We do not, however, find reliable differences in terms of random sampling.
Introduction
Alan Turing famously believed that in order to build a General Artificial Intelligence, one must create a machine that can learn like a child (1950). Indeed, recent advances in machine learning often contain references to child-like learning and exploration (Lake, Ullman, Tenenbaum, & Gershman, 2017; Riedmiller et al., 2018). Yet little is known about how children actually explore and search for rewards in their environments, and in what ways their behavior differs from adults.
In the course of learning through interactions with the environment, all organisms (biological or machine) are confronted with the exploration-exploitation dilemma (Mehlhorn et al., 2015). This dilemma highlights two opposing goals. The first is to explore unfamiliar options that provide useful information for future decisions, yet may result in poor immediate rewards. The second is to exploit options known to have high expectations of reward, but potentially forgo learning about unexplored options.
In addition to balancing exploration and exploitation, another crucial ingredient for adaptive search behavior is a mechanism that can generalize beyond observed outcomes, thereby guiding search and decision making by forming inductive beliefs about novel options. For example, from a purely combinatorial perspective, it only takes a few features and a small range of values to generate a pool of options vastly exceeding what could ever be explored in a lifetime. Nonetheless, humans of all ages manage to generalize from limited experiences in order to choose from amongst a set of potentially unlimited possibilities. Thus, a model of human search also needs to provide a mechanism for generalization.
Previous research has found extensive variability and developmental differences in children’s and adults’ search behavior, which not only result from a progressive refinement of basic cognitive functions (e.g., memory or attention), but also derive from systematic changes in the computational principles driving behavior (Palminteri, Kilford, Coricelli, & Blakemore, 2016). In particular, developmental differences in learning and decision making have been explained by appealing to three hypothesized mechanisms: children sample more randomly, explore more eagerly, or generalize more narrowly than adults.
In this paper, we investigate how these three mechanisms are able to explain developmental differences in exploration-exploitation behavior. We provide a precise characterization of these competing ideas in a formal model, which is used to predict behavior in a search task, where noisy and continuous rewards are spatially correlated. Using behavioral markers, interpreting parameter estimates from computational models, and analyzing judgments about unexplored options, our results converge on the finding that children generalize less, but engage in more directed exploration than adults. We do not, however, find reliable developmental differences in random exploration. These results enrich our understanding of maturation in learning and decision making, demonstrating that children explore using uncertainty-guided mechanisms rather than simply behaving more randomly.
A tale of three mechanisms
Development as cooling off
Because optimal solutions to the exploration-exploitation dilemma are generally intractable (Gittins & Jones, 1979), heuristic alternatives are frequently employed. In particular, learning under the demands of the exploration-exploitation trade-off has been described using at least two distinct strategies (Wilson, Geana, White, Ludvig, & Cohen, 2014). One such strategy is increased random exploration, which uses noisy, random sampling to learn about new options.
A key finding in the psychological literature is that children tend to try out more options than adults (Cauffman et al., 2010; Mata, Wilke, & Czienskowski, 2013). This has been interpreted as evidence for higher levels of random exploration in children, and has been loosely compared to algorithms of simulated annealing from computer science (Gopnik et al., 2017), where the amount of random exploration gradually reduces over time. Children can be described as having higher temperature parameters, where the learner initially samples very randomly across a large set of possibilities, before eventually focusing on a smaller subset (Gopnik, Griffiths, & Lucas, 2015). This temperature parameter is expected to “cool off” with age, leading to lower levels of random exploration in late childhood and adulthood.
Development as reduction of directed exploration
A second strategy to tackle the exploration-exploitation dilemma is to use directed exploration by preferentially sampling highly uncertain options in order to gain more information and reduce uncertainty about the environment. Directed exploration has been formalized by introducing an “uncertainty bonus” that values the exploration of lesser known options (Auer, 2002), with behavioral markers found in a number of studies (cf., Frank, Doll, Oas-Terpstra, & Moreno, 2009; Wu, Schulz, Speekenbrink, Nelson, & Meder, 2018).
Directed exploration treats information as intrinsically valuable by inflating rewards by their estimated uncertainty (Auer, 2002). This leads to a more sophisticated uncertainty-guided sampling strategy that could also explain developmental differences. Indeed, the literature on self-directed learning shows that children are clearly capable of exploring their environment in a systematic, directed fashion. Already infants tend to value the exploration of uncertain options (L. E. Schulz, 2015), and children can balance theory and evidence in simple exploration tasks (Bonawitz, van Schijndel, Friel, & Schulz, 2012) and are able to efficiently adapt their search behavior to different environmental structures (Ruggeri & Lombrozo, 2015). Moreover, children can sometimes even outperform adults in the self-directed learning of unusual relationships (Lucas, Bridgers, Griffiths, & Gopnik, 2014). Both directed and random exploration do not have to be mutually exclusive mechanisms, with recent research finding signatures of both types of exploration in adolescent and adult participants (Gershman, 2018; Somerville et al., 2017; Wilson et al., 2014).
Development as refined generalization
Rather than explaining development as a change in how we explore given some beliefs about the world, generalization-based accounts attribute developmental differences to the way we form our beliefs in the first place. Many studies have shown that human learners use structured knowledge about the environment to guide exploration (Acuna & Schrater, 2009; E. Schulz, Konstantinidis, & Speekenbrink, 2017), where the quality of these representations and the way that people utilize them to generalize across experiences can have a crucial impact on search behavior. Thus, development of more complex cognitive processes (Blanco et al., 2016), leading to broader generalizations, could also account for the observed developmental differences in sampling behavior.
The notion of generalization as a mechanism explaining developmental differences has a long standing history in psychology. For instance, Piaget (1964) assumed that children learn and adapt to different situational demands by the processes of assimilation (applying a previous concept to a new task) and accommodation (changing a previous concept in the face of new information). Expanding on Piaget’s idea, Klahr (1982) proposed generalization as a crucial developmental process, in particular the mechanism of regularity detection, which supports generalization and improves over the course of development. More generally, the implementation of various forms of decision making (Hartley & Somerville, 2015) could be constrained by the capacity for complex cognitive processes, which become more refined over the life span. For example, although younger children attend more frequently to irrelevant information than older children (Hagen & Hale, 1973), they can be prompted to attend to the relevant information by marking the most relevant cues, whereupon they eventually select the best alternative (Davidson, 1996). Thus, children may indeed be able to apply uncertainty-driven exploratory strategies, but lack the appropriate task representation to successfully implement them.
A task to study generalization and exploration
We study the behavior of both children and adults in a spatially correlated multi-armed bandit task (Fig. 1a; cf. Wu et al., 2018), where rewards are distributed on a grid characterized by spatial correlation (i.e., high rewards cluster together; see Fig. 1g) and the search horizon is vastly smaller than the number of options. Efficient search and accumulation of rewards in such an environment requires two critical components. First, participants need to learn about the underlying spatial correlation in order to generalize from observed rewards to unseen options. This is crucial because there are considerably more options than can be explored within the limited search horizon. Second, participants need a sampling strategy that achieves a balance between exploring new options and exploiting known options with high rewards.
Methods
Participants
We recruited 55 younger children (range: 7 to 9, 26 female, Mage=7.53; SD=0.50), 55 older children (range: 9 to 11, 24 female, Mage=9.95; SD=0.80), and 50 adults (25 female, Mage=33.76; SD= 8.53) from museums in Berlin, Germany. Participants were paid up to €3.50 for taking part in the experiment, contingent on performance (range: €2.00 to €3.50, Mreward=€2.67; SD=0.50). Informed consent was obtained from all participants.
Design
The experiment used a between-subjects design, where participants were randomly assigned to one of two different classes of environments (Fig. 1g), with smooth environments having stronger spatial correlations than rough environments. We generated 40 of each class of environments from a radial basis function kernel (see below), with λsmooth = 4 and λrough = 1. On each round, a new environment was sampled (without replacement) from the set 40 environments, which was then used to define a bivariate function on the grid, with each observation including additional normally distributed noise ϵ ∼ 𝒩 (0,1). The task was presented over ten rounds on different grid worlds drawn from the same class of environments. The first round was a tutorial round and the last round was a bonus round in which participants sampled for 15 trials and then had to generate predictions for five randomly chosen tiles on the grid. Participants had a search horizon of 25 trials per grid, including repeat clicks.
Materials and procedure
Participants were introduced to the task through a tutorial round and were required to correctly complete three comprehension questions prior to continuing the task. At the beginning of each round, one random tile was revealed and participants could click on any of the tiles (including re-clicks) in the grid until the search horizon was exhausted. Clicking an unrevealed tile displayed the numerical value of the reward along with a corresponding color aid, where darker colors indicated higher rewards. Per round, observations were scaled to a randomly drawn maximum value in the range of 35 to 45, so that the value of the global optima could not be easily guessed. Re-clicked tiles could show some variations in the observed value due to noise. For repeat clicks, the most recent observation was displayed numerically, while the color of the tile corresponded to the mean of all previous observations. In the bonus round, participants sampled for 15 trials and were then asked to generate predictions for five randomly selected and previously unobserved tiles. Additionally, participants had to indicate how certain they were about their prediction on a scale from 0 to 10. Afterwards, they had to select one of the five tiles before continuing with the round.
A combined model of generalization and exploration
We use a formal model that combines generalization with a sampling strategy accounting for both directed and random exploration (cf. Wu et al., 2018), and use it to predict out-of-sample search behavior. The generalization component is based on Gaussian Process (GP) regression, which is a Bayesian function learning approach theoretically capable of learning any stationary function (Rasmussen & Williams, 2006) and has been found to effectively describe human behavior in explicit function learning tasks (Lucas, Griffiths, Williams, & Kalish, 2015). The GP component is used to adaptively learn a value function, which generalizes the limited set of observed rewards over the entire search space using Bayesian inference.
The GP prior is completely determined by the choice of a kernel function k(x, x′), which encodes assumptions about how points in the input space are related to each other. A common choice of this function is the radial basis function: where the length-scale parameter λ encodes the extent of spatial generalization between options (tiles) in the grid. The assumptions of this kernel function are similar to the gradient of generalization historically described by Shepard (1987), which also models generalization as an exponentially decaying function of the stimuli’s similarity distance (see Fig. 1h), which has been observed across a wide range of stimuli and organisms. As an example, generalization with λ = 1 corresponds to the assumption that the rewards of two neighboring tiles are correlated by r = 0.6, and that this correlation decays to zero for options further than three tiles apart. We treat λ as a free parameter in our model comparison in order to assess age-related differences in the capacity for generalization.
Given different possible options x to sample from (i.e., tiles on the grid), GP regression generates normally distributed beliefs about rewards with expectation µ(x) and estimated uncertainty σ(x) (Fig. 1b,c). A sampling strategy is then used to map the beliefs of the GP onto a valuation for sampling each option at a given time. Crucially, such a sampling strategy must address the exploration-exploitation dilemma. One frequently applied heuristic for solving this dilemma is Upper Confidence Bound (UCB) sampling (Srinivas, Krause, Kakade, & Seeger, 2009), which evaluates each option based on a weighted sum of expected reward and estimated uncertainty: where β models the extent to which uncertainty is valued positively and therefore directly sought out. This strategy corresponds to directed exploration because it encourages the sampling of options with higher uncertainty according to the underlying generalization model (see Fig. 1i). We treat the exploration parameter β as a free parameter to assess how much participants value the reduction of uncertainty (i.e., engage in directed exploration). As an example, an exploration bonus of β = 0.5 means participants would prefer an option x1 expected to have reward µ(x1) = 30 and uncertainty σ(x1) = 10, over option x2 expected to have reward µ(x2) = 34 and uncertainty σ(x2) = 1. This is because sampling x1 is expected to reduce a larger amount of uncertainty, even though x2 has a higher expected reward (UCB(x1) = 35 vs. UCB(x2) = 34.5).
Finally, we use a softmax function to map the upper confidence bound values, UCB(x), of our proposed Gaussian Process-Upper Confidence Bound sampling model onto choice probabilities: where τ is the temperature parameter governing the amount of randomness in sampling behavior. If τ is high (higher temperatures), then participants are assumed to sample more randomly, whereas if τ is low (cooler temperatures), the choice probabilities are concentrated on the highest valued options (Fig. 1j). Thus, τ encodes a searchers’ tendency towards random exploration. We treat τ as a free parameter to assess the extent of random exploration in children and adults (see Supplemental Material for alternative implementations such as ϵ-greedy sampling).
In summary, GP-UCB contains three different parameters: the length-scale λ capturing the extent of generalization, the exploration bonus β describing the extent of directed exploration, and the temperature parameter τ modulating random exploration. These three parameters directly correspond to the three postulated mechanisms of developmental differences in complex decision making tasks and can also be robustly recovered (see Supplemental Material).
Results
Behavioral results
Participants gained higher rewards in smooth than in rough environments (Fig. 2a; t(158) = 10.51, p < .001, d = 1.66, BF > 100), suggesting they made use of the spatial correlations and performed better when correlations were stronger. Adults performed better than older children (Fig. 2a; t(103) = 4.91, p < .001, d = 0.96, BF > 100), who in turn performed somewhat better than younger children (t(108) = 2.42, p = .02, d = 0.46, BF = 2.68). Analyzing the distance between consecutive choices (Fig. 2b) revealed that participants sampled more locally (smaller distances) in smooth compared to rough environments (t(158) = −3.83, p < .001, d = 0.61, BF > 100). Adults sampled more locally than older children (t(103) = −3.9, p < .001, d = 0.76, BF > 100), but there was no difference between younger and older children (t(108) = 1.76, p = .08, d = 0.34, BF = 0.80). Importantly, adults sampled less unique options than older children (14.5 vs. 21.7; t(103) = 6.77, d = 1.32, p < .001, BF > 100), whereas the two children groups did not differ in how many options they sampled (21.7 vs. 22.7; t(108) = 1.27, d = 0.24 p = .21, BF = 0.4).
Looking at the learning curves (i.e., averaged rewards over trials; Fig. 2c), we found a positive rank-correlation between mean rewards and trial number (Spearman’s ρ = .12, t(159) = 6.12, p < .001, BF > 100). Although this correlation did not differ between the rough and smooth condition (t(158) = −0.43, p = .67, d = 0.07, BF = 0.19), it was significantly higher for adults than for older children (0.29 vs 0.08, t(103) = 5.90, p < .001, d = 1.15, BF > 100). The correlation between trials and rewards did not differ between younger and older children (0.04 vs 0.08; t(108) = −1.87, p = .06, d = 0.36, BF = 0.96). Therefore, adults learned faster, while children explored more extensively (see Supplemental Material for further behavioral analyses).
Model comparison
We compared the GP-UCB model with an alternative model that does not generalize across options but is a powerful Bayesian model for reinforcement learning across independent reward distributions (Mean Tracker; MT). Model comparisons are based on leave-one-round-out cross-validation error, where we fit each model combined with the Upper Confidence Bound sampling strategy to each participant using a training set omitting one round, and then assess predictive performance on the hold-out-round. Repeating this procedure for every participant and all rounds, we calculated the standardized predictive accuracy for each model (pseudo-R2 comparing out-of-sample log loss to random chance), where 0 indicates chance-level predictions and 1 indicates theoretically perfect predictions (see Supplemental Material for full model comparison with additional sampling strategies). The results of this comparison are shown in Fig. 2d. The GP-UCB model predicted participants’ behavior better overall (t(159) = 13.28, p < .001, d = 1.05, BF > 100), and also for adults (t(49) = 5.98, p < .001, d = 0.85, BF > 100), older (t(54) = 10.92, p < .001, d = 1.48, BF > 100) and younger children (t(54) = 6.77, p < .001, d = 0.91, BF > 100). The GP-UCB model predicted adults’ behavior better than that of older children (t(103) = 4.33, p < .001, d = 0.85, BF > 100), which in turn was better predicted than behavior of younger children (t(108) = 3.32, p = .001, d = 0.63, BF = 24.8).
Developmental differences in parameter estimates
We analyzed the mean participant parameter estimates of the GP-UCB model (Fig. 2e) to assess the contributions of the three mechanisms (generalization, directed exploration, and random exploration) towards developmental differences.
We found that adults generalized more than older children, as indicated by larger λ-estimates (Mann-Whitney-U = 2001, p < .001, rτ = 0.32, BF > 100), whereas the two groups of children did not differ significantly in their extent of generalization (U = 1829, p = .06, rτ = 0.15, BF = 1.7). Furthermore, older children valued the reduction of uncertainty more than adults (i.e., higher β-values; U = 629, p < .001, rτ = 0.39, BF > 100), whereas there was no difference between younger and older children (U = 1403, p = .51, rτ = 0.05, BF = 0.2). Critically, whereas there were strong differences between age groups for the parameters capturing generalization and directed exploration, there was no reliable difference in the softmax temperature parameter τ, with no difference between older children and adults (W = 1718, p = .03, rτ = 0.17, BF = 0.7) and only anecdotal differences between the two groups of children (W = 1211, p = .07, rτ = 0.14, BF = 1.4). This suggests that the amount of random exploration did not reliably differ by age group (see Supplemental Materials for other implementations of random exploration).
Thus, our modeling results converge on the same conclusion as the behavioral results. Children explore more than adults, yet instead of exploring randomly, children’s exploration behavior seems to be directed toward options with high uncertainty. Additionally, our parameter estimates are robustly recoverable (see Supplemental materials) and can be used to simulate learning curves that reproduce the differences between the age groups as well as between smooth and rough conditions (Fig. 2f).
Bonus round
In the bonus round, each participant predicted the expected rewards and the underlying uncertainty for five randomly sampled unrevealed tiles after having made 15 choices on the grid. We first calculated the mean absolute error between predictions and the true expected value of rewards (Fig. 3a). Prediction error was higher for rough compared to smooth environments (t(158) = 4.93, p < .001, d = 0.78, BF > 100), reflecting the lower degree of spatial correlation that could be utilized to evaluate unseen options. Surprisingly, older children were as accurate as adults (t(103) = 0.28, p = .77, d = 0.05, BF = 0.2), but younger children performed worse than older children (t(108) = 3.14, p = .002, d = 0.60, BF = 15). Certainty judgments did not differ between the smooth and rough environments (t(158) = 1.13, p = .26, d = 0.18, BF = 0.2) nor between the different age groups (max-BF = 0.1).
Of particular interest is how judgments about the expectation of rewards and perceived uncertainty relate to the eventual choice from amongst the five options (implemented as a 5-alternative forced choice). We standardized the estimated reward and confidence judgment of each participant’s chosen tile by dividing by the sum of the estimates for all five options (Fig. 3c). Thus, larger standardized estimates reflect a larger contribution of either high reward or high certainty on the choice. Whereas there was no difference between age groups in terms of the estimated reward of the chosen option (max-BF = 0.1), we found that younger children preferred options with higher uncertainty slightly more than older children (t(108) = 2.22, p = .03, d = 0.42, BF = 1.8), and substantially more than adults (t(103) = 2.82, p = .006, d = 0.55, BF = 6.7). This further corroborates our previous analyses, showing that children’s sampling behavior is more directed toward uncertain options than adults’.
Discussion
We examined three potential sources of developmental differences in a complex learning and decision-making task: random exploration, directed exploration, and generalization. Using a paradigm that combines both generalization and search, we found that adults gained higher rewards and exploited more strongly, whereas children sampled more unique options, thereby gaining lower rewards but exploring the environment more extensively. Using a computational model with parameters directly corresponding to the three hypothesized mechanisms of developmental differences, we found that children generalized less and were guided by directed exploration more strongly than adults. They did not, however, explore more randomly than adults. Our results paint a rich picture of developmental trajectories in generalization and exploration, casting children decision makers not as merely prone to noisy sampling behavior, but as directed explorers who are hungry for information in their environment. Our conclusions are drawn from converging evidence combining analysis of behavioral data and computational modeling. Moreover, our findings are highly recoverable and also hold for other formalizations of random exploration instead of using the softmax temperature parameter (see Supplemental Materials).
Interestingly, related work by Somerville et al. (2017) also found no developmental difference in random exploration, but increasing directed exploration across early adolescence, which stabilized in adulthood. We believe that our results are not necessarily incompatible with that finding. Somerville and colleagues defined directed exploration using horizon-sensitive exploration (i.e,. strategic planning of exploration), whereas we define directed exploration as uncertainty-guided exploration via a greedy upper confidence bound algorithm. Thus, children may have higher tendencies towards directed exploration in a stepwise-greedy fashion, but fail to exhibit such tendencies when planning ahead for multiple steps, perhaps due to cognitive limitations. This opens up further possibilities for studying different mechanisms of directed exploration and how they relate to one another.
Our results showing a developmental increase in generalization can also be related to previous findings showing a developmental increase in the use of task structure knowledge in model-based reward learning (Decker, Otto, Daw, & Hartley, 2016). Because the generalization parameter λ can be mathematically linked to the speed of learning about the underlying function (Sollich, 1999), generalization and learning are equivalent in our task. There are however other uses of the term “generalization” in the psychological literature. For example, children are known to generalize words or categories more broadly, a tendency that decreases over time, trading-off with the capacity to form more precise episodic memories (Keresztes et al., 2017). While we focus on generalization in the sense used by Shepard (i.e., generalization across stimuli), it is an outstanding question how this type of generalization relates to word and category generalization. It would be a fruitful avenue for future research to connect these two domains in a unifying theory of generalization.
Ultimately, our results suggest that to fulfill Alan Turing’s dream of creating a child-like AI, we need to incorporate generalization and curiosity-driven exploration mechanisms (Riedmiller et al., 2018).
Author Contributions
All authors developed the study concept and contributed to the study design. E. Schulz and C.M. Wu performed the data analysis and interpretation under the supervision of B. Meder and A. Ruggeri. E. Schulz and C.M. Wu drafted the manuscript, and B. Meder and A. Ruggeri provided critical revisions. All authors approved the final version of the manuscript for submission.
Declaration of Conflicting Interests
The authors declared that they had no conflicts of interest with respect to their authorship or the publication of this article.
Funding
This work was supported by the Max Planck Society and DFG-grant ME 3717/2-2 to BM. ES is supported by the Harvard Data Science Initiative. CMW is supported by the International Max Planck Research School on Adapting Behavior in a Fundamentally Uncertain World.
Open Practices
All code and data have been made publicly available and can be accessed at
Reviewed Supplementary Materials
Statistical tests
We report both frequentist and Bayesian statistics throughout the paper. Whereas frequentist tests are reported as either Student’s t-tests (for the behavioral data and model comparisons) or Mann-Whitney-U tests (for parameter comparisons), we rely on Bayes factors (BF) to quantify the relative evidence the data provide in favor of the alternative hypothesis (HA) over the null (H0).
For testing hypotheses regarding the behavioral data and the model comparison, we use the default two-sided Bayesian t-test for independent samples using a Jeffreys-Zellner-Siow prior with its scale set to , as suggested by Rouder, Speckman, Sun, Morey, and Iverson (2009). The prior is truncated below 0 for the directional tests performed to create Figure S1, which shows pairwise comparisons between the different models. All other statistical tests are non-directional as defined by a symmetric prior.
For testing hypotheses regarding the model parameters, we use the frequentist Mann-Whitney-U test and report Kendall’s rτ as an effect size. The Bayesian test is based on performing posterior inference over the test statistics and assigning a prior by means of a parametric yoking procedure (van Doorn, Ly, Marsman, & Wagenmakers, 2017). This then leads to a posterior distribution for Kendall’s rτ, and via the Savage-Dickey density ratio test, also yields an interpretable Bayes factor. The null hypothesis posits that parameters do not differ between the two groups, while the alternative hypothesis posits an effect and assigns an effect size using a Cauchy distribution with the scale parameter set to .
Other forms of random exploration
A softmax function with a temperature parameter τ is only one way to define random exploration. Another approach towards assessing random exploration is ϵ-greedy exploration. Given k number of arms (64 in our experiment), epsilon-greedy exploration chooses where ϵ is a free parameter. We test the ϵ-greedy method of exploration by using it instead of a softmax function in combination with the GP regression model and a UCB-sampling strategy. The results of this comparison show that the ϵ-greedy exploration model was systematically worse at predicting behavior than the softmax model reported in the main text (mean predictive accuracy: R2 = 0.21, t(159) = 6.67, p < .001, d = 0.53, BF > 100). Additionally the softmax model also had better predictive accuracy than the ϵ-greedy exploration model for adults (mean predictive accuracy: R2 = 0.26, t(49) = 9.29 p < .001, d = 1.31, BF > 100), and for older children (mean predictive accuracy: R2 = 0.21, t(54) = 3.60, p < .001, d = 0.49, BF = 38.5), but not for younger children (mean predictive accuracy: R2 = 0.17, t(54) = 0.33, p = .74, d = 0.04, BF = 0.2).
Next, we looked for age-related differences in the parameter estimates of the ϵ-greedy model1, specifically the directed exploration parameter β and the alternative random exploration parameter ϵ. As in the softmax-parameterized models, we find larger λ-estimates for adults than for older children (0.99 vs. 0.24, U = 1975, rτ = 0.31, p < .001, BF > 100), whereas the two children groups do not differ in their λ-estimates (0.31 vs. 0.24, U = 1299, rτ = 0.10, p < .20, BF = 0.4). Furthermore, we find more directed exploration (larger β parameters) for older children than for adults (17.30 vs. 5.38, U = 555, rτ = 0.42, p < .001, BF > 100), but no difference between the two groups of children (17.20 vs. 17.30, U = 1684,rτ = 0.0, p = .3, BF = 0.27). We also found a small difference in ϵ-greedy exploration parameter between adults and older children (0.00012 vs. 0.00014, U = 960,rτ = 0.21 p = .007, BF = 6.95), but not between the two groups of children (0.00014 vs. 0.00016, U = 1774, rτ = 0.12, p = .11, BF = 0.46). Notice that the relative proportion of random exploration decisions according to the ϵ-parameter estimates is so small, that over the 200 choices in our task, this accounts for a difference of 1 in every 250 choices. Thus, there is almost no practical difference in participants’ ϵ-parameters. The overall age-related effect in this analysis was also larger for directed exploration than for ϵ-greedy exploration (rτ = 0.40 vs. rτ = 0.25).
Thus, there are two reasons to believe that children are driven more strongly by directed than by random exploration. Firstly, the GP-UCB model combined with a softmax formulation of random exploration predicted participants better than an ϵ-greedy model, and finds no age-related difference in terms of random exploration described by the temperature parameter τ. Secondly, parameter estimates of the ϵ-greedy model find only small and practically meaningless age-related difference in ϵ-exploration, but again we find large age-related differences in the directed exploration parameter β.
Unreviewed Supplementary Materials
Full modeling results
We report a full model comparison of two models of learning each combined with three different sampling strategies (see also Fig. S7-S8). Different models of learning (i.e., Gaussian Process regression and Mean Tracker) are combined with different sampling strategies, in order to make predictions about where a participant will search next, given the history of previous observations. Table S1 contains the predictive accuracy, the number of participants best described, and the parameter estimates of each combination of a learning model and a sampling strategy. In total, cross-validated model comparisons for both models required approximately two days of computation time distributed on a cluster of 160 Dell C6220 nodes.
Models of Learning
Gaussian Process
We use Gaussian Process (GP) regression as a Bayesian model of generalization. A GP is defined as a collection of points, any subset of which is multivariate Gaussian. Let f: 𝒳 → Rn denote a function over input space 𝒳 that maps to real-valued scalar outputs. This function can be modeled as a random draw from a GP: where m is a mean function specifying the expected output of the function given input x, and k is a kernel (or covariance) function specifying the covariance between outputs:
Here, we fix the prior mean to the median value of unscaled payoffs, m(x) = 25, and use the kernel function to encode an inductive bias about the expected spatial correlations between rewards (see Radial Basis Function kernel below).
Conditional on observed data , where is drawn from the underlying function with added noise , we can calculate the posterior predictive distribution for a new input x* as a Gaussian with mean and variance given by: where y = [y1, …, yt]T, K is the t × t covariance matrix evaluated at each pair of observed inputs, and k* = [k(x1, x*), …, k(xt, x*)] is the covariance between each observed input and the new input x*.
Radial Basis Function kernel
We use the Radial Basis Function (RBF) kernel as a component of the 𝒢 𝒫 algorithm of generalization. The RBF kernel specifies the correlation between inputs x and x′ as
This kernel defines a universal function learning engine based on the principles of Bayesian regression and can model any stationary function. Note that sometimes the RBF kernel is specified as whereas we use λ = 2l2 as a more psychologically interpretable formulation. Intuitively, the RBF kernel models the correlation between points as an exponentially decreasing function of their distance. Here, λ modifies the rate of correlation decay, with larger λ-values corresponding to slower decays, stronger spatial correlations, and smoother functions. As λ → ∞, the RBF kernel assumes functions approaching linearity, whereas as λ → 0, there ceases to be any spatial correlation, with the implication that learning happens independently for each discrete input without generalization (similar to the assumption of the Mean Tracker model described below). We treat λ as a free parameter, and use cross-validated estimates to make inferences about the extent to which participants generalize.
Mean Tracker
The Mean Tracker model is implemented as a Bayesian updating model, which assumes the average reward associated with each option is constant over time (i.e., no temporal dynamics), as is the case in our experiment. In contrast to the GP regression model (which also assumes constant means over time), the Mean Tracker learns the rewards of each option independently, by computing an independent posterior distribution for the mean µj for each option j. We implemented a version that assumes rewards are normally distributed (as in the GP model), with a known variance but unknown mean, where the prior distribution of the mean is a normal distribution. This implies that the posterior distribution for each mean is also a normal distribution:
For a given option j, the posterior mean mj,t and variance vj,t are only updated when it has been selected at trial t: where δj,t = 1 if option j is chosen on trial t, and 0 otherwise. Additionally, yt is the observed reward at trial t, and Gj,t is defined as: where is the error variance, which is estimated as a free parameter. Intuitively, the estimated mean of the chosen option mj,t is updated based on the difference between the observed value yt and the prior expected mean mj,t-1, multiplied by Gj,t. At the same time, the estimated variance vj,t is reduced by a factor of 1 − Gj,t, which is in the range [0, 1]. The error variance can be interpreted as an inverse sensitivity, where smaller values result in more substantial updates to the mean mj,t, and larger reductions of uncertainty vj,t. We set the prior mean to the median value of payoffs mj,0 = 25 and the prior variance to vj,0 = 250.
Sampling strategies
Given the normally distributed posteriors of the expected rewards, which have mean µ(x) and uncertainty (formalized here as standard deviation) σ(x), for each search option x (for the Mean Tracker, we let µ(x) = mj,t and , where j is the index of the option characterized by x), we assess different sampling strategies that (combined with a softmax choice rule, Eq. S14) make probabilistic predictions about where participants would search next.
Upper Confidence Bound sampling
Given the posterior predictive mean µ(x) and its attached standard deviation , we calculate the upper confidence bound using a weighted sum where the exploration factor β determines how much reduction of uncertainty is valued (relative to exploiting known high-value options). We estimate β as a free parameter indicating participants’ tendency towards directed exploration.
Mean Greedy Exploitation
A special case of the Upper Confidence Bound sampling strategy (with β = 0) is a greedy exploitation component that only evaluates points based on their expected rewards
This sampling strategy only samples options with high expected rewards, i.e. greedily exploits the environment.
Variance Greedy Exploration
Another special case of the Upper Confidence Bound sampling strategy (with β → ∞) is a greedy exploration component which only samples points based on their predictive variance
This sampling strategy only cares about reducing uncertainty without attempting to generate high rewards.
Model comparison
We use maximum likelihood estimation (MLE) for parameter estimation, and cross-validation to measure out-of-sample predictive accuracy. A softmax choice rule transforms each model’s prediction into a probability distribution over options: where q(x) is the predicted value of each option x for a given model (e.g., q(x) = UCB(x) for the UCB model), and τ is the temperature parameter. Lower values of τ indicate more concentrated probability distributions, corresponding to more precise predictions and therefore less random sampling. All models include τ as a free parameter. Additionally, we estimate λ (generalization parameter) for the Gaussian Process regression model and θ2 (error variance) for the Mean Tracker model. Finally, we estimate β (exploration bonus) for the Upper Confidence Bound sampling strategy.
Cross validation
We fit all combinations of models and sampling strategies—per participant—using cross-validated MLE implemented via a Differential Evolution algorithm for optimization. Parameter estimates are constrained to positive values in the range [exp(−5), exp(5)]. We use leave-one-out cross-validation to iteratively form a training set by leaving out a single round, computing a MLE on the training set, and then generating out-of-sample predictions on the remaining round. This is repeated for all combinations of training set and test set. This cross-validation procedure yields one set of parameter estimates per round, per participant, and out-of-sample predictions for 200 choices per participant overall (rounds 2-9 × 25 choices).
Predictive accuracy
Prediction error (computed as log loss) is summed up over all rounds, and is reported as predictive accuracy, using a pseudo-R2 measure that compares the total log loss prediction error for each model to that of a random model: where log ℒ (ℳ rand) is the log loss of a random model (i.e., picking options with equal probability) and log ℒ (ℳ k) is the log loss of model k’s out-of-sample prediction error. Intuitively, R2 = 0 corresponds to prediction accuracy equivalent to chance, while R2 = 1 corresponds to theoretical perfect predictive accuracy, since log ℒ (ℳ k)/ log ℒ (ℳ rand) → 0 when log ℒ (ℳ k) ≪log ℒ (ℳ rand).
Parameter estimates
All parameter estimates were included in the statistical analyses, although we exclude outliers larger than 5 in Figures 1e, S4, S7, S8, and in Table S1. For GP-UCB estimates, 1.1% of λ-estimates, for 3.4% of β-estimates, and 2.7% of τ-estimates were removed this way. The presence of these outliers motivated us to use non-parametric tests (without removing outliers) to compare the different parameters across age groups, in order to achieve more robustness. However, the significance of our test results do not change even if we use parametric t-tests, but remove outliers before performing these tests. After removing values higher than 5, we still find that adults show higher λ-estimates than older children (t(103) = 3.88, p < .001, d = 0.76, BF > 100), who do not differ in their λ-estimates from younger children (t(108) = 1.00, p = .32, d = 0.19, BF = 0.3). Moreover, the β-estimates are higher for older children than for adults (t(103) = 4.41, p < .001, d = 0.87, BF > 100), but do not differ between the two groups of children t(108) = 1.41, p = .160, d = 0.27, BF = 0.5). Crucially, the random exploration parameters τ do neither differ between older children and adults (t(103) = .55, p = .58, d = 0.11, BF = 0.2), nor do they differ between the two groups of children (t(108) = 1.74, p = .08, d = 0.33, BF = 0.79).
Our criterion for outlier removal is less strict than other criteria, such as removing values higher than 1.5×IQR. In our data, this would remove 1.8% of λ-estimates, 5.1% of β-estimates, and 11.2% of τ-estimates. For completeness, we also reanalyzed the data after performing Tukey’s procedure of outlier removal, where we find the same results. Adults still show a higher λ-estimates than older children (t(102) = 3.87, p < .001, d = 0.76, BF > 100), who do not differ from younger children (t(108) = 1.04, p = .29, d = 0.20, BF = 0.3). The β-estimates are again higher for older children than for adults (t(100) = 4.69, p < .001, d = 0.93, BF > 100), and do not differ between the two groups of children (t(103) = 1.06, p = .29, d = 0.21, BF = 0.3). Finally, estimates for τ-parameter do not differ between adults and older children (t(103) = 1.36, p = .17, d = 0.26, BF = 0.5), nor between the two groups of children (t(108) = 0.96, p = .34, d = 0.18, BF = 0.3). Taken together, our results hold no matter which method of outlier removal is applied.
Full model comparison results
Figure S1 shows the log-scaled Bayes Factor of every combination of each learning model with each sampling strategy, compared in terms of predictive accuracy to every other combination, and separated by the different age groups. Results are based on one-sided Bayesian t-tests. We find that the GP-UCB model wins against every other combination, resulting in large Bayes factors overall, but also for every individual age group.
In addition to these Bayes factors for comparisons between individual models, we also compare the cross-validated log-loss at the group level using (predictive) Bayesian model selection, estimating each model’s protected probability of exceedance, which is the likelihood that the proportion of participants generated using a given combination exceeds the proportion of participants generated using all other combinations, while controlling for the chance rate (Stephan, Penny, Daunizeau, Moran, & Friston, 2009). This analysis revealed that the overall hypothesis of all models performing equally well was rejected by a Bayesian omnibus test at p < .001. Moreover, the protected probability of exceedance was virtually 1 for the GP-UCB model, where 1 is the upper theoretical limit. This result is also obtained if breaking down the analysis by the different age groups, again always leading to a protected probability of exceedance of 1 for the GP-UCB model.
Finally, we report how many participants are best predicted by each combination of a model and a sampling strategy. The results of this comparison are shown in Figure S2 (see also Table S1). As before, the GP-UCB model is the best overall model and also predicts most participants best for every age group. Overall, the GP-UCB model predicted 106 out of 160 participant best. In the different age categories, the GP-UCB model best predicted 32 out of 55 younger children, 44 out of 55 older children, and 30 out of 50 adult participants. Therefore, the GP-UCB model was by far the best model in our comparison.
Model recovery
We present model recovery results that assess whether or not our predictive model comparison procedure allows us to correctly identify the true underlying model. To assess this, we generated data based on each individual participant’s mean parameter estimates (excluding outliers larger than 5 as before). More specifically, for each participant and round, we use the cross-validated parameter estimates to specify a given model, and then generate new data in the attempt to mimic participant data. We generate data using the Mean Tracker and the GP regression model. In all cases, we use the UCB sampling strategy in conjunction with the specified learning model. We then utilize the same cross-validation method as before in order to determine if we can successfully identify which model has generated the underlying data. Figure S3 shows the cross-validated predictive performance for the simulated data.
Recovery Results
Our predictive model comparison procedure shows that the Gaussian Process model is a better predictor for data generated from the same underlying model, whereas the Mean Tracker model is only marginally (if at all) better at predicting data generated from the same underlying model. This suggests that our main model comparison results are robust to Type II errors, and provides evidence that the better predictive accuracy of the GP model on participant data is unlikely due to differences in model mimicry.
When the Mean Tracker model generates data using participant parameter estimates, the same Mean Tracker model performs better than the GP model (t(159) = 1.42, p = .16, d = 0.11, BF = 4.1) and predicts 86 out of 160 simulated participants best. Notice, however, that both models perform poorly in this case, with both models achieving an average pseudo-r-squared of around R2 = −0.04. This also shows that the Mean Tracker is not a good generative model of human-like behavior in our task.
When the Gaussian Process regression model has generated the underlying data, the same model performs significantly better than the Mean Tracker model (t(159) = 18.8, p < .001, d = 1.48, BF > 100) and predicts 153 of the 160 simulated participants best. In general, both models perform better when the GP has generated the data with the GP achieving a predictive accuracy of R2 = .20 and the MT of achieving R2 = .12. This makes our finding of the Gaussian Process regression model as the best predictive model even stronger as –technically– the Mean Tracker model can mimic parts of its behavior. Moreover, the resulting values of predictive accuracy are similar to the ones we found using participant data. This indicates that our empirical values of predictive accuracy were as good as possible under the assumption that a GP model generated the data.
Parameter Recovery
Another important question is whether the reported parameter estimates of the GP-UCB model are reliable and recoverable. We address this question by assessing the recoverability of the three underlying parameters, the length-scale λ, the directed exploration factor β, and the random exploration (temperature) parameter τ of the softmax choice rule. We use the results from the model recovery simulation described above, and correlate the empirically estimated parameters used to generate data (i.e., the estimates based on participants’ data), with the parameter estimates of the recovering model (i.e., the MLE from the cross-validation procedure on the simulated data). We assess whether the recovered parameter estimates are similar to the parameters that were used to generate the underlying data. We present parameter recovery results for the Gaussian Process regression model using the UCB sampling strategy. We report the results in Figure S4, with the generating parameter estimate on the x-axis and the recovered parameter estimate on the y-axis.
The correlation between the generating and the recovered length-scale λ is rτ = .85, p < .001, the correlation between the generating and the recovered exploration factor β is rτ = 0.75, p < .001, and the correlation between the generating and the recovered softmax temperature parameter τ is rτ = 0.81, p < .001.
These results show that the correlation between the generating and the recovered parameters is very high for all parameters. Thus, we have strong evidence to support the claim that the reported parameter estimates of the GP-UCB model are recoverable, reliable, and therefore interpretable. Importantly, we find that estimates for β (exploration bonus) and τ (softmax temperature) are indeed recoverable, providing evidence for the existence of a directed exploration bonus, as a separate phenomena from random exploration in our behavioral data.
Next, we analyze whether or not the same differences between the parameter estimates that we found for the experimental data can also be found for the simulated data. Thus, we compare the recovered parameter estimates from the data generated by the estimated parameters for the different age groups. This comparison shows that the recovered data exhibits the same characteristics as the empirical data. The recovered λ-estimates for simulated data from adults was again larger than the recovered lambda estimates for older children (U = 2021, rτ = 0.33, p < .001, BF > 100), whereas there was no difference between the recovered parameters for the two simulated groups of children (U = 1800, p = .08, rτ = 0.14, BF = 1). As in the empirical data, the recovered estimates also showed a difference between age groups in their directed exploration behavior such that the recovered β was higher for simulated older children than for simulated adults (U = 730, p < .001, rτ = 0.33, BF > 100), whereas there was no difference between the two simulated groups of children (U = 1730, p = .19, rτ = 0.10, BF = 0.6). There was no difference between the different recovered τ-parameters (max-BF = 0.5). Thus, our model can also reproduce similar group differences between generalization and directed exploration as found in the empirical data.
Counter-factual parameter recovery
Another explanation of the finding that children differ from adult participants in their directed exploration parameter β but not in their random exploration parameter τ could be that the softmax temperature parameter τ can sometimes track more of the random behavioral difference between participants than the directed exploration parameter β. If the random exploration parameter τ indeed tends to absorb more of the random variance in the data than the directed exploration parameter β, then perhaps one is always more likely to find differences in β rather than differences in τ. To assess this claim, we simulate data using our GP-UCB model as before but swap participants’ parameter estimates of β with their estimates of τ and vice versa. Ideally, this simulation can reveal whether it is possible for our method to find differences in τ but not β in a counter-factual parameter recovery where the age groups differ in their random but not their directed exploration behavior. Thus, we generate data from the swapped GP-UCB model and then use our model fitting procedure to assess the GP-UCB-model’s parameters from this generated data. The results of this simulation reveal that simulated adults do not differ from simulated older children in their estimated directed exploration parameters β (U = 1170, rτ = 0.11, p = .18, BF = 0.7). Furthermore, the two simulated children groups also do not differ in terms of their directed exploration parameter β (U = 1696, rτ = 0.09, p = .27, BF = 0.4). However, the random exploration parameter τ is estimated to be somewhat higher for simulated older children than for simulated adults (U = 1771, rτ = 0.21, p = .01, BF = 2.5) and shows no difference between the two simulated children groups (U = 1631, p = .48, rτ = 0.06, BF = 0.4). This means that the GP-UCB model can pick up on differences in random exploration as well and therefore that our findings are unlikely due to a false positive.
Further Behavioral Analyses
Learning over trials and rounds
We analyze participants learning over trials and rounds using a hierarchical Bayesian regression approach. Formally, we assume the regression weight parameters θx for x ∈ {0, 1, 2} are hierarchically distributed as we further assume a weakly informative prior of the means and standard deviation over the regression equation, defined as:
We fit one hierarchical model of means and standard deviations over all participants using Hamiltonian Markov chain Monte Carlo sampling as implemented in thePyMC-environment. This yields hierarchical estimates for each regression coefficient overall as well as individual estimates for each participant. Doing so for a model containing an intercept (mean performance), a standardized coefficient of the effect of trials on rewards, as well as a standardized coefficient of the effect of rounds on rewards results into the posterior distributions shown in Figure S5.
The overall posterior mean of participants’ rewards is estimated to be 34.4 with a 95%-credible set of [33.5, 35.4]. The standardized effect of trials onto rewards is estimated to be 1.48 with a 95%-credible set of [1.1, 1.9], indicating a strong effect of learning over trials. The standardized effect of rounds onto participants’ rewards is estimated to be 0.11 with a 95%-credible set of [-0.4, 0.6], indicating no effect. Taken together, these results indicate that participants performed well above the chance-level of 25 overall and improved their score greatly over trials. They did not, however, learn or adapt their strategies across rounds.
Reaction times
We analyze participants log-reaction times as function of previous reward, age group, and condition. Reaction times are filtered to be smaller than 5000ms and larger than 100ms (3.05% removed in total). We find that reaction times are larger for the rough as compared to the smooth condition (see Fig. S6a; t(158) = 4.44, p < .001, d = 0.70, BF > 10). Moreover, adults are somewhat faster than children of age 9-11 (t(103) = −2.60, p = .01, d = 0.51, BF = 4). The difference in reaction times between children of age 9-11 and children of age 7-8 is only small (t(108) = 2.27, p = .03, d = 0.43, BF = 1.97).
We also analyze how much a previously found reward influences participants’ reaction times, i.e. whether participants slow down after a bad outcome and/or speed up after a good outcome (see Fig. S6b). The correlation between the previous reward and reaction times is negative overall with r = −0.18, t(159) = −15, p < .001, BF > 100, indicating that larger rewards lead to faster reaction times, whereas participants might slow down after having experienced low rewards. This effect is even stronger for the smooth as compared to the rough condition (t(158) = −4.43, p < .001, d = 0.70, BF > 100). Moreover, this effect is also stronger for adults than for older children (t(103) = −5.51, p < .001, d = 1.08, BF > 100) and does not differ between the two groups of children (t(108) = 1.93, p = .06, d = 0.37, BF = 1.05).
Acknowledgements
We thank all of the families who participated in this research, the Berlin Natural History Museum where we conducted the study, Andreas Sommer for collecting the data, and Federico Meini for programming the experiment.
Footnotes
↵1 Note that interpreting estimates of inferior computational models can be problematic and should only be done with caution.