Exploration and generalization in vast spaces

Charley M. Wu; Eric Schulz; Maarten Speekenbrink; Jonathan D. Nelson; Björn Meder

doi:10.1101/171371

ABSTRACT

From foraging for food to learning complex games, many aspects of human behaviour can be framed as a search problem with a vast space of possible actions. Under finite search horizons, optimal solutions are generally unobtainable. Yet how do humans navigate vast problem spaces, which require intelligent exploration of unobserved actions? Using a variety of bandit tasks with up to 121 arms, we study how humans search for rewards under limited search horizons, where the spatial correlation of rewards (in both generated and natural environments) provides traction for generalization. Across a variety of different probabilistic and heuristic models, we find evidence that Gaussian Process function learning—combined with an optimistic Upper Confidence Bound sampling strategy—provides a robust account of how people use generalization to guide search. Our modelling results and parameter estimates are recoverable, and can be used to simulate human-like performance, providing novel insights about human behaviour in complex environments.

Footnotes

↵* cwu{at}mpib-berlin.mpg.de
↵§ Note, sometimes the RBF kernel is specified as whereas we use λ = 2l² as a more psychologically interpretable formulation.
↵¶ We use r_τ to denote the rank-correlation, which should not be confused with the temperature parameter τ of the softmax function. Additionally, we calculate the Bayes Factor (BF_τ) to quantify the evidence for the presence of a positive correlation using non-informative, shifted, and scaled beta-priors as recommended by⁸¹.
↵|| In general, assumptions about the level of correlations in the environment (i.e., extent of generalization λ) only influence rewards in the short term, and can disappear over time once each option has been sufficiently sampled²⁸
↵** Notice that these estimates are based on a linear regression, whereas learning curves are probably non-linear. Thus, this method might underestimate the true underlying effect of learning over time