## ABSTRACT

Making good decisions requires people to appropriately explore their available options and generalize what they have learned. While computational models have successfully explained exploratory behavior in constrained laboratory tasks, it is unclear to what extent these models generalize to complex real world choice problems. We investigate the factors guiding exploratory behavior in a data set consisting of 195,333 customers placing 1,613,967 orders from a large online food delivery service. We find important hallmarks of adaptive exploration and generalization, which we analyze using computational models. We find evidence for several theoretical predictions: (1) customers engage in uncertainty-directed exploration, (2) they adjust their level of exploration to the average restaurant quality in a city, and (3) they use feature-based generalization to guide exploration towards promising restaurants. Our results provide new evidence that people use sophisticated strategies to explore complex, real-world environments.

## Introduction

When facing a vast array of new opportunities, a decision maker has two key tasks: to acquire information (often through direct experience) about available options, and to apply that information to assess options not yet experienced. These twin problems of *exploration* and *generalization* must be tackled by any organism trying to make good decisions, but they are challenging to solve because optimal solutions are computationally intractable^{1}. Consequently, the means by which humans succeed in doing so—especially in the complicated world at large—have proven puzzling to psychologists and neuroscientists. Many heuristic solutions have been proposed to reflect exploratory behavior^{2–4}, inspired by research in machine learning^{5}. However, most empirical studies have used a small number of options and simple attributes^{6}. To truly ascertain the limits of exploration and generalization requires empirical analysis of behavior in the world outside the lab.

We study learning and behavior in a complex environment using a large data set of human foraging in the “wild”—online food delivery. Each customer has to decide which restaurant to pick out of hundreds of possibilities. How do they make a selection from this universe of options? Guided by algorithmic perspectives on learning, we look for signatures of adaptive exploration and generalization that have been previously identified in the lab. This allows us not only to characterize these phenomena in a naturally incentivized setting with abundant and multi-faceted stimuli, but also to weigh in on existing debates by testing competing theories of exploratory choice.

We address two broad questions. First, how do people strategically explore new options of uncertain value? Different algorithms have been proposed to describe exactly how uncertainty can guide exploration in qualitatively different ways, such as by injecting *randomness* into choice, or by making choices *directed* toward uncertainty^{2, 3, 7}. However, results have been mixed, and these phenomena remain to be studied under real-world conditions. Second, how do people generalize their experiences to other options? Modern computational theories make quantitative predictions about how feature-based similarity should govern generalization, which can in turn guide choice. But again it is unclear whether these theories can successfully predict real-world choices.

Our results indicate that customers explore (i.e., order from unexperienced restaurants) adaptively based on signals of restaurant quality, and make better choices over time. Exploration is indeed risky and leads to worse outcomes on average, but people are more willing to explore in cities where this downside is lower due to higher mean restaurant quality. Importantly, we show that customers’ exploratory behavior takes into account not only the prospective reward from choosing a restaurant, but also the degree of uncertainty in their reward estimates. Consistent with an optimistic uncertainty-directed exploration policy, they preferentially sample lesser known options and exhibit higher sampling entropy after a bad (compared to a good) outcome.

We find that choices are best fit by a model that includes both an “uncertainty bonus” for unfamiliar restaurants, and a mechanism for generalization by function learning (based on restaurant features). People appear to benefit from such generalization, as exploration yields better realized outcomes in cities where features have more predictive power. We also show that people generalize their experiences across different restaurants within the same broad cuisine type, defined both empirically within the data set, and by independent similarity ratings. As predicted by a combination of similarity-based generalization and uncertainty-directed exploration, good experiences encourage selection of other restaurants within the same category, while bad experiences discourage this to an even greater extent.

In order to set the stage for our analyses of purchasing decisions, we first review the algorithmic ideas that have been developed to explain exploration in the laboratory.

## Prior work on the exploration-exploitation dilemma

### Uncertainty-guided algorithms

Most of what we know about human exploration comes from *multi-armed bandit tasks*, in which an agent repeatedly chooses between several options and receives reward feedback^{8, 9}. Since the distribution of rewards for each option is unknown at the beginning of the task, an agent is faced with an *exploration-exploitation dilemma* between two types of actions: should she exploit the options she currently knows will produce high rewards while possibly ignoring even better options? Or should she explore lesser-known options to gain more knowledge but possibly forego high immediate rewards? Optimal solutions only exist for simple versions of this problem^{1}. These solutions are in practice difficult to compute even for moderately large problems. Various heuristic solutions have been proposed. Generally, these heuristics coalesce around two algorithmic ideas^{10, 11}. The first one is that exploration happens randomly, for example by occasionally sampling one of the options not considered to be the best^{12}; or by so-called soft-maximization of the expected utilities for each option—i.e., randomly sampling each option proportionally to its value. The other idea is that exploration happens in a directed fashion, whereby an agent is explicitly biased to sample more uncertain options. This uncertainty-guidance is frequently formalized as an “uncertainty bonus”^{5, 13} which inflates an option’s expected reward by its uncertainty.

There has been a considerable debate about whether or not directed exploration is required to explain human behavior^{14, 15}. For example, Daw and colleagues^{14} have shown that a softmax strategy explains participants’ choices best in a simple multiarmed bandit task. However, several studies have produced evidence for a direct exploration bonus^{4, 16, 17}. Recent studies have proposed that people engage in both random and directed exploration^{2, 7, 18, 19}. It has also been argued that directed exploration might play a prominent role in more structured decision problems^{20, 21}. However, evidence for such algorithms is still missing in real-world purchasing decisions^{6}.

### Generalization

Multiple studies have emphasized the importance of generalization in exploratory choice. People are known to leverage latent structures such as hierarchical rules^{22} or similarities between a bandit’s arms^{23}.

Gershman et. al^{24} investigated how generalization affects the exploration of novel options using a task in which the rewards for multiple options were drawn from a common distribution. Sometimes this common distribution was “poor” (options tended to be non-rewarding), whereas sometimes the common distribution was “rich” (options tended to be rewarding). Participants sampled novel options more frequently in rich environments than in poor environments, consistent with a form of adaptive generalization across options.

Schulz et al.^{25} investigated how contextual information (an option’s features) can aid generalization and exploration in tasks where the context is linked to an option’s quality by an underlying function. Participants used a combination of functional generalization and directed exploration to learn the underlying mapping from context to reward (see also^{21, 26}).

## Results

We looked for signatures of uncertainty-guided exploration and generalization in a data set of purchasing decision taken from the online food delivery service *Deliveroo* (see Materials and Methods for more details). Further analyses and method details can be found in the Supplemental Information. In the first two sections of the Results, we provide some descriptive characterizations of the data set. In particular, we show that customers learn from past experience and adapt their exploration behavior over time. Moreover, their exploration behavior is systematically influenced by restaurant features and hence amenable to quantification. We then turn to tests of our model-based hypotheses. We find that customers’ exploration behavior can be clustered meaningfully, exhibits several signatures of intelligent exploration which have previously been studied in the lab, and can be captured by a model that generalizes over restaurant features while simultaneously engaging in directed exploration.

### Learning and exploration over time

We first assessed if customers learned from past experiences, as reflected in their order ratings over time (Fig. 1a). The order rating is defined as customers’ evaluation on a scale between 1 (poor) and 5 (great). Customers picked better restaurants over time: there was a positive correlation between the number of a customer’s past orders and her ratings (*r* = 0.073; 99.9% CI: 0.070, 0.076).

Next, we assessed exploratory behavior by creating a variable indicating whether a given order was the first time a customer had ordered from that particular restaurant—i.e., a signature of pure exploration^{24, 26}. Figure 1b shows the averaged probability of sampling a new restaurant over time (how many orders a customer had placed previously).

Customers sampled fewer new restaurants over time, leading to a negative overall correlation between the number of past orders and the probability of sampling a new restaurant (*r* = −0.139; 99.9% CI: −0.142, −0.136). Exploration also comes at a cost (Fig. 1c), such that explored restaurants showed a lower average rating (mean rating=4.257, 99.9% CI: 4.250, 4.265) than known restaurants (mean rating=4.518, 99.9% CI: 4.514, 4.522).

Customers learned from the outcomes of past orders. Figure 1d shows their probability of reordering from a restaurant as a function of their reward prediction error (RPE; the difference between the expected quality of a restaurant, as measured by the restaurant’s average rating at the time of the order, and the actual pleasure customers perceived after they had consumed the order, as indicated by their own rating of the order). RPEs are a key component of theories of reinforcement learning^{27}, and we therefore expected that customers would update their sampling behavior after receiving either a positive or a negative RPE. Confirming this hypothesis, customers were more likely to reorder from a restaurant after an experience that was better than expected (positive RPE: p(reorder)=0.518, 99.9%; CI: 0.515, 0.520) than after an experience that was worse than expected (negative RPE: p(reorder)=0.394, 99.9%; CI: 0.391, 0.398). The average correlation between RPEs and the probability of reordering was *r* = 0.110 (99.9% CI: 0.107, 0.114).

### Determinants of exploration

In the next part of our analysis, we focused on what factors influenced customers’ decisions to explore a new restaurant. In particular, we assessed if exploration behavior was systematic and therefore looked at the following four restaurant features that were always visible to customers at the time of their order: the relative average price of a restaurant, its standardized estimated delivery time, the mean rating of a restaurant at the time of the order, and the number of people who had rated the restaurant before.

Customers preferred restaurants that were comparatively cheaper (Fig. 2a): the correlation between relative price and the probability of exploration was negative (*r* = −0.059; 99.9% CI: −0.0641, −0.0548). There was a non-linear relationship between a restaurant’s estimated delivery time and its probability of being explored (Fig. 2b): exploration was most likely for standardized delivery times between 1 and 2.5 (0.288, 99.9% CI: 0.285, 0.292), and less likely for delivery times below 1 (0.288, 99.9% CI: 0.285, 0.292 or above 2.5 (0.252, 99.9% CI: 0.229, 0.274). This indicates that customers might have taken into account how long it would take to plausibly prepare and deliver a good meal when deciding which restaurants to explore. The average rating of a restaurant also affected customers’ exploratory behavior (Fig. 2c): higher ratings were associated with a higher chance of exploration (*r* = 0.038; 99.9% CI: 0.0337, 0.0430). The number of ratings per restaurant also influenced exploration (Fig. 2d), with a negative correlation of *r* = −0.188 (99.9% CI: −0.192, −0.183). This correlation is relatively large because restaurants that have been tried more frequently are less likely to be explored for the first time. We therefore repeated this analysis for all restaurants that had been rated more than 500 times, yielding a correlation of *r* = −0.034 (99.9% CI: −0.042, −0.026).

We standardized and entered all of the variables into a mixed-effects logistic regression modeling the exploration variable as the dependent variable and adding a random intercept for each customer (see SI for full model comparison). We again found that a smaller number of total ratings (*β* = −0.475), a higher average rating (*β* = 0.086), and a lower price (*β* = −0.014) as well as a quadratic effect of time (*β*_{Linear} = −0.025, *β*_{Quadratic} = 0.015) were all predictive of customers’ exploration behavior. In summary, exploration in the domain of online ordering is systematic, interpretable and amenable to quantification. We next turned to an examination of our model-based hypotheses concerning uncertainty-directed exploration and generalization.

## Signatures of uncertainty-directed exploration

We probed the data for signatures of uncertainty-directed exploration algorithms that attach an uncertainty bonus to each option. One such signature is that directed and random exploration make diverging predictions about behavioral changes after either a positive or a negative outcome. Whereas random (softmax) exploration predicts no difference in sampling behavior after either a better or a worse-than-expected outcome, directed exploration predicts a stronger increase in sampling behavior after a worse-than-expected outcome (see SI). This is due to the properties of algorithms that assess an option’s utility by a weighted sum of its expected reward and its standard deviation. After a bad experience, the mean and standard deviation both go down, whereas after a good experience the mean goes up but the standard deviation goes down. Thus, there should be more changes in customers’ sampling behavior after a bad than after a good outcome.

We verified this prediction by calculating the Shannon entropy of customers’ next 4 purchases after having experienced either a better-than or a worse-than-expected order. The calculated entropy was higher for negative RPEs (Fig 3a; 1.112, 99.9% CI: 1.109, 1.115) than for positive RPEs (1.082, 99.9% CI: 1.081, 1.084), in line with theoretical predictions of a directed exploration algorithm. This difference was present throughout time, such that the effect size when comparing the entropies for negative and positive RPEs for different numbers of past orders (of a given customer) always revealed a negative effect (Fig 3b). However, this effect decreased over time with a negative correlation between the effect and customers’ past orders of *r* = −0.68 (99.9% CI: −0.90, -0.33). This means that customers learned over time, leading to lower entropies in their sampling behavior after unexpected outcomes as they gained more experience.

We assessed customers’ exploration behavior in dependency of the differences in ratings for a given restaurant as compared to the average of all restaurants within the same cuisine type (value difference). We also calculated each restaurant’s relative variance, i.e. how much more variance in its ratings a restaurant possessed as compared to the average variance per restaurant within the same cuisine type. The probability of exploring a new restaurant increased as a function of the restaurant’s value difference (Fig. 3c; *r* = 0.05, 99.9% CI: 0.045, 0.056). Additionally, a restaurant’s relative variance also correlated with its probability of being explored (Fig. 3c; *r =* 0.05; 99.9% CI: 0.045, 0.056). Comparing restaurants with a high vs. low relative variance in their ratings (based on a median split) revealed a shift of the choice function towards the left. In other words, restaurants with higher relative uncertainty (0.344; 99.9% CI: 0.341, 0.349) are preferred to restaurants with lower relative uncertainty (0.319; 99.9% CI: 0.317, 0.321), as predicted by uncertainty-directed exploration strategies^{2, 18}. This difference can also be observed when repeating the same analysis using a restaurant’s price (Fig. 3d): as restaurants get more expensive, they are less likely to be explored (*r* = −0.017; 99.9%CI: −0.023, −0.013). This function is again shifted for restaurants with higher relative uncertainty: given a similar price range, relatively more uncertain restaurants are more likely to be explored than less uncertain restaurants.

To further validate these findings, we fit a mixed-effects logistic regression, using the exploration variable as the dependent variable. For the independent variables, we used the mean difference in ratings between the restaurant and the average restaurant within the same cuisine type, a restaurant’s relative price, and its relative uncertainty (see Tab. 2). The average value difference (*β* = 0.114), the relative price *β* = −0.0876) and the relative uncertainty (*β* = 0.084) all affected a restaurants’ probability to be explored. Thus, even when taking into account a restaurant’s price and its ratings, customers still preferred more uncertain options. This provides strong evidence for a directed exploration strategy.

### Signatures of generalization

We assessed if customers employed generalization to guide their purchasing decisions. We first looked at patterns of consecutive explorations (how one exploratory choice predicted the next one). Specifically, we looked at 20 frequent cuisine types and assessed how much exploring a restaurant from one type predicted exploring a restaurant from another type using a simple regression, and repeating this analysis for all combinations of types. This analysis revealed clusters of cuisine types within customers’ exploratory behavior (see Fig. 4a).

For example, exploring a restaurant from the cuisine type “Burgers” was predictive of exploring a restaurant from the cuisine type “American”, most of the Asian cuisine types clustered together, and so forth. As there were seven main clusters in total (see Materials and Methods and SI for details), this also allowed us to assess moves between different clusters after either a worse or a better-than-expected outcome. This led to four insights. First, customers were more likely to explore within the same cluster after a good outcome (+2.27%) and less likely after a bad outcome (−5.19%), while bad outcomes had a larger effect than good outcomes, again hinting at strategies of directed exploration. Secondly, customers’ switches between clusters were meaningful. For example, after a bad experience with the cluster “Asian”, customers frequently switched to the cluster “Curry” (+1.51%), which contained Indian and Thai cuisine. Thirdly, there were spill-over effects; for example, a positive experience with “Mediterranean” food also made exploring the cluster “Chicken” more likely (+0.5%). Finally, customers most often switched to exploring “Unhealthy” cuisine types after bad outcomes (+2.72%).

We also assessed if customers’ switches between different cuisine types could be related to a subjective understanding of similarity between types (Fig. 5a). We therefore asked 200 participants on Amazon’s Mechanical Turk to rate the similarity between 30 pairs of cuisine types sampled from the 20 above types on a scale from 0 (not at all similar) to 10 (totally similar). We then tested how much exploratory switches between cuisine types mapped onto the mean similarity ratings between cuisine types. There was a positive correlation between similarity ratings and the frequency of switches between cuisine types of *r* = 0.78. Thus, exploratory choices not only clustered into interpretable clusters, but were also predicted by subjective similarities between cuisine type.

We further tested how much customers’ exploration was guided by generalization. Gershman et al.^{24} showed that participants explore novel options more frequently in environments where all options are generally good. We found evidence for this phenomenon in our data (Fig. 5b): there was a positive correlation between a city’s average restaurant rating and the proportion of exploratory choices in that city (*r* = 0.32; 99.9% CI: 0.21, 0.49, see SI for partial correlations).

Next, we examined whether customers’ explorations were more successful in cities where ratings were more predictable as assessed by how well individual ratings were predictable by the features’ price, delivery time, mean rating, and number of ratings using randomly sampled learning and tests sets of the same size for each city. Customers were more successful (i.e., gave higher ratings to sampled restaurants) in their exploratory choices in cities where ratings were generally more predictable (*r* = 0.73; Fig. 5d, 99.9% CI: 0.53, 0.84). Thus, customers took contextual features into account to guide their exploration, similar to findings in contextual bandit tasks^{25, 26}.

In the attempt to test algorithms of both directed exploration and generalization simultaneously, we compared three models of learning and decision making based on how well they captured the sequential choices of 3,772 new customers who had just started ordering food and who had rated all of their orders. The first model was a Bayesian Mean Tracker (BMT) that does not generalize across restaurants, only learning about a restaurant’s quality by sampling it. The second model used Gaussian Process regression to learn about a restaurant’s quality based on the four observable features (price, mean rating, delivery time, and number of past ratings). Gaussian Process regression is a powerful model of generalization and has been applied to model how participants learn latent functions to guide their exploration^{20, 21, 25}. This model was either paired with a mean-greedy sampling strategy (GP-M) or with a directed exploration strategy that sampled based on an option’s upper confidence bound (GP-UCB). We treated customers’ choices as the arms of a bandit and their order ratings as their utility, and then evaluated each model’s performance based on its one-step-ahead prediction error, standardizing performance by comparing to a random baseline. Since it was not possible to observe all restaurants a customer might have considered at the time of an order, we compared the different models based on how much higher in utility they predicted a customer’s final choice compared to an option with average features. The BMT model barely performed above chance (*r*^{2} = 0.013; 99.9% CI: 0.005, 0.022). Although the GP-M model performed better than the BMT model (*r*^{2} = 0.231; 99.9% CI: 0.220, 0.241), the GP-UCB model achieved by far the best performance (*r*^{2} = 0.477; 99.9% CI: 0.465, 0.477). Thus, a sufficiently predictive model of customers’ choices required both a mechanism of generalization (learning how features map onto rewards), and a directed exploration strategy (combining an restaurant’s mean and uncertainty to estimate its decision value).

## Discussion

We investigated customers’ exploration behavior in a large data set of online food delivery purchases. Customers learned from past experiences, and their exploratory behavior was affected by a restaurant’s price, average rating, number of ratings and estimated delivery time. Our results further provide strong evidence for several theoretical predictions: people engaged in uncertainty-directed exploration, and their exploration was guided by similarity-based generalization. Computational modeling showed that these patterns could be captured quantitatively.

Taken together, our results advance our understanding of human choice behavior in complex real-world environments. The results may also have broader implications for understanding consumer behavior. For example, we have found that customers frequently change to unhealthy food options after bad experiences. However, a potential strategy to increase the exploration of healthy food might be to increase healthy restaurants’ relative uncertainty by grouping healthy options with other frequently explored options such as Asian restaurants, which showed a comparatively lower relative uncertainty per restaurant.

While we have focused on using cognitive models to predict human choice behavior, the same issues come up for the design of recommendation engines in machine learning. These engines use sophisticated statistical techniques to make predictions about behavior, but do not typically try to pry open the human mind^{28}. This is a missed opportunity; as models of human and machine learning have become increasingly intertwined, insights from cognitive science may help build more intelligent machines for predicting and aiding consumer choice.

## Author contributions statement

ES, RB, and BB extracted and analyzed the data. BCL, MTT and SJG supervised the work. All authors wrote the paper.

## Methods

### The Deliveroo data set

The data consisted of a representative random subset of customers ordering food from the online food delivery service “Deliveroo”. The data set contained 195,333 fully anonymized customers. These customers placed 1,613,968 orders over two month (February and March 2018) in 197 cities. There were 30,552 restaurants in total leading to an average of 155 restaurants per city. We arrived at this data set by filtering out customers with less than 5 orders (too little data points to analyze learning and exploration) and more than 100 orders (likely multiple people sharing an account).

### Clustering analysis

We removed the cuisine type “European” for this analysis as it was found to contain little information about customer choice behavior. This is unsurprising, given that manual cuisine type tags vary in quality and information content. Next, we analyzed for each cuisine type how much exploring this type on a time point *t* was predictive of exploring another cuisine type on a time point t + 1, using a linear regression model. Repeating this analysis for every combination of cuisine types lead to the graph shown in Figure 4a. We then analyzed the resulting matrix of *r*^{2}-values by using hierarchical clustering.

### Similarity judgments

To elicit similarity ratings between different cuisine types, we asked 200 participants on Amazon’s Mechanical Turk to rate the similarities between two randomly sampled types out of the 20 types used for the clustering analysis reported above. Participants were paid $1 and had to rate 50 pairs of cuisine types in total. The study took less than 10 minutes on average.

## Acknowledgements

ES was supported by the Harvard Data Science Initiative. SJG was supported by the Office of Naval Research under Grant N000141712984.

## Footnotes

↵1 In total, 94% of the restaurants had higher ratings than 4 and behavior for restaurants with an average rating lower than 4 was unstable. None of the main results change when analyzing the full data set, but estimates for this part of the space were unreliable.