## Abstract

Animals engage in routine behavior in order to efficiently navigate their environments. This routine behavior may be influenced by the state of the environment, such as the location and size of rewards. The neural circuits tracking environmental information and how that information impacts decisions to diverge from routines remains unexplored. To investigate the representation of environmental information during routine foraging, we recorded the activity of single neurons in posterior cingulate cortex (PCC) in monkeys searching through an array of targets in which the location of rewards was unknown. Outside the laboratory, people and animals solve such traveling salesman problems by following routine traplines that connect nearest-neighbor locations. In our task, monkeys also deployed traplining routines, but as the environment became better known, they diverged from them despite the reduction in foraging efficiency. While foraging, PCC neurons tracked environmental information but not reward and predicted variability in the pattern of choices. Together, these findings suggest that PCC mediates the influence of information on variability in choice behavior.

## Introduction

Imagine you are at a horse race, and there are six horses, with Local Field Potential the underdog, facing 100:1 odds against. When LFP wins, a one dollar bet will pay out $100. But in addition to the reward received from this bet, learning that out of the six horses, LFP is the winner reduces your uncertainty about the outcome. Hence, LFP crossing the finish line first yields both reward and information.

Similar problems are often faced by organisms in their environment. Animals must learn not only the sizes of rewards but also their locations, times, or other properties. For example, hummingbirds will adapt their nectar foraging in response to unexpected changes in reward timing (Garrison and Gass 1999). Similarly, monkeys will adapt their foraging routines upon receiving information that a highly valued resource has become available (Menzel 1991). In general, animals can make better decisions by tracking such reward information. Perhaps once a reward has been received, it no longer pays to wait for more because the resource is exhausted or the inter-reward time is too great (McNamara 1982), such as occurs for some foraging animals. Or, perhaps receiving a reward also resolves any remaining uncertainty about an environment (Stephens and Krebs 1986). Keeping track of reward information independent of reward size thus serves as an important input into animals’ decision processes.

We designed an experiment to probe this oft-neglected informational aspect of reward-based decision making. Our experiment is based on the behavior of animals that exploit renewable resources by following a routine foraging path, a strategy known as traplining(Berger-Tal and Bar-David 2015). Trapline foraging has a number of benefits, including reducing the variance of a harvest and thereby attenuating risk(Possingham 1989), efficiently capitalizing on periodically renewing resources(Possingham 1989; Bell 1990; Ohashi, Leslie, and Thomson 2008), and helping adapt to changes in competition(Ohashi, Leslie, and Thomson 2013). Many animals trapline, including bats(Racey and Swift 1985), bees(Manning 1956; Janzen 1971), butterflies(Boggs, Smiley, and Gilbert 1981), hummingbirds(Gill 1988), and an array of primates including rhesus macaques(Menzel 1973), baboons(Noser and Byrne 2010), vervet monkeys(Cramer and Gallistel 1997), and humans(Hui, Fader, and Bradlow 2009). Wild primates foraging for fruit(Menzel 1973; Noser and Byrne 2010), captive primates searching for hidden foods(Gallistel and Cramer 1996; Desrochers et al. 2010), and humans moving through simulated(MacGregor and Chu 2011) and real(Hui, Fader, and Bradlow 2009) environments all use traplining to minimize total distance traveled and thereby maximize resource intake rates.

Though many primates trapline, information about the state of the environment, such as weather(Janmaat, Byrne, and Zuberbühler 2006), the availability of new foods(Menzel 1991), or possible feeding locations(Hemmi and Menzel 1995; Menzel 1996), can influence choices made while foraging. Such detours result in longer search distances and more variable choices(Noser and Byrne 2010; Hui, Fader, and Bradlow 2009) but allow animals to identify new resources(Menzel 1991) and engage in novel behaviors(Noser and Byrne 2010). These benefits are consistent with computer simulations that show stochastic traplining yields better long-term returns than pure traplining by uncovering new resources or more efficient routes(Ohashi and Thomson 2005). Thus environmental information may improve foraging efficiency during routine foraging.

The neural mechanisms that track, update, and regulate the impact of environmental information on decision making remain unknown. Neuroimaging studies have revealed that the posterior cingulate cortex (PCC) is activated by a wide range of cognitive phenomena that involve rewards, including prospection(Benoit, Gilbert, and Burgess 2011), value representation(Kable and Glimcher 2007; Clithero and Rangel 2014), strategy setting(Wan, Cheng, and Tanaka 2015), and cognitive control(Leech et al. 2011). Intracranial recordings in monkeys have found that PCC neurons signal reinforcement learning strategies(Pearson et al. 2009), respond to novel stimuli during conditional visuomotor learning(Heilbronner and Platt 2013), represent value(McCoy et al. 2003), risk(McCoy and Platt 2005), and task switches(Hayden and Platt 2010), and stimulation can induce shifts between options(Hayden et al. 2008). Together, these observations suggest that the PCC mediates the effect of environmental information on variability in routine behavior. However, no studies to date have attempted to separate out the hedonic value from the informational value of rewards in PCC.

Here we tested the hypothesis that PCC preferentially tracks reward information by recording the activity of PCC neurons in monkeys foraging through an array of targets in which environmental information, operationalized as the pattern of rewards, was partially decorrelated from reward size. Monkeys developed traplines in which they moved directly between nearest neighbor targets. As they acquired more information about the state of the environment, their trapline foraging behavior was faster and less variable. While foraging, PCC neurons tracked environmental information but not reward and forecast response speed and variability in traplines. These findings support our hypothesis that PCC mediates the use of information about the state of the environment to regulate adherence to routines in behavior and cognition.

## Materials and Methods

### Task Analysis

Our task manipulates both reward and information. Reward was manipulated by varying the size of received rewards: on every trial, one of six targets had a large reward, one had a small reward that was half of the size of the large, and the remaining four had zero rewards. Information was manipulated by varying the spatiotemporal pattern of rewarding targets. Given a set of four null rewards, one small, and one large, there are 6! distinct permutations. We made the simplifying assumption that monkeys perceived the pattern of received rewards without distinguishing the different null rewards received. This assumption reduces the number of distinct patterns from 720 to 30.

Different patterns correspond to different series of received reward. The environmental entropy *H _{E}* contained in receiving some reward (zero, small, or large) depends on the choice number

*i*in the sequence and the total number of possible sequences: where |·| denotes cardinality,

*P*is the set of possible permutations, and {

*P*} is the set of remaining permutations after the

_{i}*i*

^{th}choice. The amount of information contained in some reward outcome is computed as the difference in the entropy, what has been learned about the current trial’s pattern of received reward by receiving the most recent outcome: for the amount of environmental entropy

*H*on the

_{E}*i*

^{th}outcome. Expected information can then be computed as the mean amount of information to be gained by making the next choice, the weighted average over all possible next information outcomes given the pattern of received rewards thus far: for expected information

*E*[Δ

*H*] for the

_{E}*i*

^{th}choice, possible outcomes [ΔH

_{E}]

_{i}for the remaining permutations

*P*, and where |·| again denotes cardinality. Thus, as the animal proceeds through the trial, the amount of expected information varies as a function of how many possible patterns of returns have been eliminated thus far. Expected reward

_{remaining}*ER*is computed simply as the amount of remaining reward to be harvested on trial

*i*divided by the number of remaining targets

If the animal harvests all of the reward near the beginning of a trial, the expected reward will be zero. However, if the animal does not harvest the rewards until the end of a trial, the expected reward will increase across the duration of the trial.

Given the design of the task, some patterns of returns can have the same amount of expected information for the same choice number but distinct expected rewards (see supplementary information). This partial decorrelation of reward and information occurs as a result of the way expected reward and expected information are computed. Expected information is a function of the number of possible patterns to be excluded by the next outcome, whereas expected reward is a function of the identity of the pattern, and in particular, the amount of reward remaining to be harvested. The same amount of information can be expected from obtaining distinct outcomes because the number of patterns excluded by the different outcomes is identical. But the expected rewards can differ because there are distinct outcomes possible. Hence, our task partially de-confounds information and reward.

### Behavioral Methods

For our behavioral entropy measures, we again used the standard definition of entropy. For behavioral entropy, the probability of a particular step size was computed for each step size by counting the number of trials with that step size and dividing by the total number of trials. Action step sizes (from −2 to 3) and action step size probabilities (probability of taking an action of a given size) were calculated for choices 1 to 2, 2 to 3, 3 to 4, and 4 to 5 (5 to 6 had a constant update of 1). Step sizes were calculated on each choice by determining how many targets around clockwise (positive) or counterclockwise (negative) the next choice was from the previous choice; already selected targets were not included in this calculation. Step size probabilities were calculated by holding fixed all of the covariates for a particular choice (information outcome from previous choice, information expectation for next choice, reward outcome from previous choice, reward expectation for next choice, and choice number) and counting the frequencies for each step size and dividing by the total number of trials with that set of covariates. For each unique combination of covariates (choice number, information outcome, information expectation, reward outcome, and reward expectation), we computed the choicewise behavioral entropy (*H _{B}*) for that combination as
for probability of each step size

*p*. Finally, a multilinear regression correlated these behavioral entropy scores with the covariates.

_{s}Diverge choices were defined as choices that diverged from the daily dominant pattern (DDP). Determining the DDP relied on assessing the similarity between pairs of trials, for every possible pair on a given day, by computing the pair’s Hamming score(Hamming 1950). To compute the similarity between two trials, each trial’s pattern of choices by target number is first coded as a digit string (e.g., 1, 2, 4, 5, 6, 3). The Hamming distance *D _{i,i}*’ between two strings

*i, i*’ of equal length is equal to the sum of the number of differences

*d*between each entry in the string, for strings

*x, y*of length

*n*. We computed

*D*for every pair of trials, and then, for each unique pattern of choices, computed the average Hamming distance . The daily dominant pattern corresponded to the pattern with the minimum .

_{i,i}Response times, defined as the time from end of saccade for the last decision to end of saccade for the current decision, and behavioral entropy were both regressed against a number of variables and their interactions as covariates with multilinear regression. Covariates included choice number in trial, expected information, expected reward, reward outcome from the previous choice, information outcome from the previous choice, and all 2-way interactions. See Table S2 for full results of the response time regression and Table S6 for full results of the behavioral entropy regressions.

### Neural Methods

Two male rhesus macaque monkeys were trained to orient to visual targets for liquid rewards before undergoing surgical procedures to implant a head-restraint post (Crist Instruments) and receive a craniotomy and recording chamber (Crist Instruments) permitting access to PCC. All surgeries were done in accordance with Duke University IACUC approved protocols. The animals were on isoflourane during surgery, received analgesics and prophylactic antibiotics after the surgery, and were permitted a month to heal before any recordings were performed. After recovery, both animals were trained on the traplining task, followed by recordings from BA 23/31 in PCC. MR images were used to locate the relevant anatomical areas and place electrodes. Acute recordings were performed over many months. Approximately one fifth of the recordings were done using FHC (FHC, Inc., Bangor, ME) single contact electrodes and four fifths performed using Plexon (Plexon, Inc., Dallas, TX) 8-contact axial array U-probes in monkey L. No statistically significant differences in the proportion of task relevant cells were detected between the populations recorded with the two types of electrodes (χ^{2}, p > 0.5). All recordings in monkey R were done using the U-probes. Recordings were performed using Plexon neural recording systems. All single contact units were sorted online and then re-sorted offline with Plexon offline sorter. All axial units were sorted offline with Plexon offline sorter.

Neural responses often show non-linearities(Dayan and Abbott 2001), which can be captured using a generalized linear model(Aljadeff et al. 2016). We used a generalized linear model (GLM) with a log-linear link function and Poisson distributed noise estimated from the data to analyze our neuronal recordings, effectively modeling neuronal responses as an exponential function of a linear combination of the input variables. We analyzed the neural data in two epochs: a 500 ms anticipation epoch, encompassing a 250 ms pre-saccade period and the 250 ms hold fixation period to register a choice, and more focally the 250 ms pre-saccade epoch itself. Covariates included choice number in the trial, expected information, expected reward, information outcome from the last choice, reward outcome from the last choice, and all 2-way interactions.

In addition to this GLM, we confirmed our model fits in two ways for each neuron: first, we plotted the residuals against the covariates, to check for higher-order structure, and second, we used elastic net regression, to check that our significant covariates were selected by the best-fit elastic net model(Zou and Hastie 2005). Plotting residuals revealed no significant higher-order structure. Furthermore, elastic net regression confirmed our original GLM results (see supplemental methods). None of the significant covariates identified by the original GLM received a coefficient of 0 from the elastic net regression, and the sizes of the significant coefficients identified by the original GLM were very close to the sizes of the coefficients computed by the elastic net regression (see supplemental results).

Perievent time histograms (PETHs) were created by binning spikes in 10 ms bins, time-locked to the event of interest. For the anticipation epoch, PETHs were centered on the end of the choice saccade. For the last informative outcome analysis, PETHs were time-locked to the time of last informative feedback, from two seconds before to two seconds after in 50 ms time bins. PETHs were smoothed with a Gaussian and a 20 ms kernel. To analyze encoding of the last informative feedback, a log-linear GLM regression was then run on vectors of binned spike counts time-locked to the start of the trial, with time in window, time of last informative feedback (a binary covariate encoding whether or not the current time bin was before or after the last informative feedback), and their interaction as covariates.

Step sizes, step size probabilities, and choicewise entropies were linearly regressed against the firing rates during the anticipation epoch, when actions were executed, as reflected in the step size. To assess choicewise entropy encoding before and after receipt of the last bit of information, we calculated the number of neurons that significantly encoded choicewise entropy by choice before the receipt of this information to the number after. For the population response, we first separated trials by mean choicewise entropy across all choices. Next, we compared the normalized average population response for high average choicewise entropy trials to low average entropy during the two seconds before the receipt of the last information. Then we ran the same analysis on the normalized average response during the two seconds following receipt of this information, and reported the results of these two analyses below.

## Results

### Trapline Foraging in a Simulated Environment

To explore the effects of information on divergence from routines, two monkeys (*M. mulatta)* solved a simple traveling salesman problem. In this task, monkeys visually foraged through a set of six targets arranged in a circular array, only moving on to the next trial after sampling every target (Fig. 1B). On each trial, two of the targets were baited, one with a large reward and one with a small reward, with the identity of the baited targets varying from trial to trial. While foraging, monkeys gathered both rewards, herein defined by the amount of juice obtained, and information, herein defined as the reduction in uncertainty about the location of remaining rewards.

By varying the identity of the rewarding targets from trial-to-trial, reward and information were partially decorrelated. Reward was manipulated by varying the size of received rewards, with a small, large, and four zero rewards available on every trial. Information was manipulated by varying the spatiotemporal pattern of rewarding targets. Different patterns correspond to different series of received rewards. Based on the series of rewards received up to a particular choice in the trial, some subset of the set of possible sequences remained, determining the remaining uncertainty for the current trial (see methods). Monkeys gathered rewards and thereby reduced uncertainty about the current trial’s patterns, thereby gathering information about the environment. The amount of information provided by a specific outcome during a trial was calculated as the difference in the entropy before receiving the outcome compared to after. The expected information is a function of the number of possible patterns excluded by the next outcome, whereas expected reward is a function of the remaining reward to be harvested. The expected reward for each target is the total remaining reward to harvest divided by the number of remaining targets. In contrast, the expected information is the mean amount of information to be gained by making the next choice. As the animal proceeds through the trial, the amount of expected information varies as a function of how many possible patterns of returns have been eliminated so far. Distinct possible reward outcomes may offer the same information, and so our task partially de-confounds information and reward. Given the structure of the task, expected reward and expected information are partially decorrelated (linear regression, R^{2} = 0.13).

As monkeys progressed through the array, forecasts of information still available about the foraging environment influenced the speed of responses (all behavioral analyses: *n* = 145,080 choices, 24,180 trials). Though monkeys’ choice response times, the time between target acquisitions, were more sluggish over the course of a trial (Fig. 1C, top panel; linear regression; Both monkeys: *β* = 0.0134 ± 0.0004, p < 1×10^{−197}; Monkey L: *β* = 0.0148 ± 0.0005, p < 1×10^{−159}; Monkey R: *β* = 0.0100 ± 0.0007, p < 1×10^{−39}), the presence of remaining environmental information defined an information boundary generally separating speedy from sluggish choice behavior (Student’s t-test on responses when environmental information remained vs. after; Both monkeys: t(145,078) = −37.88, p < 1×10^{−311}; Monkey L: t(103,912) = −38.49, p < 1×10^{−321}; Monkey R: t(41,164) = −8.42, p < 1×10^{−16}; see supplement and Fig. S1 for full response time regression results). Response times were significantly longer when all information about the array was exhausted than when information remained even after controlling for the number of remaining eye movements (Fig. 1C, bottom panel, response times as function of expected information).

The influence of information on response times suggested a similar influence may hold for the pattern of choices that monkeys made, resulting in trial-to-trial changes in this pattern. Though monkeys usually chose the targets in the same order (the daily dominant pattern, DDP; Monkey R: same DDP across all 14 sessions; Monkey L: same DDP across 24 of 30 sessions; Fig. 1C; see methods), they occasionally diverged from this routine. We found that the informativeness of outcomes influenced this variability in the monkeys’ patterns of choices as measured by behavioral entropy, the expected degree of divergence. First, monkeys’ choices during a trial were egocentrically coded by its step size, the number of targets to the left or right from the current trial’s previously chosen target (Fig. 1B). The probability of a particular step size was computed by counting the number of trials with that step size and dividing by the total number of trials. Anticipation of more informative choice outcomes significantly reduced the entropy of the monkeys’ choices (Student’s t-test; Both monkeys: t(96,718) = −19.25, p < 1×10^{−81}; Monkey L: t(69,274) = −3.24, p < 0.005; Monkey R: t(27,442) = −23.99, p < 1×10^{−125}; Fig. 1E, left panel; see supplement for full behavioral entropy regression results). While still harvesting information about the current trial, monkeys diverged less from routine traplines, but afterward they diverged more, becoming more variable in their choices (Student’s t-test on choice numbers (CN) 4 or 5; Both monkeys: t(48,358) = −125.98, p ~ 0; Monkey L: t(34,636) = −96.32, p ~ 0; Monkey R: t(13,720) = −71.79, p ~ 0; results also significant for each CN separately; Fig. 1E, right panel). Hence, monkeys diverged less while choices were still informative and more thereafter.

### Environmental Information Signaling by Posterior Cingulate Neurons

We next probed PCC activity during the trapliner task to examine information and reward signaling from 124 cells in two monkeys (Fig. 1A; monkey L = 84 neurons; monkey R = 40 neurons). In order to control for behavioral confounds, all choices where monkeys diverged from traplines were excluded from the analyses in this section (those neural findings are reported in Barack, Chang, and Platt 2017).

Overall, we found that during the anticipation epoch (500 ms encompassing a 250 ms prechoice period and a 250 ms hold fixation period), neurons in PCC preferentially signaled information expectations over reward expectations. An example cell (Fig. 2A) showed a phasic increase in firing rate during the anticipation epoch when expected information was higher for the same choice number in the trial (for example, choice number two (CN2): Student’s t-test, p < 0.0001; firing rate for 0.72 bits = 23.57 ± 1.33 spikes/sec, firing rate for 1.37 bits = 29.79 ± 0.85 spikes/sec). However, after controlling for choice number in the trial and expected information, the same neuron did not differentiate between different amounts of expected reward (Student’s t-test, p > 0.9; firing rate for 0.2 expected reward = 23.73 ± 2.17 spikes/sec, firing rate for 0.4 expected reward = 23.43 ± 1.64 spikes/sec; Fig. 2A, second row from bottom, left panel). The tuning curves for this same cell collapsed across all choice numbers for both expected information and expected reward illustrate the strong sensitivity to large amounts of information (Fig. 2B).

In our population of 124 neurons, significantly more cells were tuned to information than reward when controlling for choice number in trial. A generalized linear model (GLM) regression revealed that during the anticipation epoch, 36 (29%) of 124 neurons (Monkey L: 26 (31%) of 84 neurons; Monkey R: 10 (25%) of 40 neurons) signaled the interaction of choice number and expected information, but only 1 (1%) of 124 neurons (Monkey L: 1 (1%) of 84 neurons; Monkey R: 0 (0%) of 40 neurons) signaled the interaction of choice number and expected reward (all results, p < 0.05, Bonferroni corrected; Both monkeys: χ^{2} = 38.9138, p < 1×10^{−9}; Monkey L: χ^{2} = 27.5808, p < 5×10^{−7}; Monkey R: χ^{2} = 11.4286, p < 0.001; see methods; see supplement for results of full regression and individual monkey results). A direct test for signaling of expected reward, by comparing the average firing rates for different amounts of expected reward for the same choice number and expected information, revealed that only about 10% of neurons signaled expected reward, except on the last choice when all information had been received (Fig. 2C). In contrast, about 20% of neurons signaled expected information (Fig. 2C).

### PCC Neurons Index Response Speed and Variability

We have previously established that PCC neurons signal decisions to diverge from traplines during our task (Barack, Chang, and Platt 2017). However, the extent to which these cells track speed and variability of responses during the task remains to be explored. We next examined the extent to which firing rates in PCC neurons predicted response times across all trials. A high proportion of neurons predicted response times in our task during the pre-saccade epoch, the 250 ms before the end of the choice saccade (both monkeys, 54 (44%) of 124 neurons significantly predicted response times, linear regression, p < 0.05; Monkey L, 46 (55%) of 84 neurons; Monkey R, 14 (35%) of 40 neurons). In order to initially assess the possibility that expected information influenced response times through the PCC, a mediation analysis was run on the group of 9 cells (Monkey L: 9; Monkey R: 2; all variables were z-scored including the firing rates and a variable for cell identity was included in the analysis) that significantly encoded the interaction of expected information and choice number and that significantly predicted response times during that epoch. This analysis compared the direct effect of the interaction of expected information and choice number on saccade response time to the indirect, mediated effect of this variable on response time through PCC neurons (Hayes 2013). Mediation analysis revealed that both paths were small but significant: the direct pathway CN x EI → response times (p < 1 x 10^{−6}; estimated effect = −0.0121 ± 0.0024) as well as the indirect pathway CN x EI → firing rates → response times (p < 5 x 10^{−10}; estimated effect = −0.0014 ± 0.00022) both significantly sped up responses. Hence, initial indications suggest that PCC partially mediates the influence of expected information on response times.

While PCC neurons predict response speed, we next asked whether these neurons index the degree to which behavior is variable, operationalized as behavioral entropy (see methods). During the anticipation epoch, behavioral entropy varied significantly with firing rate for 70 (56%) neurons (log-linear regression of behavioral entropy against firing rate, p < 0.05; results by choice number: 46 neurons (37%) for CN1, 27 neurons (22%) for CN2, 31 neurons (25%) for CN3, and 41 neurons (33%) for CN4). An example cell was more active for high entropy choices compared to low (Student’s t-test, p < 1 x 10-32; Fig. 3A). Across the population, higher firing rates predicted greater behavioral entropy (Student’s t-test on mean normalized firing rates during anticipation epoch, p < 1×10-5; Fig. 3B). Higher firing rates predicted greater behavioral entropy (BE) in the majority of both the significant subpopulation of 70 cells (GLM, p < 0.05, β_{BE} > 0 in 51 cells, β_{BE} ≤ 0 in 19 cells; mean β_{BE} = 0.0023 ± 0.0011 bitsBE/spike, Student’s t-test against β_{BE} = 0, t(69) = 2.08, p < 0.05) and the whole population (124 neurons; β_{BE} > 0 in 83 cells, β_{BE} ≤ 0 in 41 cells; mean β_{BE} = 0.0058 ± 0.0020 bitsBE/spike, Student’s t-test, t(123) = 2.85, p < 0.01). The percent modulation of behavioral entropy per 1 Hz increase in firing rate can be calculated by dividing the mean regression coefficient for the population by the mean behavioral entropy. In the subpopulation of significant neurons, these mean regression coefficients represent an increase of 2.02% in the average behavioral entropy for every additional spike, and across the whole population, of 1.49% for every additional spike.

Monkeys’ behavior could be characterized as occupying two states along a continuum from routine to divergent. In the more routine state they made faster, less variable choices, while in the more divergent state they made slower, more variable choices. Recall that a boundary defined by the receipt of the last bit of information, when the pattern of rewards on a given trial becomes fully resolved, generally separated periods of routine from divergent behavior. We next investigated whether PCC neurons signaled this information boundary. After a follow-up regression of each trial’s binned spike counts against the time in the trial and the time of last informative outcome, we found that 98 (79%) of 124 neurons differentiated these two states (GLM, effect of interaction, p < 0.05). During a four second epoch centered on the time of the last informative choice outcome, an example cell fired less before that outcome than after (Student’s test, p < 1×10^{−56}; Fig. 3C). The population of cells also signaled the transition from routine to divergent behavior, firing more after the last bit of information was received (Student’s t-test, p < 0.005; Fig. 3D).

Finally, behavioral entropy signals and information boundary signals were multiplexed in the PCC population. Significantly fewer cells (χ^{2}, p < 1×10^{−10}) predicted behavioral entropy after receiving all information (24 (19%) of 124 neurons) than before (74 (60%) neurons). A comparison of PCC population responses on choices with high to low behavioral entropy revealed significantly differences before receipt of the last informative outcome (Student’s t-test, p < 1×10^{−4}) but not after (Student’s t-test, p > 0.5; Fig. S2), with greater modulation for high entropy compared to low.

## Discussion

In this study, we show that environmental information influences responses during routine behavior and that the posterior cingulate cortex (PCC) signals this information and predicts response times and behavioral variability. Despite the fact that in our task monkeys could not utilize environmental information to increase their chance of reward, the receipt of environmental information and the exhaustion of uncertainty impacted behavioral routines. Monkeys’ responses were faster and less variable when there was more information to be gathered, but surprisingly slowed and became more variable once the environment became fully known. This pattern of slow responses after resolving all environmental uncertainty departs from the reward rate maximizing strategy of moving in a circle. While monkeys traplined, neurons in PCC robustly signaled information expectations but not reward expectations. These neurons also predicted the speed of responses as monkeys traplined, with neural activity in the subset of the population that signaled information expectations and predicted response times helping to mediate the influence of informativeness on response. PCC also neurons differentiate the degree of behavioral variability before compared to after all information was received about the pattern of rewards, with increasing activity following receipt of the last informative outcome and decreased representations of behavioral variability. In sum, our experimental findings suggest that PCC tracks the state of the environment in order to influence routine behavior.

Monkeys generally chose targets in the same pattern regardless of which targets had been most recently rewarded or informative, consistent with previous findings of repetitive stereotyped foraging in wild primate groups(Noser and Byrne 2007). They also generally moved in a circle, visiting the next nearest neighbor after the current target, likewise consistent with previous findings of in groups of wild foraging primates(Menzel 1973; Garber 1988; Janson 1998). These foraging choices almost always result in straight line routes(Janson 1998; Pochron 2001; Cunningham and Janson 2007; Valero and Byrne 2007) or a series of straight lines(Di Fiore and Suarez 2007; Noser and Byrne 2007). Experiments on captive primates have also observed nearest neighbor or near optimal path finding(Menzel 1973; Gallistel and Cramer 1996; Cramer and Gallistel 1997; MacDonald and Wilkie 1990). Our monkeys’ choices are also consistent with human behavior on traveling salesman problems, wherein next nearest neighbor paths are usually chosen for low numbers of points(Hirtle and Gärling 1992; MacGregor and Ormerod 1996; MacGregor and Chu 2011).

The PCC, a posterior midline cortical region with extensive cortico-cortical connectivity(Heilbronner and Haber 2014) and elevated resting state and off-task metabolic activity(Buckner, Andrews-Hanna, and Schacter 2008), is at the heart of the default mode network (DMN)(Buckner, Andrews-Hanna, and Schacter 2008). The DMN is a cortex-spanning network implicated in divergent cognition including imagination(Schacter et al. 2012), creativity(Kühn et al. 2014), and narration(Wise and Braga 2014). Though implicated in a range of cognitive functions, activity in PCC may be unified by a set of computations related to harvesting information from the environment to regulate routine behavior. Signals in PCC that carry information about environmental decision variables such as value(McCoy et al. 2003), risk(McCoy and Platt 2005), and decision salience(Heilbronner, Hayden, and Platt 2011) may in fact reflect the tracking of information returns from the immediate environment. For example, in a two alternative forced choice task, neurons in PCC preferentially signaled the resolution of a risky choice with a variable reward over the value of choosing a safe choice with a guaranteed reward(McCoy and Platt 2005). Such signals may in fact reflect the information associated with the resolution of uncertainty regarding the risky option. PCC neurons also signal reward-based exploration(Pearson et al. 2009) and microstimulation in PCC can shift monkeys from a preferred option to one they rarely choose(Hayden et al. 2008). Both of these functions may reflect signaling of environmental information as well; for example, the signaling of exploratory choices may reflect the information from an increase in the number of recent sources of reward(Pearson et al. 2009). Evidence from neuroimaging studies in humans similarly reveals PCC activation in a wide range of cognitive processes related to adaptive cognition, including imagination(Benoit, Gilbert, and Burgess 2011), decision making(Kable and Glimcher 2007), and creativity(Beaty et al. 2015). Interestingly, our monkeys adopted a behavioral strategy of slower responses once the state of the environment was fully known. These slower responses may be adaptive insofar as they permit re-evaluation of strategies and potential divergences from routines(Barack, Chang, and Platt 2017).

Uncovering the neural circuits that underlie variability in foraging behavior may provide insight into more complex cognitive functions. A fundamental feature of what we call prospective cognition, thoughts about times, places, and objects beyond the here and now, involves consideration of different ways the world might turn out. Various types of prospective cognition, including imagination, exploration and creativity, impose a tradeoff between engaging well-rehearsed routines and deviating in search of new, potentially better solutions(Gottlieb et al. 2013; Andrews-Hanna, Smallwood, and Spreng 2014; Beaty et al. 2015). For example, creativity involves diverging from typical patterns of thought, such as occurs in generating ideas(Benedek et al. 2014) or crafting novel concepts(Guilford 1959; Barron 1955). During creative episodes the PCC shows increased activity during idea generation(Benedek et al. 2014) and higher connectivity with control networks during idea evaluation(Beaty et al. 2015), perhaps reflecting imagined, anticipated, or predicted variation in the environment. Exploration similarly involves diverging from the familiar, such as in locating novel resources(Ohashi and Thomson 2005) or discovering shorter paths(Sutton and Barto 1998) between known locations. Such prospective cognition requires diverging from routine thought, and the identification of the neural circuits that mediate deviations from motor routines provides initial insight into the computations and mechanisms of prospective cognition. The discovery that the PCC preferentially signals the state of the environment and predicts behavioral variability relative to that state is a first step towards understanding these circuits.

The reinforcement learning literature is replete with models where exploration is driven by the search for information(Schmidhuber 1991; Johnson et al. 2012). These models hypothesize that agents should take actions that maximize the information gleaned from the environment, either by reducing uncertainty about the size of offered rewards(Schmidhuber 1991), the location of rewards in the environment(Johnson et al. 2012), or otherwise maximizing evidence for making subsequent decisions. Furthermore, evidence from initial studies studying information-based exploration shows that humans are avid information-seekers(Miller 1983; Fu and Pirolli 2007) and regulate attentional and valuational computations on the basis of information(Manohar and Husain 2013). In our task, the PCC represented environmental information and tracked when learning about the environment was complete, two variables central to information-based exploration. In particular, the dramatic change in firing rates associated with the end of information gathering suggests that PCC represents the information state of the environment and possibly also the rate of information intake, a central variable in information foraging models(Pirolli and Card 1999; Fu and Pirolli 2007; Pirolli 2007). PCC appears poised to regulate exploration for information.

In sum, harvesting information and behavioral speed and variability were both signaled by PCC neurons, suggesting a central role for PCC in determining how information drives exploration and possibly prospective cognition. Monkeys were sensitive to the amount of uncertainty remaining in the environment, with faster and more reliable patterns of choices while information remained and slower and more variable patterns after environmental uncertainty had been resolved. PCC neurons preferentially tracked this information, and predicted the variability in monkeys’ behavior. Our findings implicate the PCC in the regulation of foraging behavior, and specifically the information-driven divergence from routines.

## Contributions

D.L.B. designed the experiment, D.L.B. collected and analyzed the data, D.L.B. and M.L.P. prepared and revised the manuscript.

## Supplement

To assess the effect of information on behavior independently of reward, we designed a multi-option choice task that partially decorrelated information and reward. Monkeys performed an iterated variant of a Hamiltonian path problem, similar to the uncapacitated traveling purchaser problem, a generalization of the traveling salesman problem(Ramesh 1981; Boctor, Laporte, and Renaud 2003). Hamiltonian path problems are like traveling salesmen problems but the agent is not required to return to the starting location after their trip. In this problem, the agent must efficiently visit a series of markets that may or may not have the desired product, such that in addition to potentially purchasing the product (receiving reward), the agent also learns if the product is available (receiving information). This version of the problem is uncapacitated because the problem assumes that the markets always have sufficient capacity (or supply of a product) to meet the demand. Using this paradigm with varying reward sizes, reward and information can be partially decorrelated.

## Acknowledgments

This work was supported by the National Eye Institute of the National Institutes of Health (R01 EY013496 to M.L.P.) and an Incubator Award from the Duke Institute for Brain Sciences. Correspondence and requests for materials should be addressed to dbarack{at}gmail.com, or to the Center for Science and Society, Fayerweather Hall 511, Columbia University, 1180 Amsterdam Ave., New York, NY 10027.