Nutrient-sensitive reinforcement learning in monkeys

Animals make adaptive food choices to acquire nutrients that are essential for survival. In reinforcement learning (RL), animals choose by assigning values to options and update these values with new experiences. This framework has been instrumental for identifying fundamental learning and decision variables, and their neural substrates. However, canonical RL models do not explain how learning depends on biologically critical intrinsic reward components, such as nutrients, and related homeostatic regulation. Here, we investigated this question in monkeys making choices for nutrient-defined food rewards under varying reward probabilities. We found that the nutrient composition of rewards strongly influenced monkeys’ choices and learning. The animals preferred rewards high in nutrient content and showed individual preferences for specific nutrients (sugar, fat). These nutrient preferences affected how the animals adapted to changing reward probabilities: the monkeys learned faster from preferred nutrient rewards and chose them frequently even when they were associated with lower reward probability. Although more recently experienced rewards generally had a stronger influence on monkeys’ choices, the impact of reward history depended on the rewards’ specific nutrient composition. A nutrient-sensitive RL model captured these processes. It updated the value of individual sugar and fat components of expected rewards from experience and integrated them into scalar values that explained the monkeys’ choices. Our findings indicate that nutrients constitute important reward components that influence subjective valuation, learning and choice. Incorporating nutrient-value functions into RL models may enhance their biological validity and help reveal unrecognized nutrient-specific learning and decision computations.

ABSTRACT 11 12 Animals make adaptive food choices to acquire nutrients that are essential for survival. In reinforcement 13 learning (RL), animals choose by assigning values to options and update these values with new experiences. 14 This framework has been instrumental for identifying fundamental learning and decision variables, and their 15 neural substrates. However, canonical RL models do not explain how learning depends on biologically critical 16 intrinsic reward components, such as nutrients, and related homeostatic regulation. Here, we investigated this 17 question in monkeys making choices for nutrient-defined food rewards under varying reward probabilities. 18 We found that the nutrient composition of rewards strongly influenced monkeys' choices and learning. The 19 animals preferred rewards high in nutrient content and showed individual preferences for specific nutrients 20 (sugar, fat). These nutrient preferences affected how the animals adapted to changing reward probabilities: the 21 monkeys learned faster from preferred nutrient rewards and chose them frequently even when they were 22 associated with lower reward probability. Although more recently experienced rewards generally had a 23 stronger influence on monkeys' choices, the impact of reward history depended on the rewards' specific 24 nutrient composition. A nutrient-sensitive RL model captured these processes. It updated the value of 25 individual sugar and fat components of expected rewards from experience and integrated them into scalar 26 values that explained the monkeys' choices. Our findings indicate that nutrients constitute important reward 27 components that influence subjective valuation, learning and choice. Incorporating nutrient-value functions 28 into RL models may enhance their biological validity and help reveal unrecognized nutrient-specific learning 29 and decision computations. 30 31 32 33

INTRODUCTION 34
According to the influential Reinforcement Learning (RL) framework, animals learn by updating reward 35 values based on experience and chose by comparing these values between options 1 . The RL framework has 36 been critical for identifying fundamental learning and decision variables that guide animals' behaviour, 37 including object values and action values, which provide essential decision inputs, and the reward prediction 38 error, which updates values from experience. Direct physical implementations of these theoretical constructs 39 have been discovered in the activity of neurons in primate dopamine neurons 2-5 , striatum 6,7 , amygdala 8,9 , and 40 prefrontal cortex 10-13 . Despite its broad explanatory power, the RL framework does not explain how learning 41 and choice depend on specific reward properties. For example, nutrients are biologically critical, intrinsic 42 components of food rewards, and an animal's survival depends on its ability to make adaptive food choices 43 that acquire specific nutrients. Investigating how nutrient rewards influence learning and choice could not only 44 enhance the biological validity of RL models. It may also guide the discovery of so-far unrecognized nutrient-45 specific learning and decision computations, and their neuronal implementations. 46 Because nutrients are mainly acquired from food intake, an animal's ability to adapt its food choice to 47 changing nutrient availabilities critically determines its nutrient balance and long-term health. To optimize 48 nutrient intake, foraging animals adapt their feeding patterns in response to regional and seasonal variations of 49 food resources [14][15][16] . For instance, monkeys spend more time in food patches associated with a high probability 50 of nutritious foods (e.g., nuts) while ignoring more frequent low-nutrient foods (e.g., leaves). Primates, 51 including humans, also exhibit individual subjective preferences for specific nutrients and sensory food 52 qualities to regulate nutrient intake 17-24 . Thus, ecological data suggest that animals consider both the nutritional 53 value of food and the food's availability. However, the specific learning and decision computations underlying 54 such nutrient-sensitive food choices remain unclear. Here, we examined the food choices of rhesus monkeys 55 (Macaca mulatta) in a dynamic foraging task that involved choices between rewards with different nutrient 56 (fat, sugar) components under varying reward probabilities. 57 Previous studies examined how monkeys adapt to changing reward probabilities 9-13,25-27 . In probabilistic 58 learning tasks, monkeys track the high-probability option based on past choices and reward outcomes and 59 distribute their choices according to the reward probability of both options. This learning strategy has been 60 modelled by linking subjectively weighted recent rewards to current choices ('reward history') using logistic 61 regression 25,26 and by dynamic updating of option values based on reward outcomes via RL mechanisms 1 . We 62 followed these approaches and examined whether monkeys assigned higher value to more nutritious foods 63 during learning and learned faster from high-nutrient rewards. 64 First, we characterized monkeys' nutrient preferences and learning during probabilistic reward-based 65 choices. If the monkeys preferred specific nutrients, they should choose high-nutrient rewards more frequently 66 and track their changing probability more closely to maximize intake of the specific nutrient. We recently 67 showed in a nutrient-choice task without learning requirement that macaques' choices reflect underlying, 68 stable nutrient-value functions 22 . Accordingly, we hypothesized that nutrient-value functions also govern 69 choices during probabilistic reward learning. 70 Next, we examined whether monkeys demonstrated nutrient-specific learning. We followed established 71 approaches for characterizing the integration of past reward experiences into subjective values using logistic-72 regression and RL frameworks 10,11,25,26 to examine whether nutrient preferences modulated reward learning. 73 To account for nutrient-specific learning, influences of recent reward and choice histories on current choice 74 should be higher for high-nutrient reward. Accordingly, the value function in a formal RL model should 75 incorporate higher preferences for high-nutrient rewards ('nutrient-value function'). In addition, the animals 76 may assign higher weights to reward outcomes with particular nutrient content, as reflected by influences on 77 learning rate ('nutrient-specific learning rates'). 78 Finally, based on behavioral evidence for nutrient-sensitive reinforcement learning, we propose candidate 79 neuronal mechanisms necessary to implement nutrient-specific learning and decision computations, as a 80 framework to guide future neurophysiological recordings. Two monkeys performed in a dynamic foraging task to obtain different nutrient-defined liquid rewards 97 (Fig. 1A). In each choice trial, the monkeys were presented with two visual cues from a set of four, chose 98 between the two cues, and received either a large amount ('rewarded') or a small amount ('non-rewarded') of 99 the cue-associated liquid reward, depending on a prespecified reward probability ( ). We used new, untrained 100 visual cues in each session to avoid influences of prior experience. Session-specific visual cues were each 101 associated with one of four different rewards; cue-reward associations were fixed within each session. To 102 examine whether fat and sugar biased learning from reward outcomes, we used liquid rewards from a 2 × 2 103 factorial design with fat and sugar levels as factors ( and the other two rewards were associated with a low reward probability ( = 0.2) (Fig. 1C, block A or block 107 B). We reversed the reward probabilities every 100 trials throughout the session ( = 0.2 → 0.8; = 0.8 → 108 0.2) to encourage continual learning from reward outcomes (Fig. 1D). Notably, this design offered the 109 monkeys equal availability of fat and sugar in all choice trials irrespective of block type because there were 110 always two high-probability and two low-probability options for both high-fat and high-sugar rewards. All 111 liquids were matched in flavour (blackcurrant or peach) and other ingredients (protein, salt, etc); therefore, 112 differential learning and choice patterns could be attributed to the nutrient content of the rewards.

114
Nutrients bias reward learning and food choices 115 The behaviour in two example sessions ( Fig. 2A) showed that both monkeys exhibited preferences for 116 specific nutrients while tracking changing reward probabilities. Monkey Ya's choices ( Fig. 2A, top) were 117 dominated by a general preference for high-sugar rewards, with a smaller impact of reward probability on 118 choice. Specifically, monkey Ya chose the high-sugar rewards frequently even when they were associated with 119 a lower probability of obtaining a large reward amount; in addition, choice frequencies tracked changing 120 reward probabilities, particularly for the high-sugar rewards. By contrast, monkey Ym's choices ( Fig. 2A,  121 bottom) reflected both a preference for high-nutrient content and a strong dependence on reward probability. 122 Specifically, within a given trial block, monkey Ym preferred high-nutrient rewards over low-nutrient rewards 123 with matched reward probabilities (compare red and yellow curves) but would reduce his choices for the high-124 nutrient reward when it was associated with a relatively lower reward probability.

125
The patterns observed in single sessions were also observed in averaged data across sessions. Overall, the 126 monkeys' choice probabilities increased when reward probabilities switched from low ( = 0.2) to high ( = 127 0.8), as evident by averaged choice probabilities around probability-reversal points (Fig. 2B). Importantly, the 128 monkeys responded differently to probability changes for rewards that differed in fat and sugar content, with 129 more pronounced probability increases for high-nutrient rewards and specifically high-sugar rewards (Fig. 2B). 130 When reward probabilities were stable (between reversal points), monkey Ya showed a strong preference for 131 the high-sugar rewards irrespective of fat level, whereas monkey Ym showed graded preferences for both high-132 fat and high-sugar rewards over the low-nutrient option (Fig. 2C). Immediately following the probability 133 reversals, the monkeys had shorter learning latencies for high-nutrient rewards: they adjust their choices more 134 quickly to the changed reward probabilities when high-sugar and high-fat rewards were offered, which 135 indicated that learning was sensitive to the nutrient content of reward outcomes (Fig. 2D). Thus, the monkeys 136 preferred high-nutrient rewards, tracked changing reward probabilities in a nutrient-dependent manner, and 137 learned faster from high-nutrient reward outcomes. 138 139 140 141

174
The preferences for fat and sugar biased the monkeys' choices away from a pure probability-matching 175 (PM) strategy, which predicted distributed choices according to the relative frequency of receiving large 176 rewards from each option. In the two example sessions, choices for the high-sugar rewards accumulated more 177 rapidly than predicted by the PM strategy, whereas choices for low-sugar rewards accumulated more slowly 178 (Fig. 2E). Specifically, compared to the PM strategy, monkey Ya significantly over-matched the high-sugar 179 rewards and under-matched the low-sugar rewards, irrespective of reward-fat level. These patterns were much 180 less pronounced in monkey Ym (Fig. 2F). Specifically, the choice ratios of monkey Ya were dominated by 181 the sugar ratios but those in monkey Ym were jointly determined by the probability ratios and sugar ratios 182 (Fig.2G). Multiple regression confirmed that, in addition to the probability ratios, both the fat and sugar ratios 183 significantly influenced the choice ratios (Fig. 2H). Notably, both monkeys' choices were explained by similar 184 effect sizes of the probability ratios and the fat ratios. However, the effects of sugar ratios were particularly 185 strong in monkey Ya but slightly weaker than the influences of probability ratios in monkey Ym. 186 Taken together, these results suggested that the specific nutrient composition of food rewards and the 187 animals' individual preferences for sugar and fat biased learning and choice. 188 189 Nutrient-specific reward history and choice history influence monkeys' choices 190 One strategy to respond to unsignaled changes in reward probabilities is to choose based on recent choices 191 and reward outcomes. Because the choice outcomes reflect the underlying reward probability, this strategy 192 adapts choices to the changing reward probabilities and can help to optimize reward rate and nutrient-intake 193 levels. Consistent with these notions, we found that monkey Ym tended to repeat his choices, particularly after 194 receiving a large reward on the previous choice; this effect was evident across all reward types (Fig. 3A, right).

195
By contrast, the tendency to repeat choices was less pronounced for the low-sugar rewards in monkey Ya (Fig.  196 3A, left). This result suggested that both recent choices and the reward outcomes increased choice repetition, 197 but the influences depended on individual nutrient preferences. 198 To formally characterize the learning from recent choices and reward outcomes, we modelled the trial-199 by-trial choices in a logistic regression model (history model, see Method) that accounted for whether the 200 option was chosen in previous offers (choice history) and whether the previous choices were rewarded (reward 201 history) (Fig. 3B). The regression coefficients showed that both the choice and reward history reinforced 202 current choices and that these effects decayed for more remote past trials (Fig. 3C). Given the monkeys' 203 preferences for fat and sugar, we next examined whether these reward-and choice-history effects also 204 depended on the nutrient composition of reward outcomes and choice offers. We tested this possibility by 205 including nutrient-history interaction regressors in the history model (nutrient model, see Method). These 206 interaction terms would capture any additional reinforcing effects from specific nutrients by decomposing the 207 aggregated reward and history effects into the effects of baseline low-nutrient liquid ( 0 ), high-fat content ( ), 208 and high-sugar content ( ), depending on the fat and sugar levels of the offered reward types (Fig. 3D-E). 209 Larger history regression coefficients for sugar compared to fat suggested that recently obtained high-sugar 210 reward outcomes had a stronger impact on current-trial choice than recently obtained high-fat rewards in both 211 monkeys. However, the two monkeys differed in their tendency to repeat choices for high-fat and high-sugar 212 liquids, as indicated by the nutrient-specific choice-history coefficients. Monkey Ya repeated the high-sugar 213 choices more frequently than choices for low-nutrient rewards and high-fat rewards. By contrast, monkey Ym 214 repeated choices slightly less frequently for the high-sugar rewards. Importantly, although the explanatory 215 power of both models increased with history length, the nutrient model outperformed the history model in all 216 history length-matched comparisons (Fig. 3F). These history effects showed distinct temporal dynamics in the 217 two monkeys although they both decayed either in the history model or the nutrient model (Fig. 3G). 218 These results indicated that both monkeys' choices depended on the recent histories of obtaining and 219 choosing rewards with specific nutrient content. 220 221 Reinforcement learning based on nutrient-specific values 222 The temporal dynamics of the nutrient-specific reward-and choice-history effects suggested that the 223 monkeys constantly updated their choices based on recent choices and reward outcomes. RL models that 224 update trial-by-trial reward values for each option based on the reward outcomes are well-suited to model such 225 adaptive choices. However, canonical RL models typically do not account for the nutrient composition of food 226 rewards, and accordingly cannot explain the presently observed nutrient preferences and nutrient-specific 227 learning effects. Therefore, we developed a nutrient-sensitive RL model that incorporated subjective nutrient 228 values to model how specific nutrients (fat, sugar) differentially influenced the trial-by-trial updating of 229 expected reward values and their influence on the choice (Fig 4A). , where the values of the unchosen and unoffered rewards, ( ), were discounted according to a forgetting 244 rate ( 0 ), which would be 0 for perfect (but biologically implausible) value memory.

245
The results of fitting this nutrient-sensitive RL model to each monkeys' choices and reward outcomes in 246 each session confirmed that both monkeys assigned higher values to the high-sugar choice options and that 247 monkey Ym assigned higher value to fat but monkey Ya did not (Fig. 4B). The high-fat high-sugar reward 248 was also valued higher than the low-nutrient reference, but the fat values and the sugar values did not show 249 supra-additive effects in monkey Ya but negative interactions in monkey Ym when determining the reward 250 values (Fig. S1). The model-derived subjective values for fat and sugar accurately predicted the monkeys' 251 choices (Fig. 4C). The nutrient-sensitive RL model outperformed alternative RL models involving 252 combinatorial differential learning rates and nutrient-specific parameters ( Fig. 4D; see Methods). Notably, 253 there was no evidence for nutrient-specific learning rates but only a significant but small forgetting rate for 254 monkey Ym (Fig. S2).

255
Thus, the monkeys' stochastic choices for rewards with specific nutrient compositions were well 256 explained by a nutrient-sensitive RL model that assigned nutrient-specific values to reward outcomes.

275 276
Value updating based on distinct sugar and fat value components 277 The nutrient-sensitive RL model implied that the animals can independently track values for specific fat and 278 sugar nutrients, and integrate them into a scalar value that guided choices. To better understand the dynamics 279 of this nutrient-specific value tracking and updating, we modelled the dynamic learning of individual nutrient 280 values in a nutrient prediction error-based RL model (NPE-RL) in which the reward value on trial , ( ), was 281 jointly determined by individual fat value and sugar value components (Fig. 5A, see Methods). 282 The NPE-RL model characterized how fat and sugar values could (i) separately adapt to changes in reward 283 probabilities as indicated by experienced outcomes, and (ii) flexibly determine the integrated reward values 284 for specific choice options, based on their nutrient composition. Specifically, the fat and sugar RPEs for each 285 reward updated the fat and sugar values, respectively, which were then combined into integrated reward values 286 to guide choices (Fig. 5A). Decomposing the reward values into two independent nutrient components revealed 287 each animal's idiosyncratic sensitivity of reward values to individual nutrient constituents. To illustrate the 288 dynamic, nutrient-specific value updating, we plotted the evolving value trajectories within a session in a space 289 defined by the separate fat and sugar value components (Fig. 5B). These trajectories indicated that the updating 290 of reward values in monkey Ya was primarily based on the sugar value component over the fat value 291 component, whereas both fat and sugar value components contributed to value learning in monkey Ym (Fig.  292  5B). 293 294 295 296

319 320
The distinct sensitivities of reward values to specific nutrient components were illustrated by projections 321 of dynamic reward value trajectories onto the nutrient value space, where 'iso-value contours' visualized levels 322 of equal reward values (Fig. 5C). If reward values were equally sensitive to both the sugar and fat nutrient 323 components, the value trajectories should fall onto the 45-degree diagonal line in nutrient value space. Because 324 we normalized the nutrient values to the HFHS reward, the value trajectory for HFHS would be at the diagonal 325 for both monkeys (Fig. 5C, red). However, higher sensitivity to the sugar value components compressed the 326 low-sugar value trajectories along the sugar value axis and rotated the trajectories towards the fat value axis 327 (clockwise); similarly, higher sensitivity to the fat value components rotated the low-fat trajectories towards 328 the sugar value axis (counterclockwise). For example, monkey Ya showed a slight counterclockwise-rotated 329 LFHS trajectory and marked clockwise-rotated low-sugar trajectories, indicating his weak preference for fat 330 and the strong preference for sugar, respectively. In contrast, monkey Ym showed only mild clockwise-rotated 331 low-sugar trajectories and negligible rotation for the LFHS value trajectory, reflecting his mild sugar 332 preference and non-significant fat preference. 333 The rotating angles of the value trajectories in the nutrient value space quantified the relative changes of the value trajectories, recapitulated the subjective nutrient values estimated by the nutrient-sensitive RL model 341 (Fig. 5D). 342 Thus, subjective nutrient-value functions guided the dynamic updating and the integration of reward 343 values based on individual nutrient-specific components. 344 345 346

347
We investigated monkeys' choices for different nutrient-defined rewards under varying reward 348 probabilities. We found that the nutrient composition of rewards strongly influenced choices and learning. The 349 animals generally preferred rewards that were high in nutrient content but also showed individual preferences 350 for sugar and fat, consistent with the assignment of subjective values to choice options. The animals' nutrient 351 preferences affected how they adapted their choices to changing reward probabilities. Specifically, the 352 monkeys learned faster from preferred nutrient-rewards and chose them frequently even under low reward 353 probabilities (i.e., low probability of obtaining large reward amounts). Influences of past rewards on current 354 choice were well described by a reward-history analysis. As in previous studies 11,25 , more recent rewards had 355 a stronger influence on the monkeys' choices. Critically, we also found that the impact of reward history 356 depended on the nutrient composition of past rewards: the effect of past rewards high in preferred sugar content 357 was stronger compared to that of less preferred low-nutrient or fat rewards. The history of past choices, 358 irrespective of reward outcomes, also had a significant and nutrient-dependent effect on choice, with stronger 359 effects of past choices for preferred nutrient rewards. We proposed a nutrient-sensitive RL model that captured 360 the influences of preferred nutrients on learning and choice. The model updated the value of individual sugar 361 and fat components of expected rewards trial by trial, based on recently experienced rewards, and integrated 362 these components into scalar values that explained the monkeys' choices. These results suggest that nutrients 363 constitute important reward components that influence subjective valuation, learning and choice, and that 364 canonical RL models can be usefully extended to capture such nutrient-specific values.

365
Previous studies of reinforcement learning in macaques revealed important influences on learning and 366 choice, including effects of reward and choice history 5,7,11-13,25-27 , the variance of recent rewards 10 , novelty and 367 reward rarity 9,28 , and social observations 8 . Importantly, these studies did not vary the composition of reward 368 outcomes and thus could not test whether specific reward components differentially affected learning and 369 choice. We reasoned that nutrients are biologically critical reward components that are essential for survival 370 and that monkeys should prefer high-nutrient rewards and adapt their choices to optimise nutrient intake. By 371 manipulating the sugar and fat content of our liquid rewards, we confirmed that the monkeys' learned 372 differently from these different rewards.

373
Previous studies demonstrated that macaques have sophisticated preferences for different reward types 374 that comply with principles of economic choice theory 2,29-31 but did not examine how different rewards affect 375 learning. Here we showed that subjective preferences for specific nutrients influenced how monkeys tracked 376 the changing reward probabilities of choice options. Specifically, both animals learned faster from preferred 377 nutrient rewards. Moreover, they based their choices on both subjective valuations of offered reward types and 378 estimates of current reward probabilities. This latter finding confirms the result from a previous study that 379 macaques integrate reward type and probability information to express subjective preferences 29 ; different from 380 that study, our monkeys were required to derive probability information from past reward experiences rather 381 than from explicit visual cues. 382 Crucially, by varying the nutrient composition of rewards, we investigated reinforcement learning and 383 choice for biologically important, universal reward components. Nutrients are basic building blocks of foods 384 that are sensed by dedicated taste and oral-texture mechanisms 22,32-34 and engage physiological and homeostatic 385 processes 35,36 . Moreover, evidence from ecology and human metabolic sciences points to specific behavioral 386 mechanisms that regulate nutrient intake. For example, ecological studies identify a 'nutrient-balancing 387 mechanism' in wild macaques that promotes reproductive and survival success 14,37-39 . In humans, reduced 388 protein in ultra-processed foods increases energy intake by 'protein leveraging', a mechanism that regulates 389 food choice to counter protein deficits 35,40-42 . A 'fat-appetite mechanism' emerges in human monogenic obesity 390 affecting melanocortin-signalling 21 . We recently showed that in macaques, nutrients and sensory food qualities 391 (taste, viscosity, oral friction) shape human-like economic preferences 22 . Our approach makes a first step 392 towards integrating the influential RL framework with these nutrient-dependent behavioral processes and thus 393 enhance its biological validity. 394 The refine these models. 402 We described a nutrient-specific learning mechanism that updates value estimates for separate fat and 403 sugar reward-components and integrates this information to guide adaptive food choices. This mechanism 404 implies parallel nutrient valuation systems that detect and evaluate the nutrient components depending on 405 internal states. The neuronal implementation of this mechanism would require neurons that encode individual 406 nutrient values (nutrient-value neurons) and dynamically update these nutrient values via nutrient prediction 407 error signals (Fig. 6A). At a neural-network level (Fig. 6B), these nutrient-value neurons would extract 408 nutrient-specific features from a food's sensory properties to guide food choices. Importantly, physiological-409 state signals could modulate the neural representations of nutrient values to allow for state-dependent valuation 410 of food rewards. Therefore, we propose nutrient-value neurons and nutrient prediction error signals as potential 411 substrates for nutrient-sensitive learning and choice. 412 Our findings within a nutrient-based RL paradigm and our proposed computational framework have 413 implications for value-based learning and decision theories and underlying neural mechanisms. Because 414 nutrients provide energy and serve physiological functions for survival, animal reward systems should be 415 shaped by nutrient availability in the environment and evolved dedicated mechanisms for adaptive nutrient-416 sensitive decision-making. By decomposing the trial-by-trial reward values that guide reinforcement learning 417 into nutrient-value components, we identified candidate signals that could be encoded by neurons in the reward 418 and decision systems of the primate brain. Notably, the history regressors for each option coded past trials in terms of the offered trials because the 505 unoffered options did not carry information to influence current choices 51 . Therefore, the n-back trials for the 506 left option may not be the same choice trials as those for the right option, due to the randomized offers. 507 508 Nutrient model 509 Based on the history model, we further included nutrient-history interaction terms to characterize the 510 influences of fat and sugar levels on the effects of recent choices and reward outcomes: , where denoted recent high-fat choices and for high-fat rewarded trials; denoted recent high-518 sugar choices and for high-sugar rewarded trials. The nutrient-history interaction terms were defined as 519 follows, 520 521 , where − and − denoted whether the left or right option was chosen trials earlier (1, chosen; 0, 527 unchosen); − and − denoted whether the left or right option was chosen and was rewarded (1, chosen 528 and rewarded; 0, otherwise). 529

Reinforcement learning (RL) models 530
Standard RL model (Q-learning) 531 We adopted a standard Q-learning algorithm that followed the Rescorla- Alternative RL models 550 We systematically included differential learning rates and nutrient-specific learning parameters into the RL 551 models. Specifically, we examined 9 combinatorial RL models with 3 differential learning rates (Standard, 552 Asym, and Forget) and 3 nutrient-specific learning parameters (Standard, NutVal, Alpha) (3 x 3 = 9 models) 553 as below. 554 555 1. Differential learning rates (Standard, Asym, Forget) 556 We included differential learning rates for rewarded ( + ), unrewarded ( − ), and unoffered (  In the Standard model, the agent equally updated both the rewarded and unrewarded option and kept perfect 562 memory for the unoffered option ( + = − , 0 = 0). In the Asym model, the agent updated the rewarded and 563 unrewarded with different learning rates, while keeping perfect memory for the unoffered rewards ( + ≠ 564 − , 0 = 0). In contrast, in the Forget model, the value of the unoffered rewards decayed due to value 565 forgetting, but the rewarded and unrewarded option were updated equally ( + = − , 0 > 0). 566 567 2. Nutrient-specific learning models (NutVal, Alpha) 568 We examined nutrient preferences by including nutient-specific values (NutVal) or nutrient-specific learning 569 rates (Alpha). In the NutVal model, the reward values depend on the reward types as follows, , where + ( ) denoted the learning rate to update the value of the rewarded option on trial , which was first 581 transformed from [0,1] to any real number and modified by the high-fat level ( ), the high-sugar level ( ), 582 or their combination ( ). The logistic transformation ensured that the learning rates are always between 0 583 and 1. 584 585 Nutrient prediction error-RL model (NPE-RL model) 586 In forgetting factor 0 , as in the main nutrient value RL model ( Figure 4A). B) Nutrient-specific learning rates 634 and forgetting factors. Learning rates for HFLS ( ), LFHS ( ), and HFHS ( ) were all compared to the 635 baseline learning rates for LFLS ( ); the forgetting factors were tested against perfect value memory ( 0 = 636 0). Wilcoxon signed-rank test. 637