Optimal Response Vigor and Choice Under Non-stationary Outcome Values

Within a rational framework a decision-maker selects actions based on the reward-maximisation principle which stipulates they acquire outcomes with the highest values at the lowest cost. Action selection can be divided into two dimensions: selecting an action from several alternatives, and choosing its vigor, i.e., how fast the selected action should be executed. Both of these dimensions depend on the values of the outcomes, and these values are often affected as more outcomes are consumed, and so are the actions. Despite this, previous works have addressed the computational substrates of optimal actions only in the specific condition that the values of outcomes are constant, and it is still unknown what the optimal actions are when the values of outcomes are non-stationary. Here, based on an optimal control framework, we derive a computational model for optimal actions under non-stationary outcome values. The results imply that even when the values of outcomes are changing, the optimal response rate is constant rather than decreasing. This finding shows that, in contrast to previous theories, the commonly observed changes in the actions cannot be purely attributed to the changes in the outcome values. We then prove that this observation can be explained based on the uncertainty about temporal horizons; e.g., in the case of experimental protocols, the session duration. We further show that when multiple outcomes are available, the model explains probability matching as well as maximisation choice strategies. The model provides, therefore, a quantitative analysis of optimal actions and explicit predictions for future testing.


Introduction
According to normative theories of decision-making, actions made by humans and animals are chosen with the aim of earning the maximum amount of future reward whilst incurring the lowest cost . Within such theories individuals optimize their actions by learning about their surrounding environment so as to satisfy their long-term objectives. The problem of finding the optimal action is, however, argued to have two aspects: (1) choice, i.e., deciding which action to select from several alternatives; and (2) vigor, i.e., deciding how fast the selected action should be executed. For a rat in a Skinner box, for example, the problem of finding the optimal action involves selecting a lever (choice) and deciding at what rate to respond on that lever (vigor). High response rates can have high costs (e.g., in terms of energy consumption), whereas a low response rate could have an opportunity cost if the experimental session ends before the animal has earned sufficient reward. Optimal actions provide the right balance between these two factors and, based on the reinforcement-learning framework and methods from optimal control theory, the characteristics of optimal actions and their consistency with various experimental studies have been previously elaborated .
These previous models have assumed, however, that outcome values are stationary and do not change on-line over the course of a decision-making session. To see the limitations of such an assumption, imagine the rat is in a Skinner box and has started to earn outcomes (e.g., food pellets) by taking actions. One can assume that, as a result of consuming rewards, the motivation of the animal to earn more food outcomes will decrease (e.g., because of satiety) and, therefore, over time, the outcomes earned will have a lower value. Such changes in value should affect both optimal choice and vigor  but have been largely ignored in the previous models. This is while in most of the experimental and real-world scenarios, outcome values are affected by the history of outcome consumption, a phenomenon known as "law of diminishing marginal utility" 1 in the economics literature, and as "drive reduction theory" in psychological accounts of motivation, which indicates that the drive for earning an outcome decreases as the consequence of prior consumption of the outcomes .
Here, building on previous work, we introduce a new concept called reward field, which captures non-stationary outcome values. Using this concept and methods from optimal control theory, we derive the optimal response vigor and choice strategy without assuming that outcome values are stationary. In particular, the results indicate that even when the values of outcomes are changing, the optimal response rate in an instrumental conditioning experiment is a constant response rate. This finding rules out previous suggestions that the commonly observed decrease in within-session response rates is due to decreases in outcome value . Instead, we show that decreases in within-session response rates can be explained by uncertainty regarding session duration. This later analysis is possible because the session duration is explicitly represented in the current model, which is another dimension in which the current model extends previous work. The framework is then extended to choice situations and specific predictions are made concerning conditions under which the optimal strategy is maximization or probability matching.

Model Specification
The outcome space We define the outcome space as a coordinate space with n dimensions, where n is the number of outcomes in the environment. For example, in a concurrent instrumental conditioning experiment in which the outcomes are water and food pellets, the outcome space will have two dimensions corresponding to water and food pellets. Within the outcome space, the state of the decision-maker at time t is defined by two factors: (i) the amount of earned outcome up to time t , which is denoted by x t and can be thought of as the position of the decision-maker in outcome space; e.g., in the above example, x t = [1,2] would indicate that one unit of water and two units of food pellet have been gained up to time t ; and (ii) the outcome rate at time t , denoted by v t , which can be considered the velocity of the decision-maker in the outcome space (v t = d x t /d t ); e.g., if a rat is earning two units of water and one unit of food pellet per unit of time, then v t = [2,1]. In general, we assume that the outcome rate cannot be negative (v ≥ 0), which means that the cumulative number of earned outcomes cannot decrease with time.

The reward
We assume that there exists an n-dimensional reward field, denoted by A x,t , where each element of A x,t represents the per unit value of each of the outcomes. For example, the element of A x,t corresponding to food pellets represents the value of one unit of food pellet at time t , given that x units of outcome have been previously consumed. As such, A x,t is a function of both time and the amount of outcome earned. This represents the fact that (i) the reward value of an outcome can change value as a result of consuming previous outcomes, e.g., due to satiety (depending on x) and (ii) the reward value of an outcome can change purely with the passage of time; e.g., an animal can get hungrier over time causing the reward value of food pellets to increase (depending on t ).
In general, we assume that A x,t has two properties: which entail that (i) the outcome values decrease (or remain constant) as more outcomes are earned, and (2) that outcome values do not decrease with the passage of time.

Cost
Within the context of instrumental conditioning experiments, previous studies have expressed the cost of earning outcomes as a function of the delay between consecutive responses made to earn outcomes. For example, if a rat is required to make several lever presses to earn outcomes, then the cost will be higher if the delay between lever presses is short. More precisely, if the previous response has occurred τ time steps ago, then the cost of the current lever press will be : where a and b are constants. b is the constant cost of each lever press, which is independent of the delay between lever presses whereas the factor a controls the rate-dependent component of the cost. Previous research has established that predictions derived from this definition of cost are consistent with experimental data . Note that costs such as basal metabolic rate and the cost of operating the brain, although consuming a high portion of energy produced by the body, are not included in the above definition because they are constant and independent of response rate and, therefore, are not directly related to the analysis of response vigor and choice.
Here, we express cost as a function of outcome rate rather than the rate of action execution. We define the cost function K v as the cost paid at each time step for earning outcomes at rate v. In the specific case that the outcome space has one dimension (there is only one outcome), and under ratio schedules of reinforcement (fixed-ratio, variable-ratio, random-ratio) in which the decision-maker is required to perform either precisely or on average k responses to earn one unit of outcome, the cost defined in equation 2 will be equivalent to: See Theorem A1 in Appendix for the proof. This definition of cost implies that it is only a function of outcome rate and is time-independent (∂K v /∂t = 0). Although, in general, it may seem reasonable to assume that, as time passes within a session, the cost of taking actions will increase because of factors such as effector fatigue, here we made a time-independence assumption based on previous studies showing that factors such as effector fatigue have a negligible effect on response rate . In general, we assume that at least one response is required to earn an outcome (k > 0), and the cost of earning outcomes is non-zero (a > 0).

Value
The reward earned in each time step can be calculated as the reward of one unit of each of the outcomes (A x,t ) multiplied by the amount earned from each of the outcomes, which will be v.A x,t . The cost of earning this amount of reward is K v , and therefore the net amount of reward earned will be: A decision-making session starts at t = 0 and the total duration of that session is T . We denote the total reward gained in this period as S 0,T , which is the sum of the net rewards earned at each point in time: The quantity S 0,T is called the value function, and the goal of the decision-maker is to find the optimal rate of earning outcomes that yields the highest value. The optimal rates that maximize S 0,T can be found using different variational calculus methods, such as the Euler-Lagrange equation or the Hamilton-Jacobi-Bellman equation . The results presented in the next sections are derived using the Euler-Lagrange equation (see Appendix for details of value function in non-deterministic schedules).

Optimal response vigor
In this section we use the model presented above to analyse optimal response vigor when there is one outcome and one response available in the environment. The analysis is divided into two sections. In the first section, we assume that the decision-maker is certain about session duration, i.e., that the session will continue for T time units, and will extend this analysis in the next section to a condition in which the decision-maker assumes a probabilistic distribution of session lengths.
Response vigor when the session duration is known. We maintain the following theorem: Theorem 1 If the duration of the session is fixed and the time-dependent change in the reward field is zero (∂A x,t /∂t = 0), then the optimal outcome rate is constant (d v/d t = 0). If the timedependent change in the reward field is positive (∂A x,t /∂t > 0), then the optimal outcome rate will be accelerating (d v/d t > 0). See Appendix for a proof of this theorem. Note that the assumption ∂A x,t /∂t = 0 implies that the passage of time has no significant effect on the reward value of the outcome; e.g., a rat is not getting hungrier during an instrumental conditioning session, which is a reasonable assumption given the short duration of such experiments (typically less than an hour). Within this condition, the above theorem states that the optimal response rate is constant throughout the session, even under conditions in which the reward value of the outcome decreases within the session as a result of earning outcomes, e.g., because of satiety. As an intuitive explanation for why a constant rate is optimal, imagine a decision-maker who chooses a non-constant outcome rate that results in a total of x T outcomes during the session. If, instead of the non-constant rate, the decision-maker selects a constant rate v = x T /T , then the total outcomes earned will be the same as before; however, the cost will be lower because cost is a quadratic function of the outcome rate and, therefore, it is better to earn outcomes at a constant rate. Nevertheless, although this prediction is clear enough, it is not consistent with the experimental results, described next.

Within-session pattern of responses.
It has been established that in various schedules of reinforcement, including variable-ratio (VR) schedules , the rate of responding within a session has a particular pattern: the response rate reaches its maximum a short time after the session starts (warm-up period), and then gradually decreases toward the end of the session (Figure 1:left panel). Killeen  proposed a mathematical description of this phenomenon, which can be expressed as follows : where β is the response rate, δ is the minimum delay between responses, r is the resulting outcome rate, and α is called specific activation 2 . The model suggests that as the decision-maker earns outcomes during the session, the value of α gradually declines due to satiety, which will cause a decrease in response rate 3 . Although this model has been shown to provide a quantitative match to the experimental data, it is not consistent with Theorem 1 which posits that, even under conditions in which outcome values are changing within a session, the optimal response rate should not decrease during the session. As a consequence, the present model suggests that the cause of any decrease in the within-session response rate cannot be due purely to a change in outcome value. Note, however, the optimal response rate advocated by Theorem 1 is not consistent with reports of decreasing response rates across a session, which implies that some of the assumptions made to develop the model may not be accurate. Although the form of the cost and reward functions is reasonably general, we assumed that the duration of the session, T , is fixed and known by the decision-maker. In the next section we show that relaxing this assumption such that the duration of the session is unknown results much closer concordance between predictions from the model and the experimental data.

D R A F T
ertain about session duration, i.e., that ntinue for T time units, and will extend next section to a condition in which the mes a probabilistic distribution of session hen the session duration is known. We ing theorem: duration of the session is fixed and the nge in the reward field is zero (ˆAx,t/ˆt = al outcome rate is constant (dv/dt = 0). ent change in the reward field is positive n the optimal outcome rate will be acceleroof of this theorem. Note that the assumpimplies that the passage of time has no the reward value of the outcome; e.g., a rat ier during an instrumental conditioning sesonable assumption given the short duration (typically less than an hour). Within this e theorem states that the optimal response oughout the session, even under conditions value of the outcome decreases within the f earning outcomes, e.g., because of satiety. anation for why a constant rate is optimal, aker who chooses a non-constant outcome a total of xT outcomes during the session. n-constant rate, the decision-maker selects xT /T , then the total outcomes earned will re; however, the cost will be lower because unction of the outcome rate and, therefore, outcomes at a constant rate. Nevertheless, iction is clear enough, it is not consistent tal results, described next.
n of responses. It has been established that s of reinforcement, including variable-ratio the rate of responding within a session has : the response rate reaches its maximum a session starts (warm-up period), and then toward the end of the session ( Figure 1:left ] proposed a mathematical description of hich can be expressed as follows [13]: ponse rate, " is the minimum delay beis the resulting outcome rate, andis The figure is adopted from [11]. Right panel: The theoretical pattern of within-session responses predicted by the model in different conditions. Please see the text for details of each condition.
that the cause of any decrease in the within-session response rate cannot be due purely to a change in outcome value. Note, however, the optimal response rate advocated by Theorem 1 is not consistent with reports of decreasing response rates across a session, which implies that some of the assumptions made to develop the model may not be accurate. Although the form of the cost and reward functions is reasonably general, we assumed that the duration of the session, T , is fixed and known by the decision-maker. In the next section we show that relaxing this assumption such that the duration of the session is unknown results much closer concordance between predictions from the model and the experimental data.

Response vigor when session duration is unknown.
In this section we assume that the decision-maker is uncertain about the session duration, which can be either because the session duration is in fact non-deterministic, or because of inherent inaccuracies in interval timing in animals [15,16]. In this condition, a plausible way to calculate optimal response rates is to set an expectation as to how long the session will last and to calculate the optimal response rate based on that expectation. Based on this, if t Õ time step has passed since the beginning of the session, then the expected session duration is E[T |T > t Õ ] and therefore the value of the rest of the session will be S t Õ ,E[T |T >t Õ ] . The following theorem maintains that the optimal rate of outcome delivery that maximizes the value function is a decreasing function of the current time in the session t Õ , and therefore the optimal response rates will decrease throughout the session.
is the value function and Figure 1. The pattern of within-session response rates (responses per minute). Left panel: Experimental data. The rate of responding per minute during successive intervals (each interval is 5 minutes) in a variable-ratio (VR15) schedule (k = 15). The figure is adopted from . Right panel: The theoretical pattern of within-session responses predicted by the model in different conditions. Please see the text for details of each condition.

Response vigor when session duration is unknown.
In this section we assume that the decision-maker is uncertain about the session duration, which can be either because the session duration is in fact non-deterministic, or because of inherent inaccuracies in interval timing in animals Gibbon, 1977). In this condition, a plausible way to calculate optimal response rates is to set an expectation as to how long the session will last and to calculate the optimal response rate based on that expectation. Based on this, if t time step has passed since the beginning of the session, then the expected session duration is E[T |T > t ] and therefore the value of the rest of the session will be S t ,E[T |T >t ] . The following theorem maintains that the optimal rate of outcome delivery that maximizes the value function is a decreasing function of the current time in the session t , and therefore the optimal response rates will decrease throughout the session.

Theorem 2 Assuming S t ,E[T |T >t ] is the value function and that (i) the time dependent change in
the reward field is zero (∂A x,t /∂t = 0), (ii) the probability that the session ends at each point in time is non-zero (p(T ) > 0), (iii) values of outcomes decrease as more are consumed (∂A x,t /∂x < 0), then the optimal rate of outcome delivery is a decreasing function of t : See Appendix for the proof of this theorem. Theorem 2 stems from two bases: (i) the optimal rate of outcome delivery is a decreasing function of session length, i.e., when the session duration is long the decision-maker can afford to earn outcomes more slowly, and (ii) when the session duration is unknown, expected session duration should increase with the passage of time. This phenomenon, which has been elaborated within the context of delayed gratification , is more significant if the decision-maker assumes a heavy-tail distribution over T . Putting (i) and (ii) together implies that the optimal response rate will decrease with the passage of time. Importantly, this suggests, from a normative perspective, uncertainty about the session duration is necessary in order to explain within-session decreases in response rates.
For simulation of the model we characterized the session duration using a Generalized Pareto distribution following . Simulations of the model are depicted in Figure 1:right panel. Simulations show three different conditions. In condition (i) the session duration is known, and as the figure shows irrespective of whether the reward of outcomes is decreasing or fixed, the optimal response rate is constant. In condition (ii) session duration is unknown, but the reward of outcomes is constant. Again in this condition the optimal response rate is constant. In condition (iii) session duration is unknown and the reward decreases as more outcomes are consumed. Only in this condition, consistent with experimental data, response rates decrease as time passes (see Appendix for details of the simulations). Therefore, the simulations confirm that decreases in the reward of outcomes alone are not sufficient to explain within-session response rates, but by assuming uncertainty about session duration, the pattern of responses will be consistent with the experimental data. Note that a similar pattern will be obtained using any other distribution that assigns a non-zero probability to positive values of T .
Relationship to temporal discounting. There are, however, alternative explanations available based on changes in outcome value. One candidate explanation is based on the temporal discounting of outcome value according to which the value of future rewards is discounted compared to more immediate rewards. Typically, the discount value due to delay is assumed to be a function of the interaction of delay and outcome value. If, at the beginning of the session, outcome values are large (e.g., because a rat is more hungry), then any discount produced by selecting a slow response rate will be larger at that point than later in the session when the value of the outcome is reduced (e.g., due to satiety) and so a delay will have less impact. It could be argued, therefore, that it is less punitive to maintain a high response rate at the beginning of the session to avoid delaying outcomes and then to decrease response rate as time passes within the session. As such, temporal discounting predicts decreases in within-session response rates consistent both with experimental observations and with the argument that outcome value decreases within the session (e.g., the satiety effect).
Prediction. Although plausible, such explanations make very different predictions compared to the model. The most direct prediction from the model is that introducing uncertainty into the session duration without altering the average duration should nevertheless lead to a sharper decline in response rate within the session; e.g., if for one group of subjects the session lasts exactly 30 minutes whereas for another group the session length is uncertain and can end at any time (but ends on average after 30 minutes), then the model predicts that the response rate in the second group will be higher at the start and decrease more sharply than in the first group. This effect is not anticipated by the temporal discounting account of the effect.
Effect of experimental parameters: Relating the model to currently available data. Optimal response rates predicted by the model are affected by experimental parameters (e.g., reward magnitude), which can be compared against experimental data. In general, in an instrumental conditioning experiment, the duration of the session can be divided into three sections: (i) outcome handling/consumption time, which refers to the time that an animal spends consuming the outcome, (ii) post-reinforcer pause, which refers to the pause that occurs after consuming the outcome and before starting to make the next response (e.g., lever press). Such a pause is consistently reported in previous studies using an FR schedule, (iii) run time, which refers to the time spent making responses (e..g, lever pressing). Experimental manipulations have been shown to have different effects on the duration of these three sections of the session, and whether each of these sections is included when calculating response rates can affect the results. The predictions of the current model are with regard to response rate; whether manipulating experimental parameters are expected to change the two other measures (outcome handling and post-reinforcer pause) cannot be predicted by the current model. In the following sections, we briefly review the currently available data from instrumental conditioning experiments and their relationship to predictions of the model 4 .

The effect of response cost (a and b).
Experimental studies in rats working on a FR schedule (Alling & Poling, 1995) indicate that increasing the force required to make responses causes increases in both inter-response time and the post-reinforcement pause. The same trend has been reported in Squirrel monkeys (Adair & Wright, 1976). Consistent with this observation, the present model predicts that increases in the cost of responding, for example by increasing the effort required to press the lever (increases in a and b), lead to a lower rate of earned outcomes and a lower rate of responding ( Figure 2). The reason for this is that, by increasing the cost, the response rate for each outcome should slow in order to compensate for the increase in the cost and so maintain a reasonable gap between the reward and the cost of each outcome.
The effect of ratio-requirement (k). Experimental studies mainly imply that the rate of earned outcomes decreases with increases in the ratio-requirement (Aberman & Salamone, 1999;Barofsky & Hurwitz, 1968), which is consistent with the general trend of the optimal rate of outcome earning implied by the present model (see below).
Experimental studies on the rate of responding on FR schedules indicate that the postreinforcer pause increases with increases in the ratio-requirement (Ferster & Skinner, 1957, Figure 23) (Felton & Lyon, 1966;Powell, 1968;Premack, Schaeffer, & Hundt, 1964). In terms of overall response rates, some experiments report that response rates increase with increases in the ratiorequirement up to a point beyond which response rates will start to decline, in rats (Barofsky & Hurwitz, 1968;Kelsey & Allison, 1976;Mazur, 1982), pigeons (Baum, 1993) and mice (Greenwood, Quartermain, Johnson, Cruce, & Hirsch, 1974), although other studies have reported inconsistent results in pigeons (Powell, 1968), or a decreasing trend in response rate with increases in the ratio-requirement (Felton & Lyon, 1966;Foster, Blackman, & Temple, 1997). The inconsistency is partly due to the way that response rates are calculated in the different studies; for example  in some studies outcome handling and consumption time are not excluded when calculating response rates (Barofsky & Hurwitz, 1968), in contrast to the other studies (Foster et al., 1997). As a consequence, the experimental data regarding the relationship between response rate and the ratio-requirement is inconclusive. With regard to this issue, the present model predicts that the relationship between response rate and the ratio-requirement is an inverted U-shaped pattern ( Figure 3: left panel), which is consistent with the studies mentioned previously depending on the value of other experimental parameters. The model makes an inverted U-shaped prediction because, under a low ratio-requirement, the cost is generally low and, therefore, as the ratio-requirement increases, the decision-maker will make more responses to compensate for the drop in the amount of reward. In contrast, when the ratio-requirement is high, the cost of earning outcomes is high and the margin between the cost and the reward of each outcome becomes significantly tighter as the ratio-requirement increases. The margin can, however, be loosened by decreasing the response rate.
The Effect of deprivation level. Experimental studies that have used FR schedules suggest that response rates increase with increases in deprivation (Ferster & Skinner, 1957, Chapter 4) (Sidman & Stebbins, 1954). However, such increases are mainly due to decreases in the post-reinforcement pause, and not due to the increases in the actual rate of responding after the pause (see (Pear, 2001, Page 289) for a review of other reinforcer schedules). The model predicts that, with increases in deprivation, the rate of responding and of earned outcomes will increase linearly ( Figure 3: middle panel). The rate of increase is, however, negligible when the session duration is long, in which case, even under high deprivation, sufficient time is available to earn sufficient reward and become satiated. This provides a potential reason why the effect of deprivation on response rate has not previously been observed in experimental data.
The effect of reward magnitude. Some studies show that post-reinforcement pauses increase as the magnitude of the reward increases (Powell, 1969), whereas other studies suggest that the post-reinforcement pause decreases (Lowe, Davey, & Harzem, 1974), although, in this later study, the magnitude of the reward was manipulated within-session and a follow-up study showed that, at a steady state, the post-reinforcement pause increases with increases in the magnitude of the reward (Meunier & Starratt, 1979). Reward magnitude does not, however, have a reliable effect on the overall response rate (Keesey & Kling, 1961;Lowe et al., 1974;Powell, 1969).
Regarding predictions from the model, the effect of the reward magnitude on earned outcome and response rates is, again, predicted go take an inverted U-shaped relationship ( Figure 3: right panel), and, therefore, depending on the value of the parameters, the predictions of the model are consistent with the experimental data. The model makes a U-shaped prediction because, when the reward magnitude is large then, given high response rates, the animals will become satiated quickly and, therefore, the reward value of future outcomes will decrease substantially if animal maintains this high response rate. As a consequence, under a high reward magnitude condition, increase in reward will cause response rates to decrease. Under a low reward magnitude condition, however, satiety has a negligible effect and a high response rate ensures that sufficient reward can be earned before the session ends.

Optimal choice and response vigor
In this section we address the choice problem, i.e., the case where there are multiple outcomes available in the environment and the decision-maker needs to make a decision about the response rate along each outcome dimension. An example of this situation is a concurrent instrumental conditioning experiment in which two levers are available and pressing each lever produces an outcome on a ratio schedule. Unlike the case with single outcome environments, the optimal rate of earning outcomes is not necessarily constant and can take different forms depending on whether the reward field is a conservative field or a non-conservative field, and whether the costs of responses along the outcome dimensions are independent of each other. Below, we derive the optimal choice strategy in each condition.
Conservative reward field. A reward field is conservative if the amount of reward experienced by consuming different outcomes does not depend on the order of consumption and depends only on the number of each outcome earned by the end of the session. This property holds in two conditions (i) when the value of each outcome is unrelated to the prior consumption of other outcomes; and (ii) the consumption of an outcome affects the value of other outcomes and this effect is symmetrical. As an example of condition (i), imagine an environment with two outcomes in which one of the outcomes only satisfies thirst and the other only satisfies hunger.
Here, consumption of one of the outcomes will not affect the amount of reward that will be experienced by consuming the other outcome and, therefore, the total reward during the session does not depend on the order of choosing the outcomes. As an example of condition (ii), imagine an environment with two outcomes in which both outcomes satisfy hunger and, therefore, consuming one of the outcomes reduces the amount of future reward produced by the other outcome. Here, if the effect of the outcomes on each other is symmetrical, i.e., consuming outcome O 1 , reduces the reward elicited by outcome O 2 by the same amount that consuming outcome O 2 reduces the reward elicited by outcome O 1 , then it will not matter which outcome is consumed first and the total reward during the session will be independent of the chosen order. As such, the reward field will be conservative.
More precisely, a reward field is conservative if there exists a scalar field D x such that: It can be shown that if a reward field satisfies equation 8 then the amount of reward experienced in a session depends on the total number of earned outcomes. Under this condition the optimal response rate will be constant for each outcome relative to the other. Intuitively, this is because, in terms of the total rewards per session, the only thing that matters is the final number of earned outcomes and, therefore, there is no reason why the relative allocation of responses to outcomes should vary within the session. The actual response rate for each outcome will, however, depend on whether the costs of the outcomes are independent, a point elaborated in the next section.

Conservative reward field and independent response cost.
The costs of various outcomes are independent if the decision-maker can increase their work for one outcome without affecting the cost of other outcomes. As an example, imagine a decision-maker that is using their left hand to make responses that earn one outcome and their right-hand to make responses that earn the second outcome. In this case, the independence assumption entails that the cost of responding with one or other hand is determined by the response rate on that hand; e.g., the decision-maker can increase or decrease rate of responding on the left hand without affecting the cost of responses on the right hand. More precisely, the independence assumption entails that the Hessian matrix of K v is diagonal: In this situation even if some of the outcomes have a lower reward or a higher cost (inferior outcomes) compared to other outcomes (superior outcomes), it is still optimal to allocate a portion of responses to the inferior outcomes. This is because responding for inferior outcomes has no effect on the net reward earned from superior outcomes and, therefore, as long as the response rate for inferior outcomes is sufficiently low that the reward earned from them is more than the cost paid, responding for them is justified. The portion of responses allocated to each outcome depends, however, on the cost and reward of each outcome. We maintain the following theorem: Theorem 3 If (i) the reward field is conservative, i.e., there exists a scalar field D x such that equation 8 is satisfied, (ii) the time-dependent term of the reward field is zero (∂A x,t /∂t = 0), and (iii) the cost function satisfies equation 9, then the optimal rate of earning outcome v * will be constant (d v/d t = 0) and satisfies the following equation: See Appendix for the proof and for the specification of optimal responses. As an example, imagine a concurrent fixed-ratio (FR) schedule in which an animal needs to make k responses on the left lever in order to earn O 1 , and l k responses on the right lever in order to earn O 2 , and both outcomes have the same reward properties. According to Theorem 3, the optimal response rate for O 1 (the outcome with the lower ratio-requirement) is l times greater than the response rate for the second outcome O 2 . Figure 4:left panel independent cost condition shows the simulation of the model and the optimal trajectory in the outcome space. As the figure shows, the rate of earning O 1 is higher than O 2 , however, the relative portion of earned outcomes remains the same throughout the session. The above results are generally in line with the probability matching notion, which states that a decision-maker allocates responses to outcomes based on the ratio of responses required for each outcome (Estes, 1950). Probability matching is often argued to violate rational choice theory because, within this theory, it is expected that a decision-maker works exclusively for the outcome with the higher probability (i.e., the lower ratio-requirement). However, based on the model proposed here probability matching is the optimal strategy and therefore consistent with rational decision-making.

Conservative reward field and dependent response cost.
In this section we assume that the cost of responses for an outcome is affected by the rate of responses for earning other outcomes. In other words, what determines the cost is the delay between subsequent responses either for the same or for a different outcome; i.e., the cost is proportional to the rate of earning all of the outcomes. In instrumental conditioning this assumption entails, for example, that the cost of pressing, say, the right lever is determined by the time that has passed since the last press on either the right or a left lever. In this condition the optimal strategy is maximisation; i.e., to take the action with the higher reward (or lower ratio-requirement) and to stop taking the other action (see Theorem A2 in Appendix). The reason is, unlike the case with independent costs, allocating more responses to earn an inferior outcome will increase the cost of earning superior outcomes and, therefore, it is better to pay the cost for the superior outcome only, which requires fewer responses per unit of outcome.
For example, under a concurrent FR schedule in which an animal needs to make k responses on the left lever to earn O 1 , and l k responses on the right lever to earn O 2 (O 1 and O 2 have the same reward properties), the optimal response rate will be a constant response rate on the left lever and a zero response rate on the right lever. Figure 4:left panel dependent cost condition shows a simulation of the model and the optimal trajectory in outcome space, which shows that the subject only earns O 1 .
As such, whether the outcome rate reflects a probability matching or a maximization strategy depends on the cost function and, in instrumental conditioning experiments, the cost that reflects the maximization strategy is more readily applicable. Regarding the experimental data, evidence from concurrent instrumental conditioning experiments in pigeons tested using VR schedules (Herrnstein & Loveland, 1975) is in-line with the maximization strategy and consistent with predictions from the model. Within the wider scope of decision-making tasks, some studies are consistent with probability matching, whereas other studies provide evidence in-line with the maximization strategy (see (Vulkan, 2000) for a review). In many of these latter studies, however, the decision-making task involved making a single choice (e.g., single button press) with immediate feedback (about whether the choice was rewarded) after which the next trial was initiated and, because such disjointed actions are unlikely to convey a rate-dependent cost, the structure of such studies cannot be readily related to the model proposed here.
Prediction. One way of testing the above explanation for maximization and matching strategies is to compare the pattern of responses when two different effectors are used to make responses for the outcomes vs. when a single effector is being used to earn both outcomes. In the first condition the costs of the two outcomes are independent of each other whereas in the second condition they are dependent on each other. As a consequence, the theory predicts that, in the first condition, response rates will reflect probability matching whereas in the second condition they will reflect the maximization strategy. properties, e.g., consumption of one unit of either O1 or O2 decreases hunger by one unit, however, they generate di erent amount of rewards, e.g., one unit of O1 generates more reward than one unit of O2. Within such an environment, if hunger is high then consuming O1 generates significantly more reward than O2 and, therefore, early in the session it is better to allocate more responses to O1, whereas later in the session (when hunger is presumably lower and the di erence in the value of the outcomes is small) responses for O2 can gradually increase. If we reverse this order, i.e., first O2 is consumed and then O1, then early consumption of O2 will cause satiety and the decision-maker will lose the chance to experience high reward from O1 when hungry. As such, the amount of experienced reward depends on the order of consuming the outcomes and, based on the above explanation, a larger amount of reward can be earned during the session if more responses are allocated to the outcome with the higher reward at the beginning of the session (see Theorem S3 in SI text). Figure 2:right panel shows the simulations of the model under di erent session durations. In each simulation, at the beginning of the session the initially earned outcomes are zero and each line in the figure shows the trajectory of the amount earned from each outcome during the session. As the figure shows, in all conditions the rate of earning O1 is higher than O2 and this di erence is more apparent under long session durations, in which a large amount of reward can be gained during the session and it makes a significant di erence to earn O1 first.

Prediction.
A test of the above prediction would involve an experiment in which the subject is responding for two food outcomes containing an equal number of calories (and therefore having the same impact on motivation) but with di erent levels of the desirability (e.g., having di erent flavors) and, therefore, having a di erent reward e ect. Theorem S3 predicts that, if the session duration is long enough, early in the session the response rate for the outcome with the greater desirability will be higher whereas, later in the session, responses for the other outcome will increase. and here we extended them by providi solution to the problem of response v are two significant di erences betwee here and previous models of respons although the e ect of between-sessio values on response vigor was address [22], the e ects of on-line changes in a session were not addressed. On the of changes in outcome value on the c has been addressed in some previous m role in determining response vigor has We address such limitations directly in Secondly, previous work conceptua the task as a semi-Markov decision p optimal actions that maximize the ave time (average reward). Here, we used to calculate the optimal actions that earned within the session. One benefit in the previous works is that it exten range of instrumental conditioning sch schedules, whereas the extension of th to the case of interval schedules is not average reward (as adopted in previous the maximization of reward in an infin i.e., the session duration is unlimited. used here explicitly represents the d which, as we showed, plays an import of responses. The optimal trajectories in the outcome space when the reward field is non-conservative. The graph shows the optimal trajectory in the conditions that the session duration is short (T = 7), medium (T = 15.75) and long (T = 23).

Non-conservative reward field.
A reward filed is non-conservative if the total amount of reward experienced during the session depends on the order of the consumption of the outcomes. Imagine an environment with two outcomes say O 1 and O 2 , where both outcomes have the same motivational properties, e.g., consumption of one unit of either O 1 or O 2 decreases hunger by one unit, however, they generate different amount of rewards, e.g., one unit of O 1 generates more reward than one unit of O 2 . Within such an environment, if hunger is high then consuming O 1 generates significantly more reward than O 2 and, therefore, early in the session it is better to allocate more responses to O 1 , whereas later in the session (when hunger is presumably lower and the difference in the value of the outcomes is small) responses for O 2 can gradually increase. If we reverse this order, i.e., first O 2 is consumed and then O 1 , then early consumption of O 2 will cause satiety and the decision-maker will lose the chance to experience high reward from O 1 when hungry. As such, the amount of experienced reward depends on the order of consuming the outcomes and, based on the above explanation, a larger amount of reward can be earned during the session if more responses are allocated to the outcome with the higher reward at the beginning of the session (see Theorem A3 in Appendix). Figure 4:right panel shows the simulations of the model under different session durations. In each simulation, at the beginning of the session the initially earned outcomes are zero and each line in the figure shows the trajectory of the amount earned from each outcome during the session. As the figure shows, in all conditions the rate of earning O 1 is higher than O 2 and this difference is more apparent under long session durations, in which a large amount of reward can be gained during the session and it makes a significant difference to earn O 1 first.
Prediction. A test of the above prediction would involve an experiment in which the subject is responding for two food outcomes containing an equal number of calories (and therefore having the same impact on motivation) but with different levels of the desirability (e.g., having different flavors) and, therefore, having a different reward effect. Theorem A3 predicts that, if the session duration is long enough, early in the session the response rate for the outcome with the greater desirability will be higher whereas, later in the session, responses for the other outcome will increase.

Discussion
Computational models of action selection are essential for understanding decisionmaking processes in humans and animals, and here we extended them by providing a general analytical solution to the problem of response vigor and choice. There are two significant differences between the model proposed here and previous models of response vigor . Firstly, although the effect of between-session changes in outcome values on response vigor was addressed in previous models (Niv, Joel, & Dayan, 2006), the effects of on-line changes in outcome values within a session were not addressed. On the other hand, the effect of changes in outcome value on the choice between actions has been addressed in some previous models , however their role in determining response vigor has not been investigated. We address such limitations directly in this model.
Secondly, previous work conceptualized the structure of the task as a semi-Markov decision process and derived the optimal actions that maximize the average reward per unit of time (average reward). Here, we used a variational analysis to calculate the optimal actions that maximize the reward earned within the session. One benefit of the approach taken in the previous works is that it extends naturally to a wide range of instrumental conditioning schedules such as interval schedules, whereas the extension of the model proposed here to the case of interval schedules is not trivial. Optimizing the average reward (as adopted in previous work) is equivalent to the maximization of reward in an infinite-horizon time scale; i.e., the session duration is unlimited. In contrast, the model used here explicitly represents the duration of the session which, as we showed, plays an important role in the pattern of responses.
Relationship to principle of least action. A basic assumption that we made here is that the decision-maker takes actions that yield the highest amount of reward. This reward maximization assumption has a parallel in physics literature known as the principle of least action, which implies that among all trajectories that a system can take, the true trajectories are the ones that minimize the action. Here action has a different meaning from that used in psychology literature, and it refers to the integral of the Lagrangian (L) along the trajectory. In the case of a charged particle with charge q and mass m in a magnetic field B , the Lagrangian will be: where A is the vector potential (B = ∇ × A). By comparing equation 11 with equations 4 and 5, we can see that the reward field A x,t corresponds to the vector potential, and the term K v corresponds to 1 2 mv 2 by assuming m = 2ak 2 , and b = 0. This parallel can provide some insights into the properties of the response rates. For example, it can be shown that when the Lagrangian is not explicitly dependent on time (time-translational invariance), which here implies that ∂A x,t /∂t = 0, then the Hamiltonian (H , or energy) of the system is conserved. The Hamiltonian in the case of the system defined in equation 4 (with single outcome) is: Conservation of the Hamiltonian implies that ak 2 v 2 (and therefore v) is constant (response-rate is constant), as stated by Theorem 1. Further exploration of this parallel can be an interesting future direction.

Value in non-deterministic schedules
The value of a trajectory in the outcome space is the sum of the net amount of rewards that can be earned during a session. However, the amount of reward earned during a session can be non-deterministic, as for example in the case of VR and RR schedules of reinforcement, actions lead to outcomes probabilistically. Similarly, the cost of earning outcomes will also be nondeterministic, since the number of responses required to earn outcomes is non-deterministic. Let's denote the cost of earning outcomes under such non-deterministic schedules by K v . Using this, we define the value function as the sum of the expected net amount of rewards that will be earned during a session: where the expectation is w.r.t the number of earned outcomes along each outcome dimension during d t time step. Following the above definition, we have: where L x,v,t is the expected net reward earned in d t time step. In the main text and in the following sections, E[v] is denoted by v for simplicity of notation. By replacing v by E[v] in equation 4 we get: In the main text, equation A14 (equation 4 in the main text) was used instead of equation A13, and the aim of this section is to show that equation A14 and equation A13 are equivalent. We first consider environments with one-dimensional outcome spaces, and then we extend it to the case of environments with multi-dimensional outcome spaces. We maintain the following theorem: Theorem A1 Assume that the cost of one response, given that the delay since the last response is τ, is as follows: Furthermore, assume that on average, or exactly, k responses are required to earn one unit of the outcome, and r is the number of outcomes earned. Then we have: where Proof. We first provide an intuitive explanation for why the cost defined in equation A15 is the same as the cost defined in equation A17 in the case of FR schedules of reinforcement (i.e., exactly k responses are required to earn an outcome). Earning the outcome at rate v implies that the time it takes to earn the outcome is 1/v, and since k responses have been executed in this period, the delay between responses is: and therefore using equation A15 (equation 2 in the main text), the cost of making one response will be akv +b. Since k responses are required for earning each outcome, the total cost of earning one unit of the outcome will be k times the cost of one response, which will be k(akv + b). Since the total amount of outcome earned is vd t , the total cost per unit of time will be: which is equivalent to equation 3 and A17.
We now show that equation A16 and equation A13 are equivalent in order to prove Theorem A1. Equation A13 has two terms. As for the first term, A x,t can be assumed to be constant in d t time step, and therefore we have: As for the second term we maintain that: To show the above relation, assume that r is the number of outcomes earned after making one response. Since according to the definition of RR and VR schedules, out of N responses on average N /k will be rewarded, we have E r [r ] = 1/k and the expected rate of outcome earning will be: Furthermore, according to equation A15 the cost of one response is a/τ + b, and therefore, the cost per unit of time will be: Therefore: (using equation A17), which proves equation A21. Substituting equations A21, A20 in equation A13 yields equation A16, which proves the theorem. We now turn to the case of multi-dimensional outcome spaces. The aim is to show equation A13 is equivalent to equation A14. To show this, we first maintain that: which holds since A x,t can be assumed to be constant during d t time step. Next, we show that: which states that E[K v ] can be represented as a function of E [v]. To show this, assume r i is the number of outcomes earned after making one response for outcome i , and τ i is the delay between responses for outcome i . We have: and therefore τ i can be expressed as a function of E[v i ]. Next, assume that [C τ ] i is the cost of making one response for outcome i with delay τ i between the responses, and τ is a vector containing the delay between responses for each outcome (τ = [τ 1 . . . τ n ]). In d t time step, d t /τ i responses for outcome i are made, and therefore the total cost in d t time period will be [C τ ] i d t /τ i , which implies that the cost for outcome i per unit of time is [C τ ] i /τ i . Given this, the total cost paid for all the outcomes per unit of time will be: where we used the fact that τ in C τ can be expressed using

Optimal actions in one-dimensional outcome space
The aim is to derive optimal actions when the outcome space has one dimension. Given the reward field A x,t , the reward of gaining d x units of outcome will be A x,t d x, and since d x = vd t , the reward earned in each time step is v A x,t . Given that K v is the cost function (the cost paid in each time step), the net reward in each time step can be written as: and based on this, the value, which is the sum of net rewards in each time step, will be: The optimal rates that maximize S 0,T can be found using different variational calculus methods such as the Euler-Lagrange equation, or the Hamilton-Jacobi-Bellman equation . Here we use the Euler-Lagrange form, which sets a necessary condition for δS = 0: Furthermore, since the end-point of the trajectory is free (the amount of outcomes that can be gained during a session is not limited, but the duration of the session is limited to T ), the optimal trajectory will also satisfy the transversality conditions: which implies: where as mentioned T is the total session duration. By substituting equation A27 in equation A29 we will have: The term d A x,t /d t has two components: the first component is the change in A x,t due to the change in x and the second component is due to the time-dependent changes in A x,t : Furthermore we have: Substituting equations A33, A34 in equation A32 yields: In the condition that the rate of outcome earning is constant (d v/d t = 0), we have x T = vT , which by substituting in equation A31 yields: The above equation will be used in order to calculate the optimal rate of outcome earning.

Theorem 1: Proof
The cost function K v defined in equation 3 satisfies the following relation: which holds as long as at least one response is required to earn an outcome (k > 0), and the cost of earning outcomes is non-zero (a > 0). Assuming that ∂A x,t /∂t = 0, and given equation A37, the only admissible solution to equation A35 is: Furthermore, assuming ∂A x,t /∂t > 0, and given equation A37, the only admissible solution to equation A35 is: which proves Theorem 1.

Theorem 2: Proof and simulation details
Proof of Theorem 2. In order to prove the theorem, we first provide a lemma. Assuming that the total session duration (T ) has the probability density function f (T ) and that f (T ) > 0, here we show that the expectation of the total session duration never decreases as time passes throughout the session.

Lemma 1 Let's denote the expectation of the session duration at time t with T
and assume T has the following probability density function: Then: Proof. We have: where F (T ) is the cumulative distribution function of T . Based on the above lemma, we show that the optimal response rate is a decreasing function of t . Based on equation A31, the optimal response rate satisfies the following equation: Taking the derivative w.r.t to t we get: Theorem 2 assumes that ∂A x,t /∂x < 0 and ∂A x,t /∂T = 0, which given equations A42,A37, and that v > 0 yields: which implies that the rate of earning outcomes decreases as time passes within a session.

Simulation details.
The simulation of the model depicted in Figure 1:right panel requires defining (i) the reward field, (ii) the cost function, and (iii) a probability distribution over the session duration. As for the probability distribution of the session duration, following McGuire et al , we assumed that T follows a Generalized Pareto distribution: where k is a shape parameter (note that k is not the ratio-requirement here) and σ is a scale parameter, and the third parameter (location µ) was assumed to be zero. Furthermore we have: which has the following expected value: which as we expect is an increasing function of t . For the simulation of the model we assumed that k = 0.9 and σ = 6, which represents that the initial expectation for the session duration is 60 minutes.
For the cost function, in all the simulations the cost defined in equation 3 was used, which is equivalent to the cost function used in the previous works .
For the definition of the reward field, we used the framework provided by Keramati et al , which provides a computational model for how the values of outcomes change with the consumptions of the outcomes. They suggested that the dependency of the reward field on the amount of outcome earned is indirect and it is through the motivational drive. They conceptualized the motivational drive as the deviations of the internal states of a decision-maker from their homeostatic set-points. For example, let's assume that there is only one internal state, say hunger, where H denotes its homeostatic set-point (which corresponds to the deprivation level, assuming that initial value of x is zero), and there is an outcome which consuming each unit of it satisfies l units of the internal state. In this condition, the motivational drive at point x, denoted by D x , will be: Keramati et al  showed that such a definition of the motivational drive has implications that are consistent with the behavioral evidence. According to the framework, the reward generated by earning δx units of the outcome is proportional to the change in the motivational drive, which can be expressed as: As equation A51 suggests, with earning more outcomes (increase in x) A x,t decreases. Given the above reward field, the optimal response rate of outcome earning, obtained by equation A36, will be: Equation A52 was used in the simulations of the "decreasing reward and unknown session duration" condition in Figure 1:right panel. The simulation of this condition was done using parameters k = 15, l = 0.1, a = 0.002, b = 0.1, H = 80. As for T , in each time t within the session, the expected session duration (E[T |T > t ]) was calculated using equation A49, and was used as T in equation A52. For the "known session duration (fixed or decreasing reward)" condition in Figure 1:right panel, the same parameters as the previous condition were used, but the session duration was fixed to T = 60. For the "fixed reward (known or unknown session duration)" condition, we assumed that the reward field is independent of the amount of reward earned: Given the above reward field, the optimal rate of outcome earning is: The simulation of this condition was done using parameters k = 15, l = 0.1, a = 0.002, b = 0.1, H = 40. Note that in this condition the response rate was independent of the session duration.
The response rates in all the conditions were obtained by multiplying the outcome rates by k (since k responses are required to earn one unit of outcome). Simulation details of Figures 2, 3. The simulation depicted in Figure 2 and Figure 3 are using equation A52 with the following parameters (note that the optimal response rates were obtained by multiplying v * by k). For Figure 2:right panel simulation parameters are T = 50, k = 1, l = 1, a = 1, H = 8. Parameter b is varied between 3 to 7 in order to generate the plot.
In Figure 3:left panel simulation parameters are T = 50, l = 1, a = 0.3, b = 0.05, H = 100. Parameter k was varied between 1 to 100 in order to generate the plot.
In Figure 3:middle panel simulation parameters are T = 50, k = 1, l = 1, a = 0.3, b = 0.05. Parameter H was varied between 10 to 100 in order to generate the plot.
In Figure 3:right panel simulation parameters are T = 50, k = 1, a = 0.1, b = 0.1, H = 100. Parameter l was varied between 0 to 1 in order to generate the plot.

Optimal actions in multi-dimensional outcome space
The aim of this section is to derive the optimal actions in the condition that the outcome space is multi-dimensional. Optimal trajectory will satisfy the Euler-Lagrange equation along each outcome dimension: where: Furthermore since the end point of the trajectory is free (the total amount of outcomes is not fixed) we have: Using equation A55, A56 we have: Proof. Using equation A66 we get: Note that the use of Schwarz's theorem is based on the assumption that D x is twice differentiable, which holds in the circumstances that we consider here.

Theorem 3: Proof and simulation details
Proof of Theorem 3. Theorem 3 assumes that (i) the costs of earning outcomes are independent (equation A63), (ii) the reward field is conservative (equation A66), and (iii) the reward field is independent of time (∂A x,t /∂t = 0). Based on Lemma 2, Lemma 3 and equation A61 we have: d Given that equation A37 holds along each outcome dimension (∂ 2 K v /∂v 2 > 0), the only admissible solution to equation A68 is d v/d t = 0, which shows that the optimal rate of earning outcomes is constant. Since the optimal rate is constant, we have x T = T v * , which by substituting in boundary conditions implied by equation A57 yields equation 10: which completes the proof the theorem. Simulation details. For the simulation of the model in Figure 4:left panel "independent cost" condition, it is assumed that the two outcomes have the same reward effect, but earning the second outcome requires l times more responses. Following Keramati et al , since the two outcomes have the same reward properties we defined the motivational drive as follows: where as mentioned D x is the motivational drive and it represents the deviations of the internal state of the decision-maker from its homeostatic set-point (H ). x 1 is the amount of O 1 earned and x 2 is the amount of O 2 earned, and the current motivational drive for earning outcomes depends on the difference between the total amount of earned outcomes (x 1 + x 2 ) and the homeostatic set-point (H ). Given the motivational drive, the amount of reward generated by consuming each outcome will be equal to the amount of change in the motivational drive due to the consumption of the outcomes (equation 8), and therefore, we have: The above equation was used as the reward field in the simulations. As for the cost function, earning one unit of O 1 requires k responses on the left lever, and earning one unit of O 2 requires l k responses on the right lever. Based on this and using equation 3, the cost function will be: where v 1 is the rate of earning O 1 and v 2 is the rate of earning O 2 . Using Theorem 3, the optimal response rate will be: for left lever kl 2 H T l 2 + 2ak 2 l 2 + T , for right lever kl H T l 2 + 2ak 2 l 2 + T where as mentioned in the main text "left lever" is the response that should be taken for earning O 1 , and "right lever" is the response that should be taken for earning O 2 . Parameters used for simulations are k = 1, l = 2, a = 1, b = 0, H = 100.

Theorem A2: Definition, proof and simulation details
Proof of Theorem A2. The aim of this section is to derive optimal actions in the conditions that the costs of earning outcomes are dependent on each other. In this condition, one can assume what determines the cost is the delay between subsequent responses, either for the same or for a different outcome, i.e., the cost is proportional to the rate of earning all of the outcomes. In particular, if for earning O 1 , k responses are required and for earning O 2 , l k responses are required (l = 1), then the delay between subsequent responses (τ) will be 1/(kv 1 + l kv 2 ). Given equation 2, the cost of earning one unit of O 1 will be k[a(kv 1 +l kv 2 )+b], and the cost of earning one unit of O 2 will be kl [a(kv 1 + l kv 2 ) + b]. Such a cost function can be achieved by defining the cost as follows: K v = v 1 [ak(kv 1 + l kv 2 ) + kb] + v 2 [akl (kv 1 + l kv 2 ) + kl b].
In the following theorem, we maintain that given the above cost function, the optimal actions are to make no response for O 2 , and to make responses for O 1 at a constant rate.
Theorem A2 Given the cost function defined in equation A74 and assuming that the two outcomes have the same reward properties, i.e.,: Then the optimal actions satisfy the following equations: Proof. By substituting equation A74 in equation A56 we have: Using the boundary condition mentioned in equation A57 we have: Using equation A75 we get: which is not admissible given constraints v 1 ≥ 0 and v 2 ≥ 0, and therefore we assume either v 1 or v 2 will be equal to zero. The trajectory will have a higher value by setting v 2 to zero since O 2 has a higher cost, and therefore the optimal solution will be v 2 = 0. Since v 2 = 0 the problem degenerates to a one-dimensional problem, in which according to Theorem 1 the optimal response rate is constant, and therefore the rate of responding for O 1 will be constant, which proves the theorem. Simulation details. For the simulation of the model in Figure 4:left panel "dependent cost" condition, it is assumed that k responses on the left lever are required to earn O 1 and l k response are required on the right lever to earn O 2 . Similar to the "independent cost" condition mentioned in the previous section, the reward field was assumed as follows: Since the response rate for one of the outcomes will be zero (according to Theorem A2), the problem degenerates to an environment with one action and one outcome. Using Theorem 1, and equation A36 the optimal response rate will be: Parameters used for simulations are k = 1, l = 2, a = 1, b = 0, H = 100.

Theorem A3: Definition, proof and simulation details
Proof of Theorem A3. The aim of Theorem A3 is to derive optimal actions when the reward field is non-conservative and the costs of actions are independent. An example of a nonconservative reward field is when the amount of reward that consuming an outcome produces is greater or smaller than the change in the motivational drive. For example, assume that there are two outcomes available, and the consumption of both outcomes has a similar effect on the motivational drive: but the reward that the second outcome generates is l times larger (l = 1) than the change it creates in the motivational drive: