Field and experimental data together with computational models reveal how cleaner fish adjust decisions in a biological market

While it is generally straightforward to quantify individual performance in cognitive experiments, identifying the underlying cognitive processes remains a major challenge. Often, different mechanistic underpinnings yield similar performances, and Lloyd Morgan’s cannon warrants acceptance of the simpler explanation. Alternatively, when the different mechanisms interact with environmental conditions, variation in performance across environments might allow to statistically infer the mechanism responsible. We illustrate this point by fitting computational models to experimental data on performance by wild-caught cleaner fish Labroides dimidiatus in an ephemeral reward task, as well as cleaner and client fish densities from the locations of capture. Using Bayesian statistics to fit the model parameters to performance data revealed that cleaner fish most likely estimate future consequences of an action, while it appears unlikely that the removal of the ephemeral reward acts as psychological punishment (negative reinforcement). Incorporating future consequences also yields performances that can be considered the result of locally optimal decision-rules, in contrast to the negative reinforcement mechanism. We argue that the combination of computational models with data is a powerful tool to infer the mechanistic underpinnings of cognitive performance. Author summary Performance in behavioural experiments is often used to assess the cognitive abilities of animals. However, animals can get to the same outcome in alternative ways. Thus, the outcome of the experiments does not tell us how animals achieve their performance. In order to overcome this limitation, we used a set of computational models, which provide predictions on how alternative cognitive mechanisms perform in the face of varying environmental conditions, and compare their predictions to performance data from individuals that experienced different conditions throughout their life. Our study system is the cleaner fish Labroides dimidiatus, a coral reef fish that feeds off ectoparasites and dead tissue from the skin of other reef fishes. Cleaner fish often have to choose among alternative clients seeking their cleaning service, and the optimal choice is often hard to achieve. Our performance experiments mimic the natural choices the cleaner fish make. The combination of experiments and computational models points to cleaner fish abilities to account for the future effect of their choices by estimating the long term expected value of those choices. Estimating long term value is a mechanism involved in human foresight.

which states that the simpler explanation (mechanism) should be accepted. Ideally, 17 however, alternative hypotheses should be evaluated in light of their explanatory power. 18 One potential dimension that could allow the variation necessary among distinct 19 cognitive mechanisms is environmental variation. Controlled laboratory conditions often 20 mask between and within species variation that arises among different environments. 21 Thus, if predictions from the alternative mechanisms depend on environmental 22 conditions, it is possible to assess their relative explanatory power by performing 23 cognitive test in the context of different environmental conditions. 24 From an evolutionary perspective, mechanisms are likely selected because of how 25 they allow individuals to respond to environmental variation. For example, biological 26 market theory predicts that the exchange rate of goods and/or services traded between 27 cooperative partners adjusts to the law of supply and demand, when individuals have 28 some degree of partner choice [4]. Supply and demand conditions, which typically 29 depend on the abundance of the species involved, certainly vary in time and space. 30 Therefore, natural selection should favour the ability to flexibly adjust decisions and 31 behavioural output to current market conditions. Indeed, such adjustments have been 32 repeatedly documented [5]. In animals, an obvious general candidate mechanism for the 33 strategic adjustment is the cognitive machinery. However, it is not clear which cognitive 34 mechanisms allow individuals to adjust their behaviour to the varying conditions. More 35 specifically the question arises to what extent mechanisms beyond basic associative 36 learning may be involved. 37 One example of strategic adjustment in a biological market is the marine cleaning 38 mutualism involving the cleaner fish Labroides dimidiatus and 'client' fish. Client fish 39 seek cleaner fish services at their territory (so-called "cleaning station") and offer 40 themselves as food patches to get their ectoparasites removed, which provides cleaners 41 with food and clients with improved health [6][7][8][9]. Given the capacity of some client fish 42 to swim larger distances and access multiple cleaning stations while others access the 43 only cleaning station in their territory, it is crucial to categorize clients as either 44 "visitors" or "residents," respectively. During cleaning interactions, a cleaner fish often 45 face a choice between a visitor and a resident client seeking its cleaning services 46 simultaneously. Visitors have the option to switch to another cleaner fish if being made 47 to wait, while residents must wait for inspection. Indeed, visitors have been observed to 48 use their partner choice option in that way [10], which may explain why cleaners give 49 visitors service priority in a field study in the Red Sea [11].

50
When researchers aimed at testing wild-caught cleaner fish, as well as individuals identifying cues as salient is not the only challenge of the ephemeral reward task, as 75 revealed by proximate learning models. Such models allow varying the cognitive tool kit 76 and evaluating which minimal kit is necessary to solve the task at hand (e.g. [21]).

77
Applied to the ephemeral reward task, learning models showed that basic reinforcement 78 learning does not suffice to solve the ephemeral reward task [22,23]. This is particularly 79 so when models assume the more complex natural situation in which cleaner fish face 80 not only resident-visitor pairs but also visitor-visitor and resident-resident pairs, as well 81 a resident alone or a visitor alone. To be able to give visitors priority over residents, 82 cleaners need to be able to assess a client's value separately for the three possible 83 scenarios (alone, paired with a fish with the same strategic option, paired with a fish 84 with the alternative strategic option) [23]. The ability to distinguish and value one 85 stimulus differently alone from compound versions of it has been termed configurational 86 learning, chunking, or segmentation (see references in [22]).

87
In addition to configurational learning, cleaners also need to account for the future 88 consequences of current decisions. In the model by Quiñones et al. [23], this could be 89 achieved in two non-mutually exclusive ways: through low temporal discounting of 90 future effects, also termed 'chaining' [24]; and/or through perceiving a visitor client 91 leaving as psychological punishment (i.e. as a negative reinforcer). Low temporal 92 discounting is when individuals include in their valuation of an action the reward effects 93 that this will have in the future. This is done by combining in a single valuation the 94 reward obtained in the current time with all the reward that comes after, discounting 95 for how far in the future reward is accrued. 'Chaining' the reward of these different time 96 steps allows individuals to take actions that increase the long-term reward at the 97 sacrifice of short term considerations [24]. Even though, 'chaining' can be readily 98 implemented computationally in learning models [24,25], cognitively it seems to be a 99 complex adaptation [2]. On the other hand, using client behaviour as a negative 100 reinforcer is, in principle, easier to implement. Negative reinforcement is part of the 101 ubiquitous learning mechanism termed operant conditioning [26,27]. Thus, the standard 102 logic of Lloyd Morgan's cannon demands that operant conditioning as the simpler explanation is to be accepted by default. Ideally, however, the two mechanisms should 104 be evaluated in light of how well they explain the available data. Note that different 105 fields interested in cognition and decision making use different words to refer to negative 106 reinforcers [23,25]. Here, for the sake of simplicity and clarity, we will use the word 107 'penalty' to refer to this mechanism which includes a negative reinforcer.

108
Over the last decade, our team tested over a hundred wild-caught cleaner fish in the 109 exact same paradigm of the ephemeral reward task [13,[28][29][30][31]. These fish often come 110 from different reef locations around the study field site at Lizard Island, Great Barrier 111 Reef in Australia. Further investigation of the local eco-sociological conditions revealed 112 that cleaner and client fish population densities have a substantial impact on cleaner 113 fish performance in the task. Cleaner fish from reef sites with relatively low densities 114 were more likely to fail at solving the task [28][29][30]. The explanation for this pattern is 115 twofold. First, low cleaner fish population densities imply low supply and high demand 116 for cleaning services. Under such conditions, there are fewer occasions under which 117 failing to give priority to a visitor translates to an empty cleaning station. Second and 118 also related to the state of the cleaning market, visitor clients at low cleaner fish 119 population densities are less likely to exert their partner choice options and are hence 120 more likely to wait for inspection if not given cleaning priority [31]. In the absence of 121 partner choice cleaner fish should not give priority to visitors any more.

122
The interplay between the underlying cognitive machinery and local ecological 123 conditions apparently generate the documented variable performance among individuals 124 of the same species. Our approach of fitting the computational model to the empirical 125 data on fish densities and cleaner fish censuses and performance in the ephemeral 126 reward task aimed at: (i), determining which mechanism cleaner fish use to incorporate 127 future consequences of current decisions by testing whether chaining, penalty, or a 128 combination of both best explains their performance; (ii) determining whether the two 129 mechanisms differed with respect to the ecological conditions that are likely to cause 130 high versus low performance in the ephemeral reward task. Additionally, we assessed 131 which mechanism yields optimal performance patterns. Relying on the logic of 132 biological market theory, we predicted that appropriate performance is to show a low 133 preference for visitors under low local cleaner-to-client ratio.

135
The model  25,32]. Given the new information, the update is carried in the direction that leads to 151 more reward being obtained, given the new information. In the long run, the probability 152 of choosing a visitor over a resident converges in the model. To which probability the model will converge depends on the relative abundance of cleaners, visitors and residents; 154 as well as on the probability of visitors leaving the cleaning station when made to wait. 155 Further details of the model implementation can be found in Quiñones et al [23]. 156 The model shows that agents need to find a way to incorporate future consequences 157 of current choices. In the model, this could be achieved with either of two parameters 158 that could also work together. First, γ measures how much individuals include future 159 rewards in their decision updates. If γ = 0, individuals only use the immediate reward 160 obtained from a cleaning interaction. As γ increases, individuals include more the  The cognitive performance data was from a total of 120 cleaners (10 individuals per 191 12 site/year) tested in the ephemeral reward task [30,31]. Authors housed all captured 192 cleaners individually in glass aquaria ( 62cm × 27cm × 37cm ) and provided them with 193 PVC pipes (10cm × 1cm) as shelters. The task consisted of exposing the cleaner fish to 194 substitute models of client fish in the form of two Plexiglas plates offering the same 195 amount of food (one item of mashed prawn). The two plates differed in colour and 196 pattern (horizontal green stripes or vertical pink stripes) but had equal size 197 (10cm × 7cm). Importantly, the two plates played different roles as either a visitor 198 (ephemeral food source) or resident (permanent food source). That is, if a cleaner fish 199 inspected the resident plate first, the experimenter withdrew the visitor plate out of the 200 aquarium as a consequence. Choosing first the visitor plate, however, granted access to 201 both plates. The equal size of the plates forced cleaner fish to learn to give service 202 priority to the visitor plate based solely on the behaviour-cue of the plates rather than 203 size-cue [18]. Triki et al. [30,31] tested the fish for a maximum of 200 trials with 20 204 March 9, 2022 5/21 trials a day, 10 trials in the morning and 10 trials in the afternoon. They randomized 205 and counterbalanced the plates' spatial location (i.e. left or right) between trials.

206
Similarly, they counterbalanced the plates' decoration (colour and pattern) and the 207 plates' role (visitor or resident) between the tested fish. In the original studies, once a 208 fish reach a learning criterion, that is, performing significantly above chance level (> 50, 209 p − value ≥ 0.05), they passed to a reversal version of the task where the roles of the 210 visitor/resident Plexiglas plates were swapped (see Triki et al. 2019Triki et al. , 2020. Here, we 211 used instead a subset of these data in order to have an idea of cleaner fish preferences 212 for the visitor plate, even if they do not reach the learning criterion within 200 trials.

213
To do so, we first extracted the trial-by-trial outcomes from the last two sessions (20 214 trials) of those who never reached the learning criterion for visitor plate (N = 45 cleaner 215 fish). For those who reached the learning criterion at some point during the test and 216 passed to a reversal phase, we extracted the trial-by-trial outcomes from the last session 217 (10 trials) before passing to reversal and the last 10 trials they were exposed to in the 218 test (N = 75 cleaner fish). We chose a combination of initial and reversal to quantify The aim of the analyses is to fit the key model parameters γ and η, to the empirical data 229 from Triki et al. [30,31] to test whether each or a combination of these effects is a better 230 explanation for the pattern seen in the data. We used the ecological variables: cleaners, 231 visitor clients, resident clients abundances and visitor clients leaving probability, 232 as input to the models. As the response variable, we used cleaners' preference for 233 visitors over residents in the ephemeral reward task. Finally, we used the preference for 234 visitors resulting from the model simulations as the prediction for the response variable. 235 We kept all other parameter values used for the model simulations constant, see Table 236 S1.

237
To capture with the model the relationship between the ecological variables and 238 cleaner fish preferences for visitors, we needed to scale the absolute population densities 239 of cleaner fish from the empirical data to a measure of relative abundances that 240 captures client visitation patterns. This is because, in the model, relative abundances of 241 clients define not only the probability of residents and visitors but also how often the 242 cleaning station is empty (e.g. there are no clients to be cleaned Once we calculated the relative abundances, we obtained predictions from the model 259 for each one of the locations and ran the Markov Chain. We started the chain with  Table S1. Subsequently, we ran 272 the model and calculated the likelihood with the new parameter set. We then used the 273 ratio of the two likelihoods to choose which parameter set to keep. New parameter sets 274 with a higher likelihood than the old set replaced old ones, and those with a lower 275 likelihood replaced current ones with a probability equal to the log-likelihood ratio.

276
Given that we only used the likelihoods in the decision, we used an uninformative prior. 277 Once we decided whether the new parameter set would replace the old one, we ran the 278 model again to sample the likelihood distribution of the parameter set. We then started 279 the cycle again by proposing a new set of parameters and repeated the process for 1e 5 280 steps. We ran 5 independent chains, discarded the first 1000 samples of each chain as 281 burn in, and after that, we kept 1 in every 100 samples to avoid autocorrelation. The To compare the fit of the three alternative models, we used the distribution of pseudo 286 R 2 proposed by Mcfadden [34]. Mcfadden's pseudo R 2 is a standard measure of fit for 287 logistic regression. In that context, pseudo − R 2 uses the log-likelihood of the data given 288 the model, relative to the log-likelihood of the data given a model without covariates, as 289 a measure of fit. Our model is not a logistic regression, therefore we measured the 290 pseudo − R 2 in relation to the log-likelihood of a model with parameters γ and η set to 291 zero. This, in practice, amounts to a model that has a neutral preference between the 292 two options. We computed the pseudo R 2 for all the samples of the posterior from the 293 MCMC. Thus, we used these distributions of pseudo R 2 's as a measure of fit.

295
Estimation of parameter values for the three models (full model, chaining and penalty), 296 support chaining as the only mechanism cleaners use to account for the future effects of 297 their actions; and thus to solve the ephemeral reward task. In the estimation of the 298 parameter values of the full model, which includes both chaining and penalty, the bulk 299 of the marginal posterior distribution of η which controls the strength of penalty is 300 around 0 (Fig. 1). As for γ, controlling chaining, the 95% confidence intervals also 301 includes zero, but the mode of the posterior is around 0.5 (Fig. 1, a). In the chaining 302 model, where η is set to zero, the distribution of γ shifted to higher values, zero is no longer part of the 95% credible interval of the parameter (Fig. 1, c). In contrast, when 304 we look at the model with only penalty, the posterior distribution of η is still centred 305 around zero (Fig. 1, d). Thus, the analysis of the estimates of individual parameter 306 values in the three models only supports a strong effect of chaining. Furthermore, the 307 comparison of the models' fit favours the chaining model. In panel e of figure 1 we show 308 the distribution of pseudo − R 2 calculated using samples from the posterior 309 distributions shown before. Note, pseudo − R 2 can have negative values, which is when 310 the log-likelihood of the model is lower than that of a model that triggers neutral 311 preferences. Even though the peak of the three pseudo − R 2 distributions were not very 312 different, the model with only chaining produced a distribution of pseudo − R 2 where 313 more values were positive (to the right of the black line in Fig. 1 f). This shows that

320
The main reason for chaining and penalty to give different predictions is the way 321 that cleaner fish relative abundance influences the preference for the visitor clients.

322
Visitor leaving probability has a similar positive effect on the probability of choosing the 323 visitor clients on all three models, they all predict an increase preference for visitors as 324 the visitor probability increases (Fig. 2). In contrast, cleaner fish relative abundance preference for the visitor (Fig. 2 a,e). Note, however, we calculated preferences shown in 330 figure 2 left panels by using only the mode of the posterior distributions, and by holding 331 constant the balance between resident and visitors´abundances. Panels on the right,

332
show how close predictions are from the observed data, allowing the balance between 333 client types to vary and using a set of samples from the posterior distribution.

335
In this study, our main aim was to unravel which of two potential cognitive mechanisms, 336 chaining of events, penalty, or their combination, best explains wild-caught cleaner fish 337 performance in the ephemeral reward task, while accounting for their ecological 338 conditions. To evaluate the merits of each of these two mechanisms separately and 339 combined, we considered cleaner fish performance in the lab test to have its origin from 340 the rule these fish applied in their natural environment. That is, individuals that solved 341 the task already had a preference for visitors clients and generalized this rule to the lab 342 conditions once being familiar with the task.

343
While all three models captured well the positive relationship between visitor leaving 344 behaviour and cleaner fish performance in the market task [30], only the chaining 345 mechanism predicted that cleaner fish performance in the task should be low in habitats 346 with low cleaner-to-client ratios, regardless of the visitor leaving probability. In contrast, 347 models including negative reward predicted the highest performance in the ephemeral 348 reward task when relative cleaner fish abundance is low, particularly together with a 349 high probability of visitor leaving (Fig. 2). Low relative cleaner fish abundances mean 350 the market has an excess of demand for cleaning services. In the models, this translates 351 to a cleaning station that is frequently full. Thus, when a visitor leaves, it is likely that 352 the cleaner fish will have access to another client in the next step. Therefore, there will 353 not be much difference in future reward between choosing a visitor and a resident, and 354 cleaners will not develop a preference for the visitor in these conditions. On the other 355 hand, the effect of negative reward on cleaner fish preference is the opposite, as in a 356 busy cleaning station, to that of chaining. cleaner fish will get more often the 357 resident-visitor state and will develop a preference for the visitor faster. At high cleaner 358 fish abundances, the resident-visitor state becomes so rare that neither mechanism is 359 very efficient at generating a preference for visitors. When facing the resident-visitor 360 choice, it is still best to choose the visitor; however, the learning machinery will not be 361 able to develop this preference efficiently. Overall, the models suggest that chaining is 362 the cognitive mechanism that allows cleaner fish to adaptively adjust to their biological 363 market ecological conditions. variation in social competence that translates into site-specific variation in performance 385 in the ephemeral reward task.

386
Our models are inspired by the general processes of associative learning where short 387 term rewards are translated into decision making; thus, it ignores alternative channels of 388 information that could be relevant in market-like situations. For example, the model 389 does not investigate whether cleaner fish actually assess the frequency of client visits or 390 a mean frequency of visitors leaving. The updating learning mechanism for the 391 development of preferences works on a trial-by-trial basis. In the model, cleaner fish do 392 not need to assess the actual state of the market, i.e. their abundance, the abundance 393 of residents and visitors, and client visitation rate as an indicator of demand. They only 394 need to assess the short-term consequences of their own decisions on food intake and 395 chain them. Also, for the sake of simplicity, the model ignores the process by which 396 cleaner fish discriminate residents and visitor clients. A model that accounts for this 397 discrimination probably would involve the development of preferences for morphological 398 or behavioural features that are statistically associated with visitors or residents. For 399 example, visitors are on average larger than residents in body size [11], and contrary to 400 residents, they are less likely to chase a cleaner fish that fails to cooperate and instead 401 cheats its client by taking a bite of mucus [12]. Given these associations, chaining might 402 produce the decision-rule "choose the larger client and/or the less aggressive client," 403 which is not a useful rule in the standard ephemeral reward task.

404
In conclusion, our study shows that variation in cognitive performance as a function 405 of the local ecological conditions may set the stage for the use of mechanistic modelling 406 mind task needs to be solved in the first trial in order to exclude fast conditioning [3]. 413 Similarly, subjects need to solve a social learning task on the first trial to accept 414 imitation as a mechanism over stimulus/local enhancement. Such strict conditions are 415 virtually never met. For example, potato washing by Japanese macaques, an iconic 416 example of social learning, took several years to spread within the group [41], meaning 417 that any learner had been repeatedly exposed to demonstrations before acquisition.

418
Importantly, Galef [42] refuted imitation as a mechanism not simply because of the 419 repeated exposure but because a (rather qualitative) analysis of the spread of potato 420 washing across individuals did not follow the prediction based on imitation learning (see 421 also [43]). In our case, the number of trials it took cleaners to learn the solution to the 422 ephemeral rewards task would never allow excluding an important role of penalty based 423 on the data alone. However, fitting model predictions to our comprehensive empirical 424 data set revealed that a more complex mechanism, estimation of future reward, fits the 425 data better. Relation between the response variable used in this study and the criteria used in previous studies to assess performance in the ephemeral reward task. In the x axis, we classified the performance of cleaner fish according to whether they developed a preference for the visitor in the initial round, in the initial and reversal, or none of them.
In the y axis, we add the choices of two experimental sessions: panel on the left uses one session from the initial round and one from the reversal round when possible (as described in the main text); panel on the right uses two sessions from the initial round for all fish.