## Abstract

The precision with which items are encoded in working memory and attention decreases with the number of encoded items. Current theories typically account for this “set size effect” by postulating a hard constraint on the allocated amount of some kind of encoding resource, such as samples, spikes, slots, or bits. While these theories have produced models that are descriptively successful, they offer no principled explanation for the very existence of set size effects: given their detrimental consequences for behavioral performance, why have these effects not been weeded out by evolutionary pressure, for example by allocating resources proportionally to the number of encoded items? Here, we propose a theory that is based on an ecological notion of rationality: set size effects are the result of an optimal trade-off between behavioral performance and the neural costs associated with stimulus encoding. We derive models for four visual working memory and attention tasks and show that they account well for data from eleven previously published experiments. Our results suggest that set size effects have a rational basis and that ecological costs should be considered in models of human behavior.

Cognitive performance is strongly constrained by set size effects in working memory and attention: the precision with which these systems encode information rapidly declines with the number of items, as observed in for example delayed estimation, change detection, visual search, and multiple-object tracking tasks (1–8). By contrast, set size effects seem to be absent in long-term memory, where fidelity has been found to be independent of set size (9). The existence of set size effects is thus not a general property of neural coding, but rather a phenomenon that requires explanation. Despite an abundance of models, such an explanation is still lacking.

A common way to model set size effects has been to postulate that stimuli are encoded using a fixed total amount of resources, formalized as “samples” (1, 3, 10), slots (11), information bit rate (12), Fisher information (8), or neural firing (13): the larger the number of encoded items, the lower the amount of resource available for each item and, therefore, the lower the precision per item. There are two problems with this explanation. Firstly, they predict that precision is inversely proportional to set size, which is often not the case (e.g., (14–16)). Secondly, it is unclear whether keeping the amount of encoding resources constant across set sizes serves any ecologically relevant function: why has the brain not evolved to counter set size effects by increasing the amount of allocated resource as more items are encoded, as seems to be the case in long-term memory? To address the first problem, models have been proposed in which encoding precision is postulated to be a power-law function of set size (5, 6, 15–21). These models tend to provide excellent fits, but have been criticized for lacking a principled motivation for the postulated power-law (22, 23), thus failing to address the second problem.

Here, we take an ecological approach to explain set size effects, starting from the principle that neural firing is energetically costly (24–26). This cost may have pressured the brain to balance behavioral benefits of high precision against the neural loss that it induces (8, 25, 27, 28). What level of encoding precision establishes a good balance might depend on multiple factors, such as set size, task, and motivation. Indeed, performance on perceptual decision-making tasks can be improved by increasing monetary reward (29–31), which suggests that the total amount of resource spent on encoding has some flexibility that is driven by ecological factors. Based on these considerations, we hypothesize that set size effects on encoding precision reflect an ecologically rational strategy that balances behavioral performance against neural costs. Below, we derive formal models from this hypothesis for four visual working memory and attention tasks, fit them to data from eleven previously published experiments, and discuss implications of our findings.

## THEORY

We first present our theory in the context of the delayed-estimation paradigm (6) and will later show how it generalizes to other tasks. In single-probe delayed-estimation tasks, subjects briefly hold a set of items in memory and report their estimate of a randomly chosen target item (Fig. 1A; Table 1). Estimation error *ε* is the (circular) difference between the subject’s estimate and the true stimulus value *s*. Set size effects in this task are visible as a widening of the estimation error distribution (Fig. 1B). As in previous work (4, 5, 14, 15, 18, 32), we assume that a memory *x* follows a Von Mises distribution with mean *s* and concentration parameter *κ*, and define encoding precision *J̅* as Fisher information (33), which is one-to-one related to *κ* (see Supplementary Information). We assume that response noise is negligible, such that the estimation error is equal to the memory error, *ε*=*x*−*s*. Moreover, we assume variability in *J̅* across items and trials (5, 15, 18, 32, 34), which we model using a gamma distribution with a mean *J̅* and a scale parameter *τ* (see Supplementary Information).

The key novelty of our theory is the idea that stimuli are encoded with a level of mean precision, *J̅*, that minimizes a combination of *behavioral loss* and *neural loss*. Behavioral loss is induced by making an error *ε*, which we formalize using a mapping *L*behavioral(*ε*). This mapping may depend on both internal incentives (e.g., intrinsic motivation) and external ones (e.g., the reward scheme imposed by the experimenter). For the moment, we choose a power-law function, *L*behavioral(*ε*)=|*ε*|^{β} with *β*>0 as a free parameter, such that larger errors correspond with larger loss. The *expected* behavioral loss, denoted *L̅*_{behavioral}, is obtained by averaging loss across all possible errors, weighted by the probability that each error occurs,

Where *p*(ε| *J̅,̅N̅*)is the estimation error distribution for given mean precision and set size. In single-probe delayed-estimation tasks, the expected behavioral loss is independent of set size and subject to the law of diminishing returns (Fig. 1C, black curve).

The second kind of loss is a *neural loss* induced by the neural spiking activity that represents a stimulus. For many choices of spike variability, including the common one of Poisson-like variability (35), the precision (Fisher information) of a stimulus encoded in a neural population is proportional to the neural spiking rate (36, 37). Moreover, it has been estimated that the energetic loss induced by each spike increases with spiking rate (24, 25). When combining these two premises, the expected neural loss associated with the encoding of an item is a supralinear function of encoding precision, which can be modeled using for example a power-law function, *L̅*_{neural}(*J̅*)=α*J̅ ^{w}*, with

*α*and

*ω*as free parameters. However, to minimize the number of free parameters, we assume for the moment that the function is linear (

*β*=1) and will later present a mathematical proof that our theory generalizes to supralinear functions (

*β*>0; condition (iv) at the end of this section). Further assuming that stimuli are encoded independently of each other, neural loss is also proportional to the number of encoded items,

*N*. We thus obtain where

*α*is a free parameter that represents the amount of neural loss incurred by a unit increase in mean precision (Fig. 1C, colored lines).

We combine the two types of expected loss into a total expected loss function (Fig. 1D),
where the weight *λ*≥0 represents the importance of keeping neural loss low relative to the importance of good performance. Since *λ* and *α* have interchangeable effects on the model predictions, they can be fitted as a single free parameter We refer to the level of mean precision that minimizes the total expected loss as *optimal mean precision*,

Under the loss functions proposed above, we find that *J̅*_{optimal} is a decreasing function of set size (Fig. 1D), which is qualitatively consistent with set size effects observed in experimental data (cf. Fig. 1B). However, it can be shown (see Supplementary Information) that the conditions under which this model predicts a set size effect generalize to any choice of loss functions, as long as the four, rather general conditions are satisfied: (i) the expected behavioral loss is a strictly decreasing function of encoding precision, i.e., an increase in precision results in an increase in performance; (ii) the expected behavioral loss is subject to a law of diminishing returns (38): the higher the initial precision, the smaller the behavioral benefit obtained from an increase in precision; (iii) the expected neural loss is an increasing function of encoding precision; (iv) the expected neural loss associated with a fixed increase in precision increases with precision. Next, we evaluate the model by fitting it to data from a range of previously published experiments.

## RESULTS

### Model fits

We used maximum-likelihood estimation to fit the three model parameters (λ͂, *τ*, and *β*) to 67 individual-subject data sets from a delayed-estimation benchmark set^{+} (Table 1). The model accounts well for the raw error distributions (Fig. 2A) and the two statistics that summarize these distributions (Fig. 2B). Model comparison based on the Akaike Information Criterion (AIC) (39) indicates that the goodness of fit is comparable to that of a descriptive model variant in which the relation between encoding precision and set size is assumed to follow a power law (ΔAIC=5.27±0.70 in favor of the rational model). Hence, the rational model provides a principled explanation of set size effects in delayed-estimation tasks without sacrificing quality of fit.

### Comparison with an unconstrained model

We next try to falsify our theory by testing whether a mapping between set size and encoding precision can be found that fits the data better than the relation imposed by the loss-minimization strategy of the rational model. To this end, we fit an unconstrained variant of the model in which memory precision is fitted as a free parameter at each set size. We find only a minimal difference in goodness of fit (ΔAIC=3.49±0.93 in favor of the unconstrained model), suggesting that the fits of the rational model are close to the best possible fits. This finding is corroborated by examination of the fitted parameter values: the estimated precision values in the unconstrained model closely match the precision values in the rational model (Fig. 2C). Hence, it seems that no relation exists that fits these data substantially better than the constrained set of relations that are possible in the rational model.

### Total amount of allocated resource as a function of set size

We estimate the total amount of allocated encoding resource as the mean precision (Fisher information) per item summed across all items, . In fixed-resource models *J̅*_{total} is by definition constant and in power-law models it is a monotonic function of set size. However, an interesting qualitative feature of the rational model is that in some of the experiments the best fitting parameters produce a *non-monotonic* relation between *J̅*_{total} and set size (Fig. 2D, gray curves). This means that at small set sizes it apparently sometimes pays off (in terms of total-loss minimization) to increase the total amount of allocated resource when an item is added, while the opposite is true at large set sizes^{†}. Using the best-fitting precision values in the unconstrained model as an estimate of how much encoding resource subjects allocated on average at each set size, we find that the data show clear signs of a similar non-monotonicity (Fig. 2D, black circles); to our knowledge, this has not previously been reported.

### Alternative loss functions

To evaluate the necessity of a free parameter in the behavioral loss function, *L*behavioral(*ε*), we also test the following three parameter-free choices: |*ε*|, *ε*^{2}, and −cos(*ε*). Model comparison favors the original model with AIC differences of 14.0±2.8, 24.4±4.1, and 19.5±3.5, respectively. While there may be other parameter-free functions that give better fits, we expect that a free parameter is unavoidable here, as it is likely that the error-to-loss mapping differs across experiments (due to differences in external incentives) and possibly also across subjects within an experiment (due to differences in internal incentives). We also test a two-parameter function that was proposed recently (Eq. (5) in (40)). The main difference with our original choice is that this alternative function allows for saturation effects in the error-to-loss mapping. However, this extra flexibility does not increase the goodness of fit sufficiently to justify the additional parameter, as the original model outperforms this variant with an AIC difference of 5.3±1.8.

### Generalization to other tasks

We next examine the generality of our theory, by testing whether it can also explain set size effects in two change detection tasks (Table 1). In these experiments, the subject is on each trial sequentially presented with two sets of stimuli and reports whether there was a change at any of the stimulus locations (Fig. 3A). A change was present on half of the trials, at a random location and with a random change magnitude. The behavioral error, *ε*, takes only two values in this task: “correct” and “incorrect”. Therefore, *p*(ε |*J̅*, *N*) specifies the probabilities of correct and incorrect responses for a given level of precision and set size, which depend on the observer’s decision rule. Following previous work (14, 32), we assume that subjects use the Bayes-optimal rule (see Supplementary Information) and that there is random variability in encoding precision. This decision rule introduces one free parameter, *p*_{change}, specifying the subject’s degree of prior belief that a change will occur. Due to the binary nature of *ε* in this task, the free parameter of the behavioral loss function drops out of the model, as its effect is equivalent to changing parameter *λ͂* (see Supplementary Information). The model thus has three free parameters (*λ͂*, *τ*, and *p*_{change}). We find that the maximum-likelihood fits account well for the data in both experiments (Fig. 3B).

So far, we have considered tasks with continuous and binary judgments. We next consider two change localization experiments (Table 1) in which judgments are non-binary but categorical. The task is identical to change detection, except that a change is present on every trial and the observer reports the location at which the change occurred (out of 2, 4, 6, or 8 locations). We again assume variable precision and an optimal decision rule (see Supplementary Information). Although the rational model has only two free parameters (*λ͂* and *τ*), it accounts well for both datasets (Fig. 3C).

The final task to which we apply our theory is a visual search experiment (4) (Table 1). Unlike the previous three tasks, this is not a working memory task, as there was no delay period between stimulus offset and response. Set size effects in this experiment are thus likely to stem from limitations in attention rather than memory, but our theory applies without any additional assumptions. Subjects judged whether a vertical target was present among one of *N* briefly presented oriented ellipses (Fig. 3D). The distractors were drawn from a Von Mises distribution centered at vertical. The width of the distractor distribution determined the level of heterogeneity in the search display. Each subject was tested under three different levels of heterogeneity. We again assume variable precision and an optimal decision rule (see Supplementary Information). This decision rule has one free parameter, *p*present, specifying the subject’s prior degree of belief that a target will be present. We fit the three free parameters (*λ͂*, *τ*, and *p*present) to the data from all three heterogeneity conditions at once and find that the model accounts well for the dependencies of the hit and false alarm rates on both set size and distractor heterogeneity (Fig. 3E).

## DISCUSSION

A key strength of our theory is that it uses a single principle of rationality and relatively few parameters to produce well-fitting models across a range of quite different tasks. Nevertheless, consideration of additional mechanisms could further improve the fits and lead to more complete models of human behavior. For example, previous studies have incorporated response noise (15, 18), non-target responses (17), and a (variable) limit on the number of remembered items (12, 15, 41) to improve fits. We did not consider such components here, as they come with additional parameters, some are task-specific (such as non-target responses), and they have so far not been motivated in a principled manner. Regarding the latter point, it might be possible to treat some of these mechanisms using an ecologically rational approach as well. For example, the level of response noise might be set by optimizing a trade-off between performance and motor control effort (42).

Our findings suggest that set size effects in working memory and attention may reflect a near- optimal compromise reached by a system that strives to simultaneously maximize performance and minimize spiking activity. A possible explanation why long-term memory does not seem to suffer from set size effects (9) and has much larger capacity (43) is that loss incurred by maintaining synaptic connections is likely to be lower than the loss incurred by persistent activity.

The work presented in this paper speaks to the relation between descriptive and rational theories in psychology and neuroscience. The main motivation for rational theories is to reach a deeper level of understanding by analyzing a system in the context of the ecological needs and constraints that it evolved under. Besides the large literature on ideal-observer decision rules (44–47), rational approaches have been used to explain properties of receptive fields (48–50), tuning curves (51–53), neural wiring (54, 55), and neural network modularity (56). A transition from descriptive to rational explanations might be an essential step in the maturation of theories of biological systems, and in psychology there certainly seems to be more room for this kind of explanation.

Although several previous models in the field of working memory and attention contain rational aspects, none of them accounts for set size effects in a principled way. Sims and colleagues have examined how errors in visual working memory can be minimized by optimally taking into account statistics of the stimulus distribution, but assume a fixed total amount of available encoding resource (12, 57). Moreover, in our own previous work on visual search (4, 5), change detection (14, 32), and change localization (18), we used optimal-observer models for the decision stage, but assumed an ad hoc power law for the encoding stage. An alternative explanation of set size effects has been that the brain is unable to keep neural representations of multiple items segregated from one another (23, 58–60): as the number of encoded items increases, so does the level of interference in their representations, resulting in lower task performance. However, these models offer no principled justification for the existence of interference and some require additional mechanisms to account for set size effects; for example, the model by Oberauer and colleagues requires three additional components – including a set-size dependent level of background noise – to fully account for set size effects (23). That being said, our theory does not rule out the possibility of interference, and it could be added onto any of the models we presented.

Our approach shares both similarities and differences with the concept of bounded rationality (61), which states that human behavior is guided by mechanisms that provide “good enough” solutions rather than optimal ones. The main similarity is that both approaches acknowledge that human behavior is constrained by various cognitive limitations. However, an important difference is that bounded rationality postulates these limitations as a given fact, while our approach explains them as rational outcomes of ecological optimization processes. The suggestion that cognitive limitations are themselves subject to optimization may also have implications for theories outside the field of psychology. One example concerns recent models of value-based decision-making that incorporate constraints imposed by working memory and attention limitations (e.g., (62)). Another example is the theory of “rational inattention” in behavioral economics, which examines optimal decision-making under the assumption that decision makers have a fixed limit on the total amount of attention that they can allocate to process economic data (63). It might be interesting to extend that theory by treating the amount of allocable attention as the outcome of an optimization process rather than a constant.

While our results show that set size effects can in principle be explained as the result of an optimization strategy, they do not necessarily imply that encoding precision is fully optimized on every trial at any given task. First, encoding precision in the brain most likely has an upper limit, due to irreducible sources of noise such as Johnson noise and Poisson shot noise (64), as well as suboptimalities early in sensory processing (65). This prohibits subjects to reach the near-perfect performance levels that our model may predict when the behavioral loss associated to errors is huge. Second, precision might have a lower limit: task-irrelevant stimuli are sometimes automatically encoded (66), perhaps because in natural environments few stimuli are ever completely irrelevant. This would prevent subjects from sometimes encoding nothing at all, in contradiction to what our theory predicts to happen at very large set sizes. Third, all models that we tested incorporated variability in encoding precision. Part of this variability is possibly due to stochastic factors such as neural noise, but part of it may also be systematic in nature (e.g., particular colors and orientations may be encoded with higher precision than others (67, 68)). Whereas the systematic component could have a rational basis (e.g., higher precision for colors and orientations that occur more frequently in natural scenes (53, 69)), this is unlikely to be true for the random component. Indeed, when we jointly optimize *J̅* and *τ*, we find estimates of *τ* that are consistently 0, meaning that any variability in encoding precision is suboptimal form the perspective of our model. Finally, even if set size effects are the result of a rational trade-off between behavioral and neural loss, it may be that the solution that the brain settled on only works well on average rather than being tailored to every possible situation. In that case, set size effects could be more rigid across environmental changes (e.g., in task or reward structure) than predicted by a fully optimal model that incorporates every such change.

Future work could further examine optimality of encoding precision in working memory and attention by experimentally varying factors that affect the loss functions. In delayed estimation, an obvious choice for this would be the delay period. Assuming that working memories are maintained in persistent activity (70, 71), a longer delay would induce a higher cost and decrease optimal encoding precision. Another experimental parameter that could be varied is the error-to-loss mapping. A previous study that performed this manipulation found a behavioral effect in one experiment, but did not vary set size (72). None of the experiments modeled here contained this manipulation (DE4-DE6 imposed an explicit loss function but did not vary it; the other experiments had no explicit scoring system). Future studies could measure effects of changes in explicitly imposed scoring systems and test how well a rational model accounts for such effects. Related to this, it would be relevant to examine whether subjects are able to internalize experimental loss functions in the timespan of a single experiment and, if not, to further characterize their "natural" loss functions (40). Another line of possible future work would be to examine whether our theory can be generalized to the level of objects (73, 74).

Developmental work has shown that working memory capacity estimates change with age (75, 76). Viewed from the perspective of our proposed theory, this raises the question why the optimal trade-off between behavioral and neural loss would change with age. A speculative answer could be that a subject’s encoding efficiency (formalized by parameter *α* in Eq. (2)) may improve during childhood. An increase in encoding efficiency (i.e., lower *α*) has the same effect in our model as a decrease in the set size (i.e., higher *N*), which we know is accompanied by an increase in encoding precision. Hence, our model would predict subjects to increase encoding precision over time, which is qualitatively consistent with the findings of the developmental studies.

Finally, our results raise the question what neural mechanisms could implement the kind of near-optimal resource allocation strategy that is the core of our theory. Some form of divisive normalization (13, 77) would be a likely candidate, as it has the effect of lowering the gain when set size is larger. Moreover, divisive normalization is already a key operation in neural models of attention (78) and visual working memory (13, 58).

## METHODS

### Data and code sharing

All data analyzed in this paper and model fitting code are available at [url to be inserted].

### Model fitting

*Delayed estimation*. We used Matlab’s fminsearch function to find the parameter vector θ={*λ͂*,*β*,*τ*} that maximizes the log likelihood function, , where *n* is the number of trials in the subject’s data set, *εi* the estimation error on the *i*^{th} trial, and *Ni* the set size on that trial. To reduce the risk of converging into a local maximum, initial parameter estimates were chosen based on a coarse grid search over a large range of parameter values. The predicted estimation error distribution for a given parameter vector **θ** was computed as follows. First, *J̅*_{option} was computed by applying Matlab’s fminsearch function to Eq. (5). In this process, the integrals over *ε* and *J̅* were approximated numerically by discretizing the distributions of these variables into 100 and 20 equal-probability bins, respectively. Next, the gamma distribution over precision with mean *J̅*_{option} and scale parameter *τ* was discretized into 20 equal-probability bins. Thereafter, the predicted estimation error distribution was computed under the central value of each bin. Finally, these 20 predicted distributions were averaged. We verified that our results are robust under changes in the number of bins used in the numerical approximations.

*Change detection*. Model fitting in the change detection task consisted of finding parameter vector θ={*λ͂*,*τ*,*p _{change}*} that maximizes , where

*n*is the number of trials in the subject’s data set,

*Ri*is the response (“change” or “no change”), Δ

*i*the magnitude of change, and

*N*i the set size on the

*i*

^{th}trial. For computational convenience, Δ was discretized into 30 equally spaced bins. To find the maximum-likelihood parameters, we first created a table with predicted probabilities of “change” responses for a large range of (

*J̅*

_{option},

*τ, p*change) triplets. One such table was created for each possible (Δ,

*N*) pair. Each value

*p*(

*R*=“change” |

*N*, Δ,

*J̅*

_{option},

*τ, p*change) in these tables was approximated using the optimal decision rule (see Supplementary Information) applied to 10,000 Monte Carlo samples. Next, for a given set of parameter values, the log likelihood of each trial response was computed in two steps. First, the expected total loss was computed as a function of

*J̅̅*, using

*L̅*

_{total}(

*J̅,N*) =

*p*

_{incorrect}(

*J̅,N*)+

*λ͂J̅N*, where

*p*

_{incorrect}(

*J̅̅*,

*N*) was estimated using the pre-computed tables. Second, we looked up log

*p*(

*Ri*|

*Ni*, Δ

*i*,

*J̅*

_{option},

*τ*,

*p*change) from the pre computed tables, where

*J̅*

_{option}is the value of

*J̅*for which expected total loss was lowest. To estimate the best-fitting parameters, we performed a grid search over a large set of parameter combinations, separately for each subject.

*Change localization and visual search*. Model fitting methods for the change-localization and visual-search tasks were identical to the methods for the change-detection task, except for differences in the parameter vector (no prior in the change localization task; *p*present instead of *p*change in visual search) and the optimal decision rules (see Supplementary Information).

## Footnotes

· The original benchmark set (15) contains 10 data sets with a total of 164 individuals. Two of these data sets were published in papers that later got retracted and another one contained data for only two set sizes, which is not very informative for our present purposes. While our model accounts well for these data sets (Fig. S1 in Supplementary Information), we decided to exclude them from the main analyses.

† Upon reflection, it is perhaps not surprising to occasionally find a non-monotonic relation: when multiplying a decreasing function of set size (

*J̅*as a function of*N*) with an increasing one (*N*itself), it easy to obtain a function that is not monotonic but peaks at an intermediate value.