Abstract
Limits on the storage capacity of working memory have been investigated for decades, but the nature of those limits remains elusive. An important but largely overlooked consideration in this research concerns the relationship between the physical properties of stimuli used in visual working memory tasks and their psychological properties. Here, we show that the relationship between physical distance in stimulus space and the psychological confusability of items as measured in a perceptual task is non-linear. Taking into account this relationship leads to a parsimonious conceptualization of visual working memory, greatly simplifying the models needed to account for performance, allowing generalization to new stimulus spaces, and providing a mapping between tasks that have been thought to measure distinct qualities. In particular, performance across a variety of working memory tasks can be explained by a one-parameter model implemented within a signal detection framework. Moreover, despite the system-level distinctions between working and long-term memory, after taking into account psychological distance we find a strong affinity between the theoretical frameworks that guide both systems, as performance is accurately described using the same straightforward signal detection framework.
Introduction
Working memory is typically conceptualized as a fixed capacity system, with a discrete number of items, each represented with a certain degree of precision (e.g., Cowan, 2001; Luck & Vogel,2013). It is thought to be a core cognitive system (Baddeley, 2003; Ma, Husain & Bays 2014), with individual capacity differences strongly correlating with measures of broad cognitive function such as fluid intelligence and academic performance (Alloway & Alloway, 2010; Fukuda et al., 2010). As a result, many researchers are deeply interested in understanding and quantifying working memory capacity.
A task commonly used to probe the contents of visual working memory and to estimate its capacity is the continuous report procedure, where subjects are asked to report a remembered feature of a probed item (e.g., a color) by responding in a continuous space (e.g., a color wheel) (Wilken & Ma, 2004; Zhang & Luck, 2008). Responses are typically analyzed using a mixture model (Figure 1), the parameters of which represent the psychological status of the targets. In the simplest mixture model, the two parameters reflect distinct psychological states that correspond to (1) the proportion of trials in which subjects “guess” because the presented items are not available in working memory (represented by a uniform distribution), and (2) the precision of the representations associated with the items that are successfully represented in working memory (represented by a Gaussian-like distribution centered on the target color, with the standard deviation indicating precision). More elaborate mixture models allow for variability in the precision of target representations from item-to-item or trial-to-trial (Fougnie, Suchow & Alvarez, 2012; Ma, Husain & Bayes, 2014) and in some cases rather than assuming guessing, have some items represented with extremely low precision (van den Berg et al., 2012). Although the details of the models differ, they all assume that quantifying the relevant mixture distribution parameters provides an assessment of a fundamental limit on working memory capacity.
Variants of the mixture model have been extremely influential in shaping our current understanding of the nature of working memory. For example, experimental data quantified in terms of the parameters of the mixture model have been interpreted to mean that as the retention interval increases, representations in working memory do not decay gradually but instead “die” in an all-or-none fashion (Zhang & Luck, 2009), as ‘guess rate’ changes much more than ‘precision’ with delay. Other findings have been interpreted as showing that when working memory capacity is pressured, the precision of these representations plateaus near estimates of long-term memory precision, suggesting a shared constraint on performance (Brady et al., 2013). Mixture models have also been broadly applied to interpret the effect of many other variables on memory capacity, including aging (Peich, Husain & Bays, 2013), Parkinson’s disease (Zokaei et al., 2014), sleep deprivation (Wee, Asplund & Chee, 2012), sleep disorders (Rolinski et al., 2015), video game training (Blacker, Curby, Klobusicky & Chein,2014), transcranial magnetic stimulation (Rademaker, van de Ven, Tong & Sack, 2017), mental imagery (Keogh & Pearson, 2011), language and perception (Souza & Skora, 2017), auditory stimuli, (Kumar et al., 2013), speech (Joseph et al., 2015), visual motion (Zokaei et al., 2011), audio-visual events (Olivers, Awh & Van der Burg, 2016), medial temporal lobe damage (Pertzov et al., 2013), and the race of a face (Zhou, Mondloch & Emrich, 2018).
Here we show that the mathematical framework of fitting mixture models to characterize representations in working memory is based on a foundational assumption that is incorrect. Specifically, the models assume that degrees of error along the response wheel is a linear measure (the measure plotted on the x-axis of Figure 1B) that can be directly modeled to understand memory. However, we demonstrate that, consistent with nearly all cases of discriminability and generalization (Fechner, 1860; Shepard, 1987), the ‘psychological distance’ that is relevant for understanding which items will be confused in memory is not linear with respect to physical distance. As a result, the assumptions guiding these models are untenable (i.e. the x-axis of Figure 1B is not directly interpretable). Taking into account the relationship between physical distance and psychological distance leads to a parsimonious single-parameter conceptualization of working memory, greatly simplifying the models needed to account for performance, allowing generalization to new stimulus spaces, and providing a mapping between tasks that have been thought to measure distinct qualities.
Results
We take memory for visual features, in particular color, as our primary case study because it has become a dominant tool for formalizing working memory capacity. When using color memory to investigate working memory capacity, researchers focus on the distribution of errors people make measured in degrees along the response wheel, x, where x ranges from 0° (for the color that matches the target) to 180° (for the most distant color from the target on the response wheel).
Existing models of working memory implicitly assume that because the perceptual space is locally perceptually uniform (i.e., any two nearby points on the color wheel are approximately equally discriminable; Brainard, 2003; Allred & Flombaum, 2014; Bae et al., 2015), the internal psychological distance between items relevant for memory performance is also linear as a function of (non-local) distance in this physical space. In reality, as with nearly all cases of discriminability (Fechner, 1860), perceptual differences (Maloney & Yang, 2003) and generalization (Shepard, 1987), we show that there is an approximately exponential fall off in psychological distance with physical distance in color space. Critically, the long-tails that are typically observed in working memory tasks when performance is plotted as a function of physical distance (Figure 1B) -- and which have been interpreted to reflect guessing associated with non-represented items -- do not exist when performance is more appropriately plotted as a function of psychological distance, f(x) (i.e., after correcting for the exponential fall-off with distance).
Measuring psychological distance. To determine the relationship between physical distance along the color wheel, x, and psychological distance, f(x), we tested how accurately participants could determine which of two colors was closest to a target color using a triad task (Torgerson, 1958; Maloney & Yang, 2003). This triad task is a perceptual task, but is analogous to the working memory situation where participants have a target color in mind and are asked to compare many other colors to that target (i.e., all the colors on the color wheel). In the triad task, even with a fixed 30° distance between the two colors that had to be compared to the target, participants were much more accurate on this task the closer the two colors were to the target (Figure 2A shows a subset of data; ANOVA F(12,384) = 71.8, p<0.00001). In other words, participants largely cannot tell, even in a perceptual task, whether a color 120° from the target or a color 150° from the target is closer to the target, whereas this task is trivial if the colors are 5° and 35° from the target. Using additional triad task data, we determined the implied psychological distance of colors at different physical distances along the color wheel using the psychological scaling technique of Maloney and Yang (2003). Psychological distance falls off in a nonlinear, exponential-like function (Figure 2B). The nonlinear relationship between physical distance, x, and psychological distance, f(x), is consistent with classic psychophysical studies of discrimination and generalization in a wide variety of domains (Fechner, 1860; Shepard, 1987). A second, non-psychological factor is also relevant to this nonlinearity. In particular, physical distance in color space should be a function of the linear distance between two colors in 3D color space (e.g., in CIELa*b*), not distance along the circumference of the response wheel (Figures S1-3 [Supplemental Material]).
A key implication of these scaling considerations is that the “long tail” of errors from continuous report data – often assumed to reflect distinct no- or low-information guessing – is a measurement artifact. What appears to be a long tail in physical space (e.g., Figure 1B) is a very short distance in psychological space. In particular, since participants are incapable, even in a perceptual task, of discerning whether an item 120° or 180° from the target in color space is closer to the target, it is not surprising that they confuse these colors equally often with the target in memory. As an analogy, consider a person standing on a road aligned in both directions by telephone poles, with adjacent poles equally spaced in physical distance. As the person looks one way or the other, each telephone pole in the receding distance becomes less and less distinctive from its neighbors. Even though the local separation between the poles remains constant, the perceived (psychological) separation between them falls off with distance (Crowder, 1976). Analogously, adjacent points on the color wheel become increasingly indistinguishable from each other as the distance between the target and those adjacent points increases (Figure 2). Thus, colors far from the target on the color wheel are by necessity equally confusable with the target. The fact that participants are approximately equally likely to confuse the target with all of these far away colors, resulting in the “long tail” of errors, need not reflect a distinct guessing or low-information state, but instead is a natural consequence of psychological scaling.
Incorporating psychological distance into an equal-variance signal detection model. The results of our psychological scaling experiment provide the opportunity to investigate whether there is a simple relationship between psychological distance and memory that has been obscured by relying on physical distance (error in degrees) to model memory performance. Indeed, we find that an extremely straightforward signal detection model that treats the memory-match signal of each item as arising from its psychological distance to the target (Figure 3) accurately fits the key data from visual working memory studies.
This model is the same model used to understand how people perform 2-AFC in long-term memory, with two additional considerations: (1) other colors on the color wheel serve as additional lures (e.g., continuous report is like a 360-AFC task), and (2) the mean of the memory strength distributions for these lures corresponds to their measured psychological distance to the target, f(x). In particular, according to this model, explained in Figure 3, when participants are probed on the color of an item and shown a color wheel to choose their response, each color x on the wheel (0 degrees <= x <= 180 degrees) generates a memory signal, mx, conceptualized as a random draw from that color’s memory-match distribution, which is centered on d′x (i.e. mean memory match signal strength for color x). That is, mx ~ N(d′x, 1). Participants then respond by choosing the color corresponding to the maximum mx The mean memory-match signal for a given color x on the color wheel is given by that color’s psychological distance to the target, i.e., d′x = d′ (1-f(x)), where d′ is the model’s only free parameter and f(x) is the empirically determined psychological scaling function. When x = 0° (minimum distance), f(x) = 0 so d′0 = d′. By contrast, when x = 180° (maximum distance), f(x) = 1, so d′180 = 0. Because of the nonlinear scaling, colors in the ~90° to 180° physical distance range (i.e., the long-tail range) do not cover a great expanse but instead all cluster near f(x) ≈ f(x)max such that d′ ≈ 0. In other words, all of these colors are essentially equally distant from the target color (Figure 2). Importantly, this model makes a strong novel prediction: rather than the continuous report distributions reflecting 2 or 3 distinct parameters (guessing, precision and, in many models, variation in precision), they reflect a single d′ parameter -- exactly as memory strength is conceived of in long-term memory models -- which is then combined with the a priori measured psychological distance function, f(x).
Remarkably, this straightforward Target Confusability Model (TCM) can explain the key stimulus-driven features of working memory. Of most importance, it accurately characterizes memory performance for color report across different set sizes (Figure 4), including both the relative height of the “long tails” of the distribution compared to the width of the central distribution, and aspects of the “peakiness” of the central distribution (thought to be the hallmark of variability in precision of working memory; see van den Berg et al., 2012). This is demonstrated by the correlation between the binned data and model fits (set size 1: R2=0.996, set size 3: R2=0.980, set size 6: R2=0.992, all p<0.001).
Thus, taking into account psychological distance metrics permits a greatly simplified model of working memory performance - simply the standard signal detection model with added lures whose mean memory strengths are fixed based on psychological distance. This model also allows us to address other important theoretical questions about working memory. For example, an important debate in the working memory literature is centered on the question of whether all items are represented or whether participants have a fixed capacity limit of ~3-4 items after which they remember nothing about other items (e.g., Luck & Vogel, 2013). Our model holds that all items are approximately equally represented, even at set size 6, as we only need a single d′ parameter for all items at this set size to fit the data. A hybrid model based on the TCM but mixed with ‘guessing’ (e.g., that assumes only a subset of items are represented and the remainder have d′=0) is both more complex and provides a worse fit when adjusted for complexity (of 296 participants at set size 6, 93% have a lower AIC for the standard TCM than the one with the guess parameter). Across participants, we find a reliably better fit for the noguess model (t(295)=20.6, p<0.00001). Thus, with only a single parameter, no concept of capacity, and no variability in precision, our model accurately and parsimoniously characterizes working memory response distributions across a range of set sizes and other factors (see Supplemental Material).
Model broadly applicable across working memory research. An advantage of taking into account psychological distance is that it provides a unifying theory of working memory across many tasks previously thought to measure distinct psychological constructs. To demonstrate the strength of this model and its claim, we next illustrate several ways in which, despite having only one parameter, the TCM generalizes to new tasks in ways that have not been previously considered possible.
One extremely important task for the literature on working memory is detecting large changes to a set of items across a delay (i.e., a change-detection task). For example, participants see 5 colors in a study display, and after a brief delay see a single probe color that may have changed or not changed from the study display (Figure 5A). This task has been widely used in the literature (e.g., Luck & Vogel, 1997), including in important work on clinical populations (Gold et al., 2003; Pisella, Berberovic & Mattingley, 2004; Olson, Moore, Stark & Chatterjee, 2006; Lee et al., 2010; Parra et al., 2010; Moriya & Sugiura, 2012). However, this task has recently been rejected as insufficient because, when working in physical space rather than psychological space, change detection appears not to fully characterize working memory performance (e.g., it does not measure ‘precision’, only ‘guess rate’ Fougnie et al. 2010). Our psychological scaling findings, and the need for only a single d′ parameter in the TCM, raise an alternative possibility that the change detection task yields the same d′ parameter that characterizes continuous report data (once psychological distance is taken into account). If so, the TCM should naturally generalize from change detection tasks to continuous report tasks and vice-versa.
To test this prediction, we had participants perform a change detection task at both set size 2 and set size 5, with the probe colors being either an exact match to the item from that location or 180° away in physical color space (Figure 5A). We then obtained estimates of d′ at set size 2 and 5 from memory performance data. Importantly, this simple measurement of d′ in a change detection task allowed us to generalize and infer the distribution of responses from a continuous report task with the same stimuli. In particular, by scaling up from a standard 2 distribution signal detection model to a 360 distribution signal detection model (e.g., TCM) with the additional 358 foil color memory match signals centered at their psychological distance (a known and previously measured quantity, f(x)), we could predict what the entire histogram of data from a continuous report task using the same stimuli should look like (Figure 5B).
We found an excellent fit between this prediction and collected continuous report data at set size 2, R2=0.991, p<0.00001, and at set size 5, R2=.995, p<0.00001 (Figure 5C). The TCM provides an accurate fit to the data without any information other than how well participants can detect large (180-degree) changes and the measured psychological distance space f(x). This demonstrates that (1) the single parameter, d′, is sufficient to completely explain the distribution from continuous report studies, and (2) importantly for the field at large, shows that change detection, even with extremely large and easily discriminable changes, is sufficient to fully characterize working memory, thereby allowing the integration of a huge literature on this task that has been generally discarded as insufficient to assess working memory (e.g., Fougnie et al. 2010). Note also that, with an independent estimate of d′ in hand, the TCM is parameter free.
Generalization across stimulus variation. The TCM predicts that we should not only be able to generalize across different tasks previously thought to measure distinct constructs, but also predicts that we should be able to generalize to new stimulus spaces as well, as long as we take into account the psychological distance of the relevant stimulus space.
To assess this prediction, we collected new data from a continuous color report experiment where rather than using the standard color wheel that spans a large range of color space, we instead showed participants only colors in the “green/yellow” family and asked participants to report their memory on this smaller color wheel (see Figure S8 [Supplemental Material]). According to the TCM, the same model and d′ parameters should apply without modification to this distinct task once we take into account the different psychological distance structure of the green/yellow foils offered to participants. Indeed, we find that the best fitting d′ parameters from the standard color wheel at set sizes 1, 3 and 6 generalize extremely well to this new color wheel using the TCM (with no free parameters; set size 1: R2=0.986, set size 3: R2=0.970, set size 6: R2=0.900, all p<0.001; see Figure S10 [Supplemental Material]). This generalization across stimulus space is not captured or predicted by previous theoretical frameworks and models guiding working memory.
Thus, in a working memory task using a distinct color space, after taking into account the new psychological distance structure of this color space, the TCM generalizes with no free parameters -- effectively showing that the same memory strength (d′) is evident in both tasks even though people make quite different distributions of errors in physical space (which would otherwise result in very different parameter values, and different psychological interpretations, under previous models).
The TCM harmoniously captures existing working memory data. The TCM makes a strong prediction that previously observed trade-offs between the parameters of traditional mixture modeling analyses actually reflect variation in only a single underlying parameter (d′). To evaluate this prediction, and to ask whether the TCM accurately characterizes the parameters that researchers have obtained by fitting the mixture model in other studies, we conducted an analysis of studies that reported precision and guessing parameters estimated using the dominant two-parameter mixture model developed by Zhang and Luck (2008). These studies vary considerably in set size, encoding time and other manipulations, and generally treat the two parameters obtained as estimates of distinct psychological concepts. We compared the trade-off between parameters values obtained by fitting the standard mixture model to empirical data to the trade-off in those same parameter values, but this time obtained by fitting the mixture model to simulated data obtained from the TCM over a range of d′ values. As in our main analysis, we focused on studies that tested memory for color (15 papers, 56 data points; Table S1 [Supplemental Material]). We find that across a wide swath of papers, the two mixture model parameters indeed trade-off in the manner predicted by d′ changes in simulated TCM data (Figure 6). This is also true when fitting data from individual papers that claim to find manipulations of a single parameter (e.g., Zhang & Luck, 2009; Brady et al. 2013), which, when mapped onto this plot, clearly show a decrease in d′ is sufficient to account for their findings.
This provides additional evidence that once psychological distance is taken into account, one parameter is sufficient to capture much of the data observed in continuous report tasks.
These results also serve to test the TCM’s assumption that no items are poorly represented (i.e. low precision) or unrepresented (i.e. guesses). In particular, a capacity-limited model in which only a subset of items is represented predicts that the TCM should require multiple d′ parameters to adequately characterize the data, especially at set sizes that are thought to exceed a fixed capacity of ~3-4 items. Using a set size of 6 as an example, imagine that a fit of the standard mixture model indicated that 4 items are represented and 2 are unrepresented (guess rate = 2 / 6 = .333). To accurately fit data from a condition in which 2 items are not represented, the TCM should require a mixture of at least two d′ values (with d′ for some items set to 0 and d′ for other items set to a value greater than zero) rather than a single d’ value for all items of a given set size. When fit with just a single d′ value, if two items are in fact unrepresented, the TCM should underestimate the height of the tails (and thus the ‘guess’ parameter). We found no evidence of this in our analysis of the literature (mean “guess rate” in literature reviewed with set size greater than four: 0.393, SEM: 0.037; mean predicted by the single-d′ TCM from the corresponding SD parameters: 0.385). These results support the idea that all items are equally represented regardless of set size.
Discussion
For more than half a century, and especially over the last decade, efforts have been made to quantify the capacity of working memory. The major debate in the field is whether or not there is a finite limit on the number of items that working memory can hold (Miller, 1956; Luck & Vogel, 1997; Luck & Vogel, 2013; van den Berg, Awh & Ma, 2014), and in the last decade, this debate has been addressed largely by fitting mixture models to continuous report data. Although their details differ, all of the models agree that the items within a large working memory set are represented heterogeneously in terms of precision, including some items that are totally or almost totally unrepresented. However, our findings suggest these models are fundamentally incorrect. Once the nonlinearity of psychological distance with respect to physical distance is taken into account, neither guessing nor extremely low precision items are needed to account for working memory performance. Instead, the Target Confusability Model (TCM) parsimoniously explains working memory data under the assumption that all items are represented with a single memory strength parameter (d′) that varies across different set-size conditions. Furthermore, taking into account the non-linear relationship between physical distance and psychological distance leads to a new conception of working memory performance, one that greatly simplifies the models needed to account for performance and that also generalizes between stimulus spaces and different tasks that were previously thought to measure distinct concepts.
At a broader level, the TCM provides a compelling connection to the literature on long-term recognition memory, which is often conceptualized within a signal detection framework. In this framework, there is no inherent capacity limitation, item memory strengths are assumed to vary continuously over a significant range, and mean memory strength is influenced by a variety of different variables (including similarity of the lures to the targets, or other factors known to affect working memory performance, such as encoding time or interference). This signal detection framework can also be naturally adapted to explain a number of findings that are in common between the working memory and long-term memory literature but have been difficult to explain with previous working memory models, like the relationship between confidence and accuracy (Rademaker et al. 2012; see also Wixted & Wells, 2017) and the ability of participants to respond correctly when given a second chance even if their first response was a ‘guess’ (Fougnie, Brady & Alvarez, in revision).
The fact that the TCM assumes no inherent upper limit on how many items are represented --although d’ does decrease with set size -- places it in agreement with some variable-precision models of working memory. However, in contrast to those models, TCM assumes that the items in a working memory set – like the items on a list in a study of long-term memory – can be represented as being drawn from single-precision distributions (that is, no items are assumed to be unrepresented or very weakly represented), which calls into question the basis of most inferences about a fundamental working memory capacity limit. Rather than a single specific constraint on the system generally attributed to set size, performance in working memory, like long-term memory, is likely limited by a variety of factors (such as lure similarity, encoding time, delay). It is important to emphasize that our claim is not that working memory and long-term memory are indistinguishable; instead, our claim is that the principles that govern working memory and long-term memory are much more similar than they are generally considered to be. Despite research on working and long-term memory operating largely independent of one another, we have provided a potential unified framework for investigating the distinctions and similarities in memory across both domains.
Methods
Fixed distance triad experiment. N=40 participants on Amazon Mechnical Turk judged which of two colors presented were more similar to a target color. Mechnical Turk users from a representative subset of adults in the United States (Berinsky, Huber, & Lenz, 2012; Buhrmester, Kwang, & Gosling, 2011), and data from Turk are known to closely match data from the lab on visual cognition tasks (Brady & Alvarez, 2011; Brady & Tenenbaum, 2013), including providing extremely reliable and high-agreement on color report data (Brady & Alvarez, 2015). The target color was chosen randomly from 360 color values that were evenly distributed along a circle in the CIE L*a*b* color space. This circle was centered in the color space at (L = 54, a = 18, b = -8) with a radius of 59. The pairs of colors were chosen to be 30 degrees apart from one another, with the offset of the closest color to the target being chosen with an offset (in deg) of either 0, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 120, 150 (e.g., in the 150 degree offset condition, the two choice colors were 150 and 180 degrees away from the target color; in the 0 deg offset condition, one choice exactly matched the target and the other was 30 deg away).
Participants were asked to make their judgments solely based on intuitive visual similarity and to repeat the word ‘the’ for the duration of the trial to prevent the usage of words or other verbal information. Each participant did 130 trials, including 10 repeats of each of the 13 offset conditions, each with a different distance to the closest choice color to the target, and trials were conducted in a random order. The trials were not speeded, and the colors remained visible until participants chose an option. To be conservative about the inclusion of participants, we excluded any participant who made an incorrect response in any of the 10 trials where the target color exactly matched one of the choice colors, leading to the exclusion of 7 of the 40 participants. In addition, based on an a priori exclusion rule, we excluded trials with reaction times <200ms or >5000ms, which accounted for 1.75% (SEM:0.5%) of trials.
Variable distance triad experiment. N=100 participants on MTurk judged which of two colors presented were more similar to a target color, as in the fixed distance triad experiment.
However, the pairs of colors now varied in offset from each other and from the target to allow us to accurately estimate the psychological distance function. In particular, the closest choice item to the target color could be 0, 3, 5, 8, 10, 13, 15, 20, 25, 30, 35, 45, 55, 65, 75, 85, 100, 120, 140, 160, or 180 away from the target color. If we refer to these offsets as oi such that o1 is 0 degrees offset and o21 is 180 degrees offset, then given a first choice item of oi, the second choice item was equally often oi+1, oi+2, oi+3, oi+4, or oi+8 degrees from the target color.
Participants were asked to make their judgments solely based on intuitive visual similarity and to repeat the word ‘the’ for the duration of the trial to prevent the usage of words or other verbal information. Each participant did 261 trials, including 3 repeats of each of the possible pairs of offset conditions, and trials were conducted in a random order. The trials were not speeded, and the colors remained visible until participants chose an option. We excluded any participant whose overall accuracy was 2 standard deviations below the mean (M=77.5%) leading to the exclusion of 8 of the 100 participants. In addition, based on an a priori exclusion rule, we excluded trials with reaction times <200ms or >5000ms, which accounted for 1.7% (SEM:0.26%) of trials.
To compute psychological distance from this data, we used the model proposed by Maloney and Yang (2003), the Maximum Likelihood Difference Scaling method. This method assigns scaled values for each stimulus (in this case, each offset) designed to accurately predict observers’ judgments, such that equally different stimuli in the scaled space are discriminated with equal performance.
Similarity experiment. N=50 participants on MTurk judged the similarity of two colors presented simultaneously on a Likert scale, ranging from 1 (least similar) to 7 (most similar).
The colors were chosen from a wheel consisting of 360 color values that were evenly distributed along a commonly used response circle in the CIE L*a*b* color space. This circle identical to that used in the triad experiment. The pairs of colors were chosen by first generating a random start color from the wheel and then choosing an offset (in degrees) to the second color, from the set 0, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 120, 150, 180. Participants were given instructions by showing them two examples: (1) in example 1, the two colors were identical (0 deg apart on the color wheel) and participants were told they should give trials like this a 7; (2) in example 2, the two colors were maximally dissimilar (180 deg apart on the color wheel) and participants were told they should give this trial a 1. No information was given about how to treat intermediate trials. Participants were asked to make their judgments solely based on intuitive visual similarity and to repeat the word ‘the’ for the duration of the trial to prevent the usage of words or other verbal information. Each participant did 140 trials, including 10 repeats of each of the 14 offset conditions, each with a different starting color, and trials were conducted in a random order. The trials were not speeded, and the colors remained visible until participants chose an option. 2 participants were excluded for failing a manipulation check (requiring similarity >6 for trials where the colors were identical). Based on an a priori exclusion rule, we excluded trials with reaction times <200ms or >5000ms, which accounted for 3.0% (SEM:0.4%) of trials.
Description of Target Confusability Model. The model is typical of a signal detection model of long-term memory, but adapted to the case of continuous report, which we treat as a 360 alternative forced-choice for the purposes of the model. When probed on a single item and asked to report its color, (1) each of the colors on the color wheel generates a memory match signal mx, with the strength of this signal drawn from a Gaussian distribution, mx ~ N(d′x, 1) (Figure 3A), (2) participants report whichever color x has the maximum mx, (3) the mean of the memory match signal for each color, d′x, is determined by its psychological distance to the target according to the empirical psychophysical psychological distance function (f(x)), such that d′x = d′ (1-f(x)) (Figure 2).
Rather than using the outputs of the Maloney and Yang (2003) model for f(x), the psychological distance function, we instead use the directly measured data from the similarity experiment. In that experiment, similarity between two colors separated by x° was measured using a 7-point Likert scale, where Smin = 1 and Smax = 7 (Figure 2B). To generate f(x), the psychological distance function, we simply normalize this data to range from 0 to 1, giving a psychological similarity metric; and then subtract it from 1 to turn it into psychological distance, such that f(x) = 1 - ((Sx - Smin) / (Smax - Smin)).
Thus, according to the model, the mean memory-match signal for a given color x on the working memory task is given by d′x = d′ (1-f(x)), where d′ is the model’s only free parameter. When x = 0, f(x) = 0, so d′0 = d′. By contrast, when x = 180, f(x) = 1, so d′180 = 0. Then, as noted above, at test each color on the wheel generates a memory-match signal, mx, conceptualized as a random draw from that color’s distribution centered on d′x. That is, mx ~ N(d′x, 1). The response on a given trial is made to the color on the wheel that generates the maximum memory-match signal, r ~ argmax(m).
Thus, once the psychological distance function has been collected (Figure 2A) and interpolated to have a value between 0 and 1 for every 1 degree in distance from the target, the full code for sampling an absolute value of error from the model is only two lines of MATLAB:
Although the model assumes that every possible color probe elicits a memory match signal depending on how well it matches the color of the encoded item from the probed location (e.g., the task is a 360-AFC), the predictions of the model are not significantly affected if we instead assume that subjects randomly sample a subset of the colors on the color wheel or if we scale the number of options to the limit (e.g., n = 36 or n= 36,000) when trying to decide which color appeared in the cued location (only the inferred d’ changes as a result, not the shape of the predicted distributions).
The model can also be adapted to include a motor error component. Whereas existing mixture models predict the shape of the response distribution directly and thus confound motor error with the standard deviation of memory (see Fougnie et al. 2010 for an attempt to de-confound these), our model makes predictions about the actual item that participants wish to report. Thus, if participants do not perfectly pick the exact location of their intended response on a continuous wheel during every trial, a small degree of Gaussian motor error can be included, e.g., the response on a given trial, rather than being argmax(m), can be assumed to include motor noise of some small amount, for example, 2°:
r ~ N(argmax(m), 2°)
In MATLAB notation:
In the model fitting reported in the present paper, we include a fixed normally distributed motor error with SD=2~, although we found the results are not importantly different if we do not include this in the model.
Set size 1, 3 and 6 continuous report data. The continuous color report data used for fitting the model at set size 1, 3 and 6 was previously reported in Brady and Alvarez (2015). In that data, N=300 participants performed 48 trials each of continuous color report at set size 1, a distinct set of N=300 participants at set size 3 and a distinct set of N=300 at set size 6. At both set size 3 and 6, participants were probed on 3 of the colors on each trial. Thus, the effects of delay and response interference can be probed in this data as well (see Figure S11 [Supplemental Material]). The data used the same color wheel used in the current studies and was also conducted on MTurk to allow for direct comparison to the current similarity data.
Change detection experiment and associated set size 2 and 5 continuous report data. N=60 participants on MTurk performed a change detection task. This data was used to estimate d’ at set size 2 and 5 in this particular set of task conditions. Then a distinct set of N=50 participants performed a continuous response task, using similar task conditions and also at set size 2 and 5. The d′ from the change detection task was then used to a priori predict the response distribution of the participants in the continuous report task. Both tasks used the standard fixed luminance CIELa*b* color wheel.
Change detection task. Participants performed 200 trials of a change detection task, 100 at set size 2 and 100 at set size 5. The display consisted of 5 placeholder circles. Colors were then presented for 1000ms, followed by an 800ms ISI. At set size 2, the colors appeared at 2 random locations and placeholders remained in the other 3 locations. Colors were constrained to be at least 15° apart along the response wheel. After the ISI, a single color reappeared at one of the positions where an item had been presented. On 50% of trials each set size, this was the same color that had previously appeared at that position. On 50% of trials, it was a color from the exact opposite side of the color wheel, 180° along the color wheel from the shown color at that position. Participants’ task was to indicate whether the color that reappeared was the same or different than the color that had initially been presented at that location. We used this data to calculate a d′ separately at set size 2 and set size 5.
Continuous report task. The task was identical to the change detection task except that rather than a probe color reappearing at test, one of the placeholder circles was marked with a thicker outline, and participants were asked to response on a continuous color wheel to indicate what color had been presented at that location. Error was calculated as the number of degrees on the color wheel between the probed item and the response.
Fitting data from continuous report with change detection data. If its assumptions are correct, TCM can use the measured change detection d′ for each set size (involving test trials with discrete colors either 180° or 0° from the target) to predict the corresponding continuous report data (involving a color wheel with colors ranging from +/-180° to 0° from the target). The change detection d′ is directly related to, but is not identical to, the d′ value used by TCM. The reason is that the actual TCM d′ value is determined by the (unknown) number of colors sampled at test. To estimate the TCM d′ from the change detection d′, we followed these steps:
(1) Evaluate the probability density function of TCM at -180° and 0° (the two options in the change detection task).
(2) Assume that in a forced-choice task where participants were given those two options, they would select them proportional to their probability density function values. Thus, use the ratio of the probability density values for 0° and -180° to calculate a d′ value for a 2-AFC task that we would expect for this TCM d′ value.
(3) To adjust for the fact that our participants performed an old/new (change detection) task, not a 2-AFC task, divide the corresponding d′ by √2 to arrive at the corresponding change detection d′ for this TCM d′ value (Macmillan & Creelman, 2004).
Thus, to map from a d′ value in the change detection task (d′cd) to a d′ in the TCM model, we make use of the following relationship: Where gd’(x) is the probability density function of TCM for a given TCM d′’ value, and Φ −1 is the inverse cumulative normal distribution.
Literature analysis. To assess our model’s prediction that previously observed trade-offs between different psychological states are measuring the same underlying parameter (d′), we conducted a literature analysis of data from color working memory research. In particular, we examined the two parameters most commonly reported by those fitting mixture models to their data, precision (in terms of SD) and guessing.
We searched for papers that used these techniques by finding papers that cited the most prominent mixture modeling toolboxes, Suchow, Brady, Fougnie & Alvarez (2013) and Bays et al. (2009). We used a liberal inclusion criteria in order to obtain as many data points as possible. Our inclusion criteria was: 1) There was some delay between the working memory study array and test; 2) Instructions were to remember all the items; 3) SD/guess values were reported or graph axes were clearly labeled; 4) Participants were normal, healthy, and between ages 18-35. 5). Colors used were in CIE L*a*b* color space (as different color spaces may have different scaling non-uniformities). For papers that did not report SD/guess values, these values were obtained by digitizing Figures with clear axis labels (Rohatgi, 2011; also see Mukherjee, Seok, Vieland & Das, 2013).
This inclusion criteria resulted in a diverse set of data points, including studies using sequential or simultaneous presentation, feedback vs no feedback, cues vs no-cues, varying encoding time (100-2000 ms), and variable delay (1-10 sec). A total of 15 papers (56 data points) were included.
Acknowledgements
We thank Viola Stormer, Rosanne Rademaker and Daryl Fougnie for comments on these ideas and on the manuscript. For funding, we would like to acknowledge NSF CAREER (BCS-1653457) to TFB.