Abstract
Names for colors vary widely across languages, but color categories are remarkably consistent [1–5]. Shared mechanisms of color perception help explain consistent partitions of visible light into discrete color vocabularies [6–10]. But the mappings from colors to words are not identical across languages, which may reflect communicative needs – how often speakers must refer to objects of different color [11]. Here we quantify the communicative needs of colors in 130 different languages, using a novel inference algorithm. Some regions of color space exhibit 30-fold greater demand for communication than other regions. The regions of greatest demand correlate with the colors of salient objects, including ripe fruits in primate diets. Using the mathematics of compression we predict and empirically test how languages map colors to words, accounting for communicative needs. We also document extensive cultural variation in communicative demands on different regions of color space, which is partly explained by differences in geographic location and local biogeography. This account reconciles opposing theories for universal patterns in color vocabularies, while opening new directions to study cross-cultural variation in the need to communicate different colors.
The color word problem
What colors are “green” to an English speaker? Are they the same as what a French speaker calls “vert?” Berlin & Kay [1] and Kay et al. [5] studied this question on a world-wide scale, surveying the color vocabularies of 130 linguistic communities using a standardized set of color stimuli (Fig. 1a). They found that color vocabularies of independent linguistic origin are remarkably consistent in how they partition color space [1]. In languages with two major color terms, one term typically describes white and warm colors (red/yellow) and the other describes black and cool colors (green/blue). If a language has three color terms, there is typically a term for white, a term for red/yellow, and a term for black/green/blue. Languages with yet larger color vocabularies remain largely predictable in how they partition the space of perceivable colors into discrete terms [2–4, 12] (Fig. 1b). What accounts for this apparent universality? This “color word problem” has sparked controversy [13–15], but it is now near to resolution based on the mathematics of compression.
Two different lines of work seek to explain the universality of color vocabularies. The first approach contends that the geometry of perceptual color space accounts for consistency across languages [6, 8, 9]. Judgements of color appearance by humans with normal color vision are remarkably stable despite genetic variability in photoreceptor spectral sensitivities [16], age-dependent variability in light filtering of the eye [17], and variation in the relative number of different classes of retinal cone photoreceptors [18]. The shared psychophysics of perception provides a common metric for color similarity, and common limits on the gamut of perceivable colors, which have each been proposed to explain commonalities in color naming [1, 6, 19–25]. Using the CIE Lab color space [26, 27] – a space in which distances approximate human judgements of color similarity – Regier et al. [8] showed that vocabularies in the World Color Survey (WCS) tend to minimize the average distance between colors that share the same term. This work suggests that the “shape” of color space and shared perception of color similarity drive universal patterns of color naming across languages.
On the other hand, recent work [11, 28, 29] found that color terms tend to reflect how often speakers need to refer to different colors, with a trend that emphasizes communication about warm hues (red/yellow) over cool hues (blue/green). In this view, the shared trend in communicative needs of colors accounts for universality, while local differences in needs explain why vocabularies are similar but not identical (Fig. 1b). Shared communicative needs may derive from the statistics of surface reflectances in natural scenes [7], or they might emphasize colors of greatest biological significance to ancestral humans – such as ripe fruits or dangerous animals [30]. However, the ability of shared scene statistics to account for color naming is disputed [31], and the relative importance of communicative need versus color perception is a topic of vigorous debate [10, 32].
Here we reconcile these two hypotheses for the origins of color naming – determined either by shared mechanisms of perception, or by shared communicative needs. We show that these explanations are in fact compatible with each other, and can accommodate cross-cultural variation, under the compression theory of color naming [7, 8, 10, 33]. We show how empirical color vocabularies across 130 languages can be understood as optimal solutions to the problem of representing the vast space of human perceivable colors with a discrete set of symbols (color terms), given a distribution of communicative needs across colors. To do this, and in contrast to prior work [32, 34], we derive a novel algorithm to directly solve the inverse problem: given a color vocabulary, we infer the distribution of communicative needs necessary to arrive at that particular compressed representation of colors. Applying this algorithm to the empirical color vocabularies recorded in the WCS, and using the metric of perceptual distortion identified by Regier et al. [8], we infer language-specific distributions of communicative needs that are consistent with the empirical findings of Gibson et al. [11]. This account provides a unifying view of the color word problem, consistent with both the psychophysics of color perception and a distribution of communicative needs across colors. Our results also provide a framework to study what factors govern cross-cultural variation in the demand to speak about different colors.
Color naming as a compression problem
In the compression model of color naming, a color in the set of all perceivable colors, , needs to be communicated with some probability, p(x), to a listener. The speaker cannot be infinitely precise when referring to x, and must instead use a term, , from their shared color vocabulary, . Many colors in map to the same term, so that a listener hearing will not know exactly which color x was referenced. Color naming is then distilled to the following problem: how do we choose the mapping from colors to color terms? Rate-distortion theory [35–38], the branch of information theory concerned with lossy compression, provides an answer.
Mapping colors to a limited set of terms necessarily introduces imprecision or “distortion” in communication. The amount of distortion depends on a listener’s expectation about what color, x, a speaker is referencing when she utters color term . Under the rate-distortion hypothesis, a language’s mapping from colors to terms allows a listener to glean as much information as possible about color x from a speaker’s choice of term (Fig. 1c).
Each color is identified with a unique position, denoted x, in a perceptually uniform color space. Here we use CIE Lab as in Regier et al. [8]. The coordinates corresponding to a color term are given by its centroid: the weighted average of all colors a speaker associates with that term, . The distortion introduced when a speaker uses to refer to x is simply the squared Euclidean distance between x and in CIE Lab, denoted . Intuitively, colors that are near are more likely to be assigned to the term than colors that are far (Fig. 1c).
The mathematics of compression provides optimal ways to represent information for a given level of tolerable distortion. The size of a compressed representation, , is measured by the amount of information it retains about the uncompressed source, X, given by the mutual information . Terms represent colors by specifying the probability of using a particular term to refer to a given color , denoted . Rate-distortion efficient mappings are choices of the mapping that minimize such that the expected distortion, , does not exceed a given tolerable level. Efficient mappings and centroid positions can be found for a large class of distortion functions known as Bregman divergences, which includes the CIE Lab measure of perceptual distance (SI Sec. A).
Communicative needs of colors
Rate-distortion theory provides an efficient mapping from colors to terms that depends on three choices: the distortion function in color space, the degree of distortion tolerated by the language, and the probability p(x) that each color needs to be referenced during communication, called the “communicative need.” Previous studies have assumed that communicative needs are either uniform across the WCS color stimuli [3], correlated with the statistics of natural images [7], or approximated by a “capacity achieving prior” [10]. As a result, prior studies have drawn conflicting conclusions about whether communicative needs matter for color naming, and whether the compression model provides an accurate account of vocabularies whatsoever. Here we resolve these questions by directly estimating the communicative needs of colors for each of the 130 languages in the combined B&K+WCS dataset.
Algorithm to infer communicative needs
How can we infer the underlying communicative needs of colors from limited empirical data? Here we derive an algorithm that finds the maximum-entropy estimate of the underlying communicative needs p(x) consistent with a rate-distortion optimal vocabulary with known centroid coordinates and term frequencies , for any Bregman divergence measure of distortion.
The estimate of communicative needs has the form , with
In words, the optimal is the choice of that maximizes the entropy, H(X), among the the set of conditional probability distributions Q whose predicted focal color coordinates match the observed coordinates for each color term. We construct this solution via a novel iterative alternating maximization algorithm (see SI Sec. B for its derivation), where the vectors are chosen so that predicted focal color coordinates match observed coordinates (SI Sec. B).
This algorithm provably converges to a unique, globally optimal, maximum-entropy estimate of the true communicative need p(x) (SI Sec. B.1 and B.2). Remarkably, we can construct this solution knowing only that the observed coordinates are rate-distortion optimal centroids, without knowledge of the specific distortion measure (SI Sec. B.3; SI Fig. B1).
Inference from focal colors
Our algorithm infers a language’s communicative needs from knowledge of the centroids associated with its color terms. Berlin & Kay measured the “focal color” of each color term by asking native speakers to choose from among the Munsell stimuli (Fig. 1a) the “best example” of that term. We propose that the measured focal colors are in fact the centroids for each term.† This hypothesis may appear problematic since laboratory experiments suggest focal colors and category centroids are distinct points in color space [39–41]. However, centroids in those studies were calculated under the implicit assumption of uniform communicative needs, leaving open the possibility that focal colors are centroids under the true distribution of non-uniform needs (SI Sec. A.3).
Our approach provides an unbiased inference of communicative needs. Prior work on this problem relied on strong assumptions about the form of p(x) (SI Sec. B.3), and it produces implausible inferences for languages in the WCS (SI Fig. C4ab). Moreover, unlike prior work, our inference procedure does not rely on knowing the empirical mapping from colors to terms, , which is the quantity that we ultimately wish to predict from any theory of color naming.
Different colors, different needs
Our analysis reveals extensive variation in the demand to speak about different regions of color space (Fig. 2a). Averaged over all 130 B&K+WCS languages, the inferred communicative needs emphasize some colors (e.g. bright yellows and reds) up to 36-fold more strongly than others (e.g. blue/green pastels and browns). This conclusion stands in sharp contrast to prior work that assumed a uniform distribution of needs [8] and attributed color naming to the shape of color space alone.
Our ability to predict the color vocabulary of a language is substantially improved once we account for non-uniform communicative needs (Fig. 2b). We find improvement in an absolute sense, as measured by the root mean squared error (RMSE) between predicted and empirically measured focal colors, and also in a relative sense, measured by percent improvement over a uniform distribution of needs. The typical change in predicted focal color once accounting for non-uniform needs is easily perceivable, corresponding to a median change of two WCS color chips (Fig. 2b right). Not only are the predicted focal points in better agreement with the empirical data, once accounting for non-uniform needs, but the entire partitioning of colors into discrete terms is substantially improved, as seen in the example languages Múra-Pirahã and Colorado (Fig. 2c).
We infer communicative needs and predict color terms using data from the first of two experiments in the WCS, which measured focal colors (Fig. 3a). This inference and prediction requires fitting one parameter that controls the “softness” of the partitioning and one hyperparameter to control over-fitting (SI Sec. C). Without any additional fitting, we can then compare the predicted mappings from colors to terms to the empirical term maps measured in the second WCS experiment. For nearly all of the WCS languages analyzed (n = 110), the color term maps predicted by rate-distortion theory are significantly improved once accounting for non-uniform communicative needs (improvement in 84% of languages, Fig. 3b). Only 15% of languages show little or no improvement, with an additional 1 outlier, Huave (Huavean, Mexico), that may violate model assumptions in some significant way (see Discussion). The substantial improvement in predicted term maps can be attributed both to universal patterns in communicative needs, shared across languages, and to language-specific variation in needs (Fig. 3c).
In contrast to prior work on the compression model of color naming [8, 10], no part of our inference or prediction procedure uses empirical data on a language’s mapping from colors to terms, .† Nor are our predicted color terms simply an out-of-sample prediction, since the predicted quantities, , are not used to parameterize the model. And so our analysis is not simply a fit of the compression model to data, but rather an empirical test of its ability to predict color naming from first principles.
Communicative needs and the colors of salient objects
We can interpret the inferred communicative needs of colors by comparing them to what is known about the colors of salient objects. Prior work [11] suggests a warm-to-cool trend in communicative need, related to the frequency of colors that appear in foreground objects as identified by humans in a large dataset of natural images [42] (Fig. 4a). We find that the same correlation holds, at least when restricting to the middle range of lightness (color chips in rows C–H; Spearman’s ρ = 0.3, p < 0.001, n = 240). However the pattern of communicative needs is more complex than this warm-cool gradient alone. Pastels that are greenish blue or blue, as well as brownish-greens, need to be communicated less often than dark green or dark blue, for example. Moreover, dark colors in general (e.g. color chips in rows I-J) show a relatively high communicative need under our inference compared to their frequency in foreground objects of natural images (Fig. 4a).
We also compared communicative needs to spectral measurements by Sumner & Mollon [43, 44] of unripe and ripe fruit in the diets of catarrhine primates, which have trichromatic color vision and spectral sensitivities similar to humans. When projected onto the WCS color chips (see SI Fig. C6), unripe, midripe, and ripe fruit occupy distinct regions of perceptual color space (Fig. 4c) corresponding to low, medium, and high values of inferred communicative need, respectively (Fig. 4d). The morphological characteristics of fruit, including color, are known to be adapted to the sensory systems of frugivores that act as their seed dispersers, for vertebrates in general [45–47] and primates in particular [48–50]. And so our results support the hypothesis that communicative need in human cultures emphasize the colors of salient objects that stand out or attract attention in our shared visual system across a typical range of environments‡.
Cross-cultural variation
Languages vary considerably in their needs to communicate about different parts of color space (Fig. 5a; Fig. SI C8–C24). The inferred needs for the language Waorani (Ecuador), for example, emphasize white and mid-value blues, while de-emphasizing yellows and greens, relative to the average needs of all B&K+WCS languages. Whereas Martu-Wangka (Australia) emphasizes pinks and mid-value reds, as well as a light greens, while de-emphasizing blues and dark purples (Fig. 5a). In fact, the median distance between language-specific communicative needs and the across-language average needs is nearly as large as the distance between the average needs and uniform needs (9.9 and 11.2, respectively, in units of ΔE*).
Why do language communities vary in their needs to communicate different colors? Detailed study of this question requires language-specific investigation beyond the scope of the present work. However, we can at least measure how variation in linguistic origin, geographic location, and local biogeography (Fig. 5b) relate to differences in communicative needs. We quantified these factors for pairs of languages by determining: (1) whether or not they belong to the same linguistic family in glottolog [51]; (2) the geodesic distance between communities of native speakers; and (3) whether or not language communities share the same “ecoregion,” a measure of biogeography [52] that delineates boundaries between terrestrial biodiversity patterns [53]. Our statistical analysis also controls for differences in the number of color terms between languages, because we seek to understand cross-cultural variation above and beyond any relationship between vocabulary size and (inferred) communicative needs (SI Sec. C.3). While language differences are largely idiosyncratic, we find a small but measurable impact of distance and biogeography on communicative needs (Fig. 5c, Methods: Correlates of cross-cultural differences in communicative need). In particular, increasing the geodesic distance between language communities by a factor of 10 decreases the mean similarity in their communicative needs by a factor of 2.9% ([1.7%, 4.2%] 95% CI), while sharing the same ecoregion increases the mean similarity by a factor of 8.4% ([3.9%, 12.7%] 95% CI). By contrast, we find no significant effect of language genealogy on communicative needs, at least at the coarse scale of language family. Taken together, these results suggest that color vocabularies are adapted to the local context of language communities.
Discussion
We have inferred language-specific needs to communicate about different colors, using a novel algorithm that applies to any rate-distortion Bregman clustering. Accounting for non-uniform needs substantially improves our ability to predict color vocabularies across 130 languages. In contrast to prior work, our inference and predictions do not use any information on the mappings from colors to terms, allowing us to test the compression model of color naming against empirical data.
The distribution of communicative needs, averaged across languages, reflects a warm-to-cool gradient, as hypothesized in Gibson et al. [11]; and it is related to object salience more generally, as indicated by the positioning of ripe fruit coloration in regions of highest need. We also document extensive variation across languages in the demands on different regions of color space, correlated with geographic location and the local biogeography of language communities.
Our analysis provides clear support for the compression model of color naming. Whereas prior work has established the role of shared perceptual mechanisms for universal patterns in color naming, our results highlight communicative need as a source of cross-cultural variation that must be included for agreement with empirical measurements. A catalogue of language-specific needs (Fig. SI C8–C24) will enable future study into what drives cultural demands on certain regions of color space, and how they relate to contact rates between linguistic communities, shared cultural history, and local economic and ecological contexts. Our methodology also provides a theoretical framework and inference procedure to study categorization in other cognitive domains, including other perceptual domains of diverse importance worldwide [55], and even in non-human cognitive systems that exhibit categorization (e.g. Zebra finches [56, 57], the songbird Taeniopygia guttata).
Several languages have been advanced as possibly invalidating the universality of color categories [13–15]. Languages are known to vary in the degree to which different sensory domains are coded [55, 58, 59], and in Pirahã and Warlpiri the existence of abstract terms for colors has been disputed [60, 61]. Moreover, the color vocabularies in Karajá and Waorani notably lack alignment with the shape of perceptual color space [3]. Once we account for communicative needs, however, we find that the color terms of Karajá and Waorani are well explained by rate-distortion theory. Likewise, while Pirahã may seem exceptional when assuming uniform communicative needs, we recover accurate predictions once accounting for a non-uniform distribution of needs (Fig. 2c, Fig. 3b)†.
Nevertheless, several languages show little or no improvement in predicted term maps using inferred versus uniform communicative needs, and Warlpiri is among these cases. Before drawing conclusions about exceptionalism, however, we note that several technical assumptions of our analysis may be violated for these languages. For one, we assumed that basic color terms are used with equal frequency, to first approximation. This is a reasonable assumption given that basic color terms are elicited with roughly equal frequency under a free naming task in e.g. English [41]. Moreover, the inferred distribution of needs for WCS and B&K languages are relatively insensitive to non-uniformity of color term frequency, up to variation by a factor 1.5 (SI Sec. C.2, SI Fig. C2d). Still, this assumption may not be accurate enough for all languages, and the frequencies of color terms requires future empirical study. Another possibility is that the choice of the WCS stimuli themselves, i.e. the set of Munsell chips, , may work well for identifying focal colors of most languages, but may be too restrictive in the languages that show little improvement. Future field and lab work could remedy this by broadening the range of color stimuli used in surveys.
Another limitation of the WCS is variability in chroma across the Munsell color chips used as stimuli, which might bias participants’ choice of focal color positions [29, 62–65]. We do indeed find a small but statistically significant correlation (Spearman’s ρ = 0.13, p = 0.019) between Munsell chroma and the average inferred distribution of communicative need across WCS languages. However, if this bias dominated the choice of focal colors in the WCS, then we would not expect distributions of need inferred from focal colors to improve predictions of color term maps. The fact that we do see substantial improvement suggests that whatever bias this effect may have, it is evidently not large enough to impact the relationship between focal color positions and color term maps for most languages. Nor would chromatic bias in stimuli explain the crosscultural variation in communicative needs that we observe, since the set of stimuli was held constant across languages.
Our study has focused on how languages partition the vast space of perceivable colors into discrete terms, and how communicative needs shape this partitioning. Why some languages use more basic color terms than others remains an open topic for cross-cultural study. In principle, the issue of tolerance to imprecision in color communication is orthogonal to the distribution of communicative needs in a community. In practice, the number of color terms has a small impact on the resolution of inferred needs (SI Fig. C3a), which we control for in cross-cultural comparisons (SI Fig. C3b). Nonetheless, languages that have similar vocabulary sizes tend to have more similar communicative needs across colors, and this covariation is greater than any effect of vocabulary size on the resolution of our inferences (SI Fig. C3). These results suggest that causal factors driving vocabulary size may also influence a culture’s communicative demands on colors – a hypothesis for future research.
Future empirical work may begin to unravel why cultures vary in their communicative demands on different regions of color space. It is already known that natural environments vary widely in their color statistics [66, 67] and this variation matters for color salience [68]. The need to reference certain objects, as well as their salience relative to similar backgrounds, may help explain why communities that share environments prioritize similar regions of color space, as we have seen. And so shared environment, physical proximity, and shared linguistic history at a finer scale than language family, are all plausible avenues for future study on the determinants of color demands. Beyond these factors, there remains substantial interest in cultural features that we have not studied here, including religion, agriculture, trade, access to pigments and dies, and different ways of life, that can all shape a community’s needs to refer to different colors, and the resulting language that emerges.
Methods
World Color Survey
Berlin & Kay [1] and Kay et al. [5] surveyed color naming in 130 languages around the world using a standardized set of color stimuli. The stimuli (Fig. 1a), a set of Munsell† color chips, were designed to cover the gamut of human perceivable colors at maximum saturation, across a broad range of lightness values. Native speakers were asked to choose among the basic color terms in their language to name each color chip, one at a time, in randomized order. The WCS study surveyed 25 native speakers in each of 110 small, pre-industrial language communities; the B&K study surveyed one native speaker in each of 20 languages from a mixture of both large (e.g. Arabic, English, and Mandarin) and comparatively small (e.g. Ibibio, Pomo, and Tzeltal) pre- and post-industrial societies.
The stimuli provided by the Munsell color chips are a function of the color pigment of the chips and the ambient light illuminating them. The ambient light source was approximately controlled by conducting the survey at noon and outdoors in shade, corresponding to CIE standard illuminant C. To the extent possible, participants were surveyed independently, although preventing the discussion of responses among participants was not always possible (discussed in Regier et al. [9]).
In our treatment of the color naming data, for each language we include all recorded terms that had an associated focal color, was used by at least two surveyed speakers (unless a B&K language, in which case only one speaker was surveyed), and was considered the best choice for at least one WCS color chip.
The 20 B&K languages were included in our analyses where appropriate: comparisons based on focal colors and inferred communicative needs. They were excluded from term map comparisons because the methods of estimating term maps differed methodologically from those in the WCS [69], and they do not provide straightforward estimates of . In addition, B&K languages with significant geographic extent, e.g. Arabic and English, were excluded from statistical analysis of the correlates of cross-cultural differences in communicative needs, because estimating geographic distance or local biogeography would make little sense for these languages.
RMSE of focal color predictions
Language-specific focal color positions were compared to model predictions using the root mean squared error (RMSE) between observation and prediction in units of CIE Lab ΔE*, computed for each WCS language i according to where the superscript (j) specifies the coordinate in the CIE Lab color space of position vectors and , corresponding respectively to the predicted and empirically observed coordinates of the focal color for term in language i’s vocabulary, . Here denotes the number of basic color terms in language i’s vocabulary.
Spectral measurements of ripening fruit
Spectral measurements of ripening fruit in the diets of caterrhine primates were obtained from the Cambridge database of natural spectra.‡ Reflectance data for fruit taken from the Kibale Forest, Uganda, were converted to CIE XYZ 1931 color space coordinates using CIE standard illuminant C. We then converted points from XYZ to CIE Lab space using the XYZ values for CIE standard illuminant C (2°standard observer model) as the white point, in order to match the WCS construction of CIE Lab color chip coordinates. Calculations were performed in R (v3.6.3) using the package colorscience (v1.0.8).
Indicators of fruit ripeness include color, odor, and smell. Therefore, to measure visual salience we considered only fruit that had a discernable (in terms of CIE Lab ΔE*) difference between unripe and ripe measurements (see Fig. C6a for determination of statistical threshold on change in chromaticity). For fruits with detectable changes in chromaticity, we projected their unripe, midripe, and ripe positions onto the WCS color chips such that absolute lightness, L*, and the ratio of a* to b* was preserved (Fig. C6b).
Measuring distance between distributions over colors
We quantified the perceptual difference between any two distributions over the WCS color chips in terms of their Wasserstein distance (used in Fig. 5c), defined as where R is the set of joint distributions satisfying Σx r(x, x′) = p(x′) and Σx′ r(x, x′) = q(x). The CIE Lab coordinates of x and x′ are given by x and x′, respectively, and the Euclidean distance between them approximates their perceptual dissimilarity, by design of the CIE Lab system. Under this measure, a small displacement in CIE Lab space of distributional emphasis is distinguishable from a large displacement. For example, for discrete distribution p(x) = α if otherwise, let distribution q(x) be defined identically except substituting . Then the Wasserstein distance between p and q will increase with the Euclidean distance between xp and xq, whereas e.g. the Kullback–Leibler divergence between p and q would remain constant for any xp ≠ xq.
We used a generalization of this distance measure to quantify the match between predicted and measured term maps. To make this comparison we find the minimum-CIE ΔE*partial matching between predicted and measured term map categories, , for each term (used in Fig. 3b). To do this we find the minimum cost achievable by any assignment of chips empirically labeled by to those predicted to be labeled , weighted by the measured and predicted . The best partial matching accommodates for the fact that predicted and measured categories can differ in total weight. This measure is known as the Earth mover’s distance [70] (EMD), which has the Wasserstein distance as a special case with matching total weights. Both measures were computed in R (v3.6.3) using the emdist (v0.3-1) package.
Correlates of cross-cultural differences in communicative need
We modeled the pairwise dissimilarity in communicative need between B&K+WCS languages as a log-linear function of the geodesic distance between language communities, shared linguistic family, and shared ecoregion, using a maximumlikelihood population-effects model (MLPE) structure to account for the dependence among pairwise measurements [54]. For languages j = 2, …, n, i = 1, …, j − 1, we use a generalized linear mixed effects model with form where w(ij) is the Wasserstein distance between the inferred distributions of communicative need for languages i and j; is their estimated geodesic distance (Haversine method) in standardized (normalized by standard deviation) units based on geographic coordinates in glottolog (and restricting to languages with small geographic extent); is a binary indicator of being in the same linguistic family or not (1 or 0, respectively); is a binary indicator of being in the same ecoregion or not (1 or 0, respectively); and is the difference in their number of color terms, which we include as a control. The random effects τ1, …, τn model the dependence structure of the pairwise measurements. Model diagnostics suggest reasonable behavior of residuals using a log-link function (SI Fig. C7). Fitted coefficients indicate a positive increase in dissimilarity with geodesic distance, and a decrease in dissimilarity with ecoregion, but no significant effect of shared language family (SI Fig. C7). GLMM fits were performed in R (v3.6.3) using the lme4 (v1.1-21)package, with MLPE structure based on code from resistanceGA [71]. Model diagnostics based on simulated residuals were done using package DHARMa (v0.2.6).
Pseudo-R2 measuring overall model fit was computed as , where is the model predicted value for w(ij), based on Zheng & Agresti [72]. For our model, . However, there is no standard, single measure of R2 for models with mixed effects. A recent proposal [73, 74] suggests reporting two separate quantities, a conditional and marginal R2, which can be interpreted as measuring the variance explained by both fixed and random effects combined , and the variance explained by fixed effects alone . For our model we computed these as and , respectively, where based on Nakagawa et al. [74]. For our model, conditional and marginal . We based the inclusion of fixed effects on AIC (Fig. 5c) following best practices for MLPE models [75].
Supplementary Information
A Rate-distortion theory
Rate-distortion theory [1, 2] provides a mathematical treatment of the problem of lossy compression, based on information-theoretic quantities. In information theory [1], the entropy of a discrete random variable, X, defined provides a measure of the average length of the shortest description (“amount of information”) needed to specify the outcome of random variable X with outcomes in the set occurring with probability p(x). The joint entropy of X and a second random variable, Y, H(X, Y), is defined similarly in terms of the joint distribution of X and Y, p(x, y), and measures the average length of the shortest description needed to specify the outcomes of both random variables together. When the outcome of X is related to the outcome of Y in some (possibly nonlinear and stochastic) way, then the shortest description of both X and Y together may be smaller than the shortest descriptions of each of X and Y separately. In general, H(X, Y) ≤ H(X) + H(Y), with equality if and only if X and Y are statistically independent. The mutual information between X and Y, defined, then gives a non-negative measure of the average amount of information X and Y contain about each other, which is nonzero if and only if X and Y are not independent.
In the lossy-compression context, for a given source (random variable) X and a description of that source, , the mutual information measures the amount of information the description contains about X, and it is this quantity we wish to minimize for compression, subject to a loss function, i.e. a measure of distortion. This can be formalized as where the loss is measured in terms of an expected distortion, , with p(x) a property of the source, and the mapping of x to chosen to achieve on average the smallest description size possible, R(D), for a given allowable average distortion, D. Intuitively, the minimum compressed description size, R(D), increases as the allowable average distortion, D, decreases, dependent on the details of the source, X, and loss function, d.
A.1 Bregman clustering
The classical formulation of the rate-distortion tradeoff gives an optimal mapping of X to for fixed . When every x and has coordinates in a vector space, denoted x and , respectively, then for a large family of distortion measures known as Bregman divergences, optimal coordinates for each can be found [3] in addition to the optimal mapping between X and . For Bregman divergence , defined with convex function ϕ, gradient evaluated at , and inner product denoted 〈·, ·〉, the centroid of the mapping from to X is the minimizer of the average distortion for , i.e.
Solutions to rate-distortion Bregman clustering (RDBC) problems have the property that each satisfies Eq. 14.
A.2 Compression model of color naming
The first (implicitly) RDBC model of color naming appears in work by Yendrikhovskij [4]. Using a perceptual measure of distortion, Yendrikhovskij [4] worked to show that efficient solutions to a tradeoff between average perceptual distortion and vocabulary size account for color categories based on natural image statistics. While the results are likely sensitive to the exact, unreported choices of “natural images” used to produce the image statistics [5], the conceptual link to a rate-distortion tradeoff has proved significantly productive. Using the same RDBC-based compression model but disregarding scene statistics, instead using the neurophysiological constraints of perceptual discrimination and gamut alone, Regier et al. [6] showed that the compression model of color naming can qualitatively explain many of the typical vocabularies of natural languages in the WCS. Subsequent work by Zaslavsky et al. [7] investigated a “soft” partitioning variant of this same conceptual framework, allowing for uncertainty in the mapping between terms and colors. In all cases, implicitly or explicitly, we can equivalently restate these compression-based accounts of color naming in terms of RDBC with a perceptual measure of distortion.
A.3 Focal colors as category centroids
In the World Color Survey, participants were asked to identify among the WCS color chips the “best example” of each basic color term identified in their vocabulary. In the WCS instructions to scientists conducting the field work, this is intended to elicit a response in the participant that identifies a color chip that “… is a good, typical, or ideal… ”† example of a given color term. In this work, we hypothesize that focal colors are observations of the centroids defined by Eq. 14. Two objections to this hypothesis immediately arise.
First, past work has shown that empirical measurements of category centroids differ from focal point positions [8–10], which would seem to invalidate our hypothesis. However, the discrepancy can be resolved by understanding how past work measured category centroids. Sturges and Whitfield [9], following earlier work by Boynton and Olson [8], conducted a color naming experiment similar to the WCS but in controlled laboratory conditions (and for English speakers only). Similar to the WCS, participants were asked to name, one by one in randomized sequence, a presented color chip, recording both the response as well as the timing of the response. The chips with shortest response times were considered the focal colors, and despite the difference in method these appear to be in good agreement with the “best example” focal colors recorded by Berlin & Kay for English speakers.
For each participant, the centroid of a category was computed as the average of all the color chips (in a given color space) that the participant named with that category’s color term (e.g. “red,” “green,” etc.). To write this out mathematically, we have a sequence of participant responses, , where each response is a color term, i.e. , elicited by an experimenter presented color chip, x(1), x(2), …, x(n), where . Note that each color chip in was presented more than once in the sequence of n presentations. Then the centroid for category was computed as where 1(·) is the indicator function equaLl to 1 if its argument is true and 0 otherwise, and x(i) gives the coordinates of color x(i) in color space. Let count the occurrences of given presentation of color chip x, then we have
Let count the occurrences of x in the sequence. Then , and gives the fraction of times was used to name x, out of a total of n(x) occurrences. Since each color chip was presented the same number of times, we have further that n(x) = m. Then we have equivalently
Lastly, note that , where is the total number of color chips used, i.e. the cardinality of , and is the fraction of occurrences of in the sequence. Thus we have which by Bayes rule is equivalent to our definition of centroid with a uniform distribution of communicative need over the color chips, i.e. . Thus in past work centroids have been shown to differ from focal colors when a uniform distribution of communicative need over color chips is assumed. In this paper, by contrast, we show that by inferring and using a non-uniform distribution of communicative need we better predict both empirical color term maps and focal point positions, and that focal point positions coincide with category centroids under this non-uniform distribution of needs.
The second objection stems from work done by Abbott et al. [11] investigating a measure of the “representativeness” of focal colors based on color category extents for the WCS. Representative colors of a given category are not necessarily those with the highest likelihood, i.e. maximizing , but instead are the most likelyLrelative to their likelihood given any other category, weighted by the prior of that category, i.e. maximizing . This appears problematic for the hypothesis that category centroids are equivalent to focal points, due to the bijection between Bregman divergences and regular exponential family distributions, and the equivalence between Bregman divergence minimization and maximum likelihood estimation [3, 12]. Again, it is crucial to examine definitions to see that the discrepancy is resolved by the assumption placed on the form of . In Abbott et al. [11] was assumed to be normally distributed. Whereas under the compression hypothesis, the maximum likelihood is taken over the mixture model as a whole, and the form of is not normally distributed in general. The broader message of Abbott et al. [11] is that focal color positions reflect a balance between typicality within a category and distinction from other categories; and this interpretation agrees with our identification of focal colors as category centroids when category centroids “compete” to represent different parts of color space, as in the compression model of color naming.
B Inverse inference of source distribution
In this section we address the general problem of inferring an unknown source distribution, p(x), from knowledge of its compressed representation (i.e. a representation that lies on the rate-distortion curve for some unknown value of the tradeoff parameter, β). Concretely, we wish to find the q(x) that best approximates the unknown distribution p(x) using only what we know about and from its compressed representation, with no other assumptions. For fixed marginal distribution over , this can naturally be expressed as a problemLof finding the conditional distributions that together maximize the entropy of the marginal distribution over , subject to a set of constraints that enforces we recover the known compressed representation, i.e. where H(X) is the Shannon entropy of .
We show that a numerical solution to this problem can be found via an alternating minimization strategy used by Blahut and Arimoto in their solutions to the channel maximization and rate-distortion problems [13, 14] and later generalized by Csiszár & Tusnády [15]. To do so, we first note that the objective function can be rewritten as using the fact that . Here Q is the set of all conditional probability distributions such that for all . Since the mutual information term can be written as a maximization over , [13, 14] i.e. and is constant with respect to varying , we can rewrite our objective function as a double maximization of the function to change the problem into one of alternating maximizations over and , i.e.
The inner maximization over for constant is given by , as previously shown by Blahut and Arimoto. The outer maximization over must respect a set of constraints that ensure we recover as a minimum distortion representation of x and that we have a valid probability distribution, i.e. where . Eq. 27 enforces that there is no difference between the true compressed representation centroids and those generated by the estimated , while the remaining two constraints ensure that is a proper probability distribution.
Temporarily setting aside the non-negativity constraint (it will be enforced by the form of the solution), the Lagrangian is then for fixed . Taking the derivative with respect to and setting equal to zero, we have where we absorb a term into each Lagrange multiplier . If the function d is a Bregman divergence, i.e. it can be written as dϕ(u ‖ v) = ϕ(u) − ϕ(v) − 〈u− v, ∇ϕ(v)〉 for some convex function ϕ, then Where .
For the constraint to be true, the Lagrange multipliers, , must act as a normalization factor, giving us
This also satisfies the non-negativity constraint for each , since , and ex ≥ 0 for any x ∈ ℝ. Finally, we can combine the unknown scalar and vector into a single unknown vector , giving where must be chosen such that Eq. 27 is true.
For any Bregman divergence, dϕ(u ‖ v) = 0 iff u = v (see Banerjee et al. [3]). Thus to enforce Eq. 27, we need to find s.t. . Let . Then the vector of partial derivatives of with respect to ν are given by
Since is strictly convex, we have by Legendre transform its convex conjugate dual, and vector of partial derivatives
By the strict convexity of and the definition of the Legendre transform we have that , i.e. the unique choice of ν for a given value of . The unique choice of ν to guarantee is then simply , which can be computed numerically via e.g. BFGS.
The alternating maximization algorithm is then to iterate with , and starting from any initial . By construction, the choice of maximizes J for fixed , and maximizes J for fixed , subject to their respective constraints. We thus have a sequence indexed by t of non-decreasing values for J, which converges whenever the maximum entropy is finite. The solution for the marginal distribution of X is then given by .
B.1 Convergence to the global optimum
In this section we will show that the alternating minimization algorithm defined by Eq. 39 and Eq. 40 converges to the global maximum of for any initial choice of . We will do this using a geometric approach developed by Csiszár & Tusnády [15]†, which for example can be used to prove convergence to the global optimum for the alternating minimization algorithm proposed by Blahut [14] to find numerical solutions to the rate-distortion problem. First, note that maximizing is equivalent to minimizing . Then by Theorems 1 and 2 of Csiszár & Tusnády [15], to show convergence to the global minimum via alternating minimizations of D it is sufficient to show that the “three points property” and “four points property” both hold for D and a choice of functional, δ.
(From Csiszár & Tusnády [15]) Let δ [p, p′] be a non-negative valued function on P × P such that δ [p, p] = 0 for each p ∈ P. Given D and δ, for a p ∈ P the three points property holds if whenever pt+1 = arg minp D [p, qt]. The four points property holds for a p ∈ P if for every q ∈ Q whenever qt = arg minq D [pt, q].
We will show that the three and four point properties hold for D and the following choice of δ,
Non-negativity of Eq. 43 follows directly from the non-negativity of the KL-divergence and , as does equality holding iff .
We will also make use of the fact that we can rewrite both the definition of δ given by Eq. 43 and D in terms of the following Bregman divergence, where wi are constant non-negative weights that sum to one, and xij, yij ≥ 0, not necessarily summing to one. In this case ψ is the strictly convex function ψ(x) = Σi wi Σj xij log xij. Then with , i indexing elements of , and j indexing elements of X, we have that and
The three points property, , where , holds.
Proof. Rewriting using Eq. 45 and Eq. 50, we must show that Cancelling, we need to show which follows immediately from the Generalized Pythagoras Theorem [3] and the fact that by construction solutions of Eq. 40 maximize J for fixed , so that
The four points property, , where , holds.
Proof. From the definitions of D and δ, we must show that
By subtraction, equivalently we must show that
Denoting , from Eq. 39 we have that . Then by substitution we have where Eq. 61 follows from the fact that , and Eq. 62 from the fact that . Then this is equivalent to the statement that 0 ≤ DKL [q(x) ‖ qt(x)], which is true by non-negativity of the KL-divergence.
The sequence of alternating maximizations defined by Eq. 39 and Eq. 40 converges to the global maximum of for any initial choice of .
Proof. Proof of Theorem B.1 follows from satisfying the five point property of Csiszár & Tusnády [15], which is implied by satisfying the three and four points properties from Lemma 1 and Lemma 2, respectively.
B.2 Uniqueness
In the previous section we showed that the solution found by the alternating maximization algorithm is globally optimal. Here we show that the optimal q(x) distribution is also unique.
The distribution for the achieving the maximum of is unique.
Proof. Assume q*(x) is not unique, and there exists a distinct solution q′(x) that also achieves the maximum of with . Then two things must be true.
First, since q*(x) and q′(x) are distinct, then 0 < DKL [q*(x)‖q′(x)]. From the definition of the KL-divergence and using the fact that , we have that
Since (for any choice of ), the definition from Eq. 40, and the equivalence of , we have since after cancellation none of the terms depend on x except , and .
Second, since both q*(x) and q′(x) achieve the global optimum, we must have that . Then after cancelling we have
From the definition of in Eq. 40 and the equivalence of ,
Then, since , we can cancel the term from both sides, and using the fact that , we have
But this contradicts the inequality established by Eq. 65. Thus q*(x) must be unique.
B.3 Example inference and comparison to prior work
As an illustrative example, we present the results of the inverse inference method above for a known distribution p(x). This “toy” example allows us to study the properties of the inverse inference when the ground truth, p(x), is known. We also use this example to compare our inference method to a different method used in the literature on color naming, called “capacity achieving prior” (CAP). Rather than solving for the maximum entropy distribution consistent with a rate-distortion optimal vocabulary, the CAP method assumes instead that the true p(x) will be one such that, given a vocabulary of term mappings , we only ever need to communicate the x’s that are maximally unambiguous to specify with that vocabulary. The CAP distribution is the one that achieves the maximum channel capacity for the given term map (the specification of the channel from X to ), i.e. satisfying where . This is a strong assumption in general, and when it is violated, as we will see in this section, the CAP can be a very poor approximation of the true distribution p(x).
In our toy example x ∈ X covers the unit grid (n = 100 × 100) with an arbitrary but specified distribution p(x), as shown in Fig. B1a (ground truth). The figure also shows the RDBC solution for 4 and 8 terms (Fig. B1a and B1b, respectively; this example uses squared Euclidean distance as the distortion measure). The ground truth distribution p(x) was chosen to be nonuniform, with a broad probability gradient from (0, 0) to (1, 1), and a smaller-scale low to high to low to high oscillation in probability along the x-axis. The RDBC centroids and Voronoi (nearest-centroid) regions show a non-uniform division of X into clusters (or “terms,” to link this to the terminology of the color naming problem), as a result of using a non-uniform p(x).
Based on only the positions of the focal terms and the term frequencies , our inverse method produces an estimate of p(x) that recapitulates the broad-scale features of the ground truth. The inverse inference performs well even with as few as 4 terms (Fig. B1a), with some additional, fine-scale details captured when inferring from 8 terms (Fig. B1b). The corresponding CAP distributions, which are not based on and alone but in addition require knowing the full term map , deviate significantly from the ground truth (Fig. B1a and B1b; note different scale).
In Fig. B1c, the entropy of inferred and CAP solutions are shown for a broader range of vocabulary sizes (from 2 to 10). The figure also quantifies the dissimilarity between the ground truth and the estimated distributions, based on their KL-divergence. Successive iterations of the inverse inference algorithm show monotonic convergence to a maximal entropy value that lies between the ground-truth entropy and the unconstrained maximum entropy distribution (uniform over X). Note there are only small differences between the maximum entropy values achieved when varying the vocabulary size used (the equivalent of the number of basic color terms). While not directly constrained by the inverse inference method, since the ground truth distribution is assumed unknown, the inverse method converges to distributions that are very close to the true distribution. Solutions become closer to ground truth as the vocabulary size increases, but even small vocabularies provide inferences that closely approximate the ground truth. By comparison, CAP solutions have entropies that are substantially lower than the maximum or even the ground truth entropy, and they are sensitive to vocabulary size. CAP solutions are orders of magnitude more divergent from ground truth, compared to the results of the inverse inference method we have developed.
C Application to color categories
We use the inverse inference method of SI Sec. B to find the distributions of communicative need for empirical color vocabularies via the following correspondence (outlined in Fig. 1c). In this application, the source, X, denotes the visible colors that need to be communicated, which are the WCS stimuli set. Each WCS stimulus color, x, in the set of WCS stimuli, , has a position x in CIE Lab, a perceptually uniform color space. The unknown distribution of communicative need we wish to infer is p(x). Our estimate of p(x) will be the one that best matches the known position, , of each “best-example,” or focal, color for each term, , in the language’s color vocabulary, , and is otherwise maximally unbiased (maximizes the entropy of the inferred distribution).
Intuitively, in the inverse inference procedure (SI Sec. B), the vectors can be thought of as “pulling” on the inferred distribution such that the inferred centroids match the position of the true centroids. In the example shown in SI Sec. B.3, the positions of the true centroids lie in the interior of the boundary of all the x positions. To match prior work and the WCS itself, we use the WCS color chips (Fig. 1a) as the support set for the inverse inference. Since WCS participants selected focal colors from this same set, the average focal color position across participants could lie on or near the boundary of the support set if there is high agreement among participants. To match these positions with the given support set, the inverse method would be forced to “pull” with overly large magnitudes towards these remote points, when this is just an artifact of the constraints on participants and the choice of support.
To check if this was the case, and to mitigate any impact it may have, we constrained the maximum magnitude that any could have, and varied this value as a parameter, λ. At λ = 0 the inverse method makes no attempt to match the language centroids, and we recover only the uniform distribution over the WCS color chips. At λ = ∞, we recover the unconstrained inverse inference method. At intermediate values of λ, pathologically large magnitudes have limited impact on the inference. If indeed there are pathologically large magnitudes at play, then there should be a large difference between the entropy at λ = ∞, where the inferred distribution becomes overly concentrated at the problematic focal point, and at intermediate values of λ for which the nearly the same RMSE between inferred and true focal points is achieved. Fig. C2a shows that this is exactly the case, and suggests that λ ≤ 0.25 is sufficient to achieve RMSE’s close to the unconstrained solutions, while maintaining substantially higher entropies.
Note that RMSE is measured using the empirical focal points and the position of the focal points for the optimal rate-distortion fit using the inferred distribution at a given value of λ. Rate-distortion solutions were found using the standard alternating minimization algorithm (see Banerjee et al. [3]), where , and β is a parameter that acts as an “inverse temperature,” controlling the “softness” (low values of β) or “hardness” (high values) of the boundaries between terms given by . Since RDBC solutions are not unique, we run the algorithm starting from many different initial conditions (initial positions drawn uniformly at random from the set of WCS color chips) until convergence (change in positions between iterations is < 1 × 10−5 or the maximum number of iterations is reached; max iterations 1 × 104 used in searches for the optimal value of β; max iterations 5 × 104 for calculation of RDBC solution using the optimal value β), and keep the solution with lowest mean squared error. We used a standard derivative-free nonlinear optimization method (bound optimization by quadratic approximation [17], via the nloptr (v1.2.1) package for R v3.6.3) to search for lowest mean squared error values of β.
For each B&K+WCS language, the minimal RMSE for inverse inference with λ ≤ 0.25 is shown in Fig. C2b (y-axis), and compared with the minimal RMSE (same optimization procedure for non-unique RDBC solutions and choice of β) for uniform (x-axis). In all cases, use of the inferred distribution reduces RMSE compared to uniform (all points below 1–1 line). As useful references, we quantified the RMSE for within-language variability in focal point positions among participants (via bootstrap resampling of participant responses and measuring their RMSE with respect to the mean focal point positions for that language), as well as the RMSE when all terms are off by one WCS color chip. Most inferred distribution RMSE’s are between the median values of these two reference quantities, which is not the case for uniform.
Similarly, in Fig. C2c we show the absolute improvement in term map predictions for the WCS languages shown in Fig. 3b, comparing the Earth mover’s distance (EMD) between predicted and empirical term maps based on inferred (y-axis) and uniform (x-axis) distributions. WCS languages were used for term map comparisons both for the ability to resample from among speaker responses (the B&K data surveyed only one speaker per language) to assess confidence intervals on improvement in Fig. 3b, and because the B&K study design substantially differed methodologically from the WCS in the color naming task.† In the WCS color naming was assayed for each color chip, whereas in B&K participants selected chips out of the full set of stimuli [18]. While the B&K term maps are related to , they are not straightforward estimates of as in the WCS, and behave qualitatively differently. As useful reference points, we computed the EMD between empirical vocabularies and rotations thereof, approximated by cycling WCS columns 2:41. This transform preserves the structure of each vocabulary while increasing the displacement (in hue) between the true and rotated terms, and has been used in prior work on color naming [6]. Here it provides a more meaningful distance scale for the EMD measurements than e.g. chip-wise randomization.
C.1 RMSE reference points
We provide three points of comparison for the RMSE distributions shown in Fig. 2b. First, the “WCS variability” reference line was computed by resampling participant focal point choices by language, recomputing the mean focal point across resampled participants, and measuring the RMSE between the recomputed focal points and the actual language focal points. We used the median computed RMSE as a useful reference point approximating a lower bound on how well predicted focal points might be expected to perform. Second, the “off-by-one” reference line was computed by repeatedly offsetting each focal point by one WCS chip sampled uniformly at random from the neighborhood of WCS color chips in Fig. 1a and measuring the RMSE between the set of perturbed focal points for a language and the actual focal points. The median computed RMSE in this case gives an intermediate point of comparison for predicted focal point RMSE distributions. Third, the “random” reference line was computed by resampling each language’s focal points from the WCS color chips uniformly at random without replacement, then assigning each resampled focal point to the nearest true focal point, and measuring the RMSE of the two sets of focal points under this assignment. This gives an approximate upper bound on how poorly a predicted set of focal points might perform, using the same procedure for assigning predicted focal points under the rate-distortion model to actual language focal points.
C.2 Sensitivity of inferred distributions to term frequencies,
The inverse inference algorithm uses the frequency of vocabulary terms, , as part of the inference process for determining p(x). This information is not available for color vocabularies in the B&K and WCS datasets, and further field work would be required to estimate these quantities directly (with the additional caveat that vocabularies may have changed since having been originally surveyed). And so the present work uses the simplifying assumption of a uniform . This is a reasonable approximation, given the WCS selection criteria for basic color terms and evidence in English that basic color terms are elicited with approximately equal frequency under a free naming task [10]. Nevertheless, to investigate the sensitivity of inferred distributions to this choice, we compared inferred distributions under increasingly asymmetric (“skewed”) distributions for , sampled from a linearly increasing set of probabilities between the minimum and maximum . Fig. C2d shows the Wasserstein distance between the inferred distributions under uniform and the skewed distribution as a function of the ratio between the maximum and minimum . As a useful reference point, we computed the median Wasserstein distance between inferred distributions under the uniform assumption re-sampled from the WCS language speaker populations. Ratios in usage greater than approximately 1.5 would be needed before non-uniformity would begin to have a more significant impact on inferred distributions than the among-speaker variability inherent in the data. While this suggests the choice of a uniform is reasonable to a first approximation, the extent to which this assumption of uniformity may be violated in some languages remains an open, but potentially tractable, question for future field work.
C.3 Sensitivity of inferred distributions to vocabulary size,
Under the rate distortion hypothesis, color vocabularies optimize the information a listener can infer about the color being referenced, based on the color term chosen by a speaker. Because there are far fewer terms than perceivable colors, there is by necessity some loss of information caused by the compression of colors into terms. As a result, the size of a vocabulary (number of terms) should have some impact on our ability to infer the underlying distribution of communicative needs, p(x): larger vocabularies should provide more resolution and more detail in the inferred distribution. For the B&K+WCS languages, we can expect that fewer terms will result in the recovery of only broadscale features of a language’s communicative needs, while more terms allow for additional detail.
This effect is demonstrated in Figure C3a. Here we generated rate-distortion efficient vocabularies for simulated languages, each generated from the same underlying distribution of underlying communicative needs and differing only in the number of color terms. We then inferred the distribution of needs from the simulated focal color positions, using our inverse inference method. As expected, we find that having more color terms allows for more detail to be recovered in the inferred distribution of needs, although the results are qualitatively similar across a range of vocabulary sizes.
We also investigated the relationship between vocabulary size and inferred needs in more systematic detail. To do so, we again generated rate-distortion efficient vocabularies for pairs of languages sharing the same underlying communicative needs and differing only in vocabulary size. We used the B&K+WCS average inferred distribution as a “ground truth,” and the number of terms in each simulated vocabulary was restricted to the range of terms found in the B&K+WCS data. We then inferred the communicative needs for each simulated vocabulary in the pair, and we measured their Wasserstein distance. Figure C3b shows a small but statistically significant impact of differences in vocabulary size on the measured distance between inferred distributions of need – which arises because vocabulary size has an impact on the resolution of the inference. For comparison to these simulations, in which the underlying needs are kept constant, we also plotted the distances between inferred needs measured in the empirical data, for all pairs of B&K+WCS languages. The empirical distances between inferred needs are much larger than can be explained by the simulated data. These results imply that differences in vocabulary size alone cannot explain the large differences observed among B&K+WCS inferred communicative needs. Moreover, the relationship between differences in vocabulary size and differences in inferred needs has substantially greater magnitude in the empirical data than in the simulated languages. This suggests that there may be typical ways in which communicative needs evolve as the vocabularies of languages change in size – which is an interesting hypothesis for future study.
C.4 Capacity achieving distributions for individual WCS languages
The capacity achieving distributions, which are referred to as priors (CAP) in the literature, should not in general be expected to approximate the true distribution of communicative need, as shown in SI Sec. B.3. Here we reproduce the average CAP across the WCS languages reported in Zaslavsky et al. [7, 19]. The average CAP differs by several orders of magnitude from the average distribution p(x) inferred in this paper (Fig. C4a). The language-specific CAP’s for Waorani and Martu-Wangka are shown in Fig. C4b: they each differ radically from the communicative needs we estimate by our inference method. The CAP distributions feature implausible variation in communicative need across nearby colors.
C.5 Field work variability
Variability in how the field work for the WCS was conducted for different languages does not appear to explain the instances of non-improvement in Fig. 3b term map predictions. For the WCS, native speakers were asked to use only the basic color terms of their language, as previously identified according to a set of specific linguistic criteria. However in some cases it seems that native speakers apparently were not so constrained, either by experimenter or participant choice. Based on the identification of these two modes in the WCS by Gibson et al. [20] in their supplementary materials, there was no apparent relationship between the choice of methodology and a language showing improvement or no improvement under the inferred distribution vs. uniform.
Footnotes
↵† Or by a mixture of the best choice focal colors when there was more than one best choice.
↵† More precisely, we propose the measured focal colors are the best approximation to the true centroid among the set of WCS color stimuli.
↵† Nor do we use empirical term maps for selection among the small set of non-unique rate-distortion optimal solutions. In this study, selection is based on focal points alone. See SI Sec. C.
↵‡ Note that these results do not imply communicative needs are determined by the need to name fruit specifically.
↵† Although this does not imply that color terms in Pirahã are abstract necessarily; see Regier et al. [9].
↵† The Munsell color system was created as a means to index human perceivable color by hue, value, and chroma, at empirically measured perceptually uniform intervals along each dimension. In the WCS notation, rows correspond to equally spaced Munsell values, and columns 1–40 correspond to equally spaced Munsell hues. For column 0 Munsell chroma is 0; for all other columns Munsell chroma was chosen as the maximum for the given hue and value.
↵† The focal color assays of B&K and the WCS were essentially the same, however. Hence the inclusion of both data sets in other analyses where only focal color estimates are necessary.