Abstract
Language is acquired and processed in complex and dynamic naturalistic contexts, involving simultaneous processing of connected speech, faces, bodies, objects, etc. How words and their associated concepts are encoded in the brain during real-world processing is still unknown. Here, the representational structure of concrete and abstract concepts was investigated during movie watching to address the extent to which brain responses dynamically change depending on contextual information. First, averaging across contexts, concrete and abstract concepts are shown to encode different experience-based information in separable sets of brain regions. However, these differences reduce when multimodal context is considered. Specifically, the response profile of abstract words becomes more concrete-like when these are processed in visual scenes highly related to their meaning. Conversely, when the visual context is unrelated to a given concrete word, the activation pattern resembles more that of abstract conceptual processing. These results suggest that while concepts encode habitual experiences on average, the underlying neurobiological organization is not fixed but depends dynamically on available contextual information.
Significance Statement The capability of extracting and representing meaningful concepts from words is a unique function of human cognition. It allows us to think, communicate, and behave in goal-directed ways. Previous studies have used isolated sentences or words to study the underlying neurobiological mechanisms. However, humans learn and process language in rich, multimodal, and dynamic contexts how does this information affect conceptual processing? We used functional MRI (fMRI) to analyze patterns of brain activity corresponding to concepts processed in naturalistic context. We found that multimodal contextual information can alter the organization of conceptual representation in the brain, suggesting that the underlying neurobiology is context dependent and organized in a distributed way.
Humans acquire, and process language in situated multimodal contexts, through dynamic interactions with their environment. For example, children may learn what the word “tiger” means primarily via sensory-motor experience: they see one on TV, or they are told that a tiger looks like a big cat. Conversely, the experience required for understanding the more abstract concept behind “good” will likely include an evaluation of the rational and emotional motives underscoring intentional actions. Consequently, while more concrete concepts have external physical references (they refer to objects or actions that are easily perceived in the world) (1), more abstract concepts do not necessarily have such reference (they generally refer more to inner states of the mind, culture, or society) (2). Is this difference reflected in concrete and abstract representations in the brain during naturalistic processing? And are these static or can they change as a function of the multimodal contexts in which processing occurs?
Most studies of concrete and abstract processing are not naturalistic in that they present words or sentences in isolation from the rich contexts in which we usually process them. Collectively, these studies suggest that concrete and abstract concepts engage separate brain regions involved in processing different types of information. (3–7). Concrete words engage regions involved in experiential processing (8, 9). For example, motor related cortices activate during the processing of action verbs like “throw” (10) or action related nouns like “hammer” (11, 12), auditory cortices for sound-related words like “telephone” (13, 14), and visual cortices for color-related words like “yellow” (15, 16). These results are consistent with the view that we learn and neurobiologically encode concrete concepts in terms of the sensory and motor experiences associated with their referents. In contrast, some studies of abstract concepts have found greater activation in brain regions associated with general linguistic processing (5, 17, 18). These findings suggest that abstract concepts are learnt by understanding their role in linguistic context, including semantic relationships with other words (e.g. “democracy” is understood through its relationships to words like “people”, “parliament”, “politics”, etc.) (12, 19). However, neurobiological data also support the view that abstract conceptual representation retains sensorimotor information (20) as well as internal experiences like emotion and interoception (7, 21), which are also important for learning abstract concepts (22).
A limitation of these studies is that they have only investigated processing of decontextualized concepts (see e.g., Table 1 in a recent review by (23). That is, the methodologies used (perhaps implicitly) assume that conceptual representation in the brain is the product of a stable set of regions processing different types of information depending on whether a concept is concrete or abstract. However, this dichotomy may not account for the way in which we typically process concepts (24), if the information encoded during conceptual processing depends on the contextual information available. For example, the situated processing of concrete concepts like “chair” could be linked to many internal elements associated with abstract concepts like goals (“I want to rest”), motivations (“I have been standing for two hours”), emotions (“I would like to feel comfortable”), and theory of mind (“are those older people more in the need of this chair than me?”). Conversely, an abstract concept like “truth” is no longer particularly abstract when used in reference to a perceived physical situation (such as “it is true that it is snowing”) that matches the utterance’s meaning (that it is snowing). Here, “truth” refers to a concrete state of the world (25).
Indeed, previous behavioral data support the view that contextual information can affect conceptual processing (26– 28). For example, when an object is depicted in a context consistent with its use, the action associated with using the object is more readily available than when the context is more consistent with picking the object up (29). There is also neurobiological evidence that objects visually present in a situation can influence conceptual processing (30, 31). For example, the task-related color-congruency of objects correlates with less activation of brain regions involved in color perception during processing – likely because less retrieval of detailed color knowledge was necessary (15). However, no studies exist on the extent to which context can modulate the representational nature of concepts in the brain. The present study aims to fill this gap and test the following two predictions.
First, we submit that results from previous investigations of conceptual processing, which generally depict a stable dichotomy between concrete and abstract words, reflect the average experiential information of the type of situational context in which they are habitually experienced. Therefore, we predict that the neurobiological representation of concrete concepts, because they retain experiences related to their physical referents that are predominantly characterized by sensory and motor information, will be associated with associated brain regions (32, 33). In contrast, because their representations mostly reflect information related to internal experience as well as more general linguistic processing, we expect abstract concepts to activate brain regions associated with emotional, interoceptive, and general linguistic processing (34).
Second, the reviewed work also suggests that these habitual representations are not necessarily stable and might change during naturalistic processing depending on the specific contextual information available. We specify two context conditions: a concept is displaced if its context offers little or no visual information related to the concept’s external sensory-motor features. A concept is situated, if its context contains visual objects related to its meaning. We predict that when a concrete concept is processed in displaced situations (e.g., “cat” processed when discussing the general character traits of cats vs dogs), response profiles will include regions related to processing of emotional, interoceptive, and general linguistic information shared with abstract concepts. In contrast, when an abstract concept is processed in a situated context (for example the word “love” processed in a scene with people kissing), its representation will more heavily draw on regions involved in sensory and motor processing that are shared with concrete concepts.
We propose that this erosion of the concrete/abstract dichotomy for contextualized processing shows that both concrete and abstract concepts draw on information related to experience as well as linguistic association. Which associated neurobiological structures are engaged during processing depends dynamically on the contextual information available. This way of thinking about the representational nature of conceptual knowledge may help reconcile contradicting evidence concerning the involvement of different brain areas during conceptual processing
Results
Conceptual Processing Across Contexts
First, in line with previous studies, we predicted that across naturalistic contexts, concrete and abstract concepts are processed in a separable set of brain regions. To test this prediction, we used data from 86 participants who watched one of 10 movies during fMRI (35). The preprocessed data was analyzed using amplitude modulated deconvolution regression (36). This has the advantage that it does not assume an ideal shape of the hemodynamic response function, which is instead derived for each individual participant.
Specifically, we estimated impulse response functions for concrete and abstract words in the movie matched for number, frequency, and length over a 20s window, at one second time steps, starting at word-onset. Estimates of the modulation of the amplitude of these 20 timepoints were done using independent ratings of the concreteness and abstractness of each word (excluding function words) (37) as well as two nuisance variables. For each word, these were average luminance (controlling for correlated “low level” visual information) and loudness (controlling for “low level” auditory information) as well as duration (controlling for “low level” lexical processing). The deconvolution model itself included regressors for concrete and abstract words, all the remaining words in the movie as well as all “other” TRs (i.e. those which did not include any speech). The latter two were included to regress out unwanted activity from low-level lexical and audio-visual processing. The resulting 20 concrete and 20 abstract modulation betas were then used in a group level linear mixed effects model (38), which included a random slope for participant as well as control variables for age, gender and movie. Subsequent post-hoc general linear tests were conducted between each abstract and concrete time-points and corrected for multiple comparisons using cluster size thresholding at an alpha level of 0.01.
Collapsing over the all time points, the general linear tests revealed where activity for concrete was greater than abstract word modulation (Figure 1, red). In the frontal lobes, this included greater activity in the right posterior inferior frontal gyrus (IFG) and the precentral sulcus and gyrus. In the temporal lobes, greater modulation occurred in the bilateral transverse temporal gyrus and sulcus, planum polare and temporale, and superior temporal gyrus (STG) and sulcus (STS). In the parietal and occipital lobes, more concrete modulated activity was in large bilateral swaths of the precuneus, occipital cortices (running into the inferior temporal lobe) and the ventral visual stream. Finally, subcortically, the dorsal and posterior medial cerebellum were more active bilaterally for concrete word modulation.
Neurobiology of conceptual processing across contexts. Colored regions show group level results from a linear mixed effect model and subsequent general linear tests contrasting activity for concrete (red) versus abstract (blue) word modulation at each of 20 timepoints after word onset. Overlapping regions (yellow) indicate a concrete and abstract difference at one of these timepoints. Results are thresholded and corrected for multiple comparisons at alpha = 0.01 and displayed with a cluster-size bigger than or equal to 20 voxels.
Conversely, activation for abstract was greater than concrete modulation of words in a grossly different set of regions (Figure 1, blue). In the frontal lobes, this included bilateral anterior cingulate and lateral and medial aspects of the superior frontal gyrus. More left frontal activity for abstract modulation was in both lateral and medial prefrontal cortices and more right in the orbital gyrus, posterior IFG, and precentral sulcus and gyrus. In the parietal lobes, activity was greater in the angular gyri (AG) and inferior parietal lobules, including the postcentral gyrus. In the temporal lobes, activity was restricted to the STS. Subcortically, activity was bilaterally greater in the anterior thalamus, nucleus accumbens, and left amygdala for abstract word modulation. Finally, there was overlap in activity between concrete and abstract word modulation (Figure 1, yellow). In the frontal, parietal, and temporal lobes, this was primarily in the left IFG, AG, and STG, respectively. In the occipital lobe, processing overlapped primarily around the calcarine sulcus. The fact that we found overlap activation despite the inclusion of contrasts, suggests that concrete and abstract words are processed in these areas at different time scales.
Grossly, these results suggest that concrete words engage sensory and motor regions more whereas abstract words engage regions more associated with semantic as well as internal (e.g., emotional and interoceptive) processing. Both categories overlap in regions typically associated with word processing. However, these interpretations are based on informal reverse inference. To more formally and quantitatively evaluate this distinction between concrete and abstract words, we employed meta-analytic description and reverse correlation analyses. Both test whether brain regions involved in concrete and abstract conceptual processing reflect different types of habitual experience (i.e., sensory-motor vs internal).
Meta-analytic Descriptions
We extracted the coordinates of the centers of mass of each cluster from each condition in Figure 1. We then input these into the neurosynth automated neuroimaging meta-analysis package, which is a database of more than 14,000 studies (accessed July 2022) (39). This returns a term-based meta-analytic coactivation map and the associated high frequency terms at that voxel for that map. Only terms with a z-score greater than 3.12 (p < 0.001) were extracted. We then compared counts of terms associated with concrete and abstract activation clusters with a Kruskal Wallis test.
Results demonstrate that significantly more concrete clusters are related to the term “Movement” compared to abstract clusters (H(2) = 12.4, p < 0.001; Figure 2 and Table 1, red). In contrast, abstract clusters are more related to terms that are arguably associated with internal processing compared to concrete activation clusters, i.e., “Autobiographical Memory”, “Nausea”, “Pain”, “Reward/Motivation”, and “Valence” (all ps < 0.05; Figure 2 and Table 1). Finally, “Language” was the only term more associated with overlapping clusters than both concrete (H(2) = 7, p < 0.001) and abstract clusters (H(2) = 4, p = 0.045; Figure 2 and Table 1). For meta-analytic associations of each individual cluster, see Supplementary Material, Section A (Table S1).
Meta-analytic description of conceptual processing across contexts. We used the Neurosynth meta-analysis package to find the terms associated with the centers of mass for each concrete (red), abstract (blue), and overlap (yellow) cluster from Figure 1. Numbers refer to the number of activation clusters associated with each meta-analytic term. There were significantly more concrete than abstract clusters for the term “Movement”(p < .001), whereas there were more abstract compared to concrete clusters for “Autobiographical Memory”, “Nausea”, “Pain”, “Theory of Mind”, and “Valence” (all ps < 0.05). The term “language” was significantly more associated with overlap clusters compared to concrete (p<0.001) and abstract clusters (p=0.045)
Peak and Valley Analysis
The meta-analytic approach suggests that brain activity correlated with concrete and abstract modulations relate to separable experiential domains that roughly map onto a sensory-motor vs internal distinction, respectively. We attempted to further validate this result using the specific stimulus information presented to participants. We did so with a variant of reverse correlation in neuroimaging (40), called the “peak and valley” analysis (41). The later works by assuming that, if a brain region is involved in encoding a specific stimulus feature, that feature will be more likely to be occur when the brain response is rising and, therefore, processing information (i.e., at peaks) compared to when the response is falling and, therefore, inhibiting or not processing information (i.e., valleys). Thus, we predicted that concrete clusters were more likely to encode sensory-motor whereas abstract clusters would encode internal properties of words at peaks compared to valleys.
To accomplish this, we first extracted each participants’ timeseries from voxels from each concrete and abstract activation cluster and averaged them. Next, we calculated whether each timepoint is rising (labeled a peak) or falling (labeled a valley; Figure 4, (1)). This timeseries was delayed by 6 seconds to align with the words mentioned in the subtitles (i.e., to correspond roughly to the peak in a canonical hemodynamic response function; see Supplementary Material, Section B for the analysis of 4 (Figure 1) and 5s (Figure 2) lags). We then extracted the words in the movie that occurred during these peaks and valleys for each cluster (Figure 4, (2)).
These words were then converted into a 12-dimensional experience-based representation using the Lancaster sensorimotor norms (42) and the norms for valence and arousal (43)(Figure 4, (3)). For each of the 12 dimensions, we created two categorical arrays for each participant, one for peaks and one for valleys, inputting 0 if the word mentioned at a peak or valley was not highly rated and 1 if it was highly rated on a dimension (i.e., larger than one standard deviation from the dimension’s mean rating). We then concatenated all the peakand valley-arrays of all participants and conducted a Kruskal Wallis test between the concatenated arrays to determine whether a given experiential dimension occurred significantly more at peaks than valleys for each cluster (Figure 4, (4)). Finally, we compared the number of concrete and abstract activation clusters with significantly more peaks than valleys for each dimension using another Kruskal-Wallis test (figure 3).
Peak and valley analysis results for understanding conceptual processing across contexts. We extract the type of information processed in each activation cluster by looking at experience-based features of movie words that are aligned with significantly more peaks than valleys (see Figure 3). Words highly rated on the sensory-motor dimensions “Haptic”, “Hand-Arm”, and “Torso” were significantly more associated with concrete clusters (red, all ps < 0.05), “Valence” with abstract clusters (blue, p < 0.001) and “Mouth” with overlap clusters (yellow, ps < 0.05).
Overview of the peak and valley analysis method. First, we average the fMRI timeseries for each participant, for each abstract, concrete, and overlap cluster of activity from Figure 1. Then we label peaks and valleys in these (1) and map them onto word on- and off-set times (2). Finally, we estimate sensorimotor as well as valence and arousal representations for each word (3) and determine which dimensions are associated with significantly more peaks than valleys across participants in each cluster using a Kruskal Wallis test (4).
For concrete clusters, we expect significantly more sensorymotor dimensions (i.e., “Foot-Leg”, “Hand-Arm”, “Haptic”, “Visual” and “Torso” in the corpora) to be associated with peaks rather than valleys in the timeseries compared to abstract clusters. Conversely, we expect significantly more experiential dimensions related to internal processing (i.e., “Interoception”, “Valence”, and “Arousal” in the corpora) to be associated with peaks compared to valleys for abstract relative to concrete clusters. It was not clear to us whether the dimensions (“Auditory”, “Head”, “Mouth” and “Gustatory”) were more related to internal or sensory-motor processing. Therefore, we had no predictions for them. Comparing dimensions for abstract vs concrete clusters, we found significantly more concrete compared to abstract clusters associated with the dimension “Torso” (H(2)=7, p<0.001). Three concrete clusters were associated with “Haptic” and “Mouth”, which was also significantly more than for abstract clusters (all tests H(2)=5.2, all p=0.02). Two concrete clusters with “Foot-Leg” compared to 0 abstract clusters was not significant, but the mean was in the expected direction (H(2)=3.4, p=0.06). Conversely, eight abstract clusters were significantly more associated with the dimension “Valence”, which was significantly more than for concrete clusters (H(2)=8.3, p < .001). Three abstract clusters were associated with the dimension “Auditory”, which was not significantly more clusters than for concrete (H(2)=1.9, p=0.17). Finally, five overlap clusters were significantly more associated with the dimension “Mouth” compared to 2 concrete clusters (H(2)= 5.1, p=0.03) and 0 abstract clusters (H(2)=7.8, p < .001). For results of this analysis for each individual cluster, see Supplementary Material (Table S2).
Conceptual Processing in Context
The previous analyses suggest that, when activation is averaged across contexts, concrete and abstract words are processed in separable brain regions encoding different types of experiential information. Here, we test if these response profiles dynamically change depending on the visual object context.
Specifically, we predict that abstract concepts become more concrete-like when situated and that concrete concepts become more abstract-like when displaced. First, we derived a measure of how contextually embedded a given abstract or concrete word in a movie is within its visual environment. This was done by first extracting labels for objects visually present in a 2s (or 60 frame) window before each word, using two pretrained visual recognition models: faster R-CNN (44) and OmniSource (45)(Figure 5, (1)). Next, we averaged GloVe vector embeddings (46) for all labels within this window. These vectors represent the meaning of each label based on global word-word co-occurrence statistics. We then calculated the cosine similarity between the averaged object label embeddings and the GloVe vector embedding of the movie word after the 2s window (Figure 5, (2)).
Method for estimating contextual embeddedness for each concrete and abstract word to model context-dependent modulation of conceptual encoding. We use visual recognition models for automatically extracting labels that were visually present in the scene (60 frames, 2 seconds) before a given word was mentioned in the movie (1). We then correlate an average GloVe Vector embedding of all these labels with a GloVe Vector embedding of that word to estimate how closely related the labels of objects in the scene are to the word (2). Displayed are four randomly extracted measures of situated abstract (blue frame) and concrete (red frame) words (3) together with the objects that were visually present in the scene.
Next, we defined situated and displaced contextual embeddedness. Concrete and abstract words were said to be highly situated if the cosine similarity was greater than one standard deviation above the median (Theta > 0.6) or displaced if it was more than one standard deviation below the median (Theta < 0.4; for examples see Figure 5, (3)). We then used the same amplitude modulation approach as in the previous deconvolution analysis conducted across contexts on this subset of situated and displaced concrete and abstract words. The difference is that, rather than two conditions we now have four: situated and displaced concrete modulators and situated and displaced abstract modulators. Again, a group level linear mixed effects model was done and subsequent general linear tests were conducted between situated and displaced contexts for concrete and abstract modulation and all results were corrected for multiple comparisons at alpha = 0.01.
The resulting interaction between modulation-type (abstract/concrete) and contextual embeddedness (situated/displaced) reveals activity in regions that partially overlap with both concrete and abstract activation clusters from the analysis of conceptual processing across contexts (compare Figure 1 to Section C Figure S3 in the Supplementary Material). These regions include the middle and superior frontal gyrus, cingulate, precuneus, and AG. The unthresholded interaction map was “decoded” using the Neurosynth package revealing that it is most correlated with meta-analyses of the default mode network or “DMN” (r(1500) = 0.194, p < 0.001), “Default Mode” (r(1500) = 0.219, p < 0.001), and “Default” (r(1500) = 0.226, p < 0.001).
To better understand the nature of this interaction, we contrasted situated vs displaced embeddedness with modulation-types held constant (Figure 6). Displaced concrete words (Figure 6, concrete, blue) were processed in the frontal lobes bilaterally in the anterior cingulate. In the temporal lobes bilaterally in posterior parts of the inferior temporal sulcus as well as in STS and STG. In the parietal lobes, there was more activation in the postcentral gyrus bilaterally. Finally, subcortical activation was found bilaterally in anterior thalamus. Situated concrete words (Figure 6, concrete, red) were processed in the frontal lobes in the left superior and inferior frontal sulcus as well as bilaterally in the central and precentral sulcus. In the parietal lobes, there was more activation in postcentral gyrus and more activity was found in medial aspects of the occipital lobe. Displaced abstract words (Figure 6, abstract, blue) elicited more activity in the frontal lobes in the superior frontal gyrus bilaterally as well as the IFG. In the temporal lobes, more activity was found in the posterior STS bilaterally and anterior aspects of STS in the right hemisphere. In the parietal lobes, more activity was found in precuneus bilaterally. Subcortically, activation was found in the Thalamus and Amygdala bilaterally. Finally, situated abstract words (Figure 6, abstract, red) elicited more bilateral activity in the posterior parietal cortex, medial aspects of the occipital lobe and the fusiform gyrus.
GLT contrasts between situated (red) and displaced (blue) context conditions for concrete and abstract conceptual processing. All p-values corrected for multiple comparisons at alpha = 0.01 and displayed with a cluster size >/= 20 voxels.
To further understand the nature of the modulation-type by contextual embeddedness interaction, we also contrasted concrete vs abstract modulation with contextual embeddedness held constant. In the displaced condition, a small activation cluster was found for concrete words in the occipital pole (Figure 6, displaced, red). In the situated condition, differential activation for abstract words (Figure 6, situated, blue) was found bilaterally in superior and medial prefrontal regions. In the temporal poles, displaced abstract words elicited more activity in the left STG and IFG and in the right AG and right precuneus in the parietal lobes. In contrast, concrete words yielded more activity in left inferior supramarginal gyrus as well as superior aspects of the occipital lobe (Figure 7, situated, red).
GLT contrasts between concrete (red) and abstract (blue) conceptual processing in displaced and situated contexts. No differential activation was found for low context that survived correction for multiple comparisons. All p-values corrected for multiple comparisons at alpha = 0.01 and displayed with a cluster size >/= 20 voxels.
How do the activation profiles of concrete and abstract concepts in situated and displaced conditions relate to activation profiles across contexts? We used cosine similarity to spatially correlate the unthresholded displaced and situated context results with the unthresholded results obtained in the first analysis (across contexts) for both concrete and abstract modulations. This shows that in displaced context conditions, the spatial pattern of activity for concrete words resembles more the pattern of abstract (r(72964) = 0.49, p < 0.001) than concrete words across context (r(72964) = 0.12, p < 0.001). Thresholded activation maps for displaced concrete words (alpha = 0.01) overlap bilaterally with abstract words (across context) in ACC, medial prefrontal regions, anterior thalamus, as well left hemisphere IFG, and AG (Figure 8, top). Conversely, in the situated context condition, the organization of the brain during processing abstract words resembles more that of concrete (r(72964) = 0.62, p < 0.001) compared to abstract (r(72964) = 0.25, p = 0.476) words across context. Thresholded activation maps for situated abstract words overlap with averaged concrete words in the occipital lobe, particularly in occipotemporal cortex as well as the fusiform associated with ventral visual stream. Other regions include the transverse temporal gyrus, STG, and anterior regions in the temporal lobe, posterior insula as well as precentral gyrus and sulcus in the frontal lobe (Figure 8, bottom).
Understanding the effects of conceptual processing in context compared to across contexts. Spatial overlap between thresholded statistical brain images of concrete and abstract conceptual processing obtained from the original contextually averaged model and from the model with contextually embedded regressors. Results show overlap of displaced concrete concepts with contextually averaged abstract concepts (left) in anterior thalamus and prefrontal regions. Conversely, brain activity for situated abstract concepts overlaps with contextually averaged concrete concepts in the ventral visual stream, secondary auditory as well as motor regions (right). All maps were thresholded at alpha = 0.01 with a cluster-size >/= 5 voxels
These overlap profiles suggest that concrete displaced words are processed in a set of regions previously attributed to abstract word processing. Previous analyses suggest that these regions were involved in internal processing, as well as linguistic processing. Conversely, abstract situated words are processed in a set of regions previously attributed to concrete word processing. These were involved primarily in processing sensory-motor information.
Discussion
Conceptual processing is typically investigated in experiments where words are stripped away from their naturally occurring context: most studies use isolated words, and sometimes sentences (see Table 1 in (23). However, conceptual processing in its ecology occurs in rich multimodal contexts. Our study investigated naturalistic conceptual processing during movie-watching to begin to understand the effect of multimodal context on the neurobiological organization of real-world conceptual representation.
Conceptual Processing Across Context
First, we asked where in the brain concrete and abstract concepts are generally processed as well as the type of information they encode. Given the hypothesis that conceptual representation reflects contextual information, we expected a set of regions that correspond to the most typical set of experiences (e.g., as encountered during word learning during development)to activate across different contexts. Specifically, we expected concrete conceptual encoding to activate regions more involved in sensory and motor processing and abstract conceptual encoding to activate regions associated with more internal (emotional and interoceptive) as well as general linguistic processing (5).
Indeed, we found a central tendency for concrete and abstract words to activate regions associated with different experiences (Figure 1). Specifically, concrete words were associated with multiple regions involved in sensory and motor processing (47), including most of the dorsal and ventral visual systems (48) and the right frontal motor system (49). In contrast, abstract words engaged regions typically associated with internal processing like interoception (anterior thalamus, somatosensory cortex)(2, 50), autobiographical memory (anterior medial prefrontal regions) (51) and emotional processing and regulation (anterior medial prefrontal regions, orbital prefrontal cortex, dorsolateral prefrontal cortex, nucleus accumbens and amygdala) (7). In line with this, both meta-analytic descriptions and peak and valley analyses showed that concrete regions were more highly associated with sensory-motor properties (e.g., “Movement” and “Hand-Arm”) whereas abstract regions were associated with internal properties (e.g., “Valence”; Figures 2-4). Collectively, these results provide evidence that concrete and abstract concepts encode different types of experiences on average.
However, the regions involved in processing concrete and abstract concepts across contexts did not imply a fully dichotomous encoding of experiences. First, we found that regions involved in sensory (mostly in visual cortices) and motor processing are involved in processing both types of words (Figure 1). Moreover, we found overlap activation in regions associated with language processing in general ((52), Figure 1). Such results are in line with proposals in which both concrete and abstract representations rely on experiential information as well as their linguistic relationships with other words (e.g., (53–55)).
Conceptual Processing in Context
Though results across context presumably represent a form of experiential central tendency, the behavioral and neuroimaging literature suggests that conceptual representations might not be stable and may vary as a function of context (56–58). For this reason, we conducted a second set of analyses with the goal of understanding the extent to which representations associated with concrete or abstract conceptual processing in the brain change as a function of context (1).
We find that brain activation underlying concrete and abstract conceptual processing fundamentally changes as a function of visual context. Displaced concrete words engage more regions related to internal and general linguistic processing compared to when they are situated in context (Figure 7, top). Conversely, situated abstract words engage more sensory-motor regions than when they are displaced or averaged (Figure 7, bottom). Consequently, the concrete/abstract distinction is neurobiologically less stable than commonly assumed. Brain regions “switch alliance” during concrete or abstract word processing depending on context.
To confirm this last point, we compared the activation profiles of concrete and abstract concepts in displaced and situated visual object context with brain images obtained when collapsing across contexts. Our results show that, indeed, concrete concepts become more abstract-like in displaced contexts with less relevant visual information (Figure 8, top). Conversely, abstract concepts become more concrete-like when they are highly situated (Figure 8, bottom). We propose that this is because, when a concrete concept is processed in displaced context, its encoding will relate more to internal variables and linguistic associations, which are usually encoded by abstract concepts. Conversely, an abstract concept processed in situated visual context relates more to external situational information, which is usually encoded by concrete concepts.
Contextual Modulation in the Brain
What is the neurobiological mechanism behind contextual modulation of conceptual representation in the brain? Our results indicate that variance in visual context interacted with word-type (both concrete and abstract) in regions commonly defined as the default mode network (DMN), as well as a set of prefrontal regions associated with semantic control (Supplementary Material, Section C, Figure S3) (59, 60).
Recent literature on the role of the DMN suggests that these regions activate memories (61–65), possibly to form contextually relevant situation models (66, 67)” in order to guide behavior in response to the external environment (68). A related view, focusing less on the role of external information, suggests that multimodal experiential information is encoded across the DMN and accessed during concept retrieval in general (69–71). These views are compatible with our results contrasting brain activity during displaced and situated conceptual processing (see figures 6 and 7).
Overall, conceptual information involves more processing in displaced context conditions, likely because displaced context is naturally more challenging and affords more retrieval of conceptual knowledge. Semantic control relates to the ability to select and manipulate conceptual information on the basis of context demands and is associated with a set of frontal regions, including the left hemisphere IFG as well as posterior middle temporal gyrus (pMTG) (72, 73). We found that these brain regions activated more in displaced context for abstract concepts (Figure 6), however not for concrete concepts. Displaced context provides only few or no cues to a given concept’s meaning and therefore would engage semantic control to a greater extent. However, concrete concepts tend to be processed faster and more easily across the board (74) and may therefore engage semantic control regions less (75).
Implications
Our results imply that conceptual processing in the brain is dynamically distributed and contextually determined (Hugh paper). This is consistent with previously proposed predictive brain models in which spreading activation within conceptual networks activates other representations to predict the acoustic information arriving in auditory cortices to constrain linguistic ambiguity (41, 76, 77). These more distributed regions are typically averaged away when indiscriminately analyzed together and following thresholding because they are more variable given they are associated with many different experiences and (ii) there are individual differences in those experiences (78). These suppositions are supported by the fact that averages across different concept-types and contexts show overlap with the more typical perisylvian regions (see Supplementary Material, Section D, Figure S6).
In addition to implications for the nature of the neurobiological organization underlying conceptual representation, we want to highlight a new challenge for computational models of natural language processing (NLP) suggested by our results: namely that of understanding and modeling the dynamic influences of multimodal contextual information on representation. The recent innovation of transformer based artificial neural nets in NLP has pushed performance towards almost-human level (79, 80) and advances in this field promise to have a profound impact on daily human life. The key factor differentiating transformer-based models is that they predict upcoming words dynamically depending on linguistic context. Even though they seem to share this computational principle with the human brain (81, 82), brain-model comparisons have been limited to unimodal models only.
Yet, models trained on multimodal datasets show a promising match to human brain activity during naturalistic language processing (83). In combination with the present set of results, this suggests that including multimodal context effects (e.g. visual and discourse context) could help push models towards more human-like representations and efficiency.
Conclusion
It is clear from our study that contextual information modulates the neurobiological basis of conceptual representation in a profound way. Given this result, future work should analyze specific linguistic categories and processes in multimodal context. Furthermore, research should begin to evaluate and quantify different types of context measures (for example, object-related, action-related, emotion-related, etc.) and examine how these can affect conceptual representation in the brain. Such work might not only help us understand the neurobiology of conceptual representation but also aid in the development of better artificial models. Beyond commercial applications (such as in artificial assistants like “Siri” or “Alexa”, etc.), such developments might bear important implications for clinical domains, e.g., in progress towards helping patients who lost the ability to speak by real-time decoding of non-invasive brain recordings (84).
Materials and Methods
The present study analyzed the “Naturalistic Neuroimaging Database” (NNDb) (85).
Participants and Tasks
The Naturalistic Neuroimaging Database ((85), https://openneuro.org/datasets/ds002837/versions/1.0.6) includes 86 right-handed participants (42 females, range of age 18–58 years, M = 26.81, SD = 10.09 years) undergoing fMRI while watching one of 10 full-length movies selected across a range of genres. All had unimpaired hearing and (corrected) vision. None had any contraindication for magnetic resonance imaging (MRI), history of psychiatric or neurological disorder, or language-related learning disabilities. All participants gave informed consent, and the study was approved by the University College London Ethics Committee.
Word Annotation
Words were annotated in the movies using automated approaches with a machine learning based speech-to-text transcription tool from Amazon Web Services (AWS; https://aws.amazon.com/transcribe/). The resulting transcripts contained on and offset timings for individual words. However, as not all words were transcribed or accurately transcribed, timings were subsequently hand corrected.
Concreteness
We used measures of abstractness and concreteness for each word in the movies (excluding function words), taken from existing norms (37). In the norms, 40,000 words are rated on a scale from 0 (not experience based) to 5 (experience based) by 4,000 participants. We only included content words from the movies (concrete or abstract, based on a median split on the concreteness scale) matched for frequency and length to avoid confounds from these lexical variables. After this matching process, we were left with 880 words (half concrete, half abstract) per movie on average. Mean concreteness rating for abstract words was 1.83, compared to 3.22 for concrete words. The mean log frequency and mean length for abstract words was 4.81 and 5.27 compared to 5.21 and 4.91 for concrete words. T-tests between the ratings of both groups revealed no significant differences.
Control Variables
Possible confounds in our naturalistic movie task included nonspecific visual and auditory activation from the movies. Therefore, our model included a measure of the average luminance (low level visual variable) and volume (low level auditory variable) for each word as nuisance regressors. Luminance as well as loudness for each frame in the movie was measured across the full duration of each word using the “Librosa” package for music and audio analysis in Python (86). These low-level auditory and visual amplitude modulators are highly correlated with other potentially confounding auditory and visual variables. For example, luminance correlates significantly with stimulus and contrast ((87) and volume correlates with pitch (88) as well as prosody (89) and speaking rate ((90). Therefore, these variables should regress out much unwanted auditory and visual activation.
Conceptual Processing Across Context
In this analysis we aimed to test our prediction that when contextual information is collapsed across, the neurobiological organization of conceptual processing will reflect experiential information broadly in line with previous studies. Specifically, sensory and motor information for concrete concepts and internal as well as more general linguistic information for abstract concepts. All statistical analyses on the preprocessed NiFTI files were carried out in AFNI (91, 92). Individual AFNI programs are indicated parenthetically or in italics in subsequent descriptions.
Deconvolution Analysis
We used an amplitude modulated deconvolution regression to estimate brain activity. In contrast to a standard convolved regression analysis, this type of model does not assume a canonical hemodynamic response function. Instead, the response function is estimated over a time-window at stimulus onset using a multiple basis function for each participant. This produces a better understanding of shape differences between individual hemodynamic response functions and achieves higher statistical power at both individual and group level (36). We chose a 20s time window because this should be sufficient to capture the hemodynamic response function for the average word length. We selected “Csplin” over the Tent function to deconvolve the BOLD signal because this function offers more interpolation between time points, which might result in a more precise estimate of the individual response function (but is computationally more costly). The final amplitude modulated betas of this analysis represent the BOLD response beyond the average response across words. Since we included three nuisance amplitude modulators in the model and estimated BOLD response at each timepoint over a 20s window for both concrete and abstract words, this resulted in (20 * 3 * 2) 120 betas overall. All activity reported relates to the 20 betas associated with concrete and abstract conceptual processing beyond the amplitude of the nuisance modulators as well as the mean amplitude in the original signal (within each movie). This is an appropriately conservative methodology for naturalistic stimuli (https://afni.nimh.nih.gov/pub/dist/edu/class-lectures).
Group Analysis
Group analysis was performed on these 20 betas with linear mixed effect modeling (LMM) using “3dLME”. The within-subject factors were concreteness with two levels (concrete and abstract) and time with twenty levels (one for each time-point). We included an interaction term between factors. The model also included a centered variable for age of participant, gender (2 levels) as well as movie ID (10 levels). A random intercept for participant and an autocorrelation structure for modeling the effect estimates of the multiple basis function were also included.
Correction for Multiple Comparisons
To correct for multiple comparisons in our LMM, we used a multi-threshold approach rather than choosing an arbitrary P value at the individual voxel level threshold. In particular, we used a cluster simulation method to estimate the probability of noise-only clusters using the spatial autocorrelation function from the residuals in each LMM (“3dFWHMx” and “3dClustSim”). This resulted in the cluster sizes to achieve a corrected alpha value of 0.01 at 9 different P values (i.e., 0.05, 0.02, 0.01, 0.005, 0.002, 0.001, 0.0005, 0.0002, and 0.0001). We thresholded each map at the corresponding z-value for each of these nine p-values and associated cluster sizes. We then combined the resulting maps, leaving each voxel with its original z-value.
Estimating Regional Clusters of Activation
We used “3dMerge” with the “1clust” to form clusters with a 1mm connection distance and minimum size of 20 voxels and clip off all other voxels. We determined the center of mass for each of these clusters using “3dCM”. The resulting spatial activation map is displayed in Figure (1).
Meta-Analytic Descriptions
The resulting coordinates of the center of mass of each cluster were input into Neurosynth (https://neurosynth.org/), an online tool which includes activation maps of 14,371 neuroscientific studies. Neurosynth automatically mines all words in titles and abstracts of these articles and performs a two-way ANOVA, testing for the presence of a non-zero association between terms used in abstracts or titles reporting activation that overlaps with the input location. We scraped all terms (excluding those related to specific brain regions) with z scores above 3 (p < 0.001) and repeated this procedure for each cluster to determine terms functionally associated with concrete and abstract activation clusters (see Figure 2). Then we tested whether any terms were more important for concrete or abstract clusters using a Kruskal Wallis test (not correcting for multiple comparisons).
Peak and Valley Analysis
We extracted the averaged timeseries for each activation cluster across voxels using 3dMerge. Next, we determined peaks and valleys by calculating the discrete difference along the timeseries using the “Numpy” Python package (93) (Figure 3, (1)). This is negative if the next number is larger (rising timeseries, therefore peak) and positive if the next number is smaller (falling timeseries, therefore valley). Given that the canonical model of the hemodynamic response function peaks at around 6s after stimulus onset, we extracted what words were mentioned at each peak and valley in a given cluster’s timeseries with a 5- and 6-seconds lag (Figure 3, (2)). We then used the Lancaster sensorimotor norms (42) and norms for valence and arousal (43) to determine a 12-dimensional experience-based representation for each word (Figure 3, (3)), which included the dimensions: “Auditory”, “Gustatory”, “Haptic”, “Interoception”, “Visual”, “Hand-Arm”, “Foot-Leg”, “Torso”, “Mouth”, “Head”, “Valence”, and “Arousal”.
For each of these dimensions, we created two categorical arrays, one for peaks and one for valleys, noting down 0 if the word mentioned at a peak or valley was not highly rated on the dimension and 1 if it was highly rated. Extreme ratings were defined as deviating at least 1 standard deviation from the mean. Given the distributional nature of this data, we then conducted a Kruskal Wallis test between these arrays to determine whether a given experiential dimension occurred significantly more with peaks than valleys in the averaged timeseries of a cluster (Figure 3, (4)). We repeated this procedure for both a 5s and a 6s timeseries lag and conducted a cosine-similarity test between each result (using the “Sklearn” package in Python (94) in order to ensure that they were significantly similar.
Conceptual Processing in Context
We wanted to see how contextual information modulates brain activity during processing of concrete and abstract concepts. We predicted that when abstract concepts are situated in a highly related context, they can engage neurobiological structures that are usually involved in processing concrete concepts. Conversely, when concrete concepts are displaced from surrounding context, we predict them to engage more abstract-like regions in the brain.
Estimating Contextual Embeddedness
To test these predictions, we estimated a measure of contextual embeddedness for each word of interest. To that end, we utilized two pre-trained visual recognition models, Faster R-CNN (95) and OmniSource (45), to extract object features using computer vision toolboxes (47), respectively. For each prediction frame (about every four frames, i.e., 4*0.04 = 0.16 s), the object recognition model generated a list of detected object labels and kept those that had a prediction confidence of 90 percent or greater. Final object features were represented as the average of the vectorized object labels using GloVe (46), which represents the meaning of each label via global co-occurrence statistics.
We then estimated a representation of a 2s (or 60 frames) context window (which corresponds to the average length of an utterance) for each word listing all the labels of objects visually present in each frame within that window. Finally, we calculated the cosine similarity between the vector representation of each word and its context average, using the “Sklearn” package in Python (94), to estimate contextual embeddedness for each concept (a value between 0 and 1). We then calculated modulation of the amplitude of the brain response to concrete and abstract concepts as a cause of high and low contextual embeddedness (values of x < 0.4 or x > 0.6 respectively) by using an amplitude modulated deconvolution regression with an estimated response function.
Potential Confounds
There may be confounding visual information when estimating contextual embeddedness: high contextual situatedness may correlate positively with the number of objects present and therefore “naturally” engage the ventral visual stream for abstract situated concepts. To alleviate this concern, we determined that there were in fact more objects present in the abstract displaced (218038 across movies) as compared to the abstract situated condition (77211 across movies). A related critique is that, rather than the number of objects, certain types of objects could drive activation in the ventral visual stream for situated abstract concepts. To this end, we conducted a qualitative analysis of visual context driving situatedness and displacement (Supplementary Material, section C, Figures 4 and 5). This shows that no specific type of object plausibly driving any of the observed context effects occurs significantly more in concrete or abstract displaced and situated conditions.
Deconvolution Analysis
As in the previous analysis, the response function was estimated over 20s through the CSPLIN multiple basis function for each participant individually at word onset. This time we only used a subset of all matched abstract and concrete words, namely those that were situated or displaced according to our measure (268 concrete and abstract words on average per movie). The model included the nuisance regressors luminance, volume and duration as amplitude modulators, together with our measure of contextual embeddedness and estimated BOLD response at each timepoint over a 20s window for both concrete and abstract words in high and low context. This resulted in (20 * 3 * 2 * 2) 240 betas overall.
Group Analysis
Group analysis was performed on the deconvolved brain images with linear mixed effect modeling on the 20 betas associated with displaced and situated concrete and abstract conceptual processing respectively, beyond the amplitude of the nuisance modulators as well as the mean amplitude in the original signal (within each movie). The within-subject factors were concreteness with two levels (concrete and abstract), contextual embeddedness with two levels (situated and displaced) and time with twenty levels (one for each time-point). In addition to interactions between each of the factor pairs, the model included a centered variable for age of participant, their gender (2 levels) as well as movie ID (10 levels) as controls. A random intercept for participant and an autocorrelation structure for modeling the effect estimates of the multiple basis function were also included. Finally, we estimated the cosine similarity between the unthresholded activity maps from this analysis and the unthresholded maps obtained from Analysis (1) using the “Scipy” package in Python (96). Correction for multiple comparisons was conducted in the same way as for the previous model.
Supporting Information for
Peak and Valley Analysis for a 4s lag. A Kruskal-Wallis test shows that the distribution between sensorimotor and interoceptive/emotional dimensions for concrete and abstract words is significantly different to a 5s (H(2)=4,8, p=0.03 and 6s (H(2 =5.3, p=0.02) lag.
Peak and Valley Analysis for a 5s lag. Internal dimensions Valence and Arousal are significantly more associated with peaks in abstract compared to concrete clusters (Valence: H(2) = 5.8, p=.02; Arousal: H(2) = 6.7, p = .01). Conversely, concrete clusters are more associated with sensorimotor dimensions (Hand-Arm, Foot-Leg and Visual) – though not significantly so. Overlap is significantly more associated with “Mouth” (H (2) = 6.2, p=0.2).
Interaction between context- and word-type. All p-values corrected for multiple comparisons at alpha = 0.01.
Comparison between object counts for abstract and concrete words in situated context.
Comparison between object counts for abstract and concrete words in displaced context.
Comparison between the language network (outline) and overlap activity for both concrete and abstract words across context.
ACKNOWLEDGMENTS
VK would like to thank the creators of the NNDb for their hard work on this amazing database. He would also like to thank Bangjie Wang for help in using image recognition software, Dr Sarah Aliko for help with neuroimaging analysis, Paulina Schulz for comments on earlier drafts. The work reported here was supported in part by the European Research Council Advanced Grant (ECOLANG, 743035) and Royal Society Wolfson Research Merit Award (WRM R3 170016) to GV.
Footnotes
no competing interest to declare
We have now submitted our manuscript, which contains revisions in writing and additional supplementary material
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.
- 27.
- 28.
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.
- 55.↵
- 56.↵
- 57.
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.
- 63.
- 64.
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.↵
- 96.↵