Abstract
Recent studies show that linguistic representations predict the response of high-level visual cortex to images, suggesting an alignment between visual and linguistic information. Here, using iEEG, we tested the hypothesis that such alignment is limited to textual descriptions of the visual content of the image and would not appear for their abstract textual descriptions. We generated two types of textual descriptions for images of famous people and places: visual-text, describing the visual content of the image, and abstract-text, based on their Wikipedia definitions, and extracted their relational-structure representations from a large language model. We used these linguistic representations, along with visual representation of the images based on deep neural network, to predict the iEEG responses to images. Neural relational-structures in high-level visual cortex were similarly predicted by visual-images and visual-text, but not abstract-text representations. These results demonstrate that visual-language alignment in high-level visual cortex is limited to visually grounded language.
Competing Interest Statement
The authors have declared no competing interest.