RT Journal Article SR Electronic T1 Incorporating natural language into vision models improves prediction and understanding of higher visual cortex JF bioRxiv FD Cold Spring Harbor Laboratory SP 2022.09.27.508760 DO 10.1101/2022.09.27.508760 A1 Aria Y. Wang A1 Kendrick Kay A1 Thomas Naselaris A1 Michael J. Tarr A1 Leila Wehbe YR 2022 UL http://biorxiv.org/content/early/2022/09/29/2022.09.27.508760.abstract AB We hypothesize that high-level visual representations contain more than the representation of individual categories: they represent complex semantic information inherent in scenes that is most relevant for interaction with the world. Consequently, multimodal models such as Contrastive Language-Image Pre-training (CLIP) which construct image embeddings to best match embeddings of image captions should better predict neural responses in visual cortex, since image captions typically contain the most semantically relevant information in an image for humans. We extracted image features using CLIP, which encodes visual concepts with supervision from natural language captions. We then used voxelwise encoding models based on CLIP features to predict brain responses to real-world images from the Natural Scenes Dataset. CLIP explains up to R2 = 78% of variance in stimulus-evoked responses from individual voxels in the held out test data. CLIP also explains greater unique variance in higher-level visual areas compared to models trained only with image/label pairs (ImageNet trained ResNet) or text (BERT). Visualizations of model embeddings and Principal Component Analysis (PCA) reveal that, with the use of captions, CLIP captures both global and fine-grained semantic dimensions represented within visual cortex. Based on these novel results, we suggest that human understanding of their environment form an important dimension of visual representation.Competing Interest StatementThe authors have declared no competing interest.