Abstract
Language transformers, like GPT-2, have demonstrated remarkable abilities to process text, and now constitute the backbone of deep translation, summarization and dialogue algorithms. However, whether these models actually understand language is highly controversial. Here, we show that the representations of GPT-2 not only map onto the brain responses to spoken stories, but also predict the extent to which subjects understand the narratives. To this end, we analyze 101 subjects recorded with functional Magnetic Resonance Imaging while listening to 70 min of short stories. We then fit a linear model to predict brain activity from GPT-2 activations, and correlate this mapping with subjects’ comprehension scores as assessed for each story. The results show that GPT-2’s brain predictions significantly correlate with semantic comprehension. These effects are bilaterally distributed in the language network and peak with a correlation above 30% in the infero-frontal and medio-temporal gyri as well as in the superior frontal cortex, the planum temporale and the precuneus. Overall, this study provides an empirical framework to probe and dissect semantic comprehension in brains and deep learning algorithms.
In less than two years, language transformers like GPT-2 have revolutionized the field of natural language processing (NLP). These deep learning architectures are typically trained on very large corpora to complete partially-masked texts, and provide a one-fit-all solution to translation, summarization, and question-answering tasks and algorithms (1).
Critically, their hidden representations have been shown to – at least partially – correspond to those of the brain: single sample fMRI (2–4), MEG (2, 4), and intracranial responses to spoken and written texts (3, 5) can be significantly predicted from a linear combination of the hidden vectors generated by these deep networks. Furthermore, the quality of these predictions directly depends on the models’ ability to complete text (3, 4).
In spite of these achievements, strong doubts subsist on whether language transformers actually generate meaningful constructs (6). When asked to complete “I had $20 and gave $10 away. Now, I thus have $”, GPT-2 predicts “20”∗. Similar trivial errors can be observed for geographical locations, temporal ordering, pronoun attribution and causal reasoning. These results have thus led some to argue that “the system has no idea what it is talking about” (7).
Here, we propose to investigate semantic comprehension, not by assessing whether GPT-2’s predictions match human behavior, but by comparing its hidden representations to those of the brain. First, we compare GPT-2’s activations with the functional Magnetic Resonance Imaging of 101 subjects listening to 70 min of seven short stories. Second, we evaluate how these representations – shared between GPT-2 and the brain – vary with semantic comprehension, as individually assessed by a questionnaire at the end of each story.
GPT-2’s activations linearly map onto fMRI responses to spoken narratives
To assess whether GPT-2 generates similar representations to those of the brain, we first evaluate, for each voxel, subject and narrative independently, whether the fMRI responses can be predicted from a linear combination of GPT-2’s activations (Figure 1A). To mitigate fMRI spatial resolution and the necessity to correct each observation by the number of statistical comparisons, we here report either 1) the average scores across voxels or 2) the average score within 314 regions of interest (following a subdivision of Destrieux atlas (8), cf. SI.1). Consistent with previous findings (2, 4, 9), these brain scores are significant over a distributed and bilateral cortical network, and peak in middle- and superior-temporal gyri and sulci, as well as in the supra-marginal and the infero-frontal cortex (2, 4, 9) (Figure 1B.).
By extracting GPT-2 activations from multiple processing stages, called “layers” (from layer one to layer twelve), we confirmed that middle layers best aligned with the brain (Figure 1C), as seen in previous studies (2, 4, 9). The following analyses focus on the activations extracted from the ninth layer of GPT-2.
GPT-2’s brain predictions correlate with speech comprehension
Does the linear mapping between GPT-2 and the brain reflect a fortunate correspondence (4)? Or, on the contrary, does it reflect similar representations of high-level semantics? To address this issue, we correlate these brain scores to comprehension scores, assessed for each subject-story pair. On average across all voxels, this correlation reaches ℛ = 0.33 (p < 10−6, Figure 1D, as assessed with Pearson’s p-value provided by SciPy). This correlation is significant across a wide variety of the bilateral temporal, parietal and prefrontal cortices typically linked to language processing (Figure 1E). Together, these results suggest that the shared representations between GPT-2 and the brain encode relate to semantic comprehension.
Low-level processing only partially accounts for the correlation between comprehension and GPT-2’s mapping
Is this correlation driven by a modulation of low-level processing by uncontrolled factors, such as subjects’ attention? To test this issue, we evaluate the predictability of fMRI given low-level phonological features: the word rate, phoneme rate, phonemes, stress and tone of the narrative. The corresponding brain scores correlate with the subjects’ understanding (ℛ = 0.17, p < 10−2) but less so than the brain scores of GPT-2 (Δℛ = 0.16). These low-level correlations with comprehension peak in the left superior temporal cortex (Figure 1F). Overall, this result suggests that the link between comprehension and GPT-2’s brain mapping may be partially explained by, but not reduced to a modulation of low-level processing.
High-level processing best explains the link between comprehension and GPT-2’s predictions
Is the correlation between comprehension and GPT-2’s mapping driven by a lexical effect as opposed to a high-level ability to meaningfully combine words? To tackle this issue, we extract the non-contextualized word embeddings from GPT-2, compute their brain mapping, and, again, evaluate the correlation between comprehension and these lexical mappings. On average across voxels, the resulting ℛ scores are higher than those of phonological features, and lower than those of GPT-2’s ninth layer (Δℛ = 0.03).
Comparisons between phonological, word-embedding and GPT-2 ninth layer’s ℛ scores are displayed in Figure 1F. Lexical effects (word-embedding versus phonological) peak in the superior-temporal lobe and in pars triangularis. In addition, higher-level effects (GPT-2 ninth layer versus word-embedding) peak in the superior-frontal, posterior superior-temporal gyrus, in the precuneus and in both the pars opercularis and pars triangularis – a network typically associated with high-level language comprehension (10).
Comprehension effects are mainly driven by individuals’ variability
The variability in comprehension scores could results from inter-stimulus variability – some stories may be harder to comprehend than others for GPT-2 and/or for subjects. In addition, longer narratives could be easier to understand, and the brain mapping be higher due to a larger amount of data. To control for the this effect, we fit a linear mixed model to predict comprehension scores given brain scores, specifying the narrative as a random effect (cf. SI.1). The fixed effect of brain score (shared across narratives) is highly significant: β = 5.9, p < 10−6, cf. SI.1). However, the random effect (slope specific to each single narrative) is not (β = 1.4, p = .88). Overall, these analyses confirm that the link between GPT2 and speech comprehension is mainly driven by subjects’ individual differences in their ability to make sense of the narratives.
Discussion
Our analyses reveal a positive correlation between semantic comprehension and the degree to which GPT-2’s activations map onto those of the brain. Thus, GPT-2’s activations – at least partially – relate to the semantic representations our brain generates when we process a spoken narrative.
Our fMRI results strengthen and complete prior work on the neural correlates of semantic comprehension. In particular, previous studies have used inter-subject brain correlation to reveal the brain regions associated with understanding (11). For example, Lerner et al have presented subjects with normal texts or texts scrambled at the word, sentence or paragraph level, in order to parametrically manipulate their level of comprehension (10). The corresponding fMRI signals proved to correlate across subjects in the primary and second auditory areas when the input was scrambled below the lexical level. By contrast, they only became correlated in the bilateral infero-frontal and temporo-parietal cortex when the scrambling was performed at the level of sentences and paragraphs. Our model-based results are remarkably consistent with this hierarchical organization. In sum, the present work illustrates that language processes can be directly investigated with uncontrolled stimuli and may not necessitate artificial – and costly – experimental conditions (12).
The present results do not mean that GPT-2 fully understands what it reads. First, the correlation between speech comprehension and GPT-2’s ability to map onto the brain reaches, at best 36% (in STS) (i.e. 13% of explained variance), indicating that the variation in comprehension remains, mostly unaccounted for. Second, language comprehension is here operationally defined as subjects’ ability to correctly answer a set of questions after each text. This measure may thus be noisy because of mnesic and attentional processes. Third, fMRI’s signal-to-noise ratio and temporal resolution limits our ability to determine whether GPT-2’s representations of specific events, goals, and mental states match those of the brain.
Finally, the present study strengthens and clarifies the similarity between the brain and deep language models, repeatedly observed in the past three years (2–4, 9, 13). Together, these findings reinforce the relevance of deep language models for understanding the neural bases of narrative comprehension.
Materials and Methods
Our analyses rely on the “Narratives” dataset (14), composed of the brain signals, recorded using fMRI, of 345 subjects listening to 27 narratives.
Narratives and comprehension score
Among the 27 stories of the dataset, we selected the seven stories for which subjects were asked to answer a comprehension questionnaire at the end, and for which the answers varied across subjects (more than ten different comprehension scores across subjects), resulting in 70 minutes of audio stimuli in total, from four to 19 minutes per story (Figure 2). Questionnaires were either multiple-choice, fill-in-the blank, or open questions (answered with free text) rated by humans (14). Here, we used the comprehension score computed in the original dataset which was either a proportion of correct answers or the sum of the human ratings, scaled between 0 and 1 (14). It summarizes the comprehension of one subject for one narrative (specific to each (narrative, subject) pair).
Brain activations
The brain activations of the 101 subject who listened to the selected narratives were recorded using fMRI, as described in (14). As suggested in the original paper, pairs of (subject, narrative) were excluded because of noisy recordings, resulting in 237 pairs in total.
GPT-2 activations
GPT-2 (1) is a high-performing neural language model trained to predict a word given its previous context (it does not have access to succeeding words), given millions of examples (e.g Wikipedia texts). It consists of multiple Transformer modules (twelve, each of them called “layer”) stacked on a non-contextual word embedding (a look-up table that outputs a single vector per vocabulary word) (1). Each layer l can be seen as a nonlinear system that takes a sequence of w words as input, and outputs a contextual vector of dimension (w, d), called the “activations” of layer l (d = 768). Intermediate layers were shown to better encode syntactic and semantic information than input and output layers (15), and to better map onto brain activity (2, 4). Thus, for our analyses, we selected the activations of the ninth layer of GPT-2, elicited by the same narratives as the subjects.
In practice, the narratives’ transcripts were formatted (replacing special punctuation marks such as “–” and duplicated marks “?.” by dots), tokenized using GPT-2 tokenizer and input to the GPT-2 pretrained model provided by Huggingface ‡. The activation vectors between successive fMRI measurement were summed to obtain one vector per measurement. To match the fMRI measurements and the GPT-2 vectors over time, we used the speech-to-text correspondences provided in the fMRI dataset (14).
Linear mapping between GPT-2 and the brain
For each (subject, narrative) pair, we measure the mapping between i) the fMRI activations elicited by the narrative and ii) the activations of GPT-2 (layer nine) elicited by the same narrative. To this end, a linear spatiotemporal model is fitted on a train set to predict the fMRI scans given the GPT-2 activations as input. Then, the mapping is evaluated by computing the Pearson correlation between predicted and actual fMRI scans on a held out set I:
With f ∘ g the fitted estimator (g: temporal and f: spatial mappings), ℒ Pearson’s correlation, X(w) the activations of GPT-2 and Y (s,w) the fMRI scans of subjects s, both elicited by the narrative w.
In practice, f is a ℓ2-penalized linear regression (following scikitlearn implementation§) and g is a finite impulse response (FIR) model with 5 delays, where each delay sums the activations of GPT-2 input with the words presented between two TRs. For each subject, this procedure is repeated across five train (80% of the fMRI scans) and disjoint test folds (20% of the fMRI scans), and the Pearson correlations are averaged across folds to obtain a single score per (subject, narrative) pair. This score, denoted ℳ(X) in Figure 1A, measures the mapping between the activations space X and the brain of one subject, elicited by one narrative.
Significance
Significance was either assessed by using (i) a secondlevel Wilcoxon test (two-sided) across subject-narrative pairs, testing whether the mapping (one value per pair) was significantly different from zero (Figure 1B), (ii) either by using the first-level Pearson p-value provided by SciPy¶ (Figure 1D-G). In Figure 1B, E, F, p-values were corrected for multiple comparison (2 × 157 ROIs) using False Discovery Rate (Benjamin/Hochberg)‖.
Supporting Information (SI)
Brain parcellation
In Figure 1B, E, and F, we used a subdivision of the parcellation from Destrieux Atlas (8). Regions with more than 400 vertices were split into smaller regions (so that each regions contains less than 400 vertices). The original parcellation consists of 75 regions per hemisphere. Our custom parcellation consists in 157 regions per hemisphere.
In Figure 1G, we use the original parcellation for simplicity, and the following acronyms:
STG/STS (Superior temporal gyrus/sulcus)
MTG/MTS (Medial temporal gyrus/sulcus)
SFG/SFS (Superior frontal gyrus/sulcus)
IFG/IFS (Inferior frontal gyrus/sulcus)
Tri/Op (Pars triangularis/opercularis)
IPG/IPS (Inferior parietal gyrus/sulcus)
TTransverse (Temporal transverse sulcus)
PCG (Posterior cingulate gyrus)
PLF (Posterior lateral fissure)
Mixed-effect model
Not all subjects listened to the same stories. To check that the ℛ scores (correlation between comprehension and brain mapping) were not driven by the narratives and questionnaires’ variability, a linear mixed-effect model was fit to predict the comprehension of a subject given its brain mapping scores, specifying the narrative as a random effect. More precisely, if wi ∈ ℝ corresponds to the mapping scores of the ith subject that listened to the story w, and refers to the comprehension scores, we estimate the fixed effect parameters and (shared across narratives), and the random effect parameter βw ∈ ℝ and ηw ∈ ℝ (specific to the narrative w) such that:
With a vector of i.i.d normal errors with mean 0 and variance σ2. In practice, we use the statsmodels∗∗ implementation of linear mixed-effect models. Significance of the coefficients were assessed with a t-test, as implemented in statsmodels.
Footnotes
↵* as assessed using Huggingface interface (https://github.com/huggingface/transformers) and GPT-2 pretrained model with temperature=0.