Words in context: tracking context-processing during language comprehension using computational language models and MEG

The meaning of a word depends on its lexical semantics and on the context in which it is embedded. At the basis of this lays the distinction between lexical retrieval and integration, two basic operations supporting language comprehension. In this paper, we investigate how lexical retrieval and integration are implemented in the brain by comparing MEG activity to word representations generated by computational language models. We test both non-contextualized embeddings, representing words independently from their context, and contextualized embeddings, which instead integrate contextual information in their representations. Using representational similarity analysis over cortical regions and over time, we observed that brain activity in the left anterior temporal pole and inferior frontal regions shows higher similarity with contextualized word embeddings compared to non-contextualized embeddings, between 300 and 500 ms after word presentation. On the other hand, non-contextualized word embeddings show higher similarity with brain activity in the left lateral and anterior temporal lobe at earlier latencies – areas and latencies related to lexical retrieval. Our results highlight how lexical retrieval and context integration can be tracked in the brain using word embeddings obtained with computational models. These results also suggest that the distinction between lexical retrieval and integration might be framed in terms of context-independent and contextualized representations.

Introduction 1 these two basic operations.
In the present study, we use computational semantic models to investigate the neural 12 basis of lexical retrieval and integration. We use two classes of computational semantic 13 models, representing linguistic items either as independent or dependent from their 14 context of occurrence. In the computational linguistics literature, these models are 15 usually referred to as non-contextualized and contextualized embeddings, 16 respectively [8,9]. By comparing them to MEG data collected during sentence 17 comprehension, we aim to get more insight into the neural basis of these processes by 18 showing that integration is approximated by contextualized embeddings and that it is a 19 separate process (both functionally and physiologically) from semantic memory. The 20 main methodological contribution of computational modeling to the study of language 21 processing in the brain relies on the fact that it helps in avoiding the limitations of 22 task-oriented studies by exploiting the richness of naturally occurring sentences and by 23 relying on a fleshed out model of the investigated processes. In other words, 24 computational linguistic modeling provides a more direct implementation of the process 25 and does not require the assumption that the process can be decomposed in orthogonal 26 subprocesses that can be controlled by specific experimental tasks. 27 Integration as contextualization 28 The distinction between retrieval and integration can be grounded on the observation 29 (1) In order to open a new account, you should go to a bank . 41 (2) A fisherman is sitting with his rod on the bank of the river Thames. 42 For instance, humans distinguish the meaning of bank as "building or financial institution" or as "the shore of a river" depending on whether it is encountered in the 44 context of Sentence 1 or 2. The presence of the string "a new account" in Sentence 1 45 steers the interpretation towards the financial domain, whereas the string "a fisherman" 46 acts as bias towards a river-related interpretation of the word bank. 47 A distinct yet similar disambiguation problem arises with tokens that are not 48 lexically ambiguous, such as dog in Sentences 3 and 4. Although the basic meaning of 49 the term is the same in the two sentences, the two tokens take on two distinct sentential 50 roles: as the object of an action (4) or the subject of a statement realized as a 51 compound nominal predicate (3). 52 (3) The domestic dog is a member of the genus Canis, which forms part of the 53 wolf-like canids. 54 (4) I took my dog out for a walk in the park. 55 It has been suggested that the human brain creates representations of words that are 56 different according to such contextual cues [10]. stimulus [15][16][17]. It is therefore capital to show that the putative similarity between a model and a brain process concerns not only areas associated with such process, but 71 also that it does so in a time frame that is compatible with the time course of language 72 processing. For this reason, we use a magnetoencephalographic (MEG) dataset collected 73 during sentence reading. MEG records brain activity at the level of milliseconds, and 74 with a reasonable anatomical resolution, making it ideal for a study interested also in 75 the when, and not only the where of a specific neural process [18]. 76 Lexical processing in the brain  cortex. An important role is also hypothesized for the anterior portions of the temporal 87 lobe (anterior temporal pole, ATP). The involvement of the ATP is confirmed by both 88 studies on semantic dementia [21,22], and by a large body of neuroimaging 89 literature [23][24][25][26]. These findings have been summarized by Patterson & al. 2007 [27] 90 and led to the formulation of the hub and spoke model, which posits that concepts are 91 represented by a network of sensorimotor representations converging in the ATP, which 92 acts as a hub collecting and controlling modality-specific features in order to produce 93 supra-modal representations.

94
Integration is a process that operates on representations retrieved from semantic 95 memory. In its most basic formulation, integration consists of merging two linguistic 96 tokens (e.g., two words) and creating a larger unit, such as a phrase or, more simply, a 97 bi-gram. Integration is an operation that takes a token and embeds it into the context 98 June 16, 2020 5/26 represented, for instance, by the other tokens making up the sentence in which it is 99 presented. Brain imaging and brain lesion studies suggest that the inferior frontal gyrus, 100 in interaction with areas in the perisylvian and temporal cortex, plays an essential role 101 in lexical integration [17,28].

102
Integration also involves anterior temporal areas. By contrasting the activity 103 recorded during the reading of sentences and of word lists, works such as Mazoyer & al. 104 (1993) [29], Stowe & al. (1998) [30], Friederici & al. (2000) [15], Humphries & al. 105 (2006) [31], and Humphries & al. (2007) [32] reported an increase in activity in ATP for 106 the former condition as compared to the latter. The role of ATP in processing 107 integration is confirmed by another series of studies narrowing down the scope of the 108 analysis. Rather than working with sentences as a whole, these analyses focused on 109 simple phrasal processing, consisting of the composition of a wide range of phrasal and 110 syntactic compositional types and cross-language and modality [33][34][35][36][37].

111
Timing of processes

112
Besides the cortical loci of processing, sentence processing is characterized by a specific 113 temporal profile that describes the timing of each of its sub-processes [16,32]. The 114 earlier stages mainly concern the recognition of the word from its auditory (for spoken 115 words) or graphic (for written words) form and involve primary auditory or visual areas 116 between the onset of a word and 150-200 ms. The phases that interest our analysis are 117 the so-called Phase 1 and Phase 2, as described by Friederici [16].

118
Phase 1 takes places after the word form has been identified, and can be broken composing the sentence that is processed [38][39][40][41].

130
For the purpose of this study, we use two types of computational models developed for 131 generating word representations: non-contextualized models and contextualized models. 132 These models create so-called word embeddings, which consist of vectors of real 133 numbers populating a high-dimensional space. In other words, a model M takes a word 134 w and returns a real vector w representing w in a high-dimensional space S.

135
The first type of model generates representations w that are independent from the 136 context (sentence, paragraph, etc.) in which the represented word w is located. We call 137 this type of model non-contextualized, and it is represented by the popular word2vec 138 model (Section 4) [8].

139
Besides non-contextualized models, we also consider a contextualized model: ELMo 140 (Sections 4 [9]. This model, contrary to word2vec, assigns representations w that one for each of the two contexts in which it is found.

145
As shown in Figure 1, this becomes evident when we compute the similarity between 146 the embeddings. The cosine similarity between the word2vec generated dog For the purpose of the present study, it is important to point out the role of context 170 with regard to the way word2vec is trained and used to assign word embeddings.

171
Context indeed plays a capital role during the training of the model. In both cases, the 172 context of a word is present in the pipeline, either as the input or as the target of the 173 training function. Nonetheless, once the model is trained, its application is blind to the 174 context and relations that the words have.

175
Contextualized word embeddings (ELMo) 176 The contextualized word embedding model ELMo [9] relies on the properties of Hypotheses. The hypothesis presented in this paper is that brain activity relative to lexical retrieval can be modelled by non-contextualized embeddings, whereas that of integration can be modelled by contextualized embeddings instead.

224
We used the MEG data belonging to the MOUS dataset [18]  we refer to the original paper and to Schoffelen & al. 2018 [18, 45].
MEG data acquisition and pre-processing The data were down-sampled to a sampling frequency of 300 Hz.

242
Source reconstruction was performed using a linearly constrained minimum variance 243 beamformer (LCMV) [46], estimating a spatial filter at 8,196 locations of the 244 subject-specific reconstructed midcortical surface. The dimensionality of the data was 245 reduced by applying an atlas-based parcellation scheme based on a refined parcellation 246 Conte69 atlas (191 parcels per hemisphere). After that, spatial filters were concatenated 247 across vertices comprising a parcel. The first two spatial components were selected for 248 each parcel. For more details on this procedure we refer to Schoffelen & al. 2018 [45].   Representational Similarity Analysis (RSA). RSA is conducted by computing the similarity score between the brain space similarity matrices (left) and model similarity matrix (right). Here, the brain similarity matrix consists of the pairwise similarity scores among trials at anatomical region a and time window i. The similarity score is estimated by taking Pearson's correlation coefficient between 309 the upper off-diagonal triangle of the [n × n] symmetric paired similarity matrices (ss M 310 and ss Bai ) (Fig 6). These scores quantify the extent to which the similarity across 311 stimuli is similarly represented by the model M and by brain activity in anatomical 312 region a and time i. These measures are repeated across time t and anatomical regions 313 r (Fig 7).

314
Therefore for each anatomical region, we obtained a representation of similarity 315 between model and brain activity as a function of time from 0 to 500 ms after word  In our analysis, we computed t-statistics over subjects for each region/time 337 combination independently. We computed a one-sided t-score for each of our computational models ( W 2V , ELM o). We also computed a one-side t-score between

344
The results of the analyses are split in two main parts. In the first part, we report the 345 results of each of the two embedding models (word2vec and ELMo) separately. In the 346 second part of the analyses, the similarity scores of word2vec and ELMo are contrasted 347 to each other.

348
Results are provided at the whole-brain level, displaying the model-brain similarities 349 at 5 distinct time points: 150, 250, 350, 450, and 550 milliseconds after word onset.

351
In this section, we report the results of the RSA analyses of the two computational 352 models separately.

353
The embedding model (Fig 8) which does not include contextual information, 354 word2vec, returns lower similarities overall over time when correlated with brain 355 activity, but more so from about 300 ms post word onset. For earlier latencies, 356 word2vec shows significant activity in the left middle and inferior temporal gyri.

357
Significant similarity with brain activity is also observed around 400 ms in the left 358 posterior superior temporal gyrus. The contextualized model (Fig 9), ELMo, instead exhibits an overall significant 360 similarity with brain activity between 300 and 500 ms in the left frontal, prefrontal, and 361 left anterior temporal regions. In particular, ELMo shows significant similarity with the 362 left inferior temporal gyrus and the left anterior temporal cortex around 400 ms. ELMo shows significantly higher similarity to brain activity as compared to In Section 4, we observed that contextual and non-contextualized models yield 372 qualitatively different results with regard to the timing and location of their similarity 373 to MEG-recorded brain activity. In this section, we discuss the implications of these 374 findings in light of the nature of the models and of the brain processing of natural 375 language, as introduced in Sections 4 and 4.

376
We believe that computational word embedding models help in probing the nature of 377 the neural representations correlated to memory retrieval and to integration. This is 378 because they make the distinction between these two phenomena more computationally 379 specific. When discussing the nature of retrieval and integration, Baggio & Hagoort 380 (2011) [7] argue that approaches based on formal semantics might not be realistic 381 models of how the brain implements these two operations. In agreement with Seuren 382 (2009) [47], they state that formal semantics disregards natural language as a 383 psychological phenomenon. They continue stating their desire to develop an account 384 "that adheres to cognitive realism, in that it explains how language users derive meaning 385 and how the human brain instantiates the neural architecture necessary to achieve this 386 feat". We believe that distributional semantic models, of which contextualized embeddings are the most advanced version, have already proven their cognitive realism 388 by being good models of human behavioral -e.g. semantic similarity judgment [48] -389 and neural [49,50] data. Moreover, at the dawn of the field, distributional models -e.g., 390 Latent Semantic Analysis [51] -were actually developed as cognitive models to answer 391 questions on how children acquire word meaning, and how humans react to semantic 392 similarity and relatedness. In light of the above considerations, we think that the 393 models presented in the study might offer a cognitively realistic approximation of what 394 is going on in the brain during memory retrieval and integration.

395
In the remainder of this section, we will discuss the effect of contextualization on the 396 similarities between computational representations and brain activity (Section 4). We 397 will specifically focus on the implications of these findings regarding the role of the 398 anterior temporal lobe 4 and of activity peaking around 400 ms after stimulus onset 399 (Section 4). We will also discuss the plausibility of the models chosen for the present 400 study (Section 4).

401
The effect of contextualization on model-brain similarity regard to brain activity can be reconciled with the division of labor predicated by models such as the MUC model [28], especially concerning the distinction between 419 retrieval from memory and contextual integration.

420
The role of the anterior temporal lobe in integration line with studies that link the role of the N400 to context processing [38,39].

460
The plausibility of bi-directional RNNs

461
ELMo is essentially a bi-directional recurrent neural network-based language model that 462 integrates a word with its preceding and following context. Bi-directional recurrent 463 language models seem to violate the assumption that human language processing 464 proceeds left-to-right and word-by-word. Although this is trivially true for listening, it 465 is worth noting that studies of reading behavior tend to describe a more nuanced languages such as Arabic [55]. Moreover, a number of "jump-ahead" eye movements are 470 also commonly observed, indicating that humans either skip information that is deemed 471 irrelevant for the processing of a linguistic item or that they look ahead in order to theories of the architecture of the language system adopted in this study.

495
By highlighting a parallelism between models and brain activity, our results offer a 496 contribution to the understanding of the division of labor at the cortical level between 497 areas encoding lexical items in isolation and areas sensitive to the use of those items in 498 context.