Parallel processing in speech perception: Local and global representations of linguistic context

Speech processing is highly incremental. It is widely accepted that listeners continuously use the linguistic context to anticipate upcoming concepts, words and phonemes. However, previous evidence supports two seemingly contradictory models of how predictive cues are integrated with bottom-up evidence: Classic psycholinguistic paradigms suggest a two-stage model, in which acoustic input is represented fleetingly in a local, context-free manner, but quickly integrated with contextual constraints. This contrasts with the view that the brain constructs a single unified interpretation of the input, which fully integrates available information across representational hierarchies and predictively modulates even earliest sensory representations. To distinguish these hypotheses, we tested magnetoencephalography responses to continuous narrative speech for signatures of unified and local predictive models. Results provide evidence for some aspects of both. Local context models, one based on sublexical phoneme sequences, and one based on the phonemes in the current word alone, do uniquely predict some part of early neural responses; at the same time, even early responses to phonemes also reflect a unified model that incorporates sentence level constraints to predict upcoming phonemes. Neural source localization places the anatomical origins of the different predictive models in non-identical parts of the superior temporal lobes bilaterally, although the more local models tend to be right-lateralized. These results suggest that speech processing recruits both local and unified predictive models in parallel, reconciling previous disparate findings. Parallel models might make the perceptual system more robust, facilitate processing of unexpected inputs, and serve a function in language acquisition.


21
Speech processing is highly incremental. It is widely accepted that listeners continuously use the 22 linguistic context to anticipate upcoming concepts, words and phonemes. However, previous 23 evidence supports two seemingly contradictory models of how predictive cues are integrated with 24 bottom-up evidence: Classic psycholinguistic paradigms suggest a two-stage model, in which 25 acoustic input is represented fleetingly in a local, context-free manner, but quickly integrated with 26 contextual constraints. This contrasts with the view that the brain constructs a single unified 27 interpretation of the input, which fully integrates available information across representational 28 hierarchies and predictively modulates even earliest sensory representations. To distinguish these 29 hypotheses, we tested magnetoencephalography responses to continuous narrative speech for 30 signatures of unified and local predictive models. Results provide evidence for some aspects of 31 both. Local context models, one based on sublexical phoneme sequences, and one based on the 32 phonemes in the current word alone, do uniquely predict some part of early neural responses; at 33 the same time, even early responses to phonemes also reflect a unified model that incorporates 34 sentence level constraints to predict upcoming phonemes. Neural source localization places the 35 anatomical origins of the different predictive models in non-identical parts of the superior 36 temporal lobes bilaterally, although the more local models tend to be right-lateralized. These 37 results suggest that speech processing recruits both local and unified predictive models in parallel, 38 Introduction 41 Acoustic events in continuous speech occur at a rapid pace, and listeners face pressure to process 42 the speech signal rapidly and incrementally 1 . One strategy that listeners employ to achieve this is 43 to organize internal representations in such a way as to minimize the processing cost of future 44 language input 2 . This is reflected in a variety of measures that suggest that more predictable words 45 are easier to process 3-5 . For instance, spoken words are recognized more quickly when they are 46 heard in a meaningful context 6 , and words that are made more likely by the context are associated 47 with reduced neural responses, compared to less expected words 7-11 . This contextual facilitation 48 occurs broadly and is sensitive to language statistics 12-14 as well as discourse level meaning 15,16 . 49 Words are predictable because they occur in sequences that form meaningful messages. Similarly, 50 phonemes are predictable because they occur in sequences that form words. For example, after 51 hearing the beginning /ɹıv/, /ɝ/ would be a likely continuation forming river; /i/ would be more 52 surprising, because riviera is a less frequent word, whereas /ʊ/ would be highly surprising because 53 there are no common English words starting with that sequence. Phonemes that are thus 54 inconsistent with known word forms elicit a mismatch response 17 , and responses to valid 55 phonemes are proportionately larger the more surprising the phonemes are [18][19][20] . Predictive 56 processing is not restricted to linguistic representations, as even responses to acoustic features in 57 early auditory cortex reflect expectations based on the acoustic context 21,22 . 58 Thus, there is little doubt that the brain uses context to facilitate processing of upcoming 59 information, at multiple levels of representation. Here we investigate a fundamental question 60 about the underlying cognitive organization: Does the brain develop a single, unified 61 representation of the input? In other words, one representation that is consistent across 62 hierarchical levels, effectively propagating information from the sentence context across 63 hierarchical levels to anticipate even low-level features of the sensory input such as phonemes? 64 Or do cognitive subsystems differ in the extent and kind of context they use to interpret their 65 input? This question has appeared in different forms, for example in early debates about whether 66 sensory systems are modular 23 or whether sensory input and contextual constraints are combined 67 immediately in speech perception 6,24 . A similar distinction has also surfaced more recently 68 between the local and global architectures of predictive coding 25 . 69 A strong argument for a unified, globally consistent model comes from Bayesian frameworks, 70 which suggest that, for optimal interpretation of imperfect sensory signals, listeners ought to use 71 the maximum amount of information available to them to compute a prior expectation for 72 upcoming sensory input 26,27 . An implication is that speech processing is truly incremental, with a 73 unified linguistic representation that is updated at the phoneme (or an even lower) time scale 5 . 74 Such a unified representation is consistent with empirical results suggesting that word recognition 75 can bias subsequent phonetic representations 28 , that listeners weight cues like a Bayes-optimal 76 observer during speech perception 29,30 , and that they immediately interpret incoming speech with 77 regard to communicative goals 31,32 . A recent implementation proposed for such a model is the 78 global variant of hierarchical predictive coding, which assumes a cascade of generative models 79 predicting sensory input from higher level expectations 25,33,34 . However, a unified model is also 80 assumed by classical interactive models of speech processing, which rely on cross-hierarchy 81 interactions to generate a globally consistent interpretation of the input 35-37 . 82 However, there is also evidence for incomplete use of context in speech perception . Results from  83  cross-modal semantic priming suggest that, during perception of a word, initially multiple  84  meanings are activated regardless of whether they are consistent with the sentence context or  85  not, and contextually appropriate meanings only come to dominate at a later stage 38,39 . Similarly,  86 eye tracking suggests that lexical processing activates candidates that should be excluded by the 87 syntactic context 40 . Such findings can be interpreted as evidence for a two-stage model, in which 88 an earlier retrieval process operates without taking into account the wider sentence context, and 89 only a secondary process of selection determines the best fit with context 41  Distinguishing among these possibilities requires a task that encourages naturalistic engagement 101 with the context, and a non-intrusive measure of linguistic processing. To achieve this, we analyzed 102 magnetoencephalography (MEG) responses to continuous narrative speech. Previous work using 103 a similar paradigm has tested either only for a local or only for a unified context model, by either 104 using only the current word up to the current phoneme as context 45,46 or by using predictions from 105 a complete history of phonemes and words 47 . However, because these two context models 106 include overlapping sets of constraints, their predictions are correlated and they need to be 107 assessed jointly. Furthermore, some architectures predict that both kinds of context model should 108 affect brain responses separately. For example, a two-stage architecture predicts an earlier stage 109 of lexical processing that is sensitive to lexical statistics only, and a later stage that is sensitive to 110 the global sentence context. Here we directly test such possibilities by comparing the ability of 111 different context models to jointly predict brain responses. 112 Expressing the use of context through information theory 113 The sensitivity of speech processing to different definitions of context is formalized through 114 conditional probability distributions ( Figure 1). Each distribution reflects an interpretation of 115 ongoing speech input, at a given level of representation. We here use word forms and phonemes 116 as units of representation (Figure 1-A), but this is a matter of methodological convenience, and 117 similar models could be formulated using a different granularity 5 . Figure 1-B shows an architecture 118 in which each level uses local information from that level, but information from higher levels does 119 not affect beliefs at lower levels. In this architecture, phonemes are classified at the sublexical 120 level based on the acoustic input and possibly a local phoneme history. The word level decodes 121 the current word from the incoming phonemes, but without access to the multi-word context. 122 Finally, the sentence level updates the sentence representation from the incoming word 123 candidates, and thus selects those candidates that are consistent with the sentence context. In 124 such a model, apparent top-down effects such as perceptual restoration of noisy input 48,49 are 125 generated at higher level decision stages rather than at the initial perceptual representations 50 . In 126 contrast, Figure 1-C illustrates the hypothesis of a unified or global context model, in which priors 127 at lower levels take advantage of information available at the higher levels. Here, the sentence 128 context is used in decoding the current word by directly altering the prior over word candidates, 129 and this sentence-appropriate prior is in turn used to alter expectations for upcoming phonemes. phoneme can be invoked as part of a sublexical phoneme sequence, phk, or as part of wordj, phj,i. 134 (B) Each box stands for a level of representation, characterized by its output and a probability 135 distribution describing the level's use of context. For example, the sublexical level's output is an 136 estimate of the current phoneme, phk, and the distribution for phk is estimated as probability for 137 different phonemes based on the sound input and a sublexical phoneme history. At the sentence 138 level, sentencej,i stands for a temporary representation of the sentence at time j,i. Boxes represent 139 functional organization rather than specific brain regions. Arrows reflect the flow of information: 140 each level of representation is updated incrementally, combining information from the same level 141 at the previous time step (horizontal arrows) and the level below (bottom-up arrows). 142 (C) The unified architecture implements a unified, global context model through information 143 flowing down the hierarchy, such that expectations at lower levels incorporate information 144 accumulated at the sentence level. Relevant differences from the local context model are in red. 145 Note that while the arrows only cross one level at a time, the information is propagated in steps 146 and eventually crosses all levels. 147 These hypotheses make different predictions for brain responses sensitive to language statistics. 148 Probabilistic speech representations, as in Figure 1, are linked to brain activity through information 149 theoretic complexity metrics 51 . The most common linking variable is surprisal, which is equivalent 150 to the difficulty incurred in updating an incremental representation of the input 4 . A second 151 information theoretic measure that has been found to independently predict brain activity is 152 entropy 45,47 , a measure of the uncertainty in a probability distribution. Because entropy is a 153 function of a distribution, entropy differs depending on the unit of classification. This allows 154 distinguishing between the entropy of recognizing the current partial word, and the entropy of 155 predicting the next phoneme (see Methods for details). Entropy might relate to neuronal 156 processes in at least two ways. First, the amount of uncertainty might reflect the amount of 157 competition among different representations, which might play out through a neural process such 158 as lateral inhibition 36 . Second, uncertainty might also be associated with increased sensitivity to 159 bottom-up input, because the input is expected to be more informative 52,53 . 160 Models for responses to continuous speech 161 To test how context is used in continuous speech processing, we compared the ability of three 162 different context models to predict MEG responses, corresponding to the three levels in Figure 1-163 B (see Figure 2). The context models all incrementally estimate a probability distribution at each 164 phoneme position, but they differ in the amount and kind of context they incorporate. 165 Throughout, we used n-gram models to estimate sequential dependencies because they are 166 powerful language models that can capture effects of language statistics in a transparent manner, 167 with minimal assumptions about the underlying cognitive architecture 4,5,54 . 168 Sublexical context model: A 5-gram model estimates the prior probability for the next phoneme 169 given the 4 preceding phonemes. This model reflects simple phoneme sequence statistics 42,43 and 170 is unaware of word boundaries. Such a model is thought to play an important role in language 171 acquisition 55-57 , but it is unknown whether it has a functional role in adult speech processing. The 172 sublexical model predicted brain responses via the phoneme surprisal and entropy linking 173 variables. 174 Word context model: This model implements the cohort model of word perception 58 , applied to 175 each word in isolation. The first phoneme of the word generates a probability distribution over 176 the lexicon, including all words starting with the given phoneme, and each word's probability 177 proportional to the word's relative unigram frequency. Each subsequent phoneme trims this 178 distribution by removing words that are inconsistent with that phoneme. Like the sublexical 179 model, the lexical model can be used as a predictive model for upcoming phonemes, yielding 180 phoneme surprisal and entropy variables. In addition, the lexical model generates a probability 181 distribution over the lexicon, which yields a cohort entropy variable. 182 Sentence context model: The sentence model is closely related to the lexical model, but each 183 word's prior probability is estimated from a lexical 5-gram model. While a 5-gram model misses 184 longer-range linguistic dependencies, we use it here as a conservative initial approximation of high 185 level linguistic and interpretive constraints 5 . The sentence model implements cross-hierarchy 186 predictions by using the sentence context in concert with the partial current word to predict 187 upcoming phonemes. Brain activity is predicted from the same three variables as from the word 188 context model. 189 We evaluated these different context models in terms of their ability to explain held-out MEG 190 responses, and the latency of the brain responses associated with each model. An architecture 191 based on local context models, as in Figure 1-B, predicts a temporal sequence of responses as 192 information passes up the hierarchy, with earlier responses reflecting lower order context models. 193 In contrast, a unified architecture, as in Figure 1-C, predicts that the sentence context model 194 should exhaustively explain brain responses, because all representational levels use priors derived 195 from the sentence context. Finally, architectures that entail multiple kinds of models predict that 196 different context models might explain response components, possibly in different anatomical 197 areas. 198 199 Figure 2. Models for predictive speech processing based on the sentence, lexical and sublexical 200 context, used to predict MEG data 201 (A) Example of word-by-word surprisal. The sentence (5-gram) context generally leads to a 202 reduction of word surprisal, but the magnitude of the reduction differs substantially between 203 words. 204 (B) Sentence level predictions propagate to phoneme surprisal, but not in a linear fashion. For 205 example, in the word happened, the phoneme surprisal based on all three models is relatively low 206 for the second phoneme /ae/ due to the high likelihood of word candidates like have and had. 207 However, the next phoneme is /p/ and phoneme surprisal is high across all three models. On the 208 other hand, for words like find, on and Ohio, the sentence-constrained phoneme surprisal is 209 disproportionately low for subsequent phonemes, reflecting successful combination of the 210 sentence constraint with the first phoneme. 211 (C) Phoneme-by-phoneme estimates of information processing demands, based on different 212 context models, were used to predict MEG responses through multivariate temporal response 213 functions (mTRFs) 59 . mTRFs were estimated jointly such that each predictor, convolved with the 214 corresponding TRF, predicted a partial response, and the point-wise sum of partial responses 215 constituted the predicted MEG response. See Methods for details. 216

217
Twelve participants listened to ~45 minutes of a nonfiction audiobook. Multivariate temporal 218 response functions (mTRFs) were used to jointly predict held-out, source localized MEG responses 219 ( Figure 2-C). To test whether each context model is represented neurally, the predictive power of 220 the full model including all predictors was compared with the predictive power of a model that 221 was estimated without the predictor variables belonging to this specific context model. 222 Phoneme-, Word-and Sentence-constrained models co-exist in the brain 223 Each context model significantly improves the prediction of held-out data, even after controlling 224 for acoustic features and the other two context models (Figure 3-A). Each of the three context 225 models' source localization is consistent with sources in the superior temporal gyrus (STG), 226 thought to support phonetic and phonological processing 60 . In addition, the sentence constrained 227 model also extends to more ventral parts of the temporal lobe, consistent with higher-level 228 language processing 61,62 . For comparison, the predictive power of the acoustic features is shown 229 in Figure  The predictive power of the local context models is inconsistent with the hypothesis of a single, 250 unified context model (Figure 1-C). Instead, it suggests that different neural representations 251 incorporate different kinds of context. We next pursued the question of how these different 252 representations are organized hierarchically. While surprisal depends on the conditional 253 probability of a discrete event and is agnostic to the underlying unit of representation 4,5 , entropy 254 depends on the units over which probabilities are calculated. Entropy can thus potentially 255 distinguish between whether brain responses reflect uncertainty over the next phoneme, or 256 uncertainty over the word currently being perceived. This distinction is particularly interesting for 257 the sentence context model: if predictions are constrained to using context within a hierarchical 258 level, as in Figure 1-B, then the sentence context should affect uncertainty about the upcoming 259 word, but not uncertainty about the upcoming phoneme. On the other hand, a brain response 260 related to sentence-conditional phoneme entropy would constitute evidence for cross-hierarchy 261 predictions, with sentence level information predicting upcoming phonemes. 262 Even though phoneme and cohort entropy were highly correlated (sentence context: r = .92; word 263 context: r = .90), each of the four representations was able to explain variability in the MEG 264 responses that could not be attributed to any of the other representations (Figure 3-C; all t11 ≥ 265 2.49, p ≤ .030). This suggests that the sentence context model is not restricted to predicting 266 upcoming words, but also generates expectations for upcoming phonemes. This is thus evidence 267 for cross-hierarchy top-down information flow, indicative of a unified language model that aligns 268 representations across hierarchical levels. Together, these results thus indicate that the brain does 269 maintain a unified context model, but that it also maintains more local context models. 270 Different context models affect different neural processes 271 All three context models individually contribute to neural representations, but are these 272 representations functionally separable? While all three context models improve predictions in 273 both hemispheres, the sentence constrained model does so symmetrically, whereas the lexical 274 and sublexical models are both more powerful in the right hemisphere than in the left hemisphere 275 (Figure 3-A). The sublexical context model is indeed significantly more right-lateralized than the 276 sentence model (t11 = 4.33, p = .001; Figure 3- distinction among the other models (possibly due to lower power due to the weaker effects in the 294 left hemisphere for all but the sentence model). In sum, these results suggest that the different 295 context models are maintained by at least partially separable neural processes. 296 Sentence context affects early responses and dominates late responses 297 The TRFs estimated for the full model quantify the influence of each predictor variable on brain 298 responses over a range of latencies (Figure 2-C). Figure 4 shows the response magnitude to each 299 predictor variable as a function of time, relative to phoneme onset. For an even comparison 300 between predictors, TRFs were summed in the anatomical region in which any context model 301 significantly improved predictions. Note that responses prior to 0 ms are plausible due to 302 coarticulation, by which information about a phoneme's identity can already be present in the 303 acoustic signal prior to the conventional phoneme onset 65,66 . Figure 5 shows the anatomical 304 distribution of responses related to the different levels of context. 305 Surprisal quantifies the incremental update to a context model due to new input. A brain response 306 related to surprisal therefore indicates that the input is brought to bear on a neural representation 307 that uses the corresponding context model. Consequently, the latencies of brain responses related 308 to different context models are indicative of the underlying processing architecture. In an 309 architecture in which information is sequentially passed to higher level representations with 310 broadening context models (Figure 1-B), responses should form a temporal sequence from 311 narrower to broader contexts. However, in contrast to this prediction, the observed responses to 312 surprisal suggest that bottom-up information reaches representations using sentence-and word-313 level contexts simultaneously at an early response peak (Figure 4-A; sentence: 78 ms, SD = 24 ms; 314 word: 76 ms, SD = 11 ms). Sublexical surprisal is associated with a lower response magnitude 315 overall, but also exhibits an early peak at 94 ms (SD = 26 ms). This suggests a parallel processing 316 architecture in which different context representations are activated simultaneously by new input. 317 Later in the timecourse the responses dissociate more strongly, with a large, extended response 318 reflecting the sentence context, but not the word context starting at around 205 ms (tmax = 5.27, 319 p = .007). The lateralization of the TRFs is consistent with the trend observed for predictive power: 320 a symmetric response reflecting the unified sentence context, and more right-lateralized 321 responses reflecting the more local contexts (Figure 4-B). Brain responses related to entropy indicate that neural processes are sensitive to uncertainty or 335 competition in the interpretation of the speech input. Like surprisal, such a response suggests that 336 the information has reached a representation that has incorporated the corresponding context. 337 In addition, because entropy measures uncertainty regarding a categorization decision, the 338 response to entropy can distinguish between different levels of categorization: uncertainty about 339 the current word (cohort entropy) versus uncertainty about the next phoneme (phoneme 340 entropy). 341 The TRFs to cohort entropy suggest a similar pattern as those to surprisal (Figure 4 C-D) In contrast to surprisal and cohort entropy, the responses to phoneme entropy are similar for all 356 levels of context, dominated by an early and somewhat broader peak (Figure 4 E-F). There is still 357 some indication of a second, later peak in the response to sentence-constrained phoneme 358 entropy, but this might be due to the high correlation between cohort and phoneme entropy. A 359 direct comparison of sentence-constrained cohort and phoneme entropy indicates that early 360 processing is biased towards phoneme entropy (though not significantly) while later processing is 361 biased towards cohort entropy (tmax = 4.74, p = .017 at 231 ms). 362 In sum, the entropy results suggest that all context representations drive a predictive model for 363 upcoming phonemes. This is reflected in a short-lived response in STG, consistent with the fast 364 rate of phonetic information. Simultaneously, the incoming information is used to constrain the 365 cohort of word candidates matching the current input, with lexical activations primarily driven by 366 a unified model that incorporates the sentence context. 367 Mid-latency, sentence-constrained processing engages larger parts of the temporal lobe 368 Source localization suggests that early activity originates from the vicinity of the auditory cortex in 369 the upper STG, regardless of context ( Figure 5). No evidence for a trade-off between contexts 379 We interpret our results as evidence that different context models are maintained in parallel. An 380 alternative possibility is that there is some trade-off between contexts used, and it only appears 381 in the averaged data as if all models were operating simultaneously. This alternative predicts a 382 negative correlation between the context models, reflecting the trade-off in their activation. No 383 evidence was found for such a trade-off, as correlation between context models were generally 384 neutral or positive across subjects and across time (see Supplementary Figure 1). 385

Discussion 386
The present MEG data provide clear evidence for the existence of a neural representation of 387 speech that is unified across representational hierarchies. This representation incrementally 388 integrates phonetic input with information from the multi-word context within about 100 ms. 389 However, in addition to this globally unified representation, brain responses also show evidence 390 of separate neural representations that use more local contexts to process the same input. 391 Parallel representations of speech using different levels of context 392 The evidence for a unified global model suggests that there is a functional brain system that 393 processes incoming phonemes while building a representation that incorporates constraints from 394 the multi-word context. A possible architecture for such a system is the one shown in Figure 1-C, 395 in which a probabilistic representation of the lexical cohort mediates between sentence and 396 phoneme level representations: the sentence context modifies the prior expectation for each 397 word, which is in turn used to make low-level predictions about the phonetic input. While there 398 are different possible implementations for such a system, the key feature is that the global 399 sentence context is used to make predictions for and interpret low-level phonetic, possibly even 400 acoustic 68 input. 401 A second key result from this study, however, is evidence that this unified model is not the only 402 representation of speech. Brain responses also exhibited evidence for two other, separate 403 functional systems that process incoming phonemes while building representations that 404 incorporate different, more constrained kinds of context: one based on a local word context, 405 processing the current word with a prior based on context-independent lexical frequencies, and 406 another based on the local phoneme sequence regardless of word boundaries. Each of these three 407 functional systems generates its own predictions for upcoming phonemes, resulting in parallel 408 responses to phoneme entropy. Each system is updated incrementally at the phoneme rate, 409 reflected in early responses to surprisal. However, each system engages an at least partially 410 different configuration of neural sources, as evidenced by the localization results. 411 Together, these results suggest that multiple predictive models process speech input in parallel. 412 An architecture consistent with these observations is sketched in Figure 6: three different neural 413 systems receive the speech input in parallel. Each representation is updated incrementally by 414 arriving phonemes. However, the three systems differ in the extent and kind of context that they 415 incorporate, each generating its own probabilistic beliefs about the current word and/or future 416 phonemes. For instance, the sublexical model uses the local phoneme history to predict upcoming 417 phonemes. The incrementality of the updates is reflected in the inputs to the sublexical model at 418 time k+1, combining the state of the sublexical model at time k and the phoneme input from time 419 k. The same incremental update pattern applies to the word and sentence models. 420 421 Figure 6. An architecture for speech perception with multiple parallel context models 422 A model of information flow, consistent with brain signals reported here. Brain responses 423 associated with Information theoretic variables provided separate evidence for each of the 424 probability distributions in the colored boxes. From left to right, the three different context models 425 (sentence, lexical and sublexical) update incrementally as each phoneme arrives. The cost of these 426 updates is reflected in the brain response related to surprisal. Representations also include 427 probabilistic representations of words and upcoming phonemes, reflected in brain responses 428 related to entropy. 429 A listener whose goal is comprehending a discourse-level message might be expected to rely 430 primarily on the unified, sentence constrained context model. Consistent with this, there is some 431 evidence that this model has a privileged status. Among the linguistic models, the unified model 432 has the most explanatory power and clearly bilateral representation (Figure 3). In addition, while 433 activity in local models was short-lived, the unified model was associated with extended activation 434 for up to 600 ms and recruitment of more ventral regions of the temporal lobe (Figure 4 and 5). 435 This suggests that the update in the unified model is normally more extensive than the local 436 models, and could indicate that the unified model most commonly drives semantic as well as form 437 representations, while the short-lived local models might be restricted to form-based 438 representations The parallel model suggested in Figure 6 has a special theoretical appeal over the two-stage 465 explanation: Bayesian accounts of perception suggest that listeners generate a prior, reflecting an 466 estimate of future input, and compare this prior to the actual input to compute a posterior 467 probability, or interpretation of the sensory percept. In architectures that allow different priors at 468 sequential hierarchical levels (such as Figure 1-B), higher levels receive the posterior interpretation 469 of the input from the lower levels, rather than the unbiased input itself. This is suboptimal when 470 considering a Bayesian model of perception, because the prior of lower levels is allowed to distort 471 the bottom-up evidence before it is compared to the prior generated by higher levels 73  There is broad agreement that language processing involves prediction, but the exact nature of 478 these predictions is more controversial 74-78 . Much of the debate is about whether humans can 479 represent distributions over many likely items, or just predict specific items. Previous research 480 showing an early influence of sentence context on speech processing 7-9,79 has typically relied on 481 specifically designed, highly constraining contexts which are highly predictive of a specific lexical 482 item. In such highly predictive contexts, listeners might indeed predict specific items, and such 483 predictions might be linked to the left-lateralized speech productions system 44,77 . However, such 484 a mechanism would be less useful in more representative language samples, in which highly 485 predictable words are rare 69 . In such situations of limited predictability, reading time data suggest 486 that readers instead make graded predictions, over a large number of possible continuations 5,69 . 487 Alternatively, it has been suggested that what looks like graded predictions could actually be pre-488 activation of specific higher-level semantic and syntactic features shared among the likely 489 items 69,77,80-82 , without involving prediction of form-based representations. The present results, 490 showing brain responses reflecting sentence-constrained cohort-and phoneme entropy, provide 491 a new kind of evidence in favor of graded probabilistic predictions, involving predictive 492 representations at least down to the phoneme level. 493 Bilateral pathways to speech comprehension 494 Our results suggest that lexical/phonetic processing is largely bilateral. This is consistent with 495 extensive clinical evidence for bilateral receptive language ability 83,84,61 , and suggestions that the 496 right hemisphere might even play a distinct role in complex, real-world language processing 85,86 . 497 In healthy participants, functional lateralization of sentence processing has been studied using 498 visual half-field presentation 87 . Overwhelmingly, results from these studies suggest that lexical 499 processing in both hemispheres is dominated by sentence meaning 87-90 . This is consistent with the 500 strong bilateral representation of the unified model of speech found here. As in the visual studies, 501 the similarity of the response latencies in the two hemispheres implies that right-hemispheric 502 effects are unlikely to be due to inter-hemispheric transfer from the left hemisphere (Figure 4). 503 Nevertheless, response patterns are not identical between hemispheres. Hemispheric differences 504 in visual half-field studies have been interpreted as indicating that the left hemisphere processes 505 language in a maximally context-sensitive manner, whereas the right hemisphere is more biased 506 towards a bottom-up interpretation of sensory input 44 . Our results suggest a modification of this 507 proposal, indicating that both hemispheres rely on sentence-based graded predictions, but that 508 the right hemisphere additionally maintains stronger representations of local contexts. Finally, 509 lateralization might also depend on task characteristics such as stimulus familiarity 45 , and in highly 510 constraining contexts the left hemisphere might engage the language production system to make 511 specific predictions 44,77 . 512

Conclusions 513
Prior research on the use of context during language processing has often focused on binary 514 distinctions, such as asking whether context is or is not used to predict future input. Such questions 515 assumed a single serial or cascaded processing stream. Here we show that this assumption might 516 have been misleading, because different predictive models are maintained in parallel. Our results 517 suggest that robust speech processing is based on probabilistic predictions using different context 518 models in parallel and cutting across hierarchical levels of representations. 519 Acknowledgements 520 Work Twelve native speakers of English were recruited from the University of Maryland community (6 526 female, 6 male, age mean = 21 years, range 19-23). None reported any neurological or hearing 527 impairment. According to self-report using the Edinburgh Handedness Inventory 91 , 11 were right-528 handed and one left-handed. All subjects provided informed consent in accordance with the 529 University of Maryland Institutional Review Board. Subjects either received course credit (n=4) or 530 were paid for their participation (n=8). 531 Stimuli 532 Stimuli consisted in eleven excerpts from the audiobook version of The Botany of Desire by Michael 533 Pollan 92 . Each excerpt was between 210 and 332 seconds long, for a total of 46 minutes and 44 534 seconds. Excerpts were selected to create a coherent narrative and were presented in 535 chronological order to maximize deep processing for meaning. 536 Procedure 537 During MEG data acquisition, participants lay in a supine position. They were allowed to keep their 538 eyes open or closed to maximize comfort. Stimuli were delivered through foam pad earphones 539 inserted into the ear canal at a comfortably loud listening level. After each segment, participants 540 answered 2-3 questions relating to its content and had an opportunity to take a short break. 541 Data acquisition and preprocessing 542 Brain responses were recorded with a 157 axial gradiometer whole head MEG system (KIT, 543 Kanazawa, Japan) inside a magnetically shielded room (Vacuumschmelze GmbH & Co. KG, Hanau, 544 Germany) at the University of Maryland, College Park. Sensors (15.5 mm diameter) are uniformly 545 distributed inside a liquid-He dewar, spaced ~25 mm apart, and configured as first-order axial 546 gradiometers with 50 mm separation and sensitivity better than 5 fT·Hz -1/2 in the white noise 547 region (> 1 KHz). Data were recorded with an online 200 Hz low-pass filter and a 60 Hz notch filter 548 at a sampling rate of 1 kHz. 549 Recordings were pre-processed using mne-python 93 . Flat channels were automatically detected 550 and excluded. Extraneous artifacts were removed with temporal signal space separation 94 . Data 551 were filtered between 1 and 40 Hz with a zero-phase FIR filter (mne-python 0.20 default settings). 552 Extended infomax independent component analysis 95 was then used to remove ocular and cardiac 553 artifacts. Responses time-locked to the speech stimuli were extracted, low pass filtered at 20 Hz 554 and resampled to 100 Hz. 555 Five marker coils attached to participants' head served to localize the head position with respect 556 to the MEG sensors. Head position was measured at the beginning and at the end of the recording 557 session and the two measurements were averaged. The FreeSurfer 96 ''fsaverage'' template brain 558 was coregistered to each participant's digitized head shape (Polhemus 3SPACE FASTRAK) using 559 rotation, translation, and uniform scaling. A source space was generated using four-fold 560 icosahedral subdivision of the white matter surface, with source dipoles oriented perpendicularly 561 to the cortical surface. Regularized minimum 2 norm current estimates 97,98 were computed for 562 all data using an empty room noise covariance ( = 1/6). The temporal response function analysis 563 was restricted to brain areas of interest by excluding the occipital lobe, insula and midline 564 structures based on the "aparc" FreeSurfer parcellation 99 . Excluded areas are shaded gray in Figure  565 3. A preliminary analysis (see below) was restricted to the temporal lobe (superior, middle and 566 inferior temporal gyri, Heschl's gyrus and superior temporal sulcus). 567 Predictor variables 568 Acoustic model 569 To control for brain responses to acoustic features, all models included an 8 band gammatone 570 spectrogram and an 8 band acoustic onset spectrogram 100 , both covering frequencies from 20 to 571 5000 Hz in equivalent rectangular bandwidth (ERB) space 101 and scaled with exponent 0.6 102 . 572

Word-and phoneme segmentation 573
A pronunciation dictionary was generated by combining the Carnegie-Mellon University 574 pronunciation dictionary with the Montreal Forced Aligner 103 dictionary and adding any additional 575 words that occurred in the stimuli. Transcripts were then aligned to the acoustic stimuli using the 576 Montreal Forced Aligner 103 version 1.0.1. All models included control predictors for word onsets 577 (equal value impulse at the onset of each word) and phoneme onsets (equal value impulse at the 578 onset of each non-word initial phoneme). 579 Context-based predictors 580 All experimental predictor variables consistent of one value for each phoneme and were 581 represented as a sequence of impulses at all phoneme onsets. The specific values were derived 582 from three different linguistic context models. 583 Sublexical context model 584 The complete SUBTLEX-US corpus 104 was transcribed by substituting the pronunciation for each 585 word and concatenating those pronunciations across word boundaries (i.e., no silence between 586 words). Each line was kept separate since lines are unordered in the SUBTLEX corpus. The resulting 587 phoneme sequences were then used to train a 5-gram model using KenLM 105 . This 5-gram model 588 was then used to derive phoneme surprisal and entropy. 589 The surprisal of experiencing phoneme phk at time point k is inversely related to the likelihood of 590 that phoneme, conditional on the context (measured in bits): ( ℎ ) = − 2 ( ( ℎ | )). 591 In the case of the 5-phone model this context consists of the preceding 4 phonemes, ph k-4;…k-1 . 592 The entropy H (Greek Eta) at phoneme position phk reflects the uncertainty of what the next 593 phoneme, phk+1 will be. It is defined as the expected (average) surprisal at the next phoneme, 594 ( ℎ ) = − ∑ ( ℎ +1 = ℎ| ) log 2 ( ( ℎ +1 = ℎ| )) ℎ ℎ . Based on the 595 5-phone model, the context here is phk-3;…k. 596 Lexical context model 597 The lexical context model takes into account information from all phonemes that are in the same 598 word as, and precede the current phoneme 45 and is based on the cohort model of word 599 perception 58 . At word onset, the prior for each word is proportional to its frequency in the Corpus 600 of Contemporary American English (COCA) 106 . With each subsequent phoneme, the probability for 601 words that are inconsistent with that phoneme is set to 0, and the remaining distribution is 602 renormalized. Phoneme surprisal and entropy are then calculated as above, but with the context 603 being all phonemes in the current word so far. I addition, lexical entropy is calculated at each 604 phoneme position as the entropy in the distribution of the cohort For estimation using 4-fold cross-validation, each subject's data were concatenated along the time 622 axis and split into 4 contiguous segments of equal length. The mTRFs for predicting the responses 623 in each segment were trained on the remaining 3 segments. Each of the 4 training runs in turn 624 consisted of 3 iterations, in which the 3 segments were divided into 2 training segments and 1 625 validation segment. In each training run, an mTRF was estimated using an iterative coordinate 626 descent algorithm 108 to minimize the 1 error. The mTRF was iteratively modified based on the 627 maximum error reduction in the training set (the steepest coordinate descent) and validated 628 based on the error in the validation set. Whenever a training step caused an increase of error in 629 the validation set, the TRF for the predictor responsible for the increase was frozen, and training 630 continued until the whole mTRF was frozen. The 3 mTRFs from the 3 training runs were then 631 averaged to predict responses in the left-out testing segment. 632 Model comparisons 633 Model quality was quantified through the 1 norm of the residuals. For this purpose, the predicted 634 responses for the 4 test segments, each based on mTRFs estimated on the other 3 segments, were 635 concatenated again. To compare the predictive power of two models, the difference in the 636 residuals of the two models was calculated at each virtual source dipole. This difference map was 637 smoothed (Gaussian window, SD = 5 mm) and tested for significance using a mass-univariate one-638 sample t-test with threshold-free cluster enhancement (TFCE) 109 and a null distribution based on 639 the full set of 4095 possible permutations of the 12 difference maps. For effect size comparison 640 we report tmax, the largest t-value in the significant (p ≤ .05) area. 641 The full model consisted of the following predictors: acoustic spectrogram (8 bands); acoustic 642 onset spectrogram (8 bands); word onsets; phoneme onsets; sublexical context model (phoneme 643 surprisal and phoneme entropy); lexical context model (phoneme surprisal, phoneme entropy and 644 word entropy); sentence context model (phoneme surprisal, phoneme entropy and word 645 entropy). 646 For each of the tests reported in Figure 3, mTRFs were re-estimated using a corresponding subset 647 of the predictors in the full model. For instance, to calculate the predictive power for a given level 648 of context, the model was re-fit using all predictors except the predictors of the level under 649 investigation. Each plot thus reflects the variability that can only be explained by the level in 650 question. This is generally a conservative estimate for the predictive power because it discounts 651 any explanatory power based on variability that is shared with other predictors. 652 To express model fits in a meaningful unit, the explainable variability was estimated through the 653 largest possible explanatory power of the full model (maximum across the brain of the measured 654 response minus residuals, averaged across subjects). All model fits were then expressed as % of 655 this value. For visualization, brain maps are not masked by significance to accurately portray the 656 continuous nature of MEG source estimates. 657 To allow for univariate analyses of predictive power, an ROI was used including a region responsive 659 to all context models (white outline in Figure 3-A). This ROI was defined as the posterior 2/3 of the 660 combined Heschl's gyrus and STG "aparc" label, separately in each hemisphere. 661 Tests of lateralization 662 For spatio-temporal tests of lateralization (Figure 3-A and D) the difference map was first morphed 663 to the symmetric "fsaverage_sym" brain 110 , and the data from the right hemisphere was morphed 664 to the left hemisphere. Once in this common space, a mass-univariate repeated measures t-test 665 with TFCE was used to compare the difference map from the left and right hemisphere. 666 Tests of localization difference 667 A direct comparison of two localization maps can have misleading results due to cancellation 668 between different current sources 63 as well as the continuous nature of MEG source estimates 111 . 669 However, a test of localization difference is possible due to the additive nature of current 670 sources 64 . Specifically, for a linear inverse solver as used here, if the relative amplitude of a 671 configuration of current sources is held constant, the topography of the resulting source 672 localization is also unchanged. Consequently, we employed a test of localization difference that 673 has the null hypothesis that the topography of two effect in source space is the same 64 . 674 Localization tests were generally restricted to an area encompassing the major activation seen in 675 Figure 3, based on "aparc" labels 99 : the posterior 2/3 of the superior temporal gyrus and Heschl's 676 gyrus combined, the superior temporal sulcus, and the middle 3/5 of the middle temporal gyrus. 677 For each map, the values in this area were extracted and z-scored (separately for each 678 hemisphere). For each comparison, the two z-scored maps were subtracted, and the resulting 679 difference map was analyzed with a one-way repeated measures ANOVA with factor source 680 location (left hemisphere: 180 sources; right hemisphere: 176 sources). According to the null 681 hypothesis, the two maps should be (statistically) equal, and the difference map should only 682 contain noise. In contrast, a significant effect of source location would indicate that the difference 683 map reflects a difference in topography that is systematic between subjects. 684 TRF analysis 685 For the analysis of the TRFs, all 12 mTRFs estimated for each subject were averaged (4 test 686 segments * 3 training runs). TRFs were analyzed in the normalized scale that was used for model 687 estimation. 688 TRF time-course 689 To extract the time course of response functions, an ROI was generated including all virtual current 690 sources for which at least one of the three context models significantly improved the response 691 predictions. To allow a fair comparison between hemispheres, the ROI was made symmetric by 692 morphing it to the "fsaverage_sym" brain 110 and taking the union of the two hemispheres. With 693 this ROI, the magnitude of the TRFs at each time point was then extracted as the sum of the 694 absolute current values across source dipoles. These time courses were resampled to 1000 Hz. 695 Peak times were determined by finding the maximum value within a given window for each 696 subject. Time-courses were statistically compared using mass-univariate related measures t-tests, 697 with a null distribution based on the maximum statistic in the 4095 permutations (no cluster 698 enhancement). 699

TRF-localization 700
To analyze TRF localization, TRF magnitude was quantified as the summed absolute current values 701 in three time-windows, representing early (-50 -150 ms), mid-latency (150 -350 ms) and late 702 (350 -550 ms) responses (see Figure 5). Maps were smoothed (Gaussian window, SD = 5 mm) and 703 tested for localization differences with the same procedure as described above (Tests of  704 localization difference). 705 Analysis of trade-off between context models 706 Several analyses were performed to detect a trade-off between the use of the different context 707 models. 708 Trade-off by subject 709 One possible trade-off is between subjects: some subjects might rely on sentence context more 710 than local models, whereas other subjects might rely more on local models. For example, for 711 lexical processing, this hypothesis would predict that for a subject for whom the sentence context 712 model is more predictive, the lexical context model should be less and vice versa. According to this 713 hypothesis, the predictive power of the different context models should be negatively correlated 714 across subjects. To evaluate this, we correlations between the predictive power of the different 715 models in the in the mid/posterior STG ROI (see Supplementary Figure 1

-A). 716
Trade-off over time 717 A second possible trade-off is across time: subjects might change their response characteristics 718 over time to change the extent to which they rely on lower-or higher-level context. For example, 719 the depth of processing of meaningful speech might fluctuate with the mental state of alertness. 720 According to this hypothesis, the predictive power of the different context models should be anti-721 correlated over time. To evaluate this, we calculated the residuals for the different model fits for 722 each time point, = ( −̂), aggregating by taking the mean in the mid/posterior STG 723 ROI (separately or each subject). The predictive power was calculated for each model by 724 subtracting the residuals of the model from the absolute values of the measured data (i.e., the 725 residuals of a null model without any predictor). The predictive power for each level of context 726 was then computed by subtracting the predictive power of a corresponding reduced model, 727 lacking the given level of context, from the predictive power of the full model. Finally, to reduce 728 the number of data points the predictive power was summed in 1 second bins. 729 For each subject, the trade-off between each pair of contexts was quantified as the partial 730 correlation 112 between the predictive power of the two contexts, controlling for the predictive 731 power of the full model (to control for MEG signal quality fluctuations over time). To test for a 732 significant trad-off, a one-sample t-test was used for each pair and in each hemisphere, with the 733 null hypothesis that the correlation between contexts over time is 0 (see Supplementary Figure 1-734 B). 735