Increased connectivity among sensory and motor regions during visual and audiovisual speech perception

In everyday conversation, we usually process the talker’s face as well as the sound of their voice. Access to visual speech information is particularly useful when the auditory signal is degraded. Here we used fMRI to monitor brain activity while adult humans (n = 60) were presented with visual-only, auditory-only, and audiovisual words. The audiovisual words were presented in quiet and several signal-to-noise ratios. As expected, audiovisual speech perception recruited both auditory and visual cortex, with some evidence for increased recruitment of premotor cortex in some conditions (including in substantial background noise). We then investigated neural connectivity using psychophysiological interaction (PPI) analysis with seed regions in both primary auditory cortex and primary visual cortex. Connectivity between auditory and visual cortices was stronger in audiovisual conditions than in unimodal conditions, including a wide network of regions in posterior temporal cortex and prefrontal cortex. In addition to whole-brain analyses, we also conducted a region-of-interest analysis on the left posterior superior temporal sulcus (pSTS), implicated in many previous studies of audiovisual speech perception. We found evidence for both activity and effective connectivity in pSTS for visual-only and audiovisual speech, although these were not significant in whole-brain analyses. Taken together, our results suggest a prominent role for cross-region synchronization in understanding both visual-only and audiovisual speech that complements activity in “integrative” brain regions like pSTS.


Introduction 68
Understanding speech in the presence of background noise is notoriously challenging, 69 and when visual speech information is available, listeners make use of it-performance 70 on audiovisual (AV) speech in noise is better than for auditory-only speech in noise 71 (Sumby and Pollack, 1954). Although there is consensus that listeners make use of 72 visual information during speech perception, there is little agreement either on the 73 neural mechanisms that support visual speech processing or on the way in which visual 74 and auditory speech information are combined during audiovisual speech perception. 75 One longstanding perspective on audiovisual speech has been that auditory and 76 visual information are processed through separate channels, and then integrated at a 77 separate processing stage (Grant and Seitz, 1998;Massaro and Palmer, 1998). 78 Audiovisual integration is thus often considered an individual ability that some people 79 are better at and some people are worse at, regardless of their unimodal processing 80 abilities Mallick et al., 2015). 81 However, more recent data have brought this traditional view into question. For 82 example, Tye-Murray and colleagues (2016) showed that unimodal auditory-only and 83 visual-only word recognition scores accurately predicted AV performance, and factor 84 analyses revealed two unimodal ability factors with no evidence of a separate 85 integrative ability factor. These findings suggest that rather than a separate stage of 86 audiovisual integration, AV speech perception may depend most strongly on the 87 coordination of auditory and visual inputs (Sommers, 2021). 88 Theoretical perspectives on audiovisual integration have also informed cognitive 89 neuroscience approaches to AV speech perception. Prior functional neuroimaging 90 studies of audiovisual speech processing have largely focused on identifying brain 91 regions supporting integration. One possibility is that the posterior superior temporal 92 sulcus (pSTS) combines auditory and visual information during speech perception. The 93 pSTS is anatomically positioned between auditory cortex and visual cortex, and has the 94 functional properties of a multisensory convergence zone (Beauchamp et al., 2004). 95 During many audiovisual tasks, the pSTS is differentially activated by matching and mis-96 matching auditory-visual information, consistent with a role in integration (Stevenson 97 and James, 2009). Moreover, functional connectivity between the pSTS and primary 98 sensory regions varies with the reliability of the information in a modality (Nath and 99 Beauchamp, 2011), suggesting that the role of the pSTS may be related to combining or 100 weighing information from different senses. 101 A complementary proposal is that regions of premotor cortex responsible for 102 representing articulatory information are engaged in processing speech (Okada and  103 Hickok, 2009). The contribution of motor regions to speech perception is hotly debated. 104 Evidence consistent with a motor contribution includes a self-advantage in both visual-105 only and AV speech perception (Tye-Murray et al., 2013,2015), and effects of visual 106 speech training on speech production (Fridriksson et al., 2009;Venezia et al., 2016). 107 However, premotor activity is not consistently observed in neuroimaging studies of 108 speech perception, and in some instances may also reflect non-perceptual processing 109 (Szenkovits et al., 2012;Nuttall et al., 2016). It is also possible that premotor regions 110 are only engaged in certain types of speech perception situations (for example, when 111 there is substantial background noise, or when lipreading); individual differences in 112 hearing sensitivity or lipreading ability also may affect the involvement of premotor 113 cortex. 114 In addition to looking for brain regions that support visual-only or AV speech  115  perception, we therefore broaden our approach to study the role played by effective  116 connectivity between auditory, visual, and motor regions. If a dedicated brain region is 117 necessary to combine auditory and visual speech information, we would expect to see it 118 active during audiovisual speech. If changes in effective connectivity (Friston, 1994;119 Stephan and Friston, 2010)-that is, task-based synchronized activity-underlie visual-120 only or audiovisual speech processing, we would expect to see greater connectivity 121 between speech-related regions during these conditions relative to auditory-only 122 speech. In service of these questions we tested auditory-only speech perception and 123 AV speech perception at a range of signal-to-noise ratios (SNRs) and obtained out-of-124 scanner measures of lipreading ability from our participants (Figure 1) were tested using the entire corpus. The words selected ranged from 10%-93% correct 155 in the lipreading-only behavioral tests. They were distributed among the six conditions 156 that included visual information (AV in Quiet, AV +5 SNR, AV 0 SNR, AV -5 SNR, AV -157 10, and visual-only) so they would, on average, be equivalent for lipreading difficulty. 158 The words used in the auditory-only condition were selected from the remaining words. 159

160
We collected data from 60 participants ranging in age from 18-34 years (M = 22.42, SD 161 = 3.24, 45 female). All were right-handed native speakers of American English (no other 162 languages other than English before age 7) who self-reported normal hearing and an 163 absence of neurological disease. All provided informed consent under a protocol 164 approved by the Washington University in Saint Louis Institutional Review Board. 165

166
Before being tested in the fMRI scanner all participants were consented, completed a 167 safety screening, and completed an out-of-scanner lipreading assessment. The 168 behavioral lipreading assessment consisted of 50 single word clips selected in the same 169 way and taken from the same corpus of recorded material used in the scanner. The 170 lipreading assessment was complete by presenting each video clip to the participant 171 using a laptop. Participants were encouraged to verbally provide their best guess for 172 each clip. Only verbatim responses to the stimuli were considered correct. 173 Participants were positioned in the scanner with insert earphones and a viewing 174 mirror placed above the eyes to see a two-sided projection screen located at the head-175 side of the scanner. Those that wore glasses were provided scanner-friendly lenses that 176 fit their prescription. Participants were also given a response box that they held in a 177 comfortable position on their torso during testing. Each of the imaging runs presented 178 trials with recordings of audio, visual-only, audiovisual speech stimuli, or printed text via 179 an image projected on the screen that was visible to the participant through the viewing 180 mirror. A camera positioned at the entrance to the scanner bore was used to monitor 181 participant movement. A well-being check and short conversation occurred before each 182 run and, if needed, participants were reminded to stay alert and asked to try to reduce 183 their movement. 184 Six runs were completed during the session. Each run lasted approximately 5.5 185 minutes. The first five runs were perception runs and contained 98 trials each. The 186 stimuli were presented in blocks of five experimental trials plus two null trials for each 187 condition. The result was 14 blocks resulting in 70 experimental trials plus 28 null trials. 188 All trials included 800 ms of quiet without a visual presentation before the stimuli began. 189 During the null trials participants were presented with a fixation cross instead of the 190 audiovisual presentation. The auditory-only condition did not include visual stimuli; 191 instead a black screen was presented. The blocks were quasi-randomized so that two 192 blocks from the same condition were never presented one right after the other and one 193 null trial never occurred right after another. 194 To keep attention high, half of the experimental trials required a response from 195 the participant. On response trials, a set of two dots appeared on the screen after the 196 audiovisual/audio presentation. The right-side dot was green and the left-side dot was 197 red. The participant was instructed to use the right-hand button on the response box to 198 indicate "yes" if they were confident that they had been able to identify the previous 199 word and to use the left-hand button if they felt they had not identified the previous word 200 correctly. 201 After the initial five runs, a final run of 60 trials was presented in which 202 participants saw a series of written words projected on the screen. The items were the 203 same 50 words used for the behavioral visual-only assessment, but which did not 204 appear in any of the other fMRI conditions. Each word stayed on the screen for 2.3 205 seconds, followed by two green dots that appeared for 2.3 seconds. Participants were 206 asked to say aloud the word that was presented during the period that the dots were on 207 the screen. Ten null trials were randomly distributed throughout the sequence. Null trials 208 lasted 1.5 seconds and included a fixation cross on the screen. The reading task was 209 always the final run. 210

Behavioral data analysis 211
The out-of-scanner lipreading assessment was scored by taking the percentage of 212 correct responses made by each participant, which we used as a covariate in the fMRI 213 analyses, allowing us to explore patterns of brain activity that related to more successful 214 lipreading ability. The in-scanner lipreading was scored similarly, except scores were 215 based on participants' own judgement of their accuracy. Because we had no way to 216 verify lipreading accuracy in the scanner, we used these to assess qualitative 217 differences in difficulty across condition rather than formal statistical analyses. 218

MRI data acquisition and analysis 219
MRI images were acquired on a Siemens Prisma 3T scanner using a 32-channel head 220 coil. Structural images were acquired using a T1-weighted MPRAGE sequence with a 221 voxel size of .8 x .8 x .8 mm. Functional images were acquired using a multiband 222 sequence (Feinberg et al., 2010) in axial orientation with an acceleration factor of 8 (TE 223 = 37 ms), providing full-brain coverage with a voxel size of 2 × 2 × 2 mm. Each volume 224 took 0.770 s to acquire. We used a sparse imaging paradigm (Edmister et al., 1999;225 Hall et al., 1999) with a repetition time of 2.47 s, leaving 1.7 s of silence on each trial. 226 We presented words during this silent period, and during the repetition task, instructed 227 participants to speak during a silent period to minimize the influence of head motion on 228 the data. 229 Analysis of the MRI data was performed using Automatic Analysis version 5.  We first examined whole brain univariate effects by condition, shown in Figure 2. 276 We observed temporal lobe activity in all conditions, including visual-only, and visual 277 cortex activity in all conditions except auditory only. 278 We next related the activity during visual-only speech with the out-of-scanner 279 lipreading score (Figure 1b). Across participants, lipreading accuracy ranged from 4-280 74% (mean = 47.75, SD = 15.49), and correlated with in-scanner ratings (Spearman rho 281 = 0.38). We included out-of-scanner lipreading as a covariate to see whether individual 282 differences in out-of-scanner scores related to visual-only activity; we did not find any 283 significant relationship (positive or negative). 284 285

286
Following univariate analyses, we examined effective connectivity using 287 psychophysiological interaction (PPI) models. We started by using a seed region in left 288 visual cortex. As seen in Figure 3, compared to auditory-only speech, visual-only and 289 all audiovisual conditions showed increased connectivity with the visual cortex seed, 290 notably including bilateral superior temporal gyrus and auditory cortex. The same was 291 true with an auditory cortex seed, shown in Figure 4. Here, compared to the visual-only 292 condition, we see increased connectivity with visual cortex in all conditions except the 293 auditory-only condition. 294

297
Finally, to complement the above whole-brain analyses, we conducted an ROI 298 analyses focusing on pSTS, shown in Figure 5. For the whole-brain univariate and PPI 299 analyses described above, we extracted values from left pSTS and used one-sample t-300 tests to see whether activity was significantly different from 0. Significance (p < .05, 301 Bonferroni corrected for 19 tests giving p < .00263) is indicated above each condition. 302 Figure 5. Region-of-interest analyses highlighting the role of the left pSTS in speech processing. a. pSTS activity for univariate analyses (cf. Figure 2). b. PPI-based effective connectivity with V1 (cf. Figure 3). c. PPI-based effective connectivity with A1 (cf. Figure 4). Significant differences from 0, corrected for multiple comparisons, are indicated with an asterisk.

303
Discussion 304 We studied brain activity associated with visual-only and audiovisual speech perception. 305 We found that connectivity between auditory, visual, and premotor cortex was enhanced 306 during audiovisual speech processing relative to unimodal processing, and during 307 visual-only speech processing relative to auditory-only speech processing. These 308 findings are broadly consistent with a role for synchronized interregional neural activity 309 supporting visual and audiovisual speech perception. 310

Dedicated regions for multisensory speech processing 311
Although understanding audiovisual speech requires combining information from 312 multiple modalities, the way this happens is unclear. One possibility is that heteromodal 313 brain regions such as the pSTS act to integrate inputs from unisensory cortices. In 314 addition to combining inputs to form a unitary percept, regions such as pSTS may also 315 give more weight to more informative modalities (for example, to the visual signal when 316 the auditory signal is noisy) (Nath and Beauchamp, 2011). 317 Activity in pSTS for visual-only or AV speech was suggested by both our whole-318 brain and ROI-based analyses. In particular, we observed pSTS activity for AV speech 319 in which the auditory and visual aspects were consistently congruent, consistent with a 320 role for pSTS in integrating or combining auditory and visual information. Of course, 321 pSTS activity is not always observed for AV speech ( oscillations in auditory cortex, which was hypothesized to increase sensitivity to auditory 344 stimuli. In at least one human MEG study, audiovisual effects appear sooner in auditory 345 cortex than in pSTS (Möttönen et al., 2004), and visual speech may speed processing 346 in auditory cortex (van Wassenhove et al., 2005). These findings suggest that 347 multisensory effects are present in primary sensory regions, and that auditory and visual 348 information do not require a separate brain region in which to "integrate". 349 In the current data, we observed stronger connectivity between auditory and 350 visual cortex for visual-only and audiovisual speech conditions than for unimodal 351 auditory-only speech; and stronger connectivity in audiovisual speech conditions than in 352 unimodal visual-only speech. That is, using a visual cortex seed we found increases in 353 effective connectivity with auditory cortex, and when using an auditory cortex seed we 354 found increases in effective connectivity with visual cortex. These complementary 355 findings indicate that functionally coordinated activity between primary sensory regions 356 that is increased during audiovisual speech perception. 357 Beyond primary sensory cortices, we also observed effective connectivity 358 changes to premotor cortex for both visual-only speech and several audiovisual 359 conditions. The functional synchronization between visual cortex, auditory cortex, and 360 premotor cortex is consistent with a distributed network that orchestrates activity in 361 response to visual-only and audiovisual speech. 362 Finally, our ROI analysis showed increased effective connectivity between pSTS 363 and V1, but not A1, under most experimental conditions (Figure 5). These effective 364 connectivity changes with V1 are consistent with a role for pSTS in audiovisual speech 365 processing. However, they are also not easily reconcilable with studies reporting 366 connectivity differences between pSTS and both A1 and V1 (Nath and Beauchamp, 367 2011). Although no doubt the location and size of any pSTS ROI chosen is important, 368 we used the same ROI for the PPI analyses with both the A1 seed and V1 seed, and so 369 ROI definition alone does not seem to explain the qualitative difference between the 370 two. 371 It may be worth considering whether the pSTS plays different role in relation to 372 A1 and V1. Just because pSTS responds to both auditory and visual information does 373 not necessarily mean it treats them equally, or integrates them in a modality-agnostic 374 manner. Indeed, given that "unisensory" cortices show multisensory effects and 375 anatomical connections (Cappe & Barone, 2005), heteromodal or multisensory regions 376 can also exhibit modality preferences (Noyce et al., 2017). In many audiovisual tasks, 377 auditory information appears to be preferentially processed (Grondin and McAuley,  378 2009; Grondin and Rousseau, 1991; Grahn et al. 2011;Recanzone, 2003). Thus, pSTS 379 may be particularly important in integrating visual information into an existing auditory-380 dominated percept. Relatedly, it could also be that multimodal information is inextricably 381 bound at early stages of perception (Rosenblum, 2008), a process which may rely on 382 pSTS. 383 The emerging picture is one in which coordination of large-scale brain 384 networks-that is, effective connectivity reflecting time-locked functional processing-is 385 associated with visual-only and audiovisual speech processing. What might be the 386 function of such distributed, coordinated activity? Visual and audiovisual speech appear 387 to rely on multisensory representations. For audiovisual speech, it may seem obvious 388 that successful perception requires combining auditory and visual information. However, 389 visual-only speech has been consistently associated with activity in auditory cortex 390 (Calvert et al., 1997;Okada et al., 2013). These activations may correspond to visual-391 auditory associations, and auditory-motor associations, learned from audiovisual 392 speech that are automatically reactivated, even when the auditory input is absent. 393 Interestingly, our out-of-scanner lipreading scores did not correlate with any of 394 the whole brain results. It should be noted, however, that our sample size, while large 395 for fMRI studies of audiovisual speech processing, may still be too small to reliably 396 detect individual differences in brain activity patterns (Yarkoni and Braver, 2010). 397 Moreover, there may be multiple ways that brains can support better lipreading, and 398 such heterogeneity in brain patterns would not be evident in our current analyses. 399 Future studies with larger sample sizes may be needed to quantitatively assess the 400 degree to which users' activity might fall into neural strategies, and the degree to which 401 these are related to lipreading performance. 402 It is worth highlighting an intriguing aspect of our data, which is that auditory 403 cortex is always engaged, even in visual-only conditions, whereas the reverse is not 404 true for visual cortex (which is only engaged when visual information is present) (Figure  405 2). This observation may relate to deeper theoretical issues regarding the fundamental 406 modality of speech representation. That is, if auditory representations have primacy (at 407 least, for hearing people), we might expect these representations to be activated 408 regardless of the input modality (i.e., for both auditory and visual speech). In fact, this is 409 exactly what we have observed. Although these findings do not directly speak to the 410 level of detail contained in visual cortex speech representations (Bernstein and  411 Liebenthal, 2014), they are consistent with asymmetric auditory and visual speech 412 representations. 413

Different perspectives on multisensory integration during speech perception 414
An enduring challenge for understanding multisensory speech perception can be found 415 in differing uses of the word "integration". During audiovisual speech perception, 416 listeners use both auditory and visual information, and so from one perspective both 417 kinds of information are necessarily "integrated" into a listener's (unified) perceptual 418 experience. However, such use of both auditory and visual information does not 419 necessitate a separable cognitive stage for integration (Tye-Murray et al., 2016; 420 Sommers, 2021), nor does it necessitate a region of the brain devoted to integration. 421 The interregional coordination we observed here may accomplish the task of integration 422 in that both auditory and visual modalities are shaping perception. In this framework, 423 there is no need to first translate visual and auditory speech information into some kind 424 of common code (see also Altieri et al., 2011). 425 With any study it is important to consider how the specific stimuli used influenced 426 the results. Here, we examined processing for single words. Visual speech can inform 427 perception in multiple dimensions (Peelle and Sommers, 2015), including by providing 428 clues to the speech envelope (Chandrasekaran et al., 2009). These clues may be more 429 influential in connected speech (e.g., sentences) than in single words, as other neural 430 processes may come into play with connected speech. 431

432
Our findings demonstrate the scaffolding of connectivity between auditory, visual, and 433 premotor cortices that supports visual-only and audiovisual speech perception. These 434 findings suggest that the binding of multisensory information need not be restricted to 435 heteromodal brain regions (e.g., pSTS), but may also emerge from coordinated 436 unimodal activity throughout the brain. 437 438