Linking Activity in Human Superior Temporal Cortex to Perception of Noisy Audiovisual Speech

Regions of the human posterior superior temporal gyrus and sulcus (pSTG/S) respond to the visual mouth movements that constitute visual speech and the auditory vocalizations that constitute auditory speech. We hypothesized that these multisensory responses in pSTG/S underlie the observation that comprehension of noisy auditory speech is improved when it is accompanied by visual speech. To test this idea, we presented audiovisual sentences that contained either a clear auditory component or a noisy auditory component while measuring brain activity using BOLD fMRI. Participants reported the intelligibility of the speech on each trial with a button press. Perceptually, adding visual speech to noisy auditory sentences rendered them much more intelligible. Post-hoc trial sorting was used to examine brain activations during noisy sentences that were more or less intelligible, focusing on multisensory speech regions in the pSTG/S identified with an independent visual speech localizer. Univariate analysis showed that less intelligible noisy audiovisual sentences evoked a weaker BOLD response, while more intelligible sentences evoked a stronger BOLD response that was indistinguishable from clear sentences. To better understand these differences, we conducted a multivariate representational similarity analysis. The pattern of response for intelligible noisy audiovisual sentences was more similar to the pattern for clear sentences, while the response pattern for unintelligible noisy sentences was less similar. These results show that for both univariate and multivariate analyses, successful integration of visual and noisy auditory speech normalizes responses in pSTG/S, providing evidence that multisensory subregions of pSTG/S are responsible for the perceptual benefit of visual speech. Significance Statement Enabling social interactions, including the production and perception of speech, is a key function of the human brain. Speech perception is a complex computational problem that the brain solves using both visual information from the talker’s facial movements and auditory information from the talker’s voice. Visual speech information is particularly important under noisy listening conditions when auditory speech is difficult or impossible to understand alone Regions of the human cortex in posterior superior temporal lobe respond to the visual mouth movements that constitute visual speech and the auditory vocalizations that constitute auditory speech. We show that the pattern of activity in cortex reflects the successful multisensory integration of auditory and visual speech information in the service of perception.


Introduction
Enabling social interactions, including the production and perception of speech, is a key function of the human brain. Speech perception is a complex computational problem that the brain solves using both visual information from the talker's facial movements and auditory information from the talker's voice. Visual speech information is particularly important under noisy listening conditions when auditory speech is difficult or impossible to understand alone (reviewed in 1).
The neural substrates for the integration of auditory and visual speech are a subject of active investigation (2). In non-human primates, recordings from single neurons in pSTG/S respond to both auditory and visual social communication signals (3)(4)(5). In humans, small populations of neurons in pSTG/S recorded with intracranial electrodes respond to both auditory and visual speech (6,7). While the idea that pSTG/S integrates visual speech information with noisy auditory speech in the service of comprehension seems reasonable, it is supported by only limited empirical evidence. In a recent study, repetitive transcranial magnetic stimulation (rTMS) was used to disrupt processing of pSTG/S during perception of noisy audiovisual sentences (8). rTMS resulted in small but significant decreases (between one and two dB) in the ability to understand noisy speech. However, a BOLD fMRI study failed to find evidence that pSTG/S integrates noisy auditory and visual speech. Bishop and Miller presented noisy audiovisual syllables and used post hoc sorting to separate trials into those that were understood and those that were not (9). Differential BOLD fMRI responses to the two types of trials was observed in seventeen different brain areas, but pSTG/S was not of them.
As in most published neuroimaging studies of noisy speech perception, Bishop and Miller performed a univariate analysis, examining the response amplitude in individual voxels and regions of interest. However, in many circumstances, multivariate analyses of the pattern of activity across multiple voxels reveals information hidden from univariate analyses (10).
Another widely-used technique applied by Bishop and Miller was the use of a volumetric group analysis in which each participant was aligned to a template brain and analysis was conducted at the group level. While pSTG/S was classified as a single cytoarchitectonic area by Brodmann (BA 21), modern studies suggest that it contains a dozen or more functional subdivisions with boundaries that vary between individuals (11). Volumetric group analysis ignores this anatomical variability and assumes that a given coordinate in standard space is functionally equivalent across participants, sometimes leading to incorrect inferences (12).
We have recently developed a localizer that identifies regions important for processing audiovisual speech in the pSTG/S of individual participants (13). Inspired by earlier work (14,15), the localizer contrasts responses to silent videos of actors making mouth movements with silent videos of actors making eye movements. Even though the localizer contains only unisensory visual stimuli, it successfully locates regions of pSTG/S that respond to voices and prefer voices to environmental sounds (16). A follow-up study demonstrated that mouthpreferring and eye-preferring subregions of pSTG/S could be identified during free viewing of talking faces by measuring eye movements made by participants in the MR scanner (17).
We applied this localizer to re-examine the relationship between comprehension of noisy audiovisual speech and BOLD responses in pSTG/S. Because the localizer identifies a population of voxels likely to be important for multisensory speech processing, it allows for the application of multivariate analysis techniques that compare the response pattern across conditions. Unlike volumetric group analyses, the localizer compares "apples to apples" since it identifies functionally equivalent regions across participants.

Methods
Twenty-two healthy right-handed participants (14 females, mean age 25, range 18 -34) with normal or corrected to normal vision and normal hearing provided written informed consent under an experimental protocol approved by the Committee for the Protection of Human Subjects of the Baylor College of Medicine, Houston, TX.

Localizer fMRI Experiment
The localizer fMRI experiment was adapted from (Zhu and Beauchamp, 2017). The visual stimuli consisted of silent 2-second movies of two different actors making a variety of mouth movements and eye movements. Following each video, there was a 1-second response period during which participants pressed a button to identify the actor in the video. Stimuli were presented in a block design, with each block containing movies of a single type, either mouth movements or eye movements. Each 30-second block contained ten 3-second trials (2-second stimulus + 1-second response) followed by 10 seconds of fixation baseline. Three mouth and three eye blocks were presented alternately during each scan series, for a total duration of 240 seconds; two localizer scan series were collected for each participant.

Main fMRI Experiment
The main fMRI experiment used an event-related design. Each 6-second trial consisted of the presentation of a 3-second recording of a sentence, followed by a 3-second response period.
The sentences were recorded from a single male talker and were presented in five different formats: audiovisual (AV, video and clear audio), auditory (A, only clear audio), visual (V, only video), noisy audiovisual (AnV, video and noisy audio) and noisy auditory versions (An, only noisy audio) (Figure 1). To create noisy sentences, the audio recordings were combined with pink noise at a signal-to-noise ratio (SNR) of -16 dB. Pink noise, defined as noise with decreasing energy at increasing frequency, is less aversive than white noise and is commonly Four scan series were collected for each participant. Each series had total duration of 300 seconds and contained 40 sentence trials (8 from each of the 5 different formats) and 60 seconds of fixation baseline in a pseudo-random optimal sequence generated by the program optseq2 (Dale, 1999; https://surfer.nmr.mgh.harvard.edu/optseq).

MRI Acquisition
Participants were scanned in a 3 tesla Siemens Trio MRI scanner equipped with a 32-channel

Perceptual Ratings
After each sentence presentation in the main fMRI experiment, participants rated intelligibility using a three-alternative forced choice (to minimize perceptual learning, sentences were not repeated within participants). The rating choices were: "understood everything" (all words in the sentence understood); "understood something" (at least one word in the sentence); "understood nothing" (no words in the sentence). To ensure sufficient trial numbers for posthoc trial sorting, the "everything" and "something" choices were grouped together, resulting in two types of perceptually-sorted trials: "yes-some or all of the sentence intelligible" (Y) and "not at all intelligible" (N). This grouping was necessary because there were few "understood everything" responses (8% in the auditory-only noisy condition).  . Across subjects, the average location of the ROI midpoint was y = −25 ± 1.5 mm (left hemisphere) and y = −24 ± 0.8 mm (right hemisphere); co-ordinates in MNI standard space (N27). pSTG/S mouth ROIs were created by finding voxels in the posterior temporal anatomical ROI that showed a significant overall omnibus F-test (F > 5, q < 0.0001, false discovery rate corrected) and a significant preference for mouth compared with eye videos in the localizer fMRI experiment (q < 0.05). pSTG/S eye ROIs were defined as posterior temporal voxels that showed the reverse preference (eye videos > mouth videos, q < 0.05). Sample ROIs from 4 participants are shown in Figure S1.

Mixed-effects models
Analysis across participants was conducted using linear mixed-effects (LME) models created

Multivariate analysis
Multivariate analysis was performed within individual participant ROIs. For a given condition, the multivariate response pattern was defined as the percent signal change in each voxel within the ROI evoked by that condition. Pair-wise comparisons between conditions were calculated as the simple linear correlation between the patterns evoked by each condition (19). It should be noted that unlike some analyses (e.g. 20) we did not subtract the mean value across conditions within each voxel before performing the correlation. This means that all reported pattern correlations are positive, reflecting the positive BOLD response to sentences in the pSTG/S.

Perceptual Data
In the main experiment, participants were presented with 5 different types of sentences in the MR scanner (Fig. 1). Participants rated each sentence as intelligible (some or every word in the sentence understood) or unintelligible (no words understood). Consistent difference in intelligibility across conditions were observed ( Figure 1). Audiovisual sentences (AV) were the most intelligible (99% of sentences rated intelligible), followed by clear auditory-only sentences (A, 84%), audiovisual sentences with pink noise added to the auditory track (AnV, 80%), auditory-only noisy sentences (An, 53%), and visual-only sentences (V, 10%). It should be noted that the echo-planar pulse sequence used for MR acquisition created a moderate degree of auditory noise that was equivalent across conditions.

Functional Localizer: Identification of mouth-preferring cortex in pSTG/S
In each participant, the localizer fMRI experiment identified regions of the posterior temporal cortex that preferred visually-presented silent mouth movements to silent eye movements. Figure 2A shows an illustration of the mouth-preferring ROI on a group-averaged cortical surface (see Fig. S1 for results in 8 example hemispheres).

BOLD Response to Clear and Noisy Sentences
We examined the response of mouth-preferring pSTG/S to the 5 types of sentences presented in the main experiment (

BOLD Response to Noisy Sentences, Sorted by Perception
If the pSTG/S is the neural locus for the perceptual benefit of visual speech, we might expect to find a relationship between BOLD responses in the pSTG/S and sentence intelligibility. More effective integration of visual and auditory speech might lead to greater neural responses (higher BOLD signal) and improved perception. To examine this relationship, we applied the procedure of post-hoc sorting of fMRI data by perception. For each stimulus category, sentences rated as intelligible were placed in one bin and sentences rated as unintelligible were placed in the other (Figure 3). To minimize any confounding effects of stimulus, the initial analysis was only conducted within stimulus category, so that stimulus factors (such as the presence or absence of added noise) were equivalent between bins.
The response to noisy auditory sentences rated as intelligible was significantly greater than the response to auditory-noisy sentences rated as unintelligible, An-Y vs. An-N, 0.50% vs.

Multivariate analysis
Next, we analyzed the multivariate response pattern in pSTG/S. For each pair of conditions, we correlated the response pattern of voxels within each participant's mouth-preferring pSTG/S ROI, and then averaged across participants to calculate a mean r. Figure 5A shows the complete matrix of r-values between all pair-wise conditions. The lowest correlations were between visual-only and auditory-only sentences, the two conditions that differed the most in both sensory content and intelligibility. The average correlation between the response patterns for visual-only speech (which always received the lowest intelligibility ratings) and auditory-only speech (which always received high intelligibility ratings) was rA*V = 0.47.
In contrast, there was much higher correlation between sentences with similar sensory content. Critically, intelligibility also influenced pattern correlation values, even when sensory factors were kept constant. There was a high correlation between the pSTG/S response pattern to clear auditory speech and noisy auditory sentences rated as intelligible, rA*An-Y = 0.88. The correlation was significantly weaker for unintelligible noisy auditory sentences, rA*An-N = 0.83; 0.88 vs. 0.83, χ²(1) = 7.13, p = 0.0076 ( Figure 5B).
The influence of intelligibility on the response pattern was even more pronounced for audiovisual speech. There was very high similarity between the response patterns to clear

Control Analysis: Identification of eye-preferring cortex in pSTG/S
In our primary analysis, we focused on regions of the pSTG/S that preferred visual presentation of faces making mouth movements to faces making eye movements. To determine if the observed effects were specific to mouth-preferring pSTG/S, in a control analysis we considered eye-preferring pSTG/S. Voxels within each hemisphere meeting three criteria (anatomical location in the pSTG/S/STG; significant BOLD response to visual stimulation compared with fixation baseline; and a greater response to eye movements) were grouped into an ROI. pSTG/S eye ROIs were successfully created in each of the 44 hemispheres that were examined (see Figure S1 for ROIs in 8 sample hemispheres). For visualization only, a group map of pSTG/S eye-preferring regions was created on the cortical surface ( Figure 6A).
Mouth-preferring voxels were located more anteriorly in the pSTG/S, while eyepreferring voxels were located more posteriorly. To quantify this effect, we calculated the center of mass of the activation for the mouth and eye ROIs. The average Euclidean distance between the centers of mass of the mouth and eye ROIs was 15 ± 5 mm (SD) in standard space in the left hemisphere (t-test against zero: t(21) = 12, p = 9.5 × 10 -11 ) and 14 ± 7 mm in the right hemisphere (t-test against zero: t(21) = 10.5, p = 7.8 × 10 -10 ). The primary driver of this effect was a more anterior location for mouth-preferring voxels (results for all cardinal axes shown in Table 1). There was no significant difference in volume of activation between the mouth and eye ROIs in either the left hemisphere (average volume: 2600 ± 2100 mm 3 for eye and 2200 ± 2100 mm 3 for mouth, t-test, t(21) = 0.48, p = 0.64) or the right hemisphere (3200 ± 2300 mm 3 for mouth and 1900 ± 1900 mm 3 for eye, t(21) = 1.6, p = 0.13).

Univariate comparison of pSTG/S mouth and eye ROIs
Mouth-preferring pSTG/S showed significantly greater responses to AV than A or V sentences while eye-preferring pSTG/S did not. To determine if this difference was itself significant, a linear mixed-effects models was fit with fixed factors of ROI and physical stimulus condition (A, An, AV, AnV, V). There were significant main effects for ROI (χ²(1) = 36, p = 2.3 × 10 -9 ) and speech type (χ²(6) = 90, p = < 2.2 × 10 -16 ) as well as a significant interaction between ROI and condition (χ²(6) = 37, p = 1.8 × 10 -7 ). Complete results in Table 2.

Multivariate comparison of pSTG/S mouth and eye ROIs
Response patterns for the different kinds of sentences were more similar in mouth-preferring pSTG/S than in eye-preferring pSTG/S ( Figure 7A

Discussion
We used a localizer to identify subregions of the pSTG/S selective for the visual mouth movements that comprise visual speech in individual participants (13,17). A control ROI consisted of pSTG/S subregions selective for viewed eye movements. In the main experiment, we measured responses in both subregions of the pSTG/S to clear and noisy auditory-only and audiovisual sentences. As expected, adding visual speech to noisy auditory speech greatly improved intelligibility (reviewed in 1). Post hoc trial sorting was used to measure BOLD fMRI responses to sentences that were more or less intelligible.
In the univariate analysis, we observed stronger responses for clear auditory-only speech vs. noisy auditory-only speech in both mouth-preferring and eye-preferring subregions of pSTG/S, consistent with many previous reports of stronger BOLD signals for clear speech throughout lateral temporal cortex (9,(21)(22)(23).
We observed significantly stronger responses for intelligible vs. unintelligible speech (both auditory-only and audiovisual) in mouth-preferring but not eye-preferring pSTG/S, and this difference between subregions was significant when tested with a linear mixed-effects model. Most previous studies used volumetric group analysis, blurring together mouth and eyepreferring regions due to anatomical variability between participants. This may partially explain why Bishop and Miller (9) did not to find an effect of intelligibility in pSTG/S. In support of this idea of anatomical heterogeneity, Evans and colleagues described a small region in mid-posterior STG that was positively correlated with a post-scanning behavioral task across participants (23).
In addition to the univariate analysis, the localizer allowed for an examination of the multivariate pattern of activity across the population of voxels in the mouth-preferring pSTG/S subregion. The largest differences between response patterns were between sentences with different sensory components, such as the responses to auditory-only and visual-only speech.
However, there was also a large difference in response patterns between intelligible and unintelligible speech (with sensory information held constant). Auditory-only noisy speech that  (25) or through brain-computer interfaces (26).

Comparison with previous localizer results
The freely-available stimuli available in the pSTG/S localizer of (13) were used to identify mouth-preferring and eye-preferring regions of pSTG/S. The present study replicated the main findings of (13) in a different group of participants. First, mouth-preferring and eye-preferring regions of pSTG/S were identified in every participant. Second, mouth-preferring regions responded significantly more to auditory-only speech than eye-preferring regions even though the localizer contained only visual stimuli. The present result extends the original study by showing that mouth-preferring subregions, but not eye-preferring subregions, demonstrate the characteristic neuroimaging signature of multisensory integration, with a significantly greater response for audiovisual speech than auditory-only or visual-only speech (27). The present study also adds support for the idea that the mouth-preferring subdivision has a special role in multisensory speech perception, since it responded differently (in both univariate and multivariate analyses) when speech was intelligible than when it was not. Eye and mouthpreferring regions of the pSTG/S can be identified during free viewing of natural speech in which fixations of the mouth region last only a few hundred milliseconds (17). Therefore, techniques with faster temporal resolution than BOLD fMRI, such as MEG (28), EEG (29) and iEEG (30) will be important in further elucidating the multisensory computations performed by pSTG/S.