SUMMARY
Audiovisual speech perception relies on our expertise to map a speaker’s lip movements with speech sounds. This multimodal matching is facilitated by salient syllable features that align lip movements and acoustic envelope signals in the 4 – 8 Hz theta band (Chandrasekaran et al., 2009). The predominance of theta rhythms in speech processing has been firmly established by studies showing that neural oscillations track the acoustic envelope in the primary auditory cortex (Giraud & Poeppel, 2012). Equivalently, theta oscillations in the visual cortex entrain to lip movements (Park et al., 2016), and the auditory cortex is recruited during silent speech perception (Bourguignon et al., 2020; Cross et al., 2019; Calvert et al., 1997). These findings suggest that neuronal theta oscillations play a functional role in organising information flow across visual and auditory sensory areas. We presented silent speech movies while participants performed a pure tone detection task to test whether entrainment to lip movements enslaves the auditory system and drives behavioural outcomes. We showed that auditory detection varied depending on the ongoing theta phase conveyed by lip movements in the movies. In a complementary experiment presenting the same movies while recording participants’ electro-encephalogram (EEG), we found that silent lip movements entrained neural oscillations in the visual and auditory cortices with the visual phase leading the auditory phase. These results support the idea that the visual cortex entrained by lip movements increases the sensitivity of the auditory cortex at relevant time-windows for speech comprehension as a filtering modulator relying on theta phase synchronisation.
RESULTS
When hearing gets difficult, people often visually focus on their interlocutors’ mouth to match lip movements with sounds and to improve speech perception. Mouth opening indeed shares common features with auditory speech envelope, which temporally synchronize on dominant 4 – 8 Hz theta rhythms imposed by syllables (Park et al., 2016; Chandrasekaran et al., 2009; Luo & Poeppel, 2007). Neural oscillations in the auditory cortex track the auditory envelope structure during speech perception, suggesting that this “entrainment” reflects signal analysis (Keitel et al., 2018; Pelle & Davis, 2012; Gross et al., 2013; Giraud & Poeppel, 2012). Although the term entrainment is currently under debate (Meyer, Sun & Martin, 2019; Obleser & Keyser, 2019; Haegens & Zion Golumbic, 2018; Rimmele et al., 2018), here we use it to describe neural patterns tracking salient features conveyed in speech signals which occur at theta frequency (4-8 Hz). Previous studies demonstrated that the visual perception of silent moving lips entrains theta oscillations in the visual cortex and recruits auditory processing regions (Bourguignon et al., 2020; Cross et al., 2019), even in the absence of sound (Cross et al., 2015; Calvert et al., 1997). Further, information specific to lip movements is represented not only in the visual cortex but also in the auditory cortex (Park et al., 2018). These results beg the question of whether visual perception of lip movements modulates the auditory cortex in a functional way. In other words, do purely visually induced theta speech rhythms impose time windows that render the auditory cortex more sensitive to input in a phasic manner? If the answer to this question is yes, then visually focussing on your interlocutor’s mouth when you have trouble understanding them would indeed be an effective filter modulator to increase auditory sensitivity.
Entrainment to lip movements during silent speech drives behavioural performance
To address this question, we adapted an auditory tone detection paradigm in which a continuous white noise was presented simultaneously with silent movies displaying speakers engaged in conversations (Figure 1 and STAR Methods). Participants were instructed to press a key as fast and accurate as possible every time when they detected a pure tone (1 kHz, 100 ms) embedded in the white noise at individual threshold (determined with a calibration task). In the condition of interest, there were two target tones: the first tone occurred randomly in the first half of the trial (0 to 2.5 s after trial onset; early window) and the second tone occurred randomly in the second half of the trial (2.5 to 5 s; late window). Two additional conditions containing zero or one single tone were introduced to estimate the false alarm rates (FA) and to reduce the predictability of the second tone by the occurrence of the first one. The three conditions were counterbalanced and randomised across six blocks of 50 trials (100 trials per condition). To test the first hypothesis of visual entrainment affecting auditory processing, participants were asked to attend carefully to the silent movies centred on the speakers’ nose displayed with sound albeit non-informative. Crucially, the videos were preselected such that lip movements occurred in the 4 – 8 Hz theta range. We determined at which theta frequency the vertical mouth’s apertures and auditory speech envelope showed significant dependencies in the original clips by using mutual information method (see STAR Methods). This paradigm allowed us to link directly the onset of detected tones with the phase of the ongoing theta activity conveyed by the lip movements. As neural entrainment increases over time (Thut et al., 2011; Hanslmayr, Axmacher & Inman, 2019), we compared behavioural performance between the early and late time-windows (containing respectively the first and second tones).
We compared the mean theta phase distributions between first and second tones’ onsets across participants (Figure 2A; see STAR Methods). For each participant, the corresponding theta phases in ongoing lip activity at detected first and second tone onsets were averaged across hit trials. Individual mean theta phases were then averaged across subjects to estimate phase locking of hits to the theta signal conveyed visually in the first and second tone time-windows. Two Rayleigh’s uniformity tests were performed on the first and second grand average theta phase distributions separately. For the first tone window, the Rayleigh’s test did not reject the hypothesis of uniform distribution (n = 24; µ = 1.944 rad or 111.384°; r = 0.282; p = 0.148, Bonferroni-corrected). In contrast, the Rayleigh’s test revealed that mean phases were not uniformly distributed in the second tone window (n = 24; µ = −0.999 rad or 302.763°; r = 0.44; p < 0.01, Bonferroni-corrected). Further, a permutation test was performed on the resultant vector length (r) difference between the first and second tones to test whether the strength of visual entrainment in the second tone window was significantly stronger than in the first tone window, which indeed was the case (permutations: 10,000; effect size = 0.158; p = 0.015; Figure 2A; see STAR Methods). We performed two additional permutation tests on resultant vector length difference between hits and misses in the first and second tone windows separately to test that visual entrainment related to successful auditory processing. No significant difference of vector length was found in the first tone window (permutations: 10,000; effect size = 0.041; p = 0.405), whereas in the second tone window the resultant hits vector strongly tended to be longer than the misses vector (permutations: 10,000; effect size = 0.228; p = 0.052).
Following up, we investigated whether tone detection differed between the first and second tones windows. Such a difference might reflect an auditory bias by visual inputs (Figure 2B). First, two independent one-sample t-tests established that participants detected the first and second tones in the two tones condition, as the d’ scores were greater than zero (first tone: T(1,23) = 20.014; p < 0.001, two-tailed; second tone: T(1,23) = 21.124; p < 0.001, two-tailed). Second, a paired-samples t-test confirmed that the second tones were better detected than the first ones (T(1, 23) = −4.488; p < 0.001; two-tailed; Fig. 2B). Third, a paired-sample t-test applied on the hit reaction times showed that participants responded faster to second compared to first tones (T(1, 23) = 5.486; p < 0.001; two-tailed; Figure 2C). Importantly, the improvement of the second tone detection could not be attributed to a simple attentional effect due to the presence of the preceding first one, as the single tone condition replicated the two tones condition performances (i.e. by sorting the single tones as first/second tones according to their onsets; see Figure S1B). Finally, a paired-samples t-test performed on the FA rates in the no tone condition confirmed that detection performance modulations did not reflect a change in response bias between the two windows (T(1, 23) = 0.627; p = 0.537; two-tailed). Altogether, these results established that entrainment to theta lips activity increased in time and coincided temporally with increases in auditory detection. In the next step, we aimed at establishing whether a potential audiovisual communication relying on the critical theta organisation of information flows was reflected in the brain.
Visual cortex leads synchronization to left auditory cortex via theta oscillations during silent lips perception
The above results suggest that visual speech stimuli may recruit the auditory regions via entrainment to render some time-windows more sensible to auditory detection than others. To test this hypothesis on a neural level, we recorded the EEG signal of 23 new participants during the perception of the same 60 silent movies used in the previous tone detection task. Participants were instructed to attend to each movie and rate its emotional content based on the speaker’s face. The movies were presented in a single block and randomised across participants. First, the sources of interest responding to speakers’ lip movements were identified applying a linearly constrained minimum variance beamforming method. Neural entrainment to lip movements was estimated by computing mutual information (MI) on the theta phase between the EEG epochs and corresponding lip signals in the equivalent first (0 to 2.5 s) and second (2.5 to 5 s) tone windows. Just as in the behavioural data, we assessed whether entrainment increased over time by contrasting the difference of MI between the first and second time window. Second, the EEG data at the identified visual and auditory sources were reconstructed to perform single-trial phase coupling analysis. The synchrony between visual and auditory sources was reflected by the distribution of theta phase angle differences ϕA-V = ϕaudio – ϕvisual at each time-point within the first and second tone windows, and the directionality of the coupling was evidenced with the sign of ϕA-V (i.e. a mean distribution of ϕA-V = 0 would mean perfect phase alignment, while ϕA-V < 0 would mean that the visual phase leads the auditory phase; see STAR Methods).
Source localisation analysis revealed that the maximum increases in MIsecond as compared to MIfirst were localised in the expected left visual and auditory cortices, as well as in the right visual cortex to a lesser extend (Figure 3A). This result supports the recruitment of both sensory areas during the perception of speakers’ lip movements even in the absence of speech sound. Two separate Rayleigh tests confirmed non-uniform distributions of ϕA-V in the first (n = 23; µ = −1.99 rad or −114.45°; r = 0.768; p < 0.001, Bonferroni-corrected) and second tone windows (n = 23; µ = −0.92 rad or −52.79°; r = 0.875; p < 0.001, Bonferroni-corrected). An additional Kuiper two-sample test confirmed that the mean ϕA-V distributions between the first and second tone windows converged towards two different preferred angles (k = 3.614×105; p < 0.001). Further, an increase in theta phase synchrony between visual and auditory areas would be reflected by a more consistent distribution of ϕA-V towards zero degree. To quantify the modulation of phase coupling with entrainment, we computed the resultant vector length r of the distance between the observed ϕA-V in the data and a fixed zero ϕA-V in the first and second windows separately (zero ϕA-V = 0; meaning that visual and auditory theta phases are perfectly aligned with a constant offset of zero at each time-point of the time-window). A paired-samples t-test showed that the resultant vector length r of the distance between the observed ϕA-V and the zero ϕA-V was significantly greater in the second tone window (T(1, 22) = −2.135; p = 0.044; two-tailed), confirming that synchrony between auditory and visual sources improved with time (Figure 3B, C and D). The negative theta phase angle differences ϕA-V in both the first and second tone windows confirmed that the visual phase led the auditory phase, in line with the idea of visual oscillations responding first to the lips inputs and then enslaving theta oscillations in the auditory cortex. Altogether, these results support our hypothesis that visual cortex led synchronization to left auditory cortex via theta oscillations during silent lips perception.
DISCUSSION
In two complementary experiments, we first established that visual entrainment to theta lip phase modulated auditory detection, even if information from silent movies was irrelevant to perform the task. Second, the perception of silent lip movements entrained theta oscillations in the visual cortex, which in turn synchronized with the auditory cortex. Together, these results suggest that the brain’s natural reaction to visual speech stimuli might be to align the excitability of the auditory cortex with sharp mouth-openings because that is when one expects to hear corresponding acoustic syllable edges (Hickock & Poeppel, 2007; Giraud & Poeppel, 2012; Peelle & Sommers, 2015 Park et al., 2016; Chandrasekaran et al., 2009). Such a neural process could be a very effective filtering method to increase the sensitivity of the auditory cortex in these relevant time windows for speech comprehension.
Our EEG results suggest that theta oscillations in the left visual cortex encoded the lips’ activity first. Then information travelled to the left auditory cortex via phase coupling to shape its activity. Previous findings reported that the auditory cortex tracks both auditory and visual stimulus dynamics using low-frequency neuronal phase modulation during audiovisual movie perception (Luo, Liu and Poeppel, 2010). Other studies reported that the perception of silent lips also recruited the auditory regions (Bourguignon et al., 2020; Cross et al., 2015; Calvert et al., 1997). Our findings go beyond and establish how theta oscillations orchestrate visual and auditory cortices through phase coupling to ensure cross-region communication even in a unimodal condition. Furthermore, it is commonly agreed that entrainment takes several cycles from rhythmic inputs to build up (Doelling et al. 2014; Lakatos et al., 2008; Thut et al., 2011; Zoefel et al. 2018). Behavioural and neural indicators of entrainment reported here consistently increased from the first to the second half time-window of the trial in both experiments. This supports the idea that we indeed observed neural entrainment to lip movements and sheds light on the functional relevance of visual inputs modulating auditory theta rhythms.
As visual onsets naturally lead corresponding auditory onsets by 100-to-300 ms in audiovisual speech (Chandrasekaran et al., 2009; van Wassenhove et al., 2005; Pilling, 2009), visual entrainment to lips may act as a filter by increasing excitability in the auditory cortex to windows containing relevant acoustic features. This hypothesis is corroborated by our phase coupling results where visual theta phase systematically led auditory theta phase during silent movie presentation. However, whether such filtering reflected direct enslavement of the auditory cortex or involved top-down modulations remains unclear. Indeed, higher-level sensorimotor areas also activate during speech perception (Park et al., 2016; 2018; Cognan & Poeppel, 2011; Pulvermüller et al., 2005; Wilson et al., 2004). Assaneo & Poeppel (2018) demonstrated recently that activity in the motor and auditory cortices couple at theta rate during syllable perception, correlating with the strength of coupling between speech signal and EEG in the auditory cortex. On the other hand, motor areas play a role in temporal analysis of rhythmic sensory stimulation (Biau & Kotz, 2018; Arnal et al., 2015; Fujioka et al., 2015; Morillon et al., 2019). Entrainment to lip movements may provide the temporal theta structure of speech signal to motor cortex, which in turn adjusts auditory excitability at critical windows containing the corresponding acoustic features in a top-down fashion (in line with Park et al. 2015). Alternatively, mouth-opening perception may target internal articulatory representations and help to identify the corresponding sounds in the auditory signal. Theta activity in the auditory cortex would reflect the contribution of endogenous oscillations bearing linguistic inferences generated from motor representations activation. Although speculative, this could partially explain why the increase of entrainment in the auditory cortex was left-lateralized, i.e. by recruiting language-related representations classically associated with the left hemisphere. This hypothesis fits in recent debates on whether neural tracking during speech processing reflects online cooperation between pure entrainment to external salient features and endogenous rhythms providing abstract representations (Meyer, Sun & Martin, 2019; Obleser & Keyser, 2019; Haegens & Zion Golumbic, 2018; Rimmele et al., 2018). However, this would not explain why visual speech information improved the detection of unrelated pure tones here, which will be addressed in future experiments. Additional data-driven analysis suggested that two subpopulations of participants showed distinct visual theta phases shaping auditory perception in the TDT (Figure S2). One could hypothesize that the “good” subpopulation (i.e. group 2) were fine-tuned to a preferred visual theta phase that represents an optimal time-window. This optimal window allowed information to travel to the auditory cortex (either directly or via top-down modulations), and reset auditory activity at “perfect” moments when a tone occurred. Back to our filtering hypothesis, visual theta entrainment would increase auditory excitability coinciding with more time windows containing a tone in this “good” subpopulation regardless of the nature of sounds.
As a final note, although the auditory signal alone often provides enough structural information for the early analytic steps of continuous speech, e.g. telephone conversations, a visual filter may be especially helpful to sharpen auditory perception when hearing is impaired or in elders (Grant et al., 1998). Our results provide an important step toward understanding how visual information functionally drives auditory speech perception, and suggest future directions to investigate hearing loss compensation, i.e. to improve lip-reading along with hearing correction.
AUTHOR CONTRIBUTION
E.B, H.P and S.H designed the experiments and paradigms. E.B and D.W collected and analysed the data. E.B, D.W, H.P, O.J and S.H wrote the paper. All the authors discussed the results and commented on the manuscript.
DECLARATION OF INTEREST
The authors of this manuscript declare to have no conflicts of interest.
STAR METHODS
Key resources table
Contact for reagent and resource sharing
Further information and requests for resources and data should be directed to and will be fulfilled by the Lead Contact, Emmanuel Biau (e.biau{at}bham.ac.uk). Summarized data (cell means) are available; data for individual participants are available, as consent for sharing data at the level of the individual participant was received.
Experimental model and subject details
Tone detection experiment: Twenty-eight healthy English native speakers (mean age = 19 years ± 0.69; 21 females) took part in the first behavioural experiment. Five participants were left-handed. All of them reported normal or corrected-to-normal vision and hearing. All participants were granted experimental participation credit. The data from four participants were excluded because of extreme overall performances and the final analysis were applied on twenty-four data sets.
Silent movie perception-EEG experiment
Twenty-five healthy English native speakers (mean age = 21.52 years ± 3.86; 17 females) took part in the first behavioural experiment. All of them reported normal or corrected-to-normal vision and hearing, and were right-handed. Twenty-one participants were granted credits and five participants received financial compensation for their participation (£20). The data from two participants were excluded from the final analyses due to too noisy EEG data. In the two experiments, all the participants signed informed consent and ethical approval was granted by the University of Birmingham Research Ethics Committee, complying with the Declaration of Helsinki.
Method details
Apparatus
The two tasks were programmed with Matlab (R2018a; The MathWorks, Natick, MA, USA) and presented with Psychophysics Toolbox (Brainard, 1997; Pelli, 1997; Kleiner et al., 2007). In both tasks, the silent videos were presented on a 21-inch CRT display with a screen refresh rate of 75 Hz (nVidia Quadro K600 graphics card: 875 MHz graphics clock, 1024 MB dedicated graphics memory; Nvidia, Santa Clara, CA, USA). The auditory stimuli in the tone detection task were presented through EEG-compatible insert earphones (ER-3C; Etymotic Research, Elk Grove Village, IL). In the tone detection task, the accuracy of movie and sound presentation timing was optimised by detecting a small white square displayed on the left of the first frame of each visual stimulus with a photodiode (ThorLabs DET36A, thorlabs.de), and Psychophysics Toolbox (PsychPort Audio and ASIO4ALL extensions for Matlab). Additionally, a parallel audio port was used to record the online audio signal of each trial during presentation. Continuous photodiode and audio data during trials were recorded through a BioSemi Analog Input Box (AIB) adding two separate channel inputs into BioSemi ActiveTwo system: the BioSemi AD-box was connected with the AIB through optical fibres. The input from the photodiode was connected through a BNC connector and the input from the microphone was connected through the 3.5 mm audio. Those two inputs were connected to the AIB though a 37 pin Sub-D connector. Data were digitized using the BioSemi ActiView software, with a sampling rate of 2048 Hz. Offline analysis were performed to calculate the real delay between visual and audio stimuli offset using in-house Matlab codes. Any lag between visual and auditory stimuli onsets was later compensated in the data analyses when computing the corresponding visual theta phase to the tones onsets. The experiments were run from a solid-state hard drive on a Windows 7-based PC (3.40 GHz processor, 16 Gb RAM). Participants used a standard computer keyboard to respond to the tasks.
Stimuli of the Tone detection Task
Movies
Sixty five-second movies were extracted from natural face-to-face interviews published on YouTube (www.youtube.com) by various universities channels and downloaded via free online application. Satisfying movies containing meaningful content (i.e. one complete sentence, speaker facing toward the camera) were edited using Shotcut (Meltytech, LLC).
For each movie, the video and the sound were exported separately (Video: .mp4 format, 1280 × 720 resolution, 25 frame per second, 200 ms linear ramp fade in/out; Audio: .wav format, 44100 Hz sampling rate, mono).
Lip movements’ detection
Lips contour signal was extracted for each video using in-house Matlab codes. We computed the area information (area contained within the lips contour), the major axis information (horizontal axis within lip contour) and minor axis information (vertical axis within lip contour) as described in Park et al. (2016). In the present study, we used vertical aperture information of the lips contour to establish the theta correspondence between lips and auditory speech (i.e. aperture between the superior and inferior lips) but using area information gave very similar results, as also reported in Park et al. (2016). The lips time-series was resampled at 250 Hz for further analyses with corresponding auditory speech envelope.
Auditory speech signal
The amplitude envelope of each movie sound was computed using in-house Matlab codes (Park et al., 2018; 2016; Chandrasekaran et al., 2009). First, eight frequency bands equidistant on the cochlear map in the range 100–10,000 Hz were constructed (Smith et al., 2002). Then, sound signals were then band-pass filtered in these bands with a fourth-order Butterworth filter (forward and reverse). Hilbert transform was applied to obtain amplitude envelopes for each band. These signals were then averaged across bands and resulted in a unique wideband amplitude envelope per sound signal. Each final signal was resampled to 250 Hz for further theta correspondence analyses.
Mutual information between lip movements and corresponding auditory speech signal
To identify the main oscillatory activity conveyed by the lip movements in each visual stimulus, we determined at which theta frequency the auditory and visual speech signals showed significant dependencies. To do so, we examined the audiovisual speech frequency spectrum (1 to 20 Hz) and computed the mutual information (MI) between the minor axis information and speech envelope signals sampled at 250 Hz. MI measures the statistical dependence between two variables with no prior hypothesis, and with a meaningful effect size measured in bits (Ince et al., 2017; Shannon, 1948). We applied the Gaussian Copula Mutual Information approached described in Ince et al. (2017) in which the MI between two signals corresponds to the negative entropy of their joint copula transformed distribution. This method provides a robust, semiparametric lower bound estimator of MI by combining the statistical theory of copulas together with the closed-form solution for the entropy of Gaussian variables, allowing good estimation over circular variables, like phase as well as power. For each movie, the complex spectrum is normalized by its amplitude to obtain a 2D representation of the phase as points lying on the unit circle for both the lip movements and auditory envelope time-series. The real and imaginary parts of the normalized spectrums are rank-normalized separately and the phase dependence for each frequency between the two 2D signals is estimated using the multivariate GCMI estimator giving a lower bound estimate of the MI between the phases of the two signals. Here, we applied the GCMI analyses in two conditions to determine the frequency of interest in each movie: first, we computed MI between corresponding lips and envelope signals as well as non-matching signals (i.e. lips time-series paired with random auditory envelope signals). For the matching signals, the averaged MI spectrum revealed a greater peak in the expected 4 – 8 Hz theta frequencies, reflected by a bump in the band of interest. In contrast, there was no relationship between random auditory and visual signal pairs, which depicts a flat line profile along the whole spectrum (see Supplementary Information Figure S3 A). These results are well in line with previous studies using coherence or MI measures, and confirm the temporal coupling between lips and auditory speech streams at the expected syllable rate in our videos (Park et al., 2016; 2018; Chandrasekaran et al., 2009). Second, for each movie, we performed a peak detection on the MI spectrum to determine which specific frequency carried most theta information to maximize entrainment in the tone detection and silent movie perception tasks (4Hz frequency peak: 16 videos; 5Hz frequency peak: 15 videos; 6Hz frequency peak: 9 videos; 7Hz frequency peak: 13 videos; 8Hz frequency peak: 7 videos. See Supplementary Information Figure S3 B).
Audio tones and white noise
Pure auditory tones and white noise stimuli were generated using in-house Matlab codes. The target tone consisted in a sinusoidal signal of 100 ms at one kHz (sampling rate: 44100 Hz). The same noise consisted in a Gaussian white noise lasting two seconds for the calibration task and five seconds in the tone detection task (the white noise has been generated only once and loaded during each procedure to ensure that all the participants were tested with the same noise; sampling rate: 44100 Hz). Both the tone and the white noise signals were normalized between – 1 to 1 (arbitrary units).
Tones onsets
For each trial, the target tones were embodied in the white noise at predetermined pseudo-random onsets counterbalanced across conditions (zero, one or two tones per trial, 100 trials per condition). In the calibration task serving to determine the individual threshold of target tones detection (see below for the general procedure), there could be only zero or one tone maximum per trial. For the one tone condition, the onset of the target tone always randomly occurred between 300 and 1400 ms after the trial onset to allow participants to detect it properly and have time to respond before the end of the trial. In the zero tone condition, the auditory track consisted in two seconds of white noise only. In the tone detection task, there could be zero, one or two tones per trial. In the one tone condition, the onset of the target always occurred randomly between 300 and 4500 ms after the trial onset. In the two tones condition, the first tone randomly occurred in a time-window centred on the first half of the trial length, between 300 and 3000 ms (mean first tone onsets = 1.68 ± 0.78 s). The second tone occurred in a time-window centred on the second half of the trial length, between 1000 ms after the first tone onset and 4500 ms after the trial onset (mean first tone onsets = 3.45 ± 0.62 s). This design provided participants with enough time to detect and respond to both tones, and kept the two tones temporally unrelated from each other. In the zero tone condition, the auditory track consisted in five-second of white noise only. The signal-to-noise ratio between target tones and white noise was determined for each participant individually with the calibration task performances and adjusted consequently in the following tone detection task (see below).
Procedure of the calibration task and tone detection task (TDT)
The experiment began after the completion of a safety-screening questionnaire and the provision of informed consent. Participants sat in a well-lit testing room at approximatively 60 cm from the centre of the screen and wore the insert earphones for sound presentation. Participants performed first a short pure tone detection task with no visual stimuli (i.e. calibration task). This task served to determine the individual threshold at which each participant detected ∼ 70 – 80 % of the target tones in auditory modality only, and the signal-to-noise ratio (SNR) to be implemented between the amplitude of the target tones and the white noise in the following tone detection task (TDT). The calibration task was composed of a four-trial practice to identify the target tone itself, followed by five blocks containing 20 trials each. Each trial began with a black fixation cross (500 – 1000 ms duration, jittered) followed by the presentation of a red cross over a grey background during two seconds to indicate the period of possible target tones occurrence. A continuous white noise was displayed during the red cross presentation. In 50 % of the trials, a unique audio tone was embedded in the white noise at unpredictable onset, and participants had to press “1” key as fast and accurately as possible only when they perceived a target tone. The pseudo-random sequence of the procedure ensured that there were never more than two consecutive trials of the same condition. The participants received no feedback and the procedure continued to the next trial after the end of the two-second white noise. The signal-to-noise ratio was adjusted following an adapted two-down, one-up staircase procedure (see Leek, 2001): For the first five trials, the SNR was fixed (mean white noise power of 0.981) and served as a starting point across participants. After each trial, the keypress response of the participant was stored to adjust the SNR for the next trial as following: for two successive hits, the SNR was decreased by 2 % of the starting signal energy in the next trial. For two successive correct rejections (i.e. no response when no tone occurred) or one correct rejection following a hit, the SNR was kept identical for the next trial. After a miss or a false alarm, the SNR was always increased by 2 % of the starting signal energy. At the end of the calibration task, the individual SNR was averaged over the last 30 trials and stored for the following real tone detection task (mean calibration accuracy rate: 0.75 ± 0.05). The participants took a short break and were recalled the instructions before starting the proper tone detection task. The calibration task lasted approximatively seven minutes.
The main structure of the TDT was the same as in the precedent calibration task. The TDT was composed of a short four-trial practiced followed by 300 trials divided in 6 blocks of 50 trials each and separated by breaks (the sixty silent movies were repeated five times each to generate the total 300 trials). Each trial began with a red fixation cross presentation (500-1250 ms duration, jittered). Then, a random five-second silent movie was presented with a black fixation cross in the centre of the screen to give the participants a point to gaze at and reduce saccades. The continuous white noise was displayed together with the silent movie according to the three random conditions: no tone (100 trials), one single tone (100 trials) or two tones (100 trials) hidden in the white noise. Participants were instructed to press “1” key as fast and accurately as possible only when they perceived a target tone. The participants received no feedback on their responses and the procedure continued with the next trial after the end of the silent movie. The SNR between the tones and the white noise was determined in the previous calibration task as explained above. The TDT lasted approximatively 50 minutes.
TDT conditions
The condition of interest containing the two tones (i.e. first and second tone) served to assess our main hypothesis that entrainment increases in time with the perception of visual information conveyed by the speakers’ lip movements. According to this, the second tones should be better detected and associated to a greater theta entrainment as compared to the first tones to reflect enslavement of the auditory system by the entrained visual system to lip movements. The zero and single tone conditions were additional control conditions: the zero tone condition served to determine the false alarm rates (i.e. participants’ keypresses in the absence of tone) and controlled whether participants tended to press more together with the tone onset delays (i.e. time-dependent response bias). The single tone condition served to counterbalance the number of trials containing two tones and control for the predictability of the second tone. The replication of the performances observed in the two tones condition by sorting the single tones according to their onsets equivalent to either first or second tone onsets would confirm that the detection of the second tone is not due to its predictability from a preceding tone but its position in time only. The pseudo-random sequence of the procedure ensured that there were never more than three consecutive trials of the same condition.
Perception of silent movies-EEG task
Movies
The movies presented during the silent lips perception task were the exact same 60 movies used in the previous tone detection task. The order of movies was randomized across participants.
Procedure
Participants sat in a well-lit testing room at approximatively 60 cm from the centre of the screen to complete a safety-screening questionnaire and the provision of informed consent first. After the correct preparation of the EEG cap, the participants were instructed to attend to all the movies quietly and to avoid movements during the presentation. Each trial was preceded by a central fixation cross (500 – 1250 ms duration, jittered) followed by the presentation of a random five-second movie. A central fixation cross was displayed during the movie presentation to give participants a point to gaze at and reduce excessive saccades. Participants were instructed to attend to each movie carefully and rate its emotional content based on speaker’s facial gestures by using the number keys on the keyboard after the presentation (i.e. 1 for neutral through 5 for very emotional; results not reported). The total presentation of the sixty movies lasted approximatively 10 minutes.
Online EEG recordings: Continuous EEG signal was recorded using a 128 channel BioSemi ActiveTwo system (BioSemi, Amsterdam, Netherlands). Vertical and horizontal eye movements were recorded from additional electrodes placed approximatively one cm to the left of the left eye, one cm to the right of the right eye, and one cm below the left eye. Online EEG signals were digitalized using BioSemi ActiView software at a sampling rate of 2048 Hz. For each participant, the position of the electrodes on the scalp were tracked using a Polhemus FASTRAK device (Colchester) and recorded with Brainstorm (Tadel et al., 2011) implemented in MATLAB.
Offline EEG preprocessing: EEG data were preprocessed offline using Fieldtrip (Oostenveld et al., 2011) and SPM8 toolboxes (Wellcome Trust Centre for Neuroimaging). Continuous EEG signals were bandpass filtered between one and 100 Hz and bandstop filtered (48–52 Hz and 98–102 Hz) to remove line noise at 50 and 100 Hz. Data were epoched from 2000 ms before stimulus onset to 7000 ms after stimulus onset, and downsampled to 512 Hz. Bad trials and channels with artefacts were excluded by visual inspection and numerical criteria (e.g., variance as well as kurtosis) before applying an independent component analysis (ICA) to remove components related to ocular artefacts. Bad channels were then interpolated using the method of triangulation of nearest. After re-referencing the data to average reference, trials with artefacts were manually rejected by a last visual inspection. On average, 4.48 ± 2.48 trials were removed and 4.04 ± 1.82 channels were interpolated per participants.
Head models
For the 22 participants without individual MRI scans, the MNI-MRI and the volume conduction templates provided by Fieldtrip were used to construct the head models. Electrode positions of each participant were aligned to the template head model. Source models were prepared with the template volume conduction model and the aligned individuals’ electrode positions following standard procedures. One participant provided his own MRI scans and his head model was built using his structural scans (Michelmann et al., 2016): the MRI scans were segmented into four layers (i.e. brain, CSF, skull and scalp) using the Statistical Parametric Mapping 8 (SPM8; http://www.fil.ion.ucl.ac.uk/spm) and Huang toolboxes (Huang et al., 2013). The volume conduction model was constructed using the dipoli method implemented in Fieldtrip. Participant’s electrode positions were aligned to his individual head model. Finally, his MRI was warped into the same MNI template MRI of Fieldtrip and the inverse of the warp was applied to a template dipole grid to have each grid point position in the same normalized MNI space as the other participants for further group analyses.
Source localization during silent movie perception
Source analyses on EEG data recorded during silent movies presentation were run using individual electrode positions, grid positions and template volume conduction model. For the participant who had his MRI scans, source analyses were calculated using normalized grid positions instead. Source activity was reconstructed using a linearly constrained minimum variance beamforming method implemented in Fieldtrip (LCMV; see Van Veen et al., 1997). The neural entrainment to lip movements at source level was determined by computing mutual information between EEG epochs and the lip movements during silent movie presentation (i.e. lips time-series from the silent movie presented during the trial). To test our hypothesis that entrainment builds up in time with perceived theta lips activity, we contrasted the difference of MI between the equivalent time-window to the second tone window (MIsecond), and the equivalent time-window to the first tone window (MIfirst) in the previous TDT. Accordingly, we expected first to observe an increase of theta activity in the visual cortex reflecting entrainment to lip movements. Second, we expected an equivalent pattern in the auditory correlates reflecting a tuning from visual activity. For each single trial, MI was first computed separately in the equivalent first (0 to 2.5 seconds after trial onset; MIfirst) and second time-windows (2.5 to 5 seconds after trial onset; MIsecond) at the 2020 virtual electrodes by using the same approach described in the stimuli analysis section (i.e. where we established which frequency carried most correspondence between lips and envelope signals for each video; using a wavelet transform to compute the phase). Second, for each single trial, the MI spectrum was realigned respect to the frequency bin (± 2 Hz) corresponding to the peak of MI between lips and envelope signal established in the movie analyses. This step was done to be able to average all the trials together taking into account the main theta activity carried in each individual movie. For instance, if the peak of MI between lip movements and auditory envelope was found at 4 Hz in the video number 1, the realigned MI spectrum between EEG and lips signals from the trials presenting video number 1 was now 4 ± 2 Hz (2 to 6 Hz; 1 Hz bin) to insure that the central bin of each single trial corresponds to the objectively determined frequency peak of theta activity. Third, the realigned MIs of single trials were averaged across trials within each participant for further group analyses. For each participant, we calculated the normalized difference of MI at the frequency bin of interest (MI normalization: (MIsecond – MIfirst)/ MIfirst; third bin in the realigned MI spectrum) in the equivalent second tone window minus equivalent first tone window at all the 2020 virtual electrodes. Finally, the normalized difference of MI between second and first tone time-windows was grand averaged across participants and the grand average was interpolated to the MNI MRI template. The coordinates for auditory and visual sources of interest were determined by finding the maximum of MIsecond – MIfirst differences in regions corresponding to the auditory and visual areas, and defined using the automated anatomical labelling atlas (AAL).
Source reconstruction
We performed time-series reconstruction analysis to investigate the synchronization at theta activity between the two sources of interest during silent movie presentation. The time series data were reconstructed and extracted at the visual and auditory coordinates determined by source localization analysis. LCMV beamformer reconstruction can cause random direction of source dipoles and eventually affect phase analysis results. To get around this issue, the event-related potentials (ERP) time-locked to movie onsets at visual and auditory sources were plotted to identify the visual component, i.e. N1-P2-N2 waveform (Wang et al., 2018). After visual inspection, the sign of the reconstructed data were flipped in direction by multiplying the time-series by −1 if any visual or auditory source ERPs showed the opposite of the expected direction of a visual component (i.e. negative-positive-negative polarity). This “flipping” correction was applied consistently across all trials before sorting data between first and second time-windows, thus it did not bias results towards our hypothesis. The same phase coupling analyses were computed with unflipped source data as a control. Phase angle differences between visual and auditory theta activities in the first and second windows were also non-uniformly distributed according to Rayleigh tests with significantly different mean angles according to a Kuiper’s test, confirming that the flipping procedure only better reflected phase coupling modulation with entrainment.
Theta phase coupling between auditory and visual sources
First, auditory signal was projected orthogonally onto the visual signal applying a Gram-Schmidt process (GSP; Hipp et al., 2012) for single trials before computing phase information. This was done to reduce the noise correlation patterns reflecting activity from a common source (i.e. volume conduction) estimate captured at different electrodes (in that case, the phase alignment reflects the same source activity and not the phase coupling between two distinct source activities). The GSP increases the signal-to-noise ratio by leaving intact the proper activities conveyed at the two distinct electrodes while reducing noise correlation weight (see Hipp et al., 2012). Second, for each trial the instantaneous theta phase of the auditory and visual orthogonalized time-series were computed by applying a Hilbert transform with a bandpass filter centred on the frequency bin of MI peak ± 2 Hz, accordingly to the mean theta frequency of the video presented during the trial. Third, the difference of unwrapped instantaneous phase between auditory and visual sources was computed for each single trial at each time-point in two windows corresponding to the first (0.5 to 2 seconds after trial onset) and second tone windows (3 to 4.5 seconds after trial onset). The first and last 500 ms at the edges of the epoch were not included into phase coupling analyses to avoid the trial onset and offset responses. Additionally, the phase-slope index (PSI) was calculated in the first and second time-windows between the left auditory and visual sources to estimate the directionality of information flow between our two sources of interest, using Fieltrip procedure (Nolte et al., 2008). PSI analysis revealed negative values in both time-windows (respectively psifirst tone = −0.035 ± 0.055 and psisecond tone = −0.041 ± 0.035), confirming that that the left visual source led the left auditory source during silent movie perception.
Quantification and statistical analysis
The Tone Detection and the Silent lips perception tasks were within-subject design.
Tone detection performance
Tone detection performances
The hits (i.e. correctly detected tones) and false alarms (i.e. keypress responses during the zero tone condition allocated to the first or second tone window depending on their onsets) rates were computed to calculate the individual mean sensitivity index (i.e. d’) in the two conditions for each participant (i.e. single tone and two tones conditions). The reaction times of the hits were computed to calculate the individual mean reaction times in the two conditions for each participant (i.e. single tone and two tones conditions). Additionally, we calculated the mean correct response rates and reaction times of the two conditions concatenated together of each individual to exclude blindly potential outliers without favouring the results towards our hypothesis and performing as following: below chance level (correct response rate < 0.5) or perfectly (correct response rate = 1), or with mean reaction times outside the grand averaged reaction times ± two standard deviations range. Accordingly, four participants were excluded from analyses (two participants performed below chance level, one participant performed perfectly and one participant’s reaction times were slower than the grand average mean + two standard deviations). A paired-samples t-test was conducted on the averaged d’ scores and hit reaction times between the first and second tones in the two tones condition and single tone condition separately. Additionally, a paired-samples t-test was performed on false alarm rates from the first and second windows in the zero tone condition to control for any response bias with time.
Visual entrainment to theta activity conveyed by lip movements
To bridge visual entrainment to auditory processing together, we related the tone target onsets to the theta activity conveyed by the lip movements during silent movies perception: First, for each movie the theta phase of the lip movements’ time-series was computed by applying a Hilbert transform with a bandpass filter centred on the frequency bin of MI peak ± 2 Hz, accordingly to the mean theta frequency determined in MI stimuli analyses. Second, we computed the instantaneous theta phase of the lips signal corresponding to the onset of the tones occurring during each trial. All further circular statistics on angular scale were performed using the CircStat toolbox on Matlab (Berens, 2009). The circular uniformity in the first and second tones windows within and across participants were estimated separately by applying Rayleigh tests to calculate the mean direction and resultant vector length from hits/miss trials. To assess statistically the strength of visual entrainment between the first and second tone windows in the two tones condition (hits only), we performed a permutation test on the resultant vector length difference (z-value) second tone minus first tone reflecting the effect size. For each participant, we generated 10000 iterations as following: first, the hit trial labels were shuffled between the first and second tones in the two tones condition. Second, two balanced subsamples of shuffled trials were selected, with a number matching the smallest number of trials between the first and second tone hits. Third, the mean phase of the first and second tone shuffled trials were computed for each iteration and per participant. Fourth, a Rayleigh’s test of uniformity was applied on the mean phases to determine a resultant vector length at the first and second tones per iteration (i.e. z-value). For each iteration, we computed the difference of z-valuesecond tone – z-valuefirst tone to quantify its effect size, and the resultant 10000 z-value differences were sorted in descending order. To estimate the final p-value and test the null hypothesis, the difference of z-value between the original first and second tone data was ranked in the sorted permuted z-value differences and divided by the total number of permutations+1. If the p-value was smaller than α = 0.05, we rejected the null hypothesis H0 = there is no difference of resultant vector length between the first and second tones (i.e. the visual entrainment is significantly greater in the second tone window). The exact same approach was applied on the single tone condition, as well as on the hit versus miss entrainment comparisons.
Subpopulations of group 1 and group 2
Participants were sorted in two subpopulations according to their preferred theta phase in the second tone window, were visual entrainment supposedly took place after enough lip movements inputs in the condition of interest (i.e. two tones condition). The Rayleigh tests revealed non-uniform distributions of preferred phase at the second tones for group 1 (n = 11; µ = 348.83°; p < 0.001) and group 2 (n = 10; µ = 249.63°; p < 0.001). A Kuiper two-sample test confirmed that the mean preferred phases were different between group 1 and 2 (k = 121; p < 0.01).
Audio-visual theta synchrony in the silent lips perception-EEG task
The phase coupling between auditory and visual sources was estimated through their theta phase angle difference in time-windows equivalent to the first and second tone windows from the TDT. To quantify the improvement of phase coupling with entrainment, we computed the resultant vector length r of the distance between the observed ϕA-V in the data and a fixed ϕA-V = 0 in the first and second windows separately (ϕA-V = 0 meaning that visual and auditory theta phases are perfectly aligned with a constant offset of zero at each time-point of the time-window). A better synchrony between visual and auditory areas would be reflected by a more consistent distribution of ϕA-V towards a particular angle (i.e. 0). For each trial, we calculated the resultant vector length of the distance between the real auditory-visual phase offset and the theoretical phase offset 0° at each time point in the two time-windows. The resultant vector length was collapsed across time in the first and second windows separately, resulting in two values per trial. Single-trial values in the first and second windows were then averaged across trials for each participant, and the difference of phase entrainment values was assessed with a paired samples t-test.
SUPPLEMENTARY INFORMATION
Supplemental Information includes four figures, one silent movie (Video_1) and one white noise containing two tones (Audio_1) used in the TDT and EEG tasks.
SUPPLEMENTARY INFORMATION
Distinct preferred phases showed performance differences between two subpopulations of listeners
The behavioural data of the TDT task suggests that two separate subpopulations entrained to different preferred theta phases in the second tone time-window (see Figure 2A lower panel and Figure S2B below). In a post-hoc analysis, we assessed whether these apparently distinct populations also showed differences in tone detection performances. Arguably, any difference should be most pronounced only when visual entrainment eventually took place (second tone window) but not early in the trial. Participants were sorted in two groups based on their mean theta phase in the second tone window (i.e. ngroup1 = 11 and ngroup2 = 10; see STAR Methods) and we compared detection performance (d’) in the two tones condition by means of a repeated-measures ANOVA (with factors tone position and group).
The ANOVA on d’ scores revealed a significant interaction between tone position and group (F(1, 9) = 5.893; p = 0.038). Bonferroni-corrected pairwise t-tests showed that participants from the group 2 were better than the group 1 to detect the second tones (T(1,9) = −0.786; p = 0.028) but not the first tones (T(1,9) = −0.208; p = 0.629). Results also replicated the effect of tone position (F(1, 9) = 7.715; p = 0.021) with greater d’ for the tones sorted as second than first. Finally, no main effect of group was found (F(1, 9) = 2.103; p = 0.181). No difference between threshold (SNRgroup1 = 1.39e10−3 ± 2.31e10−3; SNRgroup2 = 1.43e10−3 ± 3.47e10−3; T(1, 24) = −0.66; p = 0.95; two-tailed), nor hit rates (hitgroup1 = 0.761 ± 0.003; hitgroup2 = 0.763 ± 0.004; T(1, 24) = −0.23; p = 0.84; two-tailed) were found in the calibration task, ruling out any hearing difference.
AKNOWLEDGMENTS
This work was supported by a sir Henry Wellcome Postdoctoral Fellowship awarded to E.B (Grant reference number: 210924/Z/18/Z), as well as grants from the ERC (Consolidator Grant 647954) and ESRC (ES/R010072/1) awarded to S.H, who is further supported by the Wolfson Foundation and Royal Society. The authors would like to thank people from the Memory and Attention Lab and David Poeppel for their valuable comments and inputs during the preparation of this manuscript.