Acoustic contamination of electrophysiological brain signals during speech production and sound perception

A current challenge of neurotechnologies is the development of speech brain-computer interfaces to restore communication in people unable to speak. To achieve a proof of concept of such system, neural activity can be investigated in patients implanted for clinical reasons while they speak. Using such simultaneously recorded audio and neural data, decoders can be built to predict speech features using features extracted from brain signals. A typical neural feature is the spectral power of field potentials in the high-gamma frequency band (between 70 and 200 Hz), a range that happen to overlap with the fundamental frequency of speech. Here, we analyzed human electrocorticographic (ECoG) and intracortical recordings during speech production and perception as well as rat microelectrocorticographic (μ-ECoG) recordings during sound perception. We observed that electrophysiological recordings, obtained with different recording setups, often contain spectrotemporal features of the sound, especially within the high-gamma band. Further analysis and in vitro replication suggest that these correlations are caused by a microphonic effect, transforming sound vibrations into an undesired electrical noise that contaminates the biopotential measurements. This study does not question the existence of relevant physiological neural information underlying speech production or sound perception in the high-gamma frequency band, but alerts on the fact that care should be taken to evaluate and eliminate any possible acoustic contamination of neural signals to investigate cortical dynamics underlying speech production and auditory perception.


Introduction
The development of brain-computer interfaces (BCI) to restore speech (Guenther et al., 2009;Brumberg et al., 2010;Leuthardt et al., 2011) is a long-term quest that seems within possible reach. Several advances have indeed been made over the past decade regarding the decoding of intracranial brain signals underlying either speech perception (Pasley et al., 2012;Chan et al., 2013;Pasley and Knight, 2013;Fontolan et al., 2014;Hyafil et al., 2015;Yildiz et al., 2016;Akbari et al., 2019) or production (Bouchard et al., 2013;Martin et al., 2014Martin et al., , 2016Mugler et al., 2014;Cheung et al., 2016;Chartier et al., 2018;Anumanchipalli et al., 2019), and most recent works have tackled with noticeable success the prediction of continuous speech from ongoing brain activity. Because of the difficulty to record from individual neurons with microelectrodes inserted in speech areas (Bartels et al., 2008;Kennedy et al., 2011;Tankus et al., 2012;Chan et al., 2013), most of speech decoding studies use field potential signals in the high-gamma frequency range, which typically covers frequencies from 70 to 200 Hz.
A noticeable feature of acoustic speech signals is the fundamental frequency f 0 of the human voice, which corresponds to the vibrational source of speech produced by the vocal folds in the larynx and further modulated by the vocal tract to produce the variety of speech sounds. The fundamental frequency depends on the size of the vocal folds and typically falls around 125 Hz for males and 215 Hz for women (Small, 2012). High-gamma frequencies thus generally overlap the range of the fundamental speech frequency.
Here, we analyzed human electrocorticographic (ECoG) and intracortical recordings during speech production and perception as well as rat microelectrocorticographic (µ-ECoG) recordings during sound perception. We found that electrophysiological recordings are usually contaminated by spectrotemporal features of the sound produced when speaking or played by the loudspeaker. This contamination seems to be the result of a microphonic effect and affects the range of high-gamma frequencies and above. These findings suggest that care should be taken to avoid including these artifacts when investigating cortical signals underlying speech production and perception.

Participants
The present study was conducted as part of the Brainspeak clinical trial (NCT02783391) approved by the French regulatory agency ANSM (DMDPT-TECH/MM/2015-A00108-41) and the local ethical committee . It is based on electrophysiological recordings obtained in 3 patients: a 42-year-old (P2) and a 29-year-old (P3) males undergoing awake surgery for tumor resection, and a 38-year-old female (P5) implanted for 7 days as part of a presurgical evaluation of her intractable epilepsy. These 3 patients gave their informed consent to participate in the study.

Electrophysiological recordings
Brain activity from participants P2 and P3 was recorded during awake surgery in the operating room just before tissue resection. For participant P2, a 256-electrode array (PMT Corp., USA) was positioned after opening the skull and the dura matter over the left sensorimotor cortex and the tumor (Figure 1, left). Ground and reference electrodes were integrated on the back side of the array and maintained wet using compresses soaked with saline. The 16 electrodes' pigtails were connected to eight 32-channels Cabrio Connectors (Blackrock Microsystems, USA) connected by shielded cables to two front-end amplifiers (FEA, Blackrock Microsystems, USA) for amplification and digitalization at 10 kHz. The digitalized signals were then transmitted by an optic fiber to two synchronized Neural Signal Processors (NSP, Blackrock Microsystems, USA) interfaced with a computer. For participant P3, a 96-chan intracortical Utah microelectrode array (UEA, Blackrock Microsystems, USA) was inserted in the pars triangularis of Broca's area (figure 1, middle), at a location that was subsequently resected to access the tumor for its removal. The pedestal serving as ground was screwed to the skull. Two wires with deinsulated tips were inserted below the dura, and one used as reference. The electrodes were connected via a Patient Cable (Blackrock Microsystems, USA) to a FEA where signals were digitalized at 30 kHz and further transmitted through an optic fiber to a NSP.
Brain activity from participant P5 was recorded in her room at the hospital. This participant was implanted with a 72-electrode ECoG array (PMT Corp., USA) covering a large portion of her left hemisphere as well as a 4-electrode strip (PMT Corp., USA) over the left ventral temporal lobe and a 96electrode UEA inserted in the left ventral sensorimotor cortex (Figure 1, right). Connections to the electrodes were made through the transcutaneous pigtails of the ECoG grid and through the pedestal of the UEA screwed to the skull. The ECoG grid was connected to a FEA linked to a NSP. The Utah array was connected to a Cereplex E headstage (Blackrock Microsystems, USA) ensuring signal amplification and digitization before transmission to a second NSP through a digital hub. For ECoG recordings, an electrode of the strip was used as the reference and another as the ground. For intracortical recordings, the reference was a wire deinsulated at its tip and inserted below the dura, and the ground was the pedestal. Data from both electrode arrays was sampled at 30 kHz and recorded on the two synchronized NSPs.

Audio recordings
For all participants, produced speech was recorded along with neural data. A microphone (SHURE Beta 58 A) was positioned at about 10-20 cm from the mouth. The signal was amplified using an audio interface (Roland OCTA-CAPTURE) and digitalized by one of the NSPs, at the same rate and synchronously with the neural data (see figure 2a).

Task and stimuli
All three participants performed an overt speech production task. They were asked to read aloud short French sentences, which were part of a large articulatory-acoustic corpus acquired previously (Bocquelet et al., 2016b) and made freely available (https://doi.org/10.5281/zenodo.154083). Participant P5 also took part in a protocol involving speech perception, where she was exposed to the sound of computer-generated vowels delivered by a loudspeaker positioned about 50 cm on her left.

Electrophysiological and audio recordings
In order to consider data recorded in a different condition, we also analyzed electrophysiological recordings obtained over the left auditory cortex of a ketamine (90 mg/kg)-xylazine (2 mg/kg) anesthetized 600-g adult Sprague Dawley rat using a 64-electrode micro-ECoG array (E64-500-20-60-H64, NeuroNexus Inc, USA). This data was obtained in compliance with European (2010-63-EU) and French (decree 2013-118 of rural code articles R214-87 to R214-126) regulations on animal experiments, following the approval of the local Grenoble ethical committee ComEth C2EA-12 and the ministry authorization 04815-02. A bone screw was used for the ground and a stainless steel wire inserted below the skin ahead of Bregma was used for the reference. Signals were acquired using the RHD2000 acquisition system and two 32-channel RHD2132 headstages (Intan Technologies, USA). To avoid any possible crosstalk inside the Intan acquisition system, the sounds delivered to the rat were recorded on an independent CED Micro1401 (Cambridge Electronic Design, UK). Both acquisition devices were interfaced and synchronized by the Spike2 software with the IntanTalker module (CED programs) and signals were digitized at 33.3kHz. The time jitter between sound and neural signals was checked to be below 2 ms.

Audio stimuli
Pure tones (3-ms rise, 167-ms plateau and 30-ms fall times) with frequencies ranging from 0.5 to 16kHz were presented with pseudo-random inter-stimulus intervals of 1.8-2.2 seconds. Sounds were delivered at about 80-90 dB SPL in open field configuration using a TDT MF1-S speaker. The three lowest tone frequencies that were further considered in the present study are 0.5, 1 and 2.5 kHz.

In vitro recordings in PBS solution
A 24-electrode ECoG array was placed in a plastic container filled with a 1X phosphate-buffered saline (PBS). Two of the electrodes were used as the ground and reference electrodes, respectively. All electrodes were plugged into a clinical headbox (Blackrock Microsystems, USA) connected by shielded cables to a FEA and a NSP used for human recordings. A microphone was placed close to the plastic container. The audio data was acquired using the same hardware as for human recordings (see section 2.1.3). Data was acquired at 30 kHz. A plastic box with a removable lid, soundproofed with cotton fiber insulation, was used to reduce sound propagation between the loudspeaker and the array. Twenty 4sec-duration pure tones, with frequencies ranging from 25 to 975 Hz and spaced every 50 Hz, were played every 2 seconds by the same loudspeaker used with participant P5, which was positioned in front of the plastic container. Each sound was played four times with 2-s intervals.

Data selection
10-min intervals with consistent speech production were selected from P2, P3 and P5 recordings. In the case of P5 recording in perception condition, the perception intervals had to be extracted, amounting to approximately 5.5 min. For animal recordings, a 10-min segment was selected. For each recording in PBS solution, the total duration of 9 minutes was kept for analysis. All recordings were visually inspected. For participant P2, 112 electrodes were removed due to several loose connections at the level of the Cabrio Connectors. For participant P5, 1 ECoG electrode showing saturating noise was removed. Segments containing high power transient artifacts were excluded from further analysis using a threshold-based detection method.

Data pre-processing
A built-in analog band-pass filter was applied to the data recorded with the NSP (0.3-2500 Hz for 10 kHz sampling rate and 0.3-7500 Hz for 30 kHz sampling rate). Common average reference was applied only for comparison with original data as in figure 2d. To center audio signals, a moving average was computed over 1-second windows and subtracted.

Spectrograms computation
In the present study, a spectrogram refers to the time-varying power spectral density (PSD) computed over a recording channel. For all analyses, spectrograms of neural and audio data were computed at a rate of 50 Hz using 200-ms time windows (weighted by a Hamming function). Mean sound PSDs were computed by averaging the spectrograms of audio signals over all selected time samples. For display purposes, the spectrograms in figure 2a and 2b were computed with higher frequency and time resolutions. These spectrograms were also z-scored within each frequency bin using artifact-free data segments containing the displayed extracts (60-and 30-second segments respectively).

Spectrograms correlations
For all recordings, the correlations between the neural and the audio spectrograms were computed. For each electrode, the sample Pearson correlation coefficient r between the power amplitudes across time of the electrode and audio signals was computed for all frequency bins separately. For each value of r, a p-value was computed using Student's t-test to test the null hypothesis that r = 0. These p-values were corrected using the Bonferroni method (i.e. by dividing them by the number of frequency bins times the number of electrodes in the recording) and then used to determine the statistical significance of each correlation coefficient (with a significance level α = 0.05).

Correlation between ECoG and sound signals during speech production
We observed strong correlations between ECoG and sound spectrograms in participant P2 during speech production. Participant P2's brain activity was recorded with an ECoG grid while he was reading sentences aloud. Simultaneously, a microphone was used to capture the sound of his voice (see figure  2a). Figure 2b shows a portion of the z-scored spectrograms of the sound signal (top) and of an electrode of the ECoG grid (bottom). In this example, the ECoG signal shows a very similar spectrotemporal structure as that of the sound. The time-frequency patterns observed are consistent with human speech and are unlikely to be brain activity. This is actually further assessed by in vitro tests described below (see section 3.4), which show such strong correlations when electrodes are simply immersed in PBS. We therefore attribute the high degree of similarity between the two bio-signals to an acoustic contamination of the electrophysiological measure. We quantitatively assessed this phenomenon by computing the correlation between the power of the signal within each frequency bin of each electrode signal with that of the sound signal. As shown in figure 2c and in the top of figure 2d, correlations up to 0.6 could be observed depending on the electrode. Up to 370 Hz, the strongest correlations were observed at frequencies most present in the sound signal, and in particular between 115 and 145 Hz, which corresponded to the range of the fundamental frequency of the subject's voice. This correspondence between the peaks of the mean sound PSD and the frequency bins showing high correlations supports the hypothesis that at least part of the correlation is caused by acoustic contamination of the neural data. Above 370 Hz, correlations were low even at frequencies for which the power of the speech signal remained high. As shown in figure 2d, the correlations between sound and ECoG spectrograms were still present and even exacerbated after common average re-referencing of the ECoG signals.
We carried out the same analysis for the ECoG data of participant P5 during speech production (figure 4a, top). In this case, the significant correlation coefficients reached values up to 0.2. Most of the highvalued coefficients were observed in a narrow band around 225 Hz, which corresponded to the range of the fundamental frequency of the participant's voice.

Correlation between intracortical and sound signals during speech production
In P3 recording, we further observed statistically significant correlations between the spectrograms of intracortical signals recorded using a Utah array and that of the produced speech signal. Figure 3a shows a portion of the z-scored spectrograms of the subject's voice (top) and of one electrode of the array (bottom). The spectrogram of the selected micro-electrode clearly shows spatio-temporal features also observed in the sound spectrogram (between 200 Hz and 400 Hz). Statistically significant correlation coefficients up to 0.7 were observed, with peaks falling in the range of frequencies where the sound signal showed high power (figure 3b). Noticeably, correlations between intracortical and sound signals during speech production were much weaker in participant P5 (figure 4b).

Correlation between electrode and sound signals during sound perception
Statistically significant correlations between electrode and sound signals were not only present during speech production as reported above, but also during sound perception. This phenomenon was observed in human and animal recordings using completely different recording instrumentations.

Human recording
Participant P5 also participated in a paradigm where artificially synthesized speech sounds were presented to her through a loudspeaker positioned on her left. Brain activity was recorded from both ECoG electrodes and intracortical microelectrodes. The sound produced by the loudspeaker was also recorded simultaneously. Performing the same analysis as for speech production data, we found that ECoG signals showed strong correlations with the sound signal, with peaks up to 0.9 (figure 4c). As observed in recordings during speech production, frequencies showing strong correlations are mostly found in the bands that concentrate most of the sound power. These bands correspond mainly to the pitch of the synthesized sound (185 Hz) and its harmonics. By comparison, the spectrograms of intracortical signals were poorly correlated with that of the sound (figure 4d).  . (a, b) Results for the speech production condition using ECoG and UEA data, respectively. (c, d) Results for the speech perception condition using ECoG and UEA data, respectively.

Rat recording
In order to verify that the correlations were not due to our clinical recording system in particular, we performed the same type of analysis on data obtained from an experiment in a rat. The left auditory cortex was recorded using a commercial µ-ECoG grid connected to an Intan neural recording system (figure 5a). In this case, pure tones were delivered in an open field paradigm. As shown in figure 5b, we again observed strong correlations between the electrode and sound spectrograms, with sharp peaks at the specific frequencies of the pure sound stimuli (500, 1000 and 2000 Hz).

In vitro evidence of acoustic contamination
Finally, we used a reduced experimental setup to investigate the cause of the observed correlations (see figure 6). The experiment was designed to verify that the correlations between the sound and the electrode recordings can be obtained without brain activity and to attempt to demonstrate that the correlations originate from the mechanical transmission of sound vibrations. The electrical potentials of ECoG electrodes placed in PBS were recorded while pure tone sounds were played by a loudspeaker. In order to evaluate the intensity of the incident sound, a microphone was placed near the container filled with PBS. A soundproof box was used to insulate either the loudspeaker or the ECoG array. The function of the box was to reduce the propagation of sound from the loudspeaker to the electrodes without substantially interfering with other parameters of the experiment. To determine the impact of sound propagation on the spectrogram correlations, we analyzed the data in open and closed box conditions.
In the first configuration, the loudspeaker was placed in the open box (figure 6a). As for in vivo experiments, we found that high correlations occurred at some of the frequencies of the sound stimuli. For some electrodes, the value of the correlation coefficient at 125 Hz was larger than 0.9. This result demonstrates that spectrogram correlations similar to those described in sections 3.1-3.3 occur in absence of any brain activity. In the second configuration, the loudspeaker was placed in the closed box (figure 5b). The reduction of the power of the incident sound due to the insulation is confirmed by the mean sound PSD (figure 5b, top). We observe that most of the correlation coefficients also have much lower values. This result supports the hypothesis of acoustic contamination, i.e. that the spectrogram correlations between sound and electrodes data originate from the mechanical propagation of sound to the neural recording hardware. In the third and fourth configurations, the electrode array and the microphone were placed in the box but the rest of the acquisition chain (cables, connector box, FEA, NSP) was left outside. When the box was left open (figure 5c), we observed high correlations at the frequency of the stimuli, similarly to the previous open box condition (figure 5a). The differences of frequency responses visible in the mean sound PSD can be explained by the modification of the arrangement of the experimental setup. In the last configuration, the box was closed over the electrodes and microphone (figure 5d). The sound insulation provided by the box was confirmed by the large reduction of the sound stimuli mean PSD (figure 5d, top). However, as shown in the bottom graph of figure 5d, the spectrogram correlations remained largely unaffected by the closing of the lid over the electrode array, contrarily to the previous experiment where the lid was closed over the loudspeaker (figure 5b). This suggests that the acoustic contamination of the electrical potential measurement may not only occur at the electrode level but also at other levels of the acquisition chain.

Discussion
Data considered in this study includes human and animal recordings during speech production and/or sound perception tasks. Using these different setup conditions, we observed statistically significant correlations between spectrograms of electrophysiological and simultaneously recorded audio signals. These correlations occurred at frequencies most present in the sound signal. The recordings used ECoG and intracortical micro-electrode arrays, interfaced with different data acquisition systems. It is not possible to draw quantitative conclusions about the influence of these factors, but the variety of cases suggests that the presence of the observed correlations is a widespread phenomenon.
Motion artifacts are classically seen in electrophysiological signals. In particular, mechanical vibrations may create variations of biopotential measurements (Luna-Lozano and Pallas-Areny, 2010). Such undesired signals may have different origins, including the bending of the electrode wires and the electrochemical changes at the electrode-electrolyte interface induced by small displacements of the electrodes (Salatino et al., 2017;Nicolai et al., 2018). Here we observed in different setups a microphonic effect on neural signals, arising from the impact of sound vibrations on the acquisition chain. We could reproduce the phenomenon in a minimal in vitro setup, confirming that soundelectrode correlations do not originate from brain activity. The experiments shown in section 3.4 further suggest that in the tested setup, the microphonic effect does not necessarily take place at the electrodes' level, but in the rest of the recording chain. Additional investigation would be required to precisely identify where in the chain and how the conversion of acoustic vibrations into electrical signals takes place.
The extent to which the acoustic noise spectrally overlaps with the measured brain activity depends on the nature of the sound and on the studied activity. In the case of ECoG recordings during speech production paradigms (see section 3.1), the overlap between the range of the voice fundamental frequency and the high-gamma band might make it difficult to record an artifact-free signal in this band. As suggested by results in section 3.3, sound stimuli, and by extension any sound during the recording, could contaminate the recorded data in any frequency band. It can thus be expected that highfrequency components of the sound might also influence the detection of multi-unit activity in microelectrode recordings (see section 3.2).
In conclusion, the purpose of this study is to alert on possible microphonic contamination of neural signals so that care could be taken to evaluate and eliminate this problem, especially when building decoders of neural activity underlying overt speech production or sound perception. Experimental setups might be improved to become less sensitive to microphonic effects, and signal-processing techniques might be developed to eliminate sound contamination in neural recordings. It should be noted that this report does not question the existence of relevant physiological neural information in high-gamma frequency signals underlying speech production or sound perception. Indeed, it has been shown by several groups that spectral features of imagined speech or silent articulation can be predicted from low or high-gamma signals recorded in patients not overtly speaking (Pei et al., 2011;Ikeda et al., 2014;Martin et al., 2014Martin et al., , 2016Bocquelet et al., 2016a;Anumanchipalli et al., 2019). Future developments of speech prostheses should thus build upon these findings.