Duplex perception reveals brainstem auditory representations are modulated by listeners’ ongoing percept for speech

So-called duplex speech stimuli with perceptually ambiguous spectral cues to one ear and isolated low- vs. high-frequency third formant “chirp” to the opposite ear yield a coherent percept supporting their phonetic categorization. Critically, such dichotic sounds are only perceived categorically upon binaural integration. Here, we used frequency-following responses (FFRs), scalp-recorded potentials reflecting phase-locked subcortical activity, to investigate brainstem responses to fused speech percepts and to determine whether FFRs reflect binaurally integrated category-level representations. We recorded FFRs to diotic and dichotic stop-consonants (/da/, /ga/) that either did or did not require binaural fusion to properly label along with perceptually ambiguous sounds without clear phonetic identity. Behaviorally, listeners showed clear categorization of dichotic speech tokens confirming they were heard with a fused, phonetic percept. Neurally, we found FFRs were stronger for categorically perceived speech relative to category-ambiguous tokens but also differentiated phonetic categories for both diotically and dichotically presented speech sounds. Correlations between neural and behavioral data further showed FFR latency predicted the degree to which listeners labeled tokens as “da” vs. “ga”. The presence of binaurally integrated, category-level information in FFRs suggests human brainstem processing reflects a surprisingly abstract level of the speech code typically circumscribed to much later cortical processing.

offer another viable test of perceptual correlates in FFR because information must be fused 114 centrally to generate a category label. Indeed, there is already some evidence that the FFR 115 reflects binaural auditory percepts (Galbraith et al. 1998; Krishnan and McDaniel 1998;116 Bidelman and Krishnan 2009). Here, we extend these ideas by using stimuli which support 117 "duplex" speech perception (Preisig and Sjerps 2019) to further probe phonetic-level coding in 118

FFR. 119
Duplex perception refers to the binaurally fused categorical percept of dichotic stop-120 consonant stimuli in which an ambiguous portion of the spectrum (i.e., low-frequency "base") is 121 presented to one ear while a disambiguating portion of the spectrum containing the third formant 122 (F3) (i.e., high-frequency "chirp") is presented to the other (Rand 1974;Liberman et al. 1981; To this end, the current study aimed to evaluate whether speech-FFRs are modulated by 136 listeners' perception, reflect binaural integration, and carry category level information of the 137 speech signal. We measured FFRs in response to dichotic duplex and diotic speech stimuli in 138 younger adults during a novel categorization paradigm task that allows for simultaneous 139 recording of brainstem responses during real time behavior (e.g., Bidelman 2015; Carter and 140 Bidelman 2023). Critically, the high-frequency bandwidth of our stimuli (> 250 Hz) was designed 141 to far exceed the low-frequency (< 100 Hz) phase-locking capacity of cortical neurons (Joris et  The sample included N=16 monolingual English-speaking young adults (age range=22-28 153 years,15 female) with an average of 18.25 ± 1.29 years of education. Participants all had normal 154 hearing (pure tone thresholds ≤25 dB HL; 250-8000 Hz), had varied musical experience 155 (mean=5.56 ± 6.4 years), and were mostly right-handed (mean=76% ± 27%; Edinburgh 156 Handedness Inventory; Oldfield 1971). Each participant provided written consent in compliance 157 with a protocol approved by the Institutional Review Board of The University of Memphis. 158 Participants were paid $10 an hour for their time. 159

Stimuli and task 160
Stimuli consisted of synthetic /da/ and /ga/ consonant vowel (CV) speech tokens that were 161 presented diotically or dichotically (duplex stimuli) to listeners (Figure 1) (stimuli acquired from: 162 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made Isolated chirps contained either a high (~2.9 kHz) or low (~2.7 kHz) F3 contour, promoting a /da/ 170 or /ga/ percept, respectively. Critically, these dichotic stimuli require listeners to combine cues 171 from both ears through binaural integration to properly arrive at a categorical label (i.e., "da" vs. 172 "ga"); they cannot be classified via a single ear alone. In addition to duplex tokens, the base and 173 chirp were presented diotically as control conditions. Listeners easily classify these latter tokens 174 since the acoustic signal itself contains all category-relevant cues. The ambiguous base by itself 175 served as an additional control. Each token was 160 ms in duration and tokens were gated with 176 5 ms ramps. In total, there were five stimulus conditions: ambiguous base + high F3 (promoting 177 the percept of "da"), ambiguous base + low F3 (promoting the percept of "ga"), diotic /da/, diotic 178 /ga/, and the ambiguous base alone. 179 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made Single tokens were presented in blocks of 30 repetitions with a rapid ISI (10 ms). After the 190 clustered block of tokens ended, the ISI was slowed to 300 ms and a single token was 191 presented to cue the behavioral response. Participants then indicated their percept (/da/ or /ga/) 192 as quickly and accurately as possible via the keyboard. Following the behavioral response and 193 a period of silence (250 ms) the next trial cluster commenced. This paradigm allowed 1980 194 presentations per token for input to FFR analysis and 66 tokens for the behavioral response.  Behavior data analysis 212 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted May 9, 2023. ; https://doi.org/10.1101/2023.05.09.540018 doi: bioRxiv preprint We calculated percent identification (percent of presentations of a token identified as /ga/) and 213 reaction times (RTs) per stimulus condition. Improbable RTs (i.e., ≤ 250 ms or >2500 ms) were 214 treated as fast guesses and lapses of attention, respectively, and were removed from the 215 analysis as outliers (Bidelman and Walker 2017). 216

FFR data analysis 217
We used Brainstorm to generate Fast Fourier Transforms (FFTs) for each token to quantify the 218 spectral information in each response. We then measured the amplitude of the spectral peaks in  We should note that the F0 used in the current study is considerably higher than those in nearly 223 all previous work. Although our tokens all had identical voice pitch, we expected changes in F0 224 amplitude across tokens, indicating a modulation in the strength of the FFR dependent on 225 listeners' online percept (Carter and Bidelman 2023) and category cues integrated from the 226 other ear. Onset latency was measured as the peak in the cross-correlation function between 227 the FFR and evoking stimulus waveform in a 5.5-10 ms search window, the expected onset 228 latency of the brainstem response (Galbraith and Brown 1990; Bidelman and Momtaz 2021). 229

Statistical analysis 230
We used one-way mixed model ANOVAs (R; lme4 package, version 1.1-32) to analyze the FFR 231 data (F0 amplitude and latency). The model included a fixed effect for token (5 levels: /da/, /ga/, 232 duplex /da/, duplex /ga/, ambiguous base) and a random effect for subjects. Identical models 233 were run for the behavioral measures (percent /ga/, RTs). We normalized the FFR amplitude 234 measures between 0-1 (within each subject) to mirror the behavioral percent /ga/ identification 235 data which is similarly bound between 0-100%. This allowed us to focus on the relative changes 236 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted May 9, 2023. observed in the auditory periphery, including cochlear filtering, level-dependent gain (i.e., 259 compression) and bandwidth control, long-term adaptation, as well as two-tone suppression. 260 Model tuning curves were fit to the characteristic frequency (CF)-dependent variation in 261 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted May 9, 2023. ; https://doi.org/10.1101/2023.05.09.540018 doi: bioRxiv preprint threshold and bandwidth for high-spontaneous rate (SR) fibers in normal-hearing cats (Miller et 262 al. 1997). The stochastic nature of AN responses is accounted for by a modified non-263 homogenous Poisson process, which includes effects of both absolute and relative refractory 264 periods and captures the major stochastic properties of single-unit AN responses (e.g., Young 265 and Barta 1986). 266 We used the AN model to simulate scalp-recorded speech-FFRs (Bidelman 2014; Carter 267 and Bidelman 2023) (Fig. 6A). This approach is based on the assumption that the far-field FFR 268 recorded at the scalp is a convolution of an elementary unit waveform (i.e., impulse response; 269 akin to the click-evoked ABR) with the instantaneous discharge rate from a given auditory To simulate dichotic FFRs, we generated model FFR outputs separately for the left and 284 right audio channels of our duplex stimuli. We then summed the monaural FFRs to simulate 285 binaurally fused responses as measured in the actual FFR experiment. As with the empirical 286 FFR recordings, we then measured model F0 (247 Hz) amplitudes from response spectra. This 287 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted May 9, 2023. ; https://doi.org/10.1101/2023.05.09.540018 doi: bioRxiv preprint allowed us to compare true FFR with model responses, which similarly reflect the output of 288 cochlear processing (e.g., spectral decomposition, nonlinearities) but are not subject to 289 attention, perception, and/or top-down cortical modulation as in the empirical recordings. 290 291 Results 292

Behavioral data 293
All speech tokens were perceived categorically by listeners (Figure 2A) was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted May 9, 2023. ; https://doi.org/10.1101/2023.05.09.540018 doi: bioRxiv preprint with high/low F3) were identified with similar labels as their diotic counterparts: "da" vs. "ga", 308 respectively. Ambiguous base stimuli, which do not carry category-relevant cues, were 309 categorized at chance levels (dotted line). Da, duplex da, duplex ga, ga, and ambig tokens here 310 correspond to the stimulus tokens 1, 9(highF3), 9(lowF3), 17, and ambig as described in

FFR data 313
Grand average FFR time waveforms and response spectra for each token are shown in Figure  314 3A-B, respectively. Note the robust periodicity of FFR waveforms, reflecting phase-locked 315 neural activity to both diotic and dichotic speech stimuli. Despite identical acoustics in the low-316 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted May 9, 2023. ; https://doi.org/10.1101/2023.05.09.540018 doi: bioRxiv preprint frequency portion of the acoustic stimulus spectrum (i.e., Fig. 1 (Figs. 4A-B). Tukey-326 Kramer adjusted contrasts revealed the amplitude effect was driven by stronger responses to 327 both the diotic (p = .062) and dichotic "da" (p = .004) stimuli relative to the ambiguous control. 328 Similarly, the latency effect was driven by earlier responses to both /da/ stimuli compared to the 329 ambiguous control (ps < .042). Responses were also faster on the whole to /da/ vs. /ga/ 330 category stimuli (Student's t-test: duplex/diotic /da/ vs. duplex/diotic /ga/ contrast; t(54.63) = -331 2.24, p = .03). These findings suggest FFR latency distinguished stimuli from opposing 332 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted May 9, 2023. was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted May 9, 2023. ; https://doi.org/10.1101/2023.05.09.540018 doi: bioRxiv preprint To better visualize whether a combination of neural (FFR) and/or behavioral measures 344 revealed category structure, we constructed a series of bivariate plots showing relationships 345 between measures in perceptual-neural space ( Figure 4C). We found perceptual speech 346 categories "separate" from each other in perceptual-neural space (Fig. 4C). More critically, 347 when considering FFR variables alone, the combination of neural amplitude and latency 348 measures cleanly separated responses to /da/ vs. /ga/ stimulus classes in both dimensions (Fig.  349   4D). Stimuli perceived as "da" clustered with one another and vice versa for those perceived as 350 "ga." Moreover, responses categorically perceived as speech clearly segregated in 351 multidimensional neural space from the ambiguous control stimulus. These findings indicate that 352 FFRs not only differentiated speech compared to nonspeech sounds but more critically, 353 clustered according to their phonetic identity even when their labeling required the fusion of 354 binaural speech cues. 355

Brain-behavior repeated measures correlations revealed a positive relation between 356
FFR latency and behavioral identification [rrm(44) = 0.38, p= 0.01] (Fig. 5). Later FFR latencies 357 were associated with a greater preponderance of "ga" percepts within individuals. Correlations 358 between F0 amplitude and PC were not significant. 359 360 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted May 9, 2023. ; https://doi.org/10.1101/2023.05.09.540018 doi: bioRxiv preprint in the true FFR data (cf. Fig. 4).

FFRs are stronger in response to speech than non-speech sounds 412
We found that FFRs were enhanced to speech compared to non-speech (i.e., ambiguous) 413 stimuli, corroborating previous studies. FFRs to forward speech are enhanced relative to FFRs 414 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted May 9, 2023. ; https://doi.org/10.1101/2023.05.09.540018 doi: bioRxiv preprint to the same speech tokens time-reversed, indicating brainstem neural coding is enhanced when 415 otherwise acoustically similar stimuli are perceptually relevant (Galbraith et al. 2004). 416 Additionally, FFRs to sine-wave speech are enhanced for trained listeners who hear these 417 stimuli as speech compared to naïve listeners who do not (Cheng 2021). Further evidence that 418 Our finding that speech FFRs were stronger to diotic and duplex speech tokens than to 422 the ambiguous base alone corroborates findings that FFRs are enhanced for speech signals. 423 Although we expected behavioral differences in RTs for duplex and ambiguous stimuli mirroring 424 our FFR findings, the absence of this effect may have been due to the unique stimulus 425 presentation paradigm with clustered presentation, allowing listeners to anticipate their 426 responses, leading to faster, more uniform, reaction times across all tokens. That FFRs may be 427 enhanced for relevant signals supports the theoretical notions that "speech is special" in terms Here, the enhancement of FFRs to speech tokens supports the idea that speech is afforded 433 special processing in the brain by showing this privilege also extends to a subcortical level. 434

FFRs reflect binaural integration 435
A novel finding here is that FFRs reflect binaural integration of duplex stimuli. Critically, our 436 stimuli require binaural integration to be categorically perceived; listeners cannot arrive at a 437 phonetic label without integrating speech cues from the two ears. The fact FFRs mirrored the 438 behavioral reports supports the notion that FFRs carry information about binaural integration 439 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted May 9, 2023. was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted May 9, 2023. ; https://doi.org/10.1101/2023.05.09.540018 doi: bioRxiv preprint that this network may also involve subcortical structures, earlier in the auditory system than 465 previously thought 1 . 466

Category representations emerge subcortically 467
We found that listeners' categorical percepts of duplex stimuli modulated subcortical responses, 468 suggesting category-level information is available to the brain prior to auditory cortex. Critically, 469 the F0 (indeed all spectral cues) in our stimuli exceeded 250 Hz, which is substantially higher  behavior in addition to the lower-level, neuro-acoustic sound representation that is presumably 505 more automatic in nature. Our use of duplex stimuli teases apart these dissociations between 506 acoustic and perceptual FFR representations since stimulus acoustics and monoaural 507 (peripheral) processing alone fail to account for our data (e.g., Fig. 6). 508

Conclusions 509
We show brainstem FFRs were enhanced for behaviorally relevant speech signals and were 510 modulated by listeners' categorization of binaurally integrated speech cues. Our findings 511 support notions that category representations are present in subcortical auditory processing. 512 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted May 9, 2023. ; https://doi.org/10.1101/2023.05.09.540018 doi: bioRxiv preprint . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted May 9, 2023. ; https://doi.org/10.1101/2023.05.09.540018 doi: bioRxiv preprint . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted May 9, 2023. ; https://doi.org/10.1101/2023.05.09.540018 doi: bioRxiv preprint . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted May 9, 2023. ; https://doi.org/10.1101/2023.05.09.540018 doi: bioRxiv preprint