Multilevel Modelling of Gaze from Hearing-impaired Listeners following a Realistic Conversation

Purpose There is a need for outcome measures that predict real-world communication abilities in hearing-impaired people. We outline a potential method for this and use it to answer the question of when, and how much, hearing-impaired listeners look towards a new talker in a conversation. Method Twenty-two older hearing-impaired adults followed a pre-recorded two-person audiovisual conversation in the presence of babble noise. We compared their eye-gaze direction to the conversation in two multilevel logistic regression (MLR) analyses. First, we split the conversation into events classified by the number of active talkers within a turn or a transition, and we tested if these predicted the listener’s gaze. Second, we mapped the odds that a listener gazed towards a new talker over time during a conversation transition. Results We found no evidence that our conversation events predicted changes in the listener’s gaze, but the listener’s gaze towards the new talker during a silent-transition was predicted by time: The odds of looking at the new talker increased in an s-shaped curve from at least 0.4 seconds before to 1 second after the onset of the new talker’s speech. A comparison of models with different random effects indicated that more variance was explained by differences between individual conversation events than by differences between individual listeners. Conclusion MLR modelling of eye-gaze during talker transitions is a promising approach to study a listener’s perception of realistic conversation. Our experience provides insight to guide future research with this method.


Introduction 45
Difficulty with everyday conversation is a debilitating consequence of hearing loss that is not well 46 captured by traditional audiometric tests (Kiessling et al., 2003). In recent years, the field of hearing science 47 to capture potentially subtle indications of turn-end perception, (2) its use of hierarchichal groupings of data comparisons between new conditions, populations, and individual differences. With this MLR approach, we 123 address the research question of when, and how much, hearing-impaired listeners look towards a new 124 talker. Furthermore, we examine how much participants and stimuli differ from one another in our design. 125 Through this work we aim to lay a foundation that can advise future research on how the gaze behavior of 126 hearing-impaired listeners reflects their experience of everyday conversation. 127

Methods 128
Participants 129 The data were collected as part of a separate experiment that investigated how to use in-ear 130 electrodes to measure eye-movements (Skoglund et al., 2022). The full experiment, including the use of the for each loudspeaker was equalized with a finite-impulse response filter based on the inverse of the Furthermore, the signal loudness was modified for each participant individually to compensate for their 174 hearing loss in their better ear (Moore & Glasberg, 1998;Skoglund et al., 2022). 175 A pair of Tobii Pro 2 Eye Tracking Glasses (Tobiipro, Sweden, https://www.tobiipro.com) were used 176 to collect eye-movement data that were stored directly on a SD card within the device. The glasses were 177 fitted with a set of six reflective markers that were used to track the position of the eye-tracker in the room 178 via a four-camera motion capture system (Vicon Vero, UK, https://www.vicon.com). 179 Data from the motion capture system was recorded on a desktop computer that also controlled the 180 experiment presentation. Audio was interfaced through two linked soundcards (RME Fireface 400, 181 Hammerfall DSP Multiface II). In addition to connecting to the loudspeakers, the soundcards transmitted a 182 trigger signal to the eye-tracker recording system (via a custom-made trigger box) and to the motion capture 183 system for the purpose of synchronizing the measurements with the audio. Behavioral responses from the 184 participants were recorded through a bluetooth-connected number pad. 185 The stimuli were controlled with Matlab 2019a ( positioned such that their head was approximately 1.5 metres from the television screen. After fitting the black background, accompanied by the onset of the multitalker babble noise. After three seconds, the 200 conversation speech and video began. The listeners were asked to hold their heads stationary, to look at the 201 fixation cross when it appeared on the screen, to listen to the conversation, and to answer (by button-press) 202 the multiple-choice question that appeared on the screen at the conclusion of each clip. The question and 203 the response options were also explained. The next trial started automatically after the participant's 204 response. The participants were given 2 blocks of practice trials (19 trials total) and 2 blocks of experiment 205 trials (26 trials total), with the option of breaks in between blocks. These blocks were interleaved with 206 testing blocks for other experimental conditions that were not of interest to the current investigation and 207 are discussed elsewhere (Skoglund et al., 2022). 208

Conversation Data Processing 209
From the individual audio tracks of each talker in our stimuli, we generated voice activity timeseries 210 by segmenting the audio into 5 ms-long windows that overlapped by 1 ms, and using an individually selected 211 root mean square threshold on the segments to remove cross-talk during silent transitions, similar to 212 (Sørensen et al., 2021). In several tracks with substantial cross-talk, periods of the talkers' silence were 213 manually edited to exclude cross-talk. Individual tracks were normalized to the same energy level for the 214 voice-active portions within a clip, to compensate for variation in the speaking level of the talkers. These 215 voice activity timeseries were used to classify each clips' timecourse into five different conversation event 216 types. We applied the algorithm from (Sørensen et al., 2021) to identify periods of turns and transitions, and 217 label periods of silence and overlap. From this, the conversation was divided into discrete events based on 218 the number of talkers (either silence, solo, or overlap for 0, 1, or 2 simultaneous talkers) for both turns and 219 transitions separately. This resulted in the following labels, which are illustrated in Figure 2: Silence-turn, 220 silence-transition, solo, overlap-turn, and overlap-transition. Similar to (Heldner & Edlund, 2010), periods of 221 silence-turn that were less than 180 ms in duration were bridged, and voice activity less than 70 ms in duration were excluded. The frequency and durations of the resulting conversation event types can be seen 223 in Table 1. 224 To answer our research question, we needed to identify points in the stimuli where we expected 225 listeners to transfer their gaze to the new talker due to a transition in the conversation. Given the low 226 occurrence of overlap-transitions in our stimuli, we selected time periods surrounding the onset of a new 227 talker after silent-transition events only. To exclude transition events that were uncharacteristically long or 228 short, we selected only those in which the durations of the silent period and the subsequent turn were 229 roughly within the interquartile range for each event type (minimum duration of turn for inclusion: 0.4 s, 230 silence: 0.14 s; maximum time included from turn: 2.92 s, silence: 0.508 s). This resulted in a selection of 68 231 out of the total 120 silence-transition events within our stimuli. 232

Eye-gaze Data Processing 233
Since previous research has established that listeners prefer looking at faces compared to other 234 elements of a scene (see e.g. (Hessels, 2020for a review), our eye-gaze data processing worked from the 235 assumption that our participants did so, and focussed on describing how gaze was divided between talkers in 236 relation to their speech activity. This assumption allowed us to process the data with few a priori 237 assumptions, and without physically restricting the participants heads. Our motivation here was to develop a 238 method that can translate well into future experimental setups that have less control over data quality and 239 more stimulus variability, which is a necessary compromise to move into research with even more realistic 240 scenarios. Note that, although we did not compensate for the listener's head position in our eye-gaze 241 analysis, motion capture data confirmed that participants held their heads mostly stationary during the 242 experiment. Only 4% of trials showed a change in yaw larger than 2 degrees from the average head position 243 in the trial. Only 5% showed translations larger than 14.6 mm on the horizontal azimuth from the average 244 head position in the trial, which corresponds to less than 1 degree of visual angle in our setup. We conclude 245 therefore that head movements made a negligible impact on our analysis. 246 For each block of trials, we selected the eye that had the lowest proportion of missing data in each 247 participant, and we used the gaze direction coordinates to calculate gaze angle in degrees on the azimuth. an automatic data-driven manner by k-means clustering (k = 2) applied to the azimuth measurement within 250 each trial. This method ignores that some of the gaze samples belonged to saccades between the talkers, 251 and towards elements of the scene outside of the talkers. We treated these latter gaze samples as inherent 252 noise, and assumed that this noise was not systematically different between the talkers across the full 253 experiment. 254 Prior to the k-means clustering, we removed gaze samples outside of a 50 degree range on the 255 azimuth, centered at the trial's mean gaze angle. These were assumed to be noise due to their spatial 256 extremeness (keeping in mind that the talker's were separated by approximately 20 degrees). This cleaning 257 mitigated cluster biases from extreme values. After k-means labelling, we further processed the gaze to 258 conform to physiologically-based assumptions. We assumed that (1) a change in region-of-interest -the 259 moment the gaze crossed the midline between the two talkers -would cover 10 degrees, given that the 260 talkers were approximately 20 degrees apart, that (2) a saccade of this size would last approximately 40 ms, 261 and that (3) a fixation to a talker would last at least 100 ms. Therefore, changes of labels that lasted less than 262 180 ms were considered noise and were filled in with the previous label. Likewise, chunks of missing data 263 less than 180 ms that did not span a label change were filled in with the previous label. 264 After this preprocessing, data from 12 out of 572 trials were excluded due to missing more than 40% 265 of the trials' data. The remaining trials had a total of 8.4% data missing, 3% of which was due to the 266 preprocessing steps described above and the remainder of which was due to signal loss during data 267 collection. 268

Target talker labelling 269
Our modelling analysis required that we first decide which talker to label as the target in each type 270 of conversation event. To do so, we aligned data from all participants by their timestamps, and classified 271 these according to one of the five conversation event types shown in Figure 2  at each timepoint to test whether the distribution of the group's gaze was different from random. We silence-turn, and overlap-turn events. The resulting proportions revealed that the group preferred to look at 278 the finishing talker during transitions and the turn-holder during turns. Specifically, during overlap-transition 279 and silence-transition events, 34.1% and 36.3% of collected timepoints showed that the group looked 280 towards the finishing talker, versus 12.4% and 11.8% towards the starting talker. During solo, overlap-turn, 281 and silence-turn events, 65.1%, 42.8%, and 45.7% of collected timepoints showed that the group looked 282 towards the turn-holder, versus 4.4%, 4.7%, and 0.0% towards non-turn-holder. Based on this, the finishing 283 and turn-holding talkers are herein referred to as the target for any given timepoint. This denotation was 284 used to code the data as either on or off target for the subsequent modelling. 285

Results 286
We tested two different models (Equation 1 & Equation 2) to address our research question. In both, 287 the dependent variable was the probability of gaze-to-target transformed to log odds. We modelled an 288 intercept for each individual transition occurrence (γ1) as a random effect, in order to account for differences 289 between individual events in their tendency to predict gaze to the target talker. We further included an 290 intercept for each participant (γ2) as a random effect, which allowed the model to account for differences 291 between participants in their overall tendency to look at the target talker. We estimated the effect size of 292 our models by looking at the variance explained by the fixed effects alone versus the complete model 293 (including both fixed and random effects). This was done by calculating the pseudo-R-Squared value for 294 Generalized Mixed-Effect model from the MuMIn package (https://cran.r-295 project.org/web/packages/MuMIn/MuMIn.pdf). 296

Model 1: Effects of talker overlap and silence in turns and transitions 297
Prior to answering our research question, we first wanted to establish how the tendency to look at 298 the active talker changed in events with overlap or silence during turns or transitions. Given that realistic 299 conversation has many sources of potential variation to control for if it is to be used in an experiment, we wanted to know if conversation event type is a meaningful explanatory factor. This system of classification has the advantage that it can be automatically applied by examining the timeseries of voice activity, and is 302 therefore more efficient than manual conversational analysis. Therefore, for our first model, the 303 independent variable was event type and included five categorical levels (silence-transition, silence-turn, 304 solo, overlap-transition, and overlap-turn). This was modelled as a fixed effect with the random effects 305 described above (Equation 1). In addition, we included an independent slope parameter for each participant 306 to account for differences between participants in how they behave for each event type as a random effect 307 This suggests that differences in conversation event type are not predictive of where listeners look when 316 viewing a multitalker conversation. Nonetheless, the model allowed us to describe where a listener looked 317 during each conversation event type as a probability with confidence intervals (Table 2). Furthermore, the 318 model assessed how well each event type changed the odds of gaze-to-target, as compared to solos (Figure  319 3). Only during silence in a turn were the odds of gazing at the target talker significantly changed from the 320 odds in a solo, where gaze-to-target was 1.77 times more likely (p < 0.001, 95% CI [1.29, 2.44]). Given that 321 we found no evidence that any conversation event type decreased the odds of gaze-to-target, turns that 322 included overlap and silence were not excluded from our subsequent analysis. 323 In the current investigation, we introduced a new paradigm for studying a listeners' experience of a 349 realistic multitalker conversation that is intended to capture the perception of turn-end cues. Using MLR From the results of model 2, we observed that the likelihood of the listeners' gaze towards the new 355 talker peaked at 1 s after the new talker's speech onset, with odds of 2.14 ( Figure 4). These odds translate 356 into a 68% chance that the listeners' gaze was to the new talker. Table 3 summarizes previous research that 357 has described the gaze trajectory of listeners with normal hearing, who similarly followed talker transitions 358 in realistic conversation. The values reported in the current investigation are close to what has been 359 observed in the normal-hearing population, with perhaps a longer timing to the peak of the gaze towards 360 the new talker. We cannot draw conclusions from this comparison, however, due to substantial differences 361 in methodology between the current and previous works. Notably, in our experiment, the conversation 362 audio was accompanied by multitalker babble noise whereas in previous research the audio was always 363 presented without noise. 364 The data presented in Figure 4  Nonetheless, this analysis allowed us to quantify where a hearing-impaired listener looks during 381 conversation (Table 2). Similar to previous research, we observed that listeners preferred to look at the 382 active talker when only one talker was speaking, with odds of 2.21. This translates to 68.85% of the time.  from log odds. "Target" is defined as the turn-holding talker during turns, and as the finishing talker during 593 transitions. Note that the confidence intervals were calculated with each event type, respectively, as the 594 baseline intercept for the model.