Disentangling listening effort and memory load beyond behavioural evidence: Pupillary response to listening effort during a concurrent memory task

Recent research has demonstrated that pupillometry is a robust measure for quantifying listening effort. However, pupillary responses in listening situations where multiple cognitive functions are engaged and sustained over a period of time remain hard to interpret. This limits our conceptualisation and understanding of listening effort in realistic situations, because rarely in everyday life are people challenged by one task at a time. Therefore, the purpose of this experiment was to reveal the dynamics of listening effort in a sustained listening condition using a word repeat and recall task. Words were presented in quiet and speech-shaped noise at different signal-to-noise ratios (SNR): 0dB, 7dB, 14dB and quiet. Participants were presented with lists of 10 words, and required to repeat each word after its presentation. At the end of the list, participants either recalled as many words as possible or moved on to the next list. Simultaneously, their pupil dilation was recorded throughout the whole experiment. When only word repeating was required, peak pupil dilation (PPD) was bigger in 0dB versus other conditions; whereas when recall was required, PPD showed no difference among SNR levels and PPD in 0dB was smaller than repeat-only condition. Baseline pupil diameter and PPD followed different variation patterns across the 10 serial positions within a block for conditions requiring recall: baseline pupil diameter built up progressively and plateaued in the later positions (but shot up when listeners were recalling the previously heard words from memory); PPD decreased at a pace quicker than in repeat-only condition. The current findings demonstrate that additional cognitive load during a speech intelligibility task could disturb the well-established relation between pupillary response and listening effort. Both the magnitude and temporal pattern of task-evoked pupillary response differ greatly in complex listening conditions, urging for more listening effort studies in complex and realistic listening situations.

Introduction decode the incoming speech input embedded in various types of background noise, 49 retain some information for mental processing, pondering over the best choices of words 50 and articulating a verbal response (potentially monitoring the feedback of one's own 51 voice), all of which require sustained cognitive processing over time. Understanding 52 pupillary response to speech understanding in those situations is essential to 53 conceptualise and quantify listening effort in ecological conditions, especially in the case 54 of hearing aid or cochlear implant users. 55 Specifically, the relation between single task demand and pupil dilation has been 56 shown and well-replicated in studies manipulating speech intelligibility and memory 57 load [24][25][26][27][28][29]. However, there are only a handful of pupillometry studies involving 58 multiple and sustained tasks within hearing science. For instance, Karatekin et al. [16] 59 found that pupil diameter increased progressively with more digits to remember during 60 a digit span task and a dual task (digit span with visual response time task), but the 61 rate of increase was shallower in the dual task than the single task. McGarrigle et 62 al. [42] asked NH participants to listen to one short passage per trial, presented with 63 multi-talker babble noise, and at the end of each passage judge whether images 64 presented on the screen were mentioned in the previous passage. A steeper decrease in 65 (baseline corrected) pupil size was found for difficult than easy SNR, but only in the 66 second paragraph. This was interpreted as an index of the onset of fatigue in listening conditions requiring sustained effort. However, paragraphs were between 13-18s and the 68 target word was periodically varied inside the paragraph, making it difficult to measure 69 directly the pupillary response evoked by recognising and encoding the target item. In 70 Zekveld et al. [43], participants had to recall the four-word cues (either related or 71 unrelated to the following sentence) presented visually before the onset of the sentence 72 embedded in a speech masker. The 7dB SNR difference between two sentence-in-noise 73 conditions (-17dB and -10dB) elicited a difference in intelligibility, but not in peak and 74 mean pupil dilation. This contrasted with the well-established effect of auditory task 75 demands on the pupillary response, suggesting that an external cognitive load (i.e., 76 memory) during speech processing could nullify the intelligibility effects on pupil 77 dilation response. Overall, these studies point to a complicated, but under-investigated, 78 relation between the speech task and the pupil dilation response, when other cognitive 79 task load is present. 80 Therefore, the current experiment starts addressing the lack of systematic 81 investigation on the dynamics of pupillary response in complex and sustained listening 82 situations. To do so, we designed a behavioural paradigm including two TASKs with 83 different demands in cognitive resources: a repeat-only condition where participants 84 listen and repeat one word consecutively for ten words, and a repeat-with-recall 85 condition where after listening and repeating each of the ten words, they need to recall 86 as many words as possible at the end of the tenth word. Using words instead of digits or 87 paragraphs, the paradigm utilises natural speech, yet still provides precise time-locking 88 to the canonical task-evoked pupil response. The recall task poses a substantial and  only once in each list. They were then masked by speech-shaped noise (filtered on the 131 long-term excitation pattern of the entire material, respectively in English or French) at 132 three SNR levels of 0 dB, 7 dB and 14 dB. A quiet condition was also included, making 133 a total of 4 LISTENING conditions.
(measured using a luxometer with the sensor positioned at the same height as the participants' left eye and facing the screen). The luminance levels were fixed throughout 144 the experiment, to avoid changes in light level inducing task-unrelated pupillary 145 response. All audio stimuli were presented through a Beyer Dynamics DT 990 Pro 146 headphone via an external soundcard (Edirol UA), calibrated at 65 dB SPL.

147
Experiments were run in Matlab 2016b, using Psychtoolbox and custom software.

148
After demonstrating the task and explaining the procedure, participants practised 149 with one repeat-with-recall condition at 14 dB to familiarise themselves with the test 150 sequence and requirements for pupil recording.

151
Before each test block, participants were notified by words on the screen to either 152 recall (printed in red) or not recall (printed in black) at the end of the ten words. 3s 153 after the notification, a black fixation cross appeared and stayed for another 1s, to 154 indicate the start of the first trial and eliminate any carry-over effect from reading the 155 coloured words in the pre-block notification. In each trial within the block, the 156 presentation of speech-shaped noise masker (or quiet in the quiet condition) started 1.5s 157 before the onset of the word. This was to provide time for the pupils to recover from 158 the previous trial to temper carry-over effect (0.5s) and to measure baseline pupil 159 diameter (1s). SNR was varied by fixing the masker level and adjusting the target level. 160 In this way, listeners could not estimate the upcoming task difficulty based on the noise 161 level [30]. Participants were instructed to fixate on the black fixation cross displayed at 162 the centre of the screen. After 1.5s, the word was played, and the presentation of the 163 masker noise (or quiet in the quiet condition) was turned off 1s after the word offset.

164
Upon the masker offset, the fixation cross turned into a circle, and this prompted 165 participants to repeat back the word. They were instructed to fixate on the black circle 166 during the verbal response. The experimenter then typed down the repeated word and 167 pressed ENTER to proceed to the next trial. Words were scored automatically based on 168 words that were correctly recalled misperceptions (similar to [44]), dissociating the 181 impact of intelligibility from recall performance.

182
At the end of each block regardless of the TASK condition, participants were asked 183 verbally to rate How effortful the last block was from 1 to 10, 10 being most effortful.  Test sequence in a block. Before each block, participants were presented with either words 'please listen, repeat and recall ' in red or words 'please listen, repeat and no recall ' in black against a white screen, indicating whether the incoming block was repeat-only or repeat-with-recall condition. 3s after the words notification, a black fixation cross appeared and stayed for another 1s, to signal the start of the first trial. The trial started with acoustic presentation of 0.5s speech-shaped noise (or quiet in the quiet condition) and visual presentation of a black fixation cross ('intertrial '). Another 1s of baseline measurement followed, with the same acoustic and visual presentation ( 'baseline' ). The word was then played at 1.5s into the trial, followed by noise presentation (or quiet in the quiet condition) for 1s ( 'waitpeak ' ), with the same visual presentation. Upon the offset of 'waitpeak ', the black fixation cross turned into a black circle to prompt listeners to repeat back the word 'repeat '. If the block was a repeat-with-recall condition, at the end of the 10th word, participants were prompted by the word RECALL followed by a black circle on the screen to start recalling previously repeated words. At the end of the block, participants were verbally reminded to rate How effortful was the last block from 1 to 10, 10 being most effortful.
The experiment lasted for 1 hour. using between-subjects two-tailed t-tests. Therefore, data were firstly aggregated over 191 language (as this played no role and was not a factor of interest in our study).  Mixed effect models allow for controlling the variance associated with random factors 197 without data aggregation. Therefore, by using LISTENER as random effect factor in 198 the model, we controlled for the variance in overall performance (random intercept) and 199 dependency on other fixed factors (random slope) that were associated with LISTENER. 200

Data processing and analysis
Models were constructed using the lme4 package [45] in R [46], and figures were 201 produced using the ggplot2 package [47]. Fixed and random effect factors entered the  To examine the effect of background noise on stated word recall performance, a logistic 209 mixed-effect model was fitted on the number of words correctly recalled, with 210 LISTENING condition as fixed effect factor and LISTENER as random effect factor, 211 and following the same procedure reported above. Note that the recall performance was 212 counted as stated word correct, and as such a word could be misunderstood and yet 213 corrected recalled. Matlab, to further decrease the impact of the obscured pupil from blinks. Trials that 223 had over 20% of the data points coded as blinks from the start of baseline to the start 224 of the next trial were excluded. Trials containing blinks longer than 0.4s were also 225 excluded, because they were more likely to be artefacts than normal blinks [49]. Three 226 participants had more than 20% of the overall trials discarded and were excluded from 227 the pupillometry analysis (but kept for behavioural and subjective rating analysis).

228
All valid traces were low-pass filtered at 10 Hz with a first-order Butterworth filter 229 to preserve only cognitively related pupil size modulation [50]. Processed traces were 230 then aligned by the onset of the response prompt (the display of circle to signal 231 participants to repeat back the word) and aggregated per listener, by each WORD

Pupil data analysis 234
Three indices of pupil response (baseline pupil diameter, peak pupil dilation PPD and 235 peak latency) were obtained from processed traces, consistent with the method in [17]. 236 PPD was the maximum diameter of pupil measurements from word onset to response low listening effort. LISTENER was entered in the model as a random effect factor.

263
Model buildings followed the same procedure above.

264
To further explore the sequence of different cognitive processing stages, pupil and recall. In other words, it was examined as a predictive tool: predict whether a given 281 word would be correctly understood or not, and recalled or forgotten, from the 282 particular shape of a pupil trace.

283
Finally, to explore the impact of LISTENING condition on the pupillary response 284 April 29, 2020 13/41 during recall, pupil traces from recall onset cue to 15s after the cue was firstly 285 baseline-corrected by subtracting the average diameter of all previous word trials in the 286 block. They were then de-blinked and low-pass filtered using the same parameters as 287 above. Processed traces were then aggregated per listener by LISTENING condition.

288
The mean of the trace during word recall was calculated. A mixed effect model was 289 fitted on the mean pupil diameter during recall, with LISTENING condition as fixed 290 effect factor and LISTENER as random effect factor. conditions as fixed effect factors and LISTENER as random effect factor, and following 295 the same procedure reported above.

296
In a final attempt to delineate different components of the pupillary dynamics, 297 each participant's pupillary responses (baseline diameter and PPD) were correlated with 298 their age, word recognition, word recall and subjective rating performance.

299
All best fitting models and summary output were reported in the Supplementary 300 Materials S1 repeat-with-recall than in repeat-only condition (β = 0.27, se = 0.12, p = 0.03).

417
Note that these correlations should be considered with caution due to no corrections. Consistent with our first hypothesis, results showed that noise impaired both word 428 recognition and recall. Fewer stated words were recalled at 0dB than 7dB, 14dB and 429 quiet conditions. Note that to dissociate the impact of word recognition from recall 430 performance, word recall scoring was based on whether the recalled words matched the 431 words repeated by participants, rather than the transcripts (similar to [44] performance in the word recall task [4,8,[54][55][56].
455 Surprisingly, we found a possible interference from the recall task on the word At 0dB, listeners might be more attentive and ready to engage overall more cognitive 462 resources when they were notified at the beginning of the block that they should recall 463 at the end of 10th word because they anticipated the incoming block to be demanding. 464 When no recall was required, they might have judged beforehand that the incoming 465 block was not worthwhile to mobilise too many resources, hence worse recognition 466 performance. Furthermore, in quiet with repeat-with-recall condition, listeners should 467 have sufficient capacity to reach a better primary task performance (as shown by a 468 higher word recognition in repeat-only condition), but instead, they performed worse in 469 the word recognition task compared to in the repeat-only condition. This might suggest 470 that they did not prioritise the word recognition task (although they were instructed 471 explicitly to do so by the experimenter), and may have shifted some resources to the 472 recall task probably because it was more interesting and rewarding [57][58][59][60].

473
This interference warrants further investigation, because it concerns the validity of 474 using a dual-task paradigm in measuring listening effort. In order to interpret safely the 475 difference in secondary task performance as a result of listening effort, implicit 476 assumptions of the dual-task paradigm need to be reviewed [61]. Firstly, the paradigm 477 assumes that participants have a limited pool of cognitive resources, but The Framework 478 for Understanding Effortful Listening (FUEL) model also notes that resources that are 479 available to be allocated are fluctuating with other factors besides overall task 480 demands [3,4]. In other words, the relationship between task difficulty and effort is not 481 linear, but modulated by factors like fatigue, motivation and (dis)pleasure [33,[62][63][64][65][66][67].

482
Secondly, the paradigm assumes that listeners, under explicit instructions, will prioritise 483 the primary task by investing as many resources as possible, and only leaves whatever 484 left of the resources for the secondary task. However, individual differences and task 485 characteristics might affect listeners' actual strategy [3]. For instance, older adults may 486 differ from younger adults in the extent to which they prioritise one task over another [57][58][59]. And when the primary task is too complex or secondary task more novel, participants may consciously or unconsciously shift more resources to the 489 secondary task relative to the primary task [68][69][70]. Although the recall paradigm from 490 previous studies is sensitive to the relative allocation of cognitive resources, there is no 491 direct method to gauge the total amount of resources deployed and how they are 492 allocated [61]. As illustrated in the current experiment, listeners might not mobilise 493 and/or allocate the same amount of cognitive resources for the speech recognition task 494 when a secondary recall task was anticipated, even under explicit instruction. This

502
This might be of particular concern when applying the test to listener groups who are 503 susceptible to fatigue and task interference, for instance hearing impaired populations 504 and children, because they might either give up or not fully engaged in the first place 505 even when the available capacity can meet the processing demand [3,66,[68][69][70]. reported similar trends, they were using different materials and test designs, making it 515 hard to demonstrate clearly the impact of additional memory task on listening effort in 516 both magnitude and dynamics. For instance, within one speech perception task, pupil 517 diameter gradually decreased with increasing trial numbers, due to task/stimuli 518 habituation [17,[38][39][40]. However, when listeners needed to remember the digits or 519 pseudo-words presented auditorily, pupil diameter increased progressively, until the 520 memory span was exceeded [16,24,26,71]. Note that in the current experiment, listeners 521 needed to continuously decode words embedded in noise, which was more effortful than 522 listening to digits or pseudo-words in quiet. The more demanding primary speech 523 recognition task led to more accumulated and sustained effort over time. This might 524 explain earlier plateau in baseline diameter in our experiment than observed in those 525 studies. We observed a quadratic trend of baseline pupil diameter from the 1st to the 526 10th word within a list. [26] reported the plateau at the 9th digit for young adults, 527 and [72] reported the plateau at 6th digits for children and 8th digit for adults. Our 528 results are in good agreement with such estimates, and confirm that additional memory 529 task places a heavier and sustained load on cognitive effort. More specifically, baseline 530 diameter could reveal the impact on cognitive effort from the additional task, and the 531 rate of increase in baseline diameter could be suggestive of the magnitude of sustained 532 effort in a test paradigm with multiple sources of cognitive effort.

533
However, the steeper decrease of PPD in repeat-with-recall condition compared to 534 repeat-only condition was unexpected. PPD has been shown to be sensitive to memory 535 load, therefore, with more words to be remembered, we expected PPD to increase 536 accordingly over time [16,25,26]. Decrease in PPD was reported when listeners tended 537 to give up in the tasks that were impossibly difficult [27,29]. In those cases, performance level was typically low (around 0%). But we did not observe a decrease in 539 recognition and recall performance for words in the later part of the list in our results, 540 or a worse word recognition performance in repeat-with-recall condition at difficult 0dB 541 condition (in fact, word recognition was higher in repeat-with-recall than repeat-only 542 condition). This suggests that listeners did not give up at the later part of the word list, 543 or at 0dB. Similarly, a smaller PPD at 0dB in repeat-with-recall than repeat-only 544 condition was surprising. Additional recall task with difficult SNR is certainly more 545 demanding than a single task, therefore, we expected PPD to be larger in the 546 repeat-with-recall condition and at difficult SNR level. But we observed the opposite:

547
PPD actually decreased in the repeat-with-recall condition. We do not believe that 548 these are spurious results. This huge contrast with the well-established effect of task 549 demands on the pupillary response was also observed in Zekveld et al. [43]. In Zekveld 550 et al. [43], participants had to recall the four-word cues (either related or unrelated to 551 the following sentence) presented visually before the onset of the sentence embedded in 552 speech masker. The 7dB SNR difference between two sentence-in-noise conditions 553 (-17dB and -10dB) elicited a difference in intelligibility, but not in peak and mean pupil 554 dilation. Zekveld et al. [43] interpreted the absence of pupillary difference between two 555 SNRs as participants prioritising the central factors (memory task) than peripheral 556 factors (sentence recognition task). There are a few characteristics that distinguish our 557 design from Zekveld et al. [43]. Firstly, the memory and sentence recognition tasks in 558 Zekveld et al. [43] were more independent: participants read the cue words for 5s before 559 the auditory stimulus onset; after the auditory stimulus offset, participants either 560 repeated the sentence or the cue words. This separation between two tasks could 561 facilitate intentional prioritisation of the memory over the speech recognition task.

562
Secondly, participants in Zekveld et al. [43] only needed to memorise a four-word cue at 563 the start of each trial, with no accumulation of memory load over time. In comparison, 564 April 29, 2020 25/41 the memory task in our paradigm was more imposing on the limited cognitive resources: 565 participants had to complete both word recognition and memorising tasks within the 566 same time window, and they needed to keep retaining more words in the memory from 567 the 1st to the 10th word. Therefore, it is not surprising that we observed not only a lack 568 of correlation between task demands and pupillary response at easier SNR levels, but 569 also a reversal of that relation at the most cognitively demanding condition (0dB and 570 repeat-with-recall). repeat-with-recall than repeat-only condition but the baseline diameter was similar. As 588 the baseline diameter grew bigger and plateaued in repeat-with-recall condition, PPD 589 did not have much space to grow, so it decreased faster than repeat-only condition.
Similarly, at repeat-with-recall condition, baseline diameter was already bigger than 591 repeat-only condition for all SNR levels to start with, leaving little room for PPD to 592 increase further during the task. It looks as if under sustained listening condition, there 593 is a limit on the magnitude of pupil dilation, beyond which no further increase is 594 possible. This interpretation is tempting in its logic. However, this limit must not be 595 imposed by physiological constraint of the iris muscles, because at the onset of the 596 recall, pupil diameter increased dramatically, on average by 0.3mm or equivalent to an 597 effect six times bigger than the average PPD at the 10th word (also seen in Cabestrero 598 et al. [26] and discussed in Zekveld et al. [43]). Instead, this limit might be of a manner. In Cabestrero et al. [26], the biggest 'jump ' at the onset of recall was when 5 611 digits were to be recalled (low load), and the smallest 'jump ' was when 11 digits were 612 to be recalled (overload), suggesting that this sharp increase in pupil diameter is 613 proportionate to the cognitive resources left for the recall task. Arguably, how cognitive 614 resources are allocated to different tasks could also depend on individual cognitive 615 capacity and cognitive abilities. Listeners with bigger cognitive capacity and better abilities to process speech in noise, might allocate fewer resources (lower limit) to word 617 recognition and encoding, because they will be more efficient in completing the 618 task [75,76]. Therefore, to fully test this hypothesis, future studies need to include more 619 individual cognitive ability measurements and different types of manipulations on 620 cognitive load.  due to a latent variable, for instance individual cognitive capacity [23,27,44,53,78,79]. 667 To summarise, while behavioural performance (i.e., recall) and subjective rating been developed yet to reliably adjust pupil diameter for these factors. Special care was 683 taken during the experiment and data preprocessing: participants were instructed to 684 keep fixating at the fixation circle during verbal responses; we extrapolated points in 685 the pupil traces where the centre of gazing was beyond 3SD from the centre and 686 excluded trials where over 20% of the traces were either blinks or erratic gazing.

687
Although this lead to loss of data, we ensured that the data left for analysis was valid. 688 Nevertheless, speech production following the response cue could potentially 689 interfere with the pupillary response corresponding to memory encoding. Individual 690 differences in the timing of responding could also interfere with the correspondence between memory encoding and pupillary response. However, this artefact was present for every word because participants needed to repeat words in all conditions. Therefore, 693 the difference in pupil trace observed within this time window could not be entirely due 694 to production confounds. It is possible that prior motivational, emotional, cognitive factors and social pressure could disturb the relation between pupillary response and listening effort that is 717 well-established in research settings.

718
Supporting information 719 S1 File. Alternative method to calculate PPD Results and discussions on the 720 alternative method to perform baseline correction using the averaged pupil trace 1s 721 before the first word in the list. 722 S1 factor LISTENING is 0dB, for the factor TASK is repeat-only.