Modulation of the primary auditory thalamus when recognising speech in noise

Recognising speech in background noise is a strenuous daily activity, yet most humans can master it. A mechanistic explanation of how the human brain deals with such sensory uncertainty is the Bayesian Brain Hypothesis. In this view, the brain uses a dynamic generative model to simulate the most likely trajectory of the speech signal. Such simulation account can explain why there is a task-dependent modulation of sensory pathway structures (i.e., the sensory thalami) for recognition tasks that require tracking of fast-varying stimulus properties (i.e., speech) in contrast to relatively constant stimulus properties (e.g., speaker identity) despite the same stimulus input. Here we test the specific hypothesis that this task-dependent modulation for speech recognition increases in parallel with the sensory uncertainty in the speech signal. In accordance with this hypothesis, we show—by using ultra-high-resolution functional magnetic resonance imaging in human participants—that the task-dependent modulation of the left primary sensory thalamus (ventral medial geniculate body, vMGB) for speech is particularly strong when recognizing speech in noisy listening conditions in contrast to situations where the speech signal is clear. Exploratory analyses showed that this finding was specific to the left vMGB; it was not present in the midbrain structure of the auditory pathway (left inferior colliculus, IC). The results imply that speech in noise recognition is supported by modifications at the level of the subcortical sensory pathway providing driving input to the auditory cortex.


Introduction
Honking horns and roaring engines, the hammering from a construction site, the mix of music and speech at a restaurant or pub, the chit-chat of many children in a classroom are just some examples of background noises which continuously accompany us. Nevertheless, humans have a remarkable ability to hear and understand the conversation partner, even under these severe listening conditions (Cherry, 1953).
To-date it is by-and-large unclear why the human brain is so robust to speech-in-noise perception. Understanding human speech-in-noise recognition on a mechanistic level would be important as it would advance the understanding of why some clinical populations have difficulties with speech-in-noise perception. Furthermore, a more mechanistic understanding of how the human brain recognises speech-in-noise might also trigger new insight on why artificial speech recognition systems still have difficulties when speech is presented in noise (Gupta et al., 2016;Qian et al., 2016;Scharenborg, 2007).
One mechanistic account of brain function that attempts to explain how the human brain deals with noise or uncertainty in the stimulus input is the Bayesian brain hypothesis. It assumes that the brain represents information probabilistically and uses an internal generative model and predictive coding for the most effective processing of sensory input (Friston, 2005;Friston and Kiebel, 2009;Kiebel et al., 2008;Knill and Pouget, 2004).
Such type of processing has the potential to explain why the human brain is robust to sensory uncertainty, e.g., when recognising speech despite noise in the speech signal (Knill and Pouget, 2004;Srinivasan et al., 1982). Although predictive coding is often discussed in the context of cerebral cortex organization (Hesselmann et al., 2010;Shipp et al., 2013), it may also be a governing principle of the interactions between cerebral cortex and subcortical sensory pathway structures (Adams et al., 2013;Bastos et al., 2012;Huang and Rao, 2011;Mumford, 1992;Seth Anil K. and Friston Karl J., 2016;von Kriegstein et al., 2008). In accordance with this suggestion, studies in animals found that feedback from cerebral cortex areas changes the processing in the sensory pathway, i.e., the sensory thalamus and brainstem nuclei (Krupa et al., 1999;Sillito et al., 2006Sillito et al., , 1994Wang et al., 2018).
In humans, responses in the auditory sensory thalamus (medial geniculate body, MGB) are higher for speech tasks (that emphasise recognition of fast-varying speech properties) in contrast to control tasks (that require recognition of relatively constant properties of the speech signal, such as the speaker identity or the sound intensity level). This response difference holds even if the stimulus input is the same (Díaz et al., 2012;von Kriegstein et al., 2008). This task-dependent modulation seems to be behaviorally relevant for speech recognition: performance level in auditory speech recognition was positively correlated with the amount of task-dependent modulation in the MGB of the left hemisphere (Mihai et al., 2019;von Kriegstein et al., 2008). This behaviourally relevant task-dependent modulation was located in the ventral part of the MGB (vMGB), which is the primary subsection of the MGB, but not in other MGB subsections (Mihai et al., 2019). These findings could fit the Bayesian brain hypothesis on cortico-subcortical interactions: cerebral cortex areas provide dynamic predictions about the incoming sensory input to the sensory thalamus to optimally encode the trajectory of the fast-varying and predictable speech input (Díaz et al., 2012;von Kriegstein et al., 2008). If this is the case, the specific hypothesis ensues that the task-dependent modulation of the vMGB is especially involved when the fast dynamics of speech have to be recognised in conditions with high sensory uncertainty (Díaz et al., 2012;Feldman and Friston, 2010;Van de Cruys et al., 2014;Yu and Dayan, 2005), for example when the incoming signal is disturbed (Feldman and Friston, 2010;Friston and Kiebel, 2009;Gordon et al., 2017;Yu and Dayan, 2005). The present study aimed to test this hypothesis.
An ecologically valid way to increase uncertainty about the speech input is the presentation of speech in background noise (Chandrasekaran and Kraus, 2010a). We, therefore, tested, whether the task-dependent modulation of the left vMGB for speech is higher when the speech stimuli are heard in a noisy as opposed to a clear background. We used ultra-high field fMRI at 7 T and a design that has been shown to elicit task-dependent modulation of the MGB in previous studies (Díaz et al., 2012;von Kriegstein et al., 2008). We complemented the design by a noise factor: the speech stimuli were presented with and without background noise. The experiment was a 2 × 2 factorial design with the factors task (speech task, speaker task) and noise (noise, clear). To test our hypothesis, we performed a task × noise interaction analysis. We predicted that the task-dependent modulation of the left vMGB increases with decreasing signal-to-noise ratios (i.e., increasing uncertainty about the speech sounds). We focused on the left vMGB for two reasons. First, its response showed behavioural relevance for speech recognition in previous studies (Mihai et al., 2019;von Kriegstein et al., 2008). Second, a study on developmental dyslexia -a condition that is often associated with speech-in-noise recognition difficulties (Chandrasekaran et al., 2009;Ziegler et al., 2009) -showed reduced task-dependent modulation of the left MGB in comparison to controls (Díaz et al., 2012).
In addition to testing our main hypothesis, the design also (i) served to test for replicability of previous findings on the involvement of the MGB in speech recognition as well as its relevance for speech recognition behaviour (Mihai et al., 2019;von Kriegstein et al., 2008), and (ii) allowed to explore the role of the inferior colliculus (IC) -the midbrain station of the auditory sensory pathway -in speech and speech-in-noise recognition.

Results
Participants listened to blocks of auditory syllables (e.g., /ada/, spoken by three different speakers) and performed either a speech or a speaker task (Figure 1). In the speech task, participants reported via button press whether the current syllable was different from the previous one (1-back task). In the speaker task, participants reported via button press whether the current speaker was different from the previous one. The speakers' voices were resynthesized from the recordings of one speaker's voice to only differ in constant speaker individuating features (i.e., the vocal tract length and the fundamental frequency of the voice). This ensured that the speaker task could not be done on dynamic speaker individuating features (e.g., idiosyncrasies in pronunciations of phonemes). Participants listened to either stimuli embedded in speech-shaped noise (noise condition) or without background noise (clear condition). Figure 1. Design and trial structure of the experiment. In the speech task, listeners performed a one-back syllable task. They pressed a button whenever there was a change in syllable in contrast to the immediately preceding one, independent of speaker change. The speaker task used precisely the same stimulus material and trial structure. The task was to press a button when there was a change in speaker identity in contrast to the immediately preceding one, independent of syllable change. An initial task instruction screen informed participants about which task to perform. Participants heard stimuli either with concomitant speech-shaped noise (noise condition) or without background noise (clear condition). Thus the experiment had four conditions: speech task/noise, speaker task/noise, speech task/clear, speaker task/clear. Stimuli in the speech and speaker tasks were precisely identical.

Behavioural results
Participants performed well above chance level in all four conditions (> 82% correct; Table   1; Figure 2A). Performing the tasks with background noise was more difficult than the conditions without background noise for both the speech and the speaker task ( Figure 2B, for details on statistics, see figure legend). The rate of hits in the speech task was the same as in the speaker task ( Figure 2C). There was a detectable interaction between task and noise ( Figure 2D/E), but simple main effects (i.e., speech task/noise -speaker task/noise ( Figure   2F) and speech task/clear -speaker task/clear ( Figure 2G)) were not present.

The task-dependent modulation of left vMGB increased for recognizing speech-in-noise in contrast to the clear speech condition
We localised the left vMGB based on an independent functional localizer (see section 4. Materials and Methods). Following our hypothesis, there was increased BOLD response for the task × noise interaction [(speech task/noise -speaker task/noise) -(speech task/clearspeaker task/clear)] in the left vMGB ( Figure 3A/B). The interaction effect had a mean large effect size ranging from a small effect to a very large effect (g*=2.549 [0.211, 5.066]; Figure 3C and 3D). The 95% HPD of the interaction effect excluded 0, indicating that this was a robust effect (Bunce and McElreath, 2017;McElreath, 2018). Simple main effect analyses showed that the direction of the interaction was as expected. The speech task/noise condition yielded higher left vMGB responses in contrast to the speaker task/noise condition, ranging from a medium to a very large effect (g* = 1.104 [0.407, 1.798]; Figure 3E). Conversely, the left vMGB response difference between the speech task and speaker task in the clear condition had a small effect size (g* = 0.243 [-0.366, 0.854]; Figure 3F), ranging from a negative medium effect to a positive large effect, and the HPD overlapped 0.   (-11, -28, -6 [-0.366, 0.854]). The HPD contained zero.
The results showed that the task-dependent modulation of the left vMGB for the speech task was increased when participants recognised speech -speaker identity in background noise in contrast to speech -speaker identity without background noise. This finding cannot be explained by differences in stimulus input as the same stimulus material was used for the speech and the speaker task. The results are also unlikely due to differences in task difficulty between conditions, as the behavioural results showed no detectable differences in performance for the simple main effects.

Test for replication of previous findings
In addition to addressing the main hypothesis of the present paper, the data also allowed the testing for replication of previous findings (  Crosshairs point to MNI coordinate (-12, -27, -7). The white outline shows the boundary of the vMGB mask. The orientation of the images is the same as in Figure 3A/B. B. Results of the Bayesian analysis of the parameter estimates for the main effect of task. There was a large effect size for the contrast speech -speaker task of 0. 679 [0.192, 1.200]. The HPD excluded zero.
Correlation between main effect of task and speech recognition performance: There was no significant correlation between the task-dependent modulation (i.e., parameter estimates for the contrast of speech -speaker) and the correct proportion of hits in the speech task; the effect size was very small and non-significant (mean Pearson's r = 0.15, p = 0.566; Figure  participants, (Mihai et al., 2019) with n = 33 participants), but was also not significant in one previous study (Díaz et al., 2012, with n = 14 participants). Since the previous studies did not include the factor noise, we also computed correlation coefficients between the simple main effect of task (speech/clear -speaker/clear task) and the proportion of hits in the speech/clear condition. Correlation coefficients were small and non-significant (r=0.03, p=0.917; Figure S1B).
To not wrongly treat variable results across studies as indicating a null-effect (  We attribute the non-significant correlation between the task-dependent modulation and the correct proportion of hits in the speech task in the present study to the fact that ~11% of the behavioural data in the speech task had ceiling or near to ceiling responses resulting in reduced correlation values (Bland and Altman, 2011). Many of the behavioural values were huddled towards the ceiling when plotted against BOLD responses ( Figure S1). This was not the case in previous studies, where there were no ceiling effects in the behavioural data (Díaz et al., 2012;Mihai et al., 2019;von Kriegstein et al., 2008).

Exploratory analyses on the inferior colliculus
In exploratory analyses, we investigated the bilateral inferior colliculus' (IC) involvement during speech processing. The reason for these exploratory analyses were studies using auditory brainstem responses (ABR) during passive listening to speech sounds that have shown that the quality of speech sound representation (i.e., as measured by the frequency following response, FFR) explains inter-individual variability in speech-in-noise recognition abilities (Chandrasekaran et al., 2009;Schoof and Rosen, 2016;Selinger et al., 2016;Song et al., 2010). These findings indicated that there might be subcortical nuclei beyond the MGB that are involved in speech-in-noise perception, potentially also sources in the auditory brainstem, particularly the IC (Chandrasekaran and Kraus, 2010b). Four previous fMRI experiments, however, have shown that there is no significant task-dependent modulation (i.e., higher BOLD responses for a speech in contrast to a control task on the same stimuli) of the inferior colliculus (Díaz et al., 2012;Mihai et al., 2019;von Kriegstein et al., 2008).
Two of them showed a significant positive correlation between the amount of BOLD response difference between a speech and a control task in the left IC and the speech recognition performance across participants (von Kriegstein et al., 2008), but the others did not. Thus the role of the IC in speech recognition and speech-in-noise recognition is to date unclear. In the present data, there was a small effect of task in the left IC (speech -speaker,  (2008) (n=17), (Díaz et al., 2012) (n=14), (Mihai et al., 2019) (n=33), and the current study (n=17)

Discussion
We showed that the task-dependent modulation of the left hemispheric primary sensory thalamus (vMGB) for speech is particularly strong when recognising speech in noisy listening conditions in contrast to conditions where the speech signal is clear. This finding confirmed our a priori hypothesis which was based on explaining sensory thalamus function within a Bayesian brain framework. Exploratory analyses showed that there was no influence of noise on the responses for the contrast between speech and speaker task in the auditory midbrain, i.e., the inferior colliculi (IC). Besides answering our main hypothesis, we also provided three additional key findings. First, we replicated results from previous experiments (Díaz et al., 2012;von Kriegstein et al., 2008) that showed taskdependent modulation in the MGB for speech, and localised the task-dependent modulation in the vMGB (Mihai et al., 2019). Second, a meta-analysis of five studies showed that there was a positive correlation between the task-dependent modulation for speech in the left MGB and behavioural performance in the speech task across studies. Third, the same metaanalysis revealed a positive correlation between the task-dependent modulation for speech and the behavioural performance in the speech recognition task in the left IC.
Our main hypothesis in the present paper was based on the assumption that predictive coding might be a governing principle of how the human brain deals with background noise during speech recognition. Bayesian approaches to brain function propose that the brain uses internal dynamic models to predict the trajectory of the sensory input (Friston, 2005;Friston and Kiebel, 2009;Kiebel et al., 2008;Knill and Pouget, 2004). Thus, slower dynamics of the internal dynamic model (e.g., syllable and word representations) could be encoded by auditory cerebral cortex areas (Davis and Johnsrude, 2007;Giraud et al., 2000;Hickok and Poeppel, 2007;Mattys et al., 2012;Price, 2012;Wang et al., 2008), and provide predictions about the faster dynamics of the input arriving at lower levels of the anatomic hierarchy (Kiebel et al., 2008;von Kriegstein et al., 2008). In this view, dynamic predictions modulate the response properties of the first-order sensory thalamus to optimise the early stages of speech recognition (Mihai et al., 2019). In speech processing, such a mechanism might be especially useful as the signal includes rapid dynamics, as predictable (e.g., due to co-articulation or learned statistical regularities in words) (Saffran, 2003), and often has to be computed online under conditions of (sensory) uncertainty. Uncertainty refers to the limiting reliability of sensory information about the world (Knill and Pouget, 2004).
Examples include the density of hair cells in the cochlea that limit frequency resolution, the neural noise-induced at different processing stages, or -as was the case in the current study -background environmental noise that surrounds the stimulus of interest. An internal generative model about the fast sensory dynamics (Friston, 2005;Friston and Kiebel, 2009;Kiebel et al., 2008;Knill and Pouget, 2004) of speech could lead to enhanced stimulus representation in the subcortical sensory pathway and by that provides improved signal quality to the auditory cortex. Such a mechanism would result in more efficient processing when taxing conditions, such as background noise, confront the perceptual system. The interaction between task and noise in the left vMGB is in congruence with such a mechanism. It shows that the task-dependent modulation of the left vMGB is increased in a situation with high sensory uncertainty in contrast to the situation with lower sensory uncertainty.
Speech-in-noise recognition abilities are thought to rely (i) on additional cognitive resources that are recruited when recognising speech-in-noise (reviewed in Peelle, 2018) and (ii) on the fidelity of speech sound representation in brainstem nuclei, as measured by auditory brainstem response recordings (reviewed in . For example, studies investigating speech-in-noise recognition at the level of the cerebral cortex found networks that include areas pertaining to linguistic, attentional, working memory, and motor planning (Bishop and Miller, 2008;Salvi et al., 2002;Scott et al., 2004;Wong et al., 2008). These results suggest that during speech recognition in challenging listening conditions additional cerebral cortex regions are recruited that likely complement the processing of sound in the core speech network (reviewed in Peelle, 2018). The present study showed that besides the additional cerebral cortex region recruitment, a specific part of the sensory pathway is also modulated during speech-in-noise recognition, the left vMGB.
Auditory brainstem response (ABR) recordings during passive listening to speech sounds have shown that the quality of speech sound representation (i.e., as measured by the frequency following response, FFR) explains inter-individual variability in speech-in-noise recognition abilities (Chandrasekaran et al., 2009;Schoof and Rosen, 2016;Selinger et al., 2016;Song et al., 2010) and can be modulated by attention to speech in situations with two competing speech streams (Forte et al., 2017). It is difficult to directly relate the results of these FFR studies on participants with varying speech-in-noise recognition abilities (Chandrasekaran et al., 2009;Schoof and Rosen, 2016;Selinger et al., 2016;Song et al., 2010) to the studies on task-dependent modulation of structures in the subcortical sensory pathway (Díaz et al., 2012;Mihai et al., 2019;von Kriegstein et al., 2008): they involve very different measurement modalities and the FFR studies focus mostly on speech-in-noise perception in passive listening designs. One major candidate for the FFR source is the inferior colliculus. Particularly for speech, the FFR, as recorded by EEG, seems to be dominated by brainstem and auditory nerve sources (Bidelman, 2018;reviewed in Chandrasekaran et al., 2014). The results of the present study, however, do not provide evidence for a specific involvement of the inferior colliculus when recognising speech-innoise. Whether the inferior colliculus plays a different role in speech-in-noise processing is an open question.
We speculate that the task-dependent vMGB modulation might be a result of feedback from cerebral cortex areas. The strength of the feedback could be enhanced when speech has to be recognised in background noise. The task-dependent feedback may emanate directly from primary auditory or association cortices, or indirectly via other structures such as the reticular nucleus with its inhibitory connections to the MGB (Rouiller and de Ribaupierre, 1985). Feedback cortico-thalamic projections from layer 6 in A1 to the vMGB, but also from association cortices such as the motion-sensitive planum temporale (Tschentscher et al., 2019), may modulate information ascending through the lemniscal pathway, rather than convey information to the vMGB (Lee, 2013;Llano and Sherman, 2008 (Boets et al., 2007;Díaz et al., 2012;Ziegler et al., 2009).
In conclusion, the results presented here suggest that the left vMGB is particularly involved in decoding speech as opposed to identifying the speaker if there is background noise. This enhancement may be due to top-down processes that act upon subcortical sensory structures, such as the auditory thalamus, to better predict dynamic incoming signals in conditions with high sensory uncertainty.

Participants
The  (Schneider et al., 2007). The cut-off for both reading scores was set to those levels mentioned in the test instructions as the "lower average and above" performance range (i.e., 26% -100% of the calculated population distribution). None of the participants performed below the cut off performance (mean 68.7%, SD 20.6%, lowest mean score: 36%). In addition, participants were tested on rapid automatized naming (RAN) of letters, numbers, and objects (Denckla and Rudel, 1976

Stimuli
We recorded 79 different vowel-consonant-vowel (VCV) syllables with an average duration of 784 ms, SD 67 ms. These were spoken by one male voice (age 29 years), recorded with a video camera (Canon Legria HFS10, Canon, Japan) and a Røde NTG-1 microphone (Røde Microphones, Silverwater, NSW, Australia) connected to a pre-amplifier (TubeMP Project Series, Applied Research and Technology, Rochester, NY, USA) in a sound-attenuated room.
The sampling rate was 48 kHz at 16 bit. Auditory stimuli were cut and flanked by Hamming windows of 15 ms at the beginning and end, converted to mono, and root-mean-square equalised using Python 3.6 (Python Software Foundation, www.python.org). The 79 auditory files were resynthesized with TANDEM-STRAIGHT (Banno et al., 2007)  The parameter choice (VTL and GPR) was motivated by the fact that a VTL difference of 25% and a GPR difference of 45% suffices for listeners to hear different speaker identities (Gaudrain et al., 2009;Kreitewolf et al., 2014). Additionally, we conducted pilot experiments (12 pilot participants which did not participate in the main experiment) in order to fine-tune the combination of VTL and GPR that resulted in a balanced behavioural accuracy score between the speech and speaker tasks. The pilot experiments were conducted outside the scanner, and each run included continuous recordings of scanner gradient noise to simulate a real scanning environment.
The 237 stimuli were embedded in background noise to create the stimuli for the condition with background noise. The background noise consisted of normally distributed random (white) noise filtered with a speech-shaped envelope. We calculated the envelope from the sum of all VCV stimuli presented in the experiment. We used speech-shaped noise as it has a stronger masking effect than stationary random non-speech noise (Carhart et al., 1975).
Before each experimental run, the noise was computed and added to the stimuli included in the run with a signal-to-noise ratio (SNR) of 2 dB. The SNR choice was based on a pilot study that showed a performance decrease of at least 5% but no greater than 15% between the clear and noise condition. In the pilot study, we started at an SNR of -10 dB and increased this value until we converged on an SNR of 2 dB. Calculations were performed in

Procedure
We conceived the experiment as a 2 × 2 factorial design with the factors task (speech, speaker) and background noise (clear, noise). Participants listened to blocks of auditory VCV syllables and were asked to perform two types of tasks: a speech task and a speaker task. In the speech task, participants reported via button press whether the current syllable was different from the previous one (1-back task). In the speaker task, participants reported via button press whether the current speaker was different from the previous one.
The blocks had either syllables with background noise (noise condition) or without background noise (clear condition).
Task instructions were presented for two seconds before each block and consisted of white written words on a black background (German words "Silbe" for syllable, and "Person" for person). After the instruction, the block of syllables started (Figure 1). Each block contained twelve stimuli. Each stimulus had a duration of approximately 784 ms, and the stimulus presentation was followed by 400 ms of silence. Within one block both syllables and speakers changed at least twice, with a theoretical maximum of nine changes. The theoretical maximum was derived from random sampling of seven instances from three possible change types: no change, speech change, speaker change, and change of speech and speaker. The average length of a block was 15.80 seconds, SD 0.52 seconds.
The experiment was divided into four runs. The first three runs had a duration of 12:56 min and included 40 blocks: 10 for each of the four conditions (speech task/noise, speaker task/noise, speech task/clear, speaker task/clear). A fourth run had a duration of 6:32 min and included 20 blocks (5 for each of the four conditions). For two participants, only the first three runs were recorded due to time constraints. Participants could rest for one minute between runs.
Participants were familiarised with the three speakers' voices to ensure that they could perform the speaker-identity task of the main experiment. The speaker familiarisation took place 30 minutes before the fMRI experiment. It consisted of a presentation of the speakers and a test phase. In the presentation phase, the speakers were presented in six blocks, each containing nine pseudo-randomly chosen VCV stimuli from the 237 total. Each block contained one speaker-identity only. Participants were alerted to the onset of a new speaker identity block by the presentation of white words on a black screen indicating speaker 1, speaker 2, or speaker 3. Participants listened to the voices with the instruction to memorise the speaker's voice. In the following test phase participants were presented with four blocks of nine trials that each contained randomly chosen syllable pairs spoken by the three speakers. The syllable pairs could be from the same or a different speaker. We asked participants to indicate whether the speakers of the two syllables were the same by pressing keypad buttons "1" for yes and "2" for no. Participants received visual feedback for correct (the green flashing German word for correct: "Richtig") and incorrect (the red flashing German word for incorrect: "Falsch") answers. The speaker familiarisation consisted of three 2:50 min runs (each run contained one presentation and one test phase).
If participants scored below 80% on the last run, they performed an additional run until they scored above 80%. All participants exceeded the 80% cut-off value.

Data Acquisition and Processing
MRI data were acquired using a Siemens Magnetom 7 T scanner (Siemens AG, Erlangen, Germany) with an 8-channel head coil. We convened on the 8-channel coil, due to its spaciousness which allowed the use of higher quality headphones (manufactured 2008; MrConfon GmbH, Magdeburg, Germany). Functional MRI data were acquired using echoplanar imaging (EPI) sequences. We used partial brain coverage with 30 slices. The volume was oriented in parallel to the superior temporal gyrus such that the slices encompassed the MGB, the inferior colliculi (IC), and the Heschl's gyrus.
To address geometric distortions in EPI images we recorded gradient echo based field maps which had the following acquisition parameters: TR = 1500 ms, TE1 = 6.00 ms, TE2 = 7.02 ms, flip angle 60°, 0% phase oversampling, matrix size 100, FoV 220 mm x 220 mm, phase partial Fourier off, voxel size 2.2 mm isotropic resolution, interleaved acquisition, anterior to posterior phase-encode direction. Resulting images from field map recordings were two magnitude images and one phase difference image.

Behavioural Data Analysis
Button presses (hits, misses) were binomially distributed, and were thus modeled using a binomial logistic regression which predicts the probability of correct button presses based on four independent variables (speech task/noise, speaker task/noise, speech task/clear, speaker task/clear) in a Bayesian framework (McElreath, 2018).
To pool over participants and runs we modelled the correlation between intercepts and slopes. For the model implementation and data analysis, we used PyMC3 3.5 (Salvatier et al., 2016), a probabilistic programming package for Python 3.6. We sampled with a No-U-Turn Sampler (Hoffman and Gelman, 2014) with four parallel chains. Per chain, we had 5,000 samples with 5,000 as warm-up. There were the following effects of interest: main effects (clear -noise, speech task -speaker task), the interaction (speech task/ noisespeaker task/ noise) -(speech task/ clear -speaker task/ clear), simple main effects (speech task/ noise -speaker task/ noise, speech task/ clear -speaker task/ clear). For the effects of interest, we calculated means from the posterior distributions and 95% highest posterior density intervals (HDP). The HPD is the probability that the mean lies within the interval (Gelman et al., 2013;McElreath, 2018), this means that we are 95% sure the mean lies within the specified interval bounds. If the posterior probability distribution of odds ratios does not strongly overlap one (i.e., the HPD excludes one), then it is assumed that there is a detectable difference between conditions (Bunce and McElreath, 2017;McElreath, 2018).
The predictors included in the behavioural data model were: task (xS:1 = speech task, 0 = speaker task), and background noise (xN: 1 = noise, 0 = clear). We also included the two-way interaction of task and noise condition. Because data were collected across participants and runs, we included random effects for both of these in the logistic model. Furthermore, since ~11% of the data exhibited ceiling effects (i.e., some participants scored at the highest possible level) which would result in underestimated means and standard deviations (Uttl, 2005), we treated these data as right-censored and modeled them using a Potential class (Jordan, 1998;Lauritzen et al., 1990) as implemented in PyMC3. This method integrates out the censored values using the log of the complementary normal cumulative distribution function (Gelman et al., 2013;McElreath, 2018). In essence, we sampled twice, once for the observed values without the censored data points, and once for the censored values only.
The model is described below. We used the template image created for a previous study (Mihai et al., 2019) using structural MP2RAGE images from the 28 participants of that study. We chose this template since 15 participants in the current study are included in this image, and the vMGB mask (described below) is in the same space as the template image. The choice of this common template reduces warping artefacts, which would be introduced with a different template, as both the vMGB mask and the functional data of the present study would need to be warped to a common space. The template was created and registered to MNI space with ANTs (Avants et al., 2008) and the MNI152 template provided by FSL 5.0.8 (Smith et al., 2004). All MP2RAGE images were preprocessed with Freesurfer (Fischl et al., 2004;Han and Fischl, 2007) using the recon-all command to obtain boundaries between grey and white matter, which were later used in the functional to structural registration step.
Preprocessing and statistical analyses pipelines were coded in nipype 1.1.2 (Gorgolewski et al., 2011). Head motion and susceptibility distortion by movement interaction of functional runs were corrected using the Realign and Unwarp method (Andersson et al., 2001) in SPM 12. This step also makes use of a voxel displacement map (VDM), which addresses the problem of geometric distortions in EPI caused by magnetic field inhomogeneity. The VDM was calculated using field map recordings, which provided the absolute value and the phase difference image files, using the FieldMap Toolbox (Jezzard and Balaban, 1995)

Physiological data
Physiological data (heart rate and respiration rate) were processed by the PhysIO Toolbox (Kasper et al., 2017) to obtain Fourier expansions of each, in order to enter these into the design matrix (see section 4.6.3 Testing our hypothesis in the left vMGB). Since heartbeats and respiration result in undesired cortical and subcortical artefacts, regressing these out increases the specificity of fMRI responses to the task of interest (Kasper et al., 2017). These artefacts occur in abundance around the thalamus (Kasper et al., 2017).

Testing our hypothesis in the left vMGB
Models were set up in SPM 12 using the native space data for each participant. We modelled five conditions of interest: speech task/noise, speaker task/noise, speech task/clear, speaker task/clear, and task instruction. Onset times and durations were used to create boxcar functions, which were convolved with the hemodynamic response function (HRF) provided by SPM 12. The design matrix also included the following nuisance regressors: three cardiac, four respiratory, and a cardiac × respiratory interaction regressor. We additionally entered the outlier regressors from the ArtifactDetect step.
Parameter estimates were computed for each condition at the first level using restricted maximum likelihood (REML) as implemented in SPM 12. Parameter estimates for each of the four conditions of interest (speech task/noise, speaker task/noise, speech task/clear, speaker task/clear) were registered to the MNI structural template using a two-step registration in ANTs. First, a quick registration was performed on the whole head using rigid, affine and diffeomorphic transformations (using Symmetric Normalization, SyN), and the mutual information similarity metric. Second, the high-quality registration was confined to the volume that was covered by the 30 slices of the EPI images. These volumes include the IC, MGB, and primary and secondary auditory cortices. This step used affine and SyN transformations and mean squares and neighbourhood cross-correlation similarity measures. We performed the registration to MNI space by linearly interpolating the contrast images using the composite transforms from the high-quality registration.
We extracted parameter estimates for each of the four conditions of interest per participant, averaged over all voxels from the region of interest, i.e., the left vMGB. To locate the left vMGB, we used the mask from (Mihai et al., 2019), which included 15 of the 17 participants of the present study.
We analysed the extracted parameter estimates in a Bayesian framework (McElreath, 2018). The model was implemented in PyMC3 with a No-U-Turn Sampler with four parallel chains. Per chain, we sampled posterior distributions which had 5000 samples with 5000 as warm-up. The predictors included in the model were: task (xS: 1 = speech task, 0 = speaker task), and background noise (xN: 1 = noise, 0 = clear). We also included the two-way interaction of task and noise condition. Because data were collected across participants, it was reasonable to include random effects. To pool over participants, we modelled the correlation between intercepts and slopes over participants. The interaction model is Similarly, , , , , and , are the sub-models for the speech task -speaker task slope, clear-noise slope and the interaction slope, respectively; is the covariance matrix and is the prior for the correlation matrix modelled as an LKJ probability density (Lewandowski et al., 2009). Weakly informative priors for the intercept ( ) and additional coefficients (e.g., ), random effects for participant ( , ), and multivariate priors for participants identify the model by constraining the position of to reasonable values. Here we used Student's-T distributions as priors.
From the model output, we calculated posterior distributions for each condition of interest (speech task/noise, speaker task/ noise, speech task/clear, speaker task/clear). Posterior distributions, in comparison to point estimates, have the advantage of quantifying uncertainty about each parameter. We summarised each posterior distribution using the mean as a point estimate (posterior mean) together with a 95% highest posterior density interval (HPD). The HPD is the probability that the mean lies within the interval (Gelman et al., 2013;McElreath, 2018), e.g., we are 95% sure the mean lies within the specified interval bounds. We computed the following contrasts of interest: interaction (speech task/noisespeaker task/noise) -(speech task/clear -speaker task/clear); simple main effects (speech task/noise -speaker task/noise), (speech task/clear -speaker task/clear); main effect of task (speech task -speaker task). Differences between conditions were converted to effect sizes [Hedges g* (Hedges and Olkin, 1985)]. Hedges g*, like Cohen's d (Cohen, 1988), is a population parameter that computes the difference in means between two variables normalised by the pooled standard deviation with the benefit of correcting for small sample sizes. Based on Cohen (1988), we interpreted effect sizes on a spectrum ranging from small (g* ≈ 0.2), to medium (g* ≈ 0.5), to large (g* ≈ 0.8), and beyond. If the HPD did not overlap zero, we considered this to be a robust effect (Bunce and McElreath, 2017;McElreath, 2018). However, we caution readers that if the HPD includes zero, it does not mean that the effect is missing (Amrhein et al., 2019). Instead, we quantify and interpret the magnitude (by the point estimate) and its uncertainty (by the HPD) provided by the data and our assumptions (Anderson, 2019).

Test for replication of previous findings: Main effect of task in the left and right vMGB
We tested for replication of previous studies that have found a task-dependent modulation (speech -speaker task) in the left and the right MGB (Díaz et al., 2012;von Kriegstein et al., 2008). To do this we adopted the same procedure as described in section 4.6.3.1. For the right vMGB mask we used a mask described in (Mihai et al., 2019). Posterior means and 95% HPD were used to summarise results.

Test for replication of previous findings: Correlation between the main effect of task and speech recognition performance in the left vMGB
To test for the correlation between the main effect of task in the BOLD response and the speech recognition performance across participants, we performed a Pearson's correlation calculation between the estimated parameters from the Bayesian model across subjects in the left vMGB for the speech -speaker contrast together with the proportion of hits in the speech task. Additionally, we performed the correlation between the simple main effect of task (speech task/clear -speaker task/clear) and the speech task/clear accuracy score. 4.6.6 Meta-analysis of the correlation (speech -speaker task correlated with speech accuracy score) in the left MGB The lack of statistical significance for the correlation between speech -speaker task contrast and the proportion of hits in the speech task raised the question whether the correlation effect in the left MGB is different from the ones reported previously (Díaz et al., 2012;Mihai et al., 2019;von Kriegstein et al., 2008). We performed a random-effects metaanalysis to test whether the lack of task-dependent modulation in the present study was different from other studies that have reported a correlation in the MGB. We included five studies in the meta-analysis: two experiments from (von Kriegstein et al., 2008), the control participants of Díaz et al. (2012), the result of (Mihai et al., 2019), and the current study.
Pearson correlation values were Fisher-z transformed (Fisher, 1915) to z-values and standard errors. These were then entered into a random-effects model that was estimated with restricted maximum likelihood using JASP 0.9 (jasp-stats.org). The resulting z-value was converted back to a correlation value for easier interpretation.

Analyses of the left inferior colliculus
To analyse the task × noise interaction and the main effect of task in the bilateral IC we used the same analysis procedures as described for the left vMGB (see section 4.6.3 Testing our hypothesis in the left vMGB ). As region of interest, we used the IC masks described in (Mihai et al., 2019). Furthermore, to analyse the correlation (speech -speaker task correlated with speech accuracy score) in the left IC, we followed the same analysis procedures as for the left vMGB (see section 4.6.5 Test for replication of previous findings: Correlation between the main effect of task and speech recognition performance in the left vMGB).
Additionally, we computed a meta-analysis for the correlation between speech -speaker task contrast and the proportion of hits in the speech task in the left IC. We focused on the left IC since previous studies only reported correlations in the left IC (Díaz et al., 2012;von Kriegstein et al., 2008). We included correlation coefficients from five studies: four previous studies [two experiments from von Kriegstein et al., (2008), the control participants of Díaz et al., (2012), one experiment from Mihai et al., (Mihai et al., 2019)] and the current study. Díaz et al. (2012) did not report the correlation coefficient in the IC. We took this value from the original study data that was part of our research group's archive. Pearson correlation coefficients were Fisher-z transformed (Fisher, 1915) to z-values and standard errors.
These were then entered into a random-effects model that was estimated with restricted maximum likelihood using JASP 0.9 (jasp-stats.org). The resulting z-value was converted back to a correlation value for a more straightforward interpretation. Figure