Improvement in auditory spatial discrimination from ambiguous visual stimuli is not explained by ideal observer causal inference

In order to survive and function in the world, we must understand the content of our environment. This requires us to gather and parse complex, sometimes conflicting, information. Yet, the brain is capable of translating sensory stimuli from disparate modalities into a cohesive and accurate percept with little conscious effort. Previous studies of multisensory integration have suggested that the brain’s integration of cues is well-approximated by an ideal observer implementing Bayesian causal inference. However, behavioral data from tasks that include only one stimulus in each modality fail to capture what is in nature a complex process. Here we employed an auditory spatial discrimination task in which listeners were asked to determine on which side they heard one of two concurrently presented sounds. We compared two visual conditions in which task-uninformative shapes were presented in the center of the screen, or spatially aligned with the auditory stimuli. We found that performance on the auditory task improved when the visual stimuli were spatially aligned with the auditory stimuli—even though the shapes provided no information about which side the auditory target was on. We also demonstrate that a model of a Bayesian ideal observer performing causal inference cannot explain this improvement, demonstrating that humans deviate systematically from the ideal observer model.

As we navigate the world, we gather sensory information about our surroundings from 2 multiple sensory modalities. Information gathered from a single modality may be 3 ambiguous or otherwise limited, but by integrating information across modalities, we 4 form a better estimate of what is happening around us. While our integration of 5 multisensory information seems effortless, the challenge to the brain is non-trivial. The 6 brain must attempt to determine whether incoming information originates from the 7 same source, as well as estimate the reliability of each modality's cues so that they may 8 be appropriately weighted. 9 Studies of multisensory integration have explained how a Bayesian ideal observer 10 could solve this problem by combining reliability-weighted evidence from multiple 11 sensory modalities. In the forced integration model, an observer gathers evidence from 12 March 25, 2019 1/17 multiple modalities and combines them according to the modality's reliability [1]. 13 Importantly this allows for the most reliable sensory estimate to dominate the percept 14 while noisier measurements have less influence; however, it also implies that percepts of 15 distinct stimuli that in actuality originate from independent sources must nonetheless 16 be perceptually influenced by each other. More recently, causal inference has expanded 17 upon the forced integration model by allowing the observer to treat stimuli as 18 originating from different sources. The observer first determines whether both pieces of 19 evidence are likely to come from a common source, and if so weight them by their 20 reliabilities as in the forced integration model, to generate a combined percept [2]. In 21 their basic forms neither model attempts to contend with scenes more complex than a 22 single stimulus in each modality. 23 Numerous experiments have shown that humans behave as ideal or near-ideal 24 Bayesian observers performing forced integration [3][4][5][6] or causal inference [7][8][9][10]. There 25 have even been efforts to reveal which brain structures contribute to Bayesian 26 computations [11,12]. However, those studies have not considered scenarios in which 27 many sources in an environment give rise to multiple cues within each modality. Here 28 we test the Bayesian causal inference model using a new paradigm, and in doing so 29 introduce a key question missing from these prior studies, but common in the natural 30 world; which auditory and visual stimuli will be integrated when multiple stimuli exist 31 in each modality?

32
In the case of a single stimulus in each modality, visual influence on auditory 33 location has been largely demonstrated by studies of perceptual illusions. Notably the 34 ventriloquist effect, a bias of auditory location toward visual location when cues of both 35 modalities are presented simultaneously [13], has been extensively characterized. The 36 influence of the visual location depends mainly on two factors: the discrepancy between 37 the two stimuli (with visual-induced bias waning as the spatial separation becomes too 38 large) [14], and the size of the visual stimulus (smaller, more reliable, visual stimuli 39 yielding a larger bias) [4]. Dependence on discrepancy points to a causal inference 40 structure, while size dependence indicates a weighting by the quality of the location 41 estimates (larger visual stimuli are localized less accurately). Agreement with the 42 Bayesian causal inference model [2,4] would indicate that the bias is due to an 43 integration of the two cues in which the brain produces a combined estimate of location. 44 Therefore, congruent auditory and visual evidence should result in a more accurate 45 estimate of object location than auditory evidence alone.

46
In this study we engaged listeners in a concurrent auditory spatial discrimination 47 task to look for a benefit from spatially aligned, task-uninformative visual stimuli. We 48 presented two sounds, a tone and noise, with centrally located or spatially aligned visual 49 stimuli of per-trial random color and shape. Listeners were asked to report which side 50 the tone was on. Importantly, those shapes do not provide information about the correct 51 choice in either condition, only providing information about the separation of the two 52 auditory stimuli in the spatially aligned condition. We investigated whether subjects 53 nonetheless benefited from this additional information and improve their performance 54 on the task as one might predict from an extrapolation of the ventriloquist effect. Our 55 results show a benefit due to the spatially aligned shapes. However, an extension of the 56 ideal Bayesian causal inference model for two auditory and two visual stimuli could not 57 explain any difference in auditory performance between the two visual conditions. ig 1. Listeners fixate while concurrently hearing two auditory stimuli on either side of the fixation dot and seeing two random shapes that are either centrally located or spatially aligned with the auditory stimuli. Shapes are presented in alternating frames to avoid overlap.

60
We engaged listeners in an auditory spatial discrimination task to see if they could 61 benefit from spatially aligned task-uninformative visual stimuli. Listeners were 62 presented with two simultaneous sounds (a tone complex and noise token with the same 63 spectral shape) localized symmetrically about zero degrees azimuth and asked to report 64 which side the tone was on. Concurrently, two task-uninformative visual stimuli of 65 per-trial random shape and hue were presented. In two conditions (Figure 1), visual 66 stimuli were either spatially aligned with the auditory stimuli ("Matched" condition) or 67 in the center of the screen ("Central" condition) as a control. For both conditions, 68 auditory separations ranged from 1.25 degrees to 20 degrees. We measured the 69 improvement in performance due to the spatially aligned shapes as the difference in 70 percent correct between matched and central conditions for each separation (Figure 2A). 71 Averaging across separations, the improvement was significant with (p = 0.007, t-test). 72 The effect was individually significant at moderate and large separations (5 degrees 73 (p = 0.003) and 20 degrees (p = 0.02)).

74
To further understand the effect, we calculated 75% thresholds for each condition by 75 fitting a psychometric function to each subject's response data ( Figure 2B).  Figure 2B) is necessarily paired with an increase in percent correct at threshold (dotted 79 line Figure 2B) due to the fit method (slope and lapse rate of the sigmoid were 80 determined from responses to both conditions and only the threshold parameter of the 81 function was allowed to differ between the two conditions). Nonetheless, we find that 82 improvements at threshold (and consequently, performance at threshold) are significant 83 across the population (p = 0.0002, sign test). The average threshold improvement across 84 the population is a 1.1 degree decrease, and the size of the effect increases as baseline 85 auditory spatial ability gets worse. On average, someone with a 5 degree central 86 separation threshold experiences a 0.5 degree (10%) improvement in threshold but 87 someone with a 15 degree central threshold experiences a 3 degree (20%) improvement. 88 The average improvement in performance at the central threshold is 2.2%.

90
We developed an ideal observer model for our task in order to investigate whether our 91 data are compatible with an optimal combination of auditory and visual cues in this . Line of best fit and 95% confidence intervals also shown. Marginal distribution of threshold improvement shown to the right. There is more mass towards positive threshold improvement than negative.
P(tone location|X a tone ) P(noise location|X a noise ) P(tone location|X a tone , X a noise )  Schematic showing that visual combination cannot provide a benefit to the ideal observer. Listeners use the knowledge that the tone and noise are symmetrically presented to compute a combined auditory likelihood. Then, for each side, they combine this auditory likelihood with a visual likelihood similarly devised from both visual shapes' likelihoods. Listeners determine which side the tone is on by picking the side of the posterior with more probability mass. Whether they do or do not combine evidence across modalities, the observer responds right. task. Our model (details in Methods) follows Kording et al. [2] in performing inference 93 over whether two cues are the result of the same event, or due to different events 94 ("causal inference"). Cues stemming from the same event are combined according to 95 their relative reliabilities in an optimal manner. This results in a posterior belief about 96 the location of the auditory tone. If this posterior has more mass left of the midline, the 97 ideal observer responds "left", otherwise "right".

98
While the ideal observer performance follows an approximately sigmoidal shape as a 99 function of auditory azimuth as expected, the two model fits corresponding to the symmetrically, the observer can compute a within-modality combined belief, weighting 112 each cue by relative reliability as in Ernst & Banks [1] [P (S a tone |X a tone , X a noise ) and Importantly, when combining with the bimodal 114 visual likelihood, the observer must separately consider two possible scenarios: the tone 115 is on the right, or the tone is on the left. Using the visual observation to refine their 116 estimate of the tone location, the observer combines auditory and visual information for 117 each scenario and must base their final decision on a weighted combination of these Even weighting the two 119 scenarios equally, there is more evidence in favor of the tone being on the right, the 120 same side as that implied by just the auditory observations. In reality, the weights will 121 depend on the proximity of auditory and visual observations, favoring the visual cue 122 that falls on the same side of the midline as the subject's belief about the tone and will 123 therefore yield an identical response to the one got by considering just the auditory 124 observations. Equivalently, the side with the greater mass for As 126 a result, using the visual stimuli to refine the final posterior does not change the side 127 with more probability mass (Figure 3), and therefore cannot benefit the ideal observer. 128

129
Here we show that normal hearing listeners improve their performance in an auditory 130 spatial discrimination task when spatially aligned but task-uninformative visual stimuli 131 are present. We further show that these findings cannot be explained by an ideal 132 observer performing the discrimination task.

133
Even though the shapes presented on any given trial give no indication of which side 134 the tone is on, subjects' behavioral performance suggests the spatial information they In either scenario, we posit that visual stimuli can act as anchors to attract 144 auditory location. The brain may therefore correct errors in auditory spatial 145 discrimination by refining one or both auditory locations as long as it is able to 146 correctly determine which auditory and visual stimuli to pair. Additional work must be 147 done in order to understand how the brain accomplishes multisensory pairing.

148
Another interpretation of the visual benefit would be that the visual shapes help 149 direct auditory spatial attention. The time required to shift auditory spatial attention is 150 on the order of 300 ms [15], rendering it unlikely that attention is driving the present 151 results. Visual stimuli preceded the auditory stimuli by 100 ms and the auditory stimuli 152 were only 300 ms long, a duration insufficient for the brain to redirect attention to 153 either of the visual locations, let alone both (splitting auditory attention can have 154 profound behavioral costs [16,17]). 155 We find that the ideal Bayesian causal inference model cannot account for the 156 benefit provided by spatially aligned visual stimuli. In particular, we have proven that 157 the visual stimuli in our task can reduce the variance of the auditory estimate but the 158 side on which most of the probability mass lies, and hence the decision, never changes. 159 This means that an ideal observer model is insufficient to explain the benefit we 160 measured behaviorally.

161
March 25, 2019 6/17 The nature of the suboptimality that allows listeners to benefit from 162 task-uninformative visual stimuli is yet unknown. Different decision strategies for the 163 causal inference model have been tested with traditional audiovisual localization tasks, 164 suggesting that individuals may use a suboptimal probability matching strategy [18]. 165 Such a strategy could also drive the effect we measured. Alternatively, a model of an 166 observer whose low-level auditory coding is biased by visual stimuli (i.e., an early 167 integration model [19]) might also explain the effect. There is some evidence of this 168 type of early integration in the visual cortex during audiovisual localization [12]. Future 169 work exploring such nonideal perceptual mechanisms and their effects on behavior is 170 necessary.

171
Here we show that listeners use task-uninformative visual stimuli to improve their 172 performance on an auditory spatial discrimination task. This finding demonstrates that 173 the brain can pair auditory and visual stimuli in a more complex environment than Two auditory stimuli were generated in the frequency domain with energy from 220 to 190 4000 Hz and a 1/f envelope (−3 dB/octave). One was pink noise ('noise') and the other 191 was composed of harmonics of 220 Hz ('tone'). With the exception of one subject who 192 was run with a frozen noise token, the noise was randomly generated for each trial.

193
Data were similar for the subject with the frozen noise token and therefore not excluded. 194 In order to change the location of each sound, they were convolved with HRTFs (head 195 related transfer functions) from the CIPIC library [20]. Because the experimentally random hue such that the two shapes in the trial had opposite hue. Each shape was 210 presented during alternating frames at 144 frames per second such that both shapes 211 were visible, even in cases where they would overlap (in a manner similar to [22]).

212
Task 213 During each trial, the tone and noise were presented symmetrically about zero degrees 214 azimuth with visual onset leading auditory by 100 ms. Subjects were asked to report 215 which side the tone was on by pressing one of two buttons. Before the experiment, 216 subjects were given 10 sample trials and then asked to complete a training session. Their 217 responses to training trials with auditory stimuli at 20 degrees separation were logged 218 until at least 20 trials had been completed and the probability of achieving that number 219 of correct responses by chance (assuming a binomial distribution) was under 5%. If the 220 training criteria was not satisfied after 20 trials subjects were allowed to re-attempt 221 once. Four subjects were dismissed when they did not pass the second attempt. the aforementioned values). In this notation, sign( a tone ) denotes the correct response for 254 that trial. Using the notation N (x; µ, σ 2 ) to denote the probability density function of a 255 normal random variable with mean µ and variance σ 2 , the observed tone location 256 (X a tone ), noise location (X a noise ), left visual cue location (X v left ) and right visual cue 257 location (X v right ) for the trial are randomly drawn with probability: where (σ a tone ) 2 , (σ a noise ) 2 , (σ v ) 2 are the uncertainties associated with the observed 259 tone, noise and visual cue locations respectively.

260
It is important to note that the subject does not have access to the true variables 261 a tone and v right and must make their decision from the observed variables. 262 We model subject perception as inference in a hierarchical generative model of the 263 sensory inputs (shown in the figure). Let S a tone and S v right be the perceived tone and 264 right visual cue location whose likelihood are given as follows P (X a noise |S a tone ) = N (X a noise ; −S a tone , (σ a noise ) 2 ) March 25, 2019 9/17 Eq. 5 to 8 assume that the subjects can account for their uncertainty accurately 266 based on prior sensory experience. We assume that the subject has learned that the 267 auditory and visual stimuli are symmetric about zero degrees azimuth, which allows 268 them to collapse X a tone (or X v left ) and X a noise (or X v right ) into unimodal estimates. The 269 priors over S a tone and S v right can be conditioned on whether the subject perceived the 270 tone to be from left or right (denoted as R=-1 or R=1 respectively) and if they 271 perceived the auditory and visual cues to be from the same cause or not (denoted by 272 C=1 or C=0 respectively). Assuming a flat prior over location for S a tone and S v right (the 273 results still hold for symmetric proper priors), this can be written as where ∝ R indicates that the proportionality context is independent of R. H(x) 275 denotes the Heaviside function.

276
Having inferred R, we note that the ideal observer makes their choice (Ch) by 277 choosing the side with the higher posterior mass, i.e.
Calculating the posterior

279
Before comparing the probability mass on either side, we must evaluate the posterior 280 over R. In order to do so, we marginalize over the cause variable C We can evaluate the term inside the sum by first using Bayes rule and then 282 simplifying under the assumption that the priors over R and C are assumed to be 283 independent, i.e. P (R, C) = P (R)P (C) By assuming equal priors for the left and right side, i.e. P (R) = 0.5(Ideal observer 285 has no response bias.
March 25, 2019 10/17 We can then expand the expression of the side with the higher posterior mass by 287 considering both values of the cause variable C, which using Eq. 14 can be rewritten as 288 In general, the likelihood can be evaluated by averaging over all possible auditory 289 and visual cue locations Using the independence relations implied by the generative model, we can simplify 291 the previous equation to get Substituting expressions for the likelihoods of each cue (Eq. 5-8) and the prior (Eq. 293 9), we can evaluate Eq. 17 by repeated multiplication of normal probability density 294 functions to get expressions for both C=0 and C=1

295
No audiovisual combination (C=0) Multiplying the gaussian likelihoods in Eq. 18, we can pull the terms independent of 297 R into a proportionality constant to get where 299 α a tone,noise = (σ a noise ) 2 (σ a tone ) 2 + (σ a noise ) 2 is the weight given to the tone location while combining with the noise location.
where Φ(x; µ, σ 2 ) denotes the cumulative density function evaluated at x for a 305 normal random variable with mean µ and variance σ 2 .

306
Audiovisual cue combination (C=1) We can integrate over S v right by evaluating all functions of S v right at RS a tone because of 308 Multiplying the gaussian likelihoods in Eq. 22, we can pull the terms independent of 310 R into a proportionality constant to get where 312 α av = 0.5(σ v ) 2 (σ a tone ) 2 α a tone,noise + 0.5(σ v ) 2 is the weight given to auditory location while combining with the visual location and 313 X a,v combined = X a tone,noise α av + R(1 − α av )( is the weighted combination of the visual and auditory cues. The integral in Eq. 23 314 evaluates to Using the fact that Φ(0; µ, σ 2 ) is a decreasing function of µ, the maximum of Eq. 20 316 simplifies to We note that ( ) > 0 (by definition). Using that fact N (sign(µ)x, µ, σ 2 ) > N (−sign(µ)x, µ, σ 2 ) in addition to the decreasing nature of Φ(0; µ, σ 2 ), the maximum of Eq. 24 simplifies to arg max The positive weighted combination of two function is maximized at the point of Importantly, the side with the higher posterior mass is independent of cause C.

321
Generating a psychometric curve 322 To evaluate the probability the subject will choose right at each auditory azimuth 323 (psychometric curve), we need Using the independence relations implied by the generative model, we can simplify 325 the previous equation to get Substituting Eq. 1-4, 27 in Eq. 29 and simplifying, we get Assuming a subject lapses with a probability λ and responds randomly with equal 328 probability, we get the model predicted psychometric curve as 329 P (Ch = 1| a tone , v right ) = λ(0.5) + (1 − λ)Φ(0; − a tone , (σ a noise ) 2 α a tone,noise ) It is important to note that the ideal observer response is not affected by the where Bin(n, p) denotes the binomial probability density function with parameters n 344 and p. The probability parameter in Eq. (27) is obtained from Eq. 31 for the parameter 345 values.