ABSTRACT
Background Ventriloquism aftereffect (VAE), observed as a shift in the perceived locations of sounds after audiovisual stimulation, requires reference frame (RF) alignment since hearing and vision encode space in different RFs (head-centered, HC, vs. eye-centered, EC). Experimental studies examining the RF of VAE found inconsistent results: a mixture of HC and EC RFs was observed for VAE induced in the central region, while a predominantly HC RF was observed in the periphery. Here, a computational model examines these inconsistencies, as well as a newly observed EC adaptation induced by AV-aligned audiovisual stimuli.
Methods The model has two versions, each containing two additively combined components: a saccade-related component characterizing the adaptation in auditory-saccade responses, and auditory space representation adapted by ventriloquism signals either in the HC RF (HC version) or in a combination of HC and EC RFs (HEC version).
Results The HEC model performed better than the HC model in the main simulation considering all the data, while the HC model was more appropriate when only the AV-aligned adaptation data were simulated.
Conclusion Visual signals in a uniform mixed HC+EC RF are likely used to calibrate the auditory spatial representation, even after the EC-referenced auditory-saccade adaptation is accounted for.
1. Introduction
Auditory spatial perception is highly adaptive and visual signals often guide this adaptation. In the “ventriloquism aftereffect” (VAE), the perceived location of sounds presented alone is shifted after repeated presentations of spatially mismatched visual and auditory stimuli [1–3]. Complex transformations of spatial representations in the brain are necessary for the visual calibration of auditory space to function correctly, as visual and auditory spatial representations differ in many important ways. Here, we propose a computational model and perform a behavioral data analysis to examine the visually guided adaptation of auditory spatial representation in VAE and the related transformations of the reference frames (RFs) of auditory and visual spatial encoding.
Several previous models were developed to describe the ventriloquism aftereffect in humans and birds. The bird models examined VAE in the barn owls [4, 5] which cannot move their eyes and therefore do not need to re-align the auditory and visual RFs. The human models mainly focused on spatial and temporal aspects of the ventriloquism aftereffect [6–8], not considering the differing RFs. There are models of the audio-visual reference frame alignment, but those only consider audio-visual integration [9] and multi-sensory integration [10] when in the auditory and stimuli are presented simultaneously, like in the ventriloquism effect, not the adaptation and transformations underlying VAE.
Here, we primarily examine the reference frame (RF) in which VAE occurs. While visual space is initially encoded relative to the direction of the eye gaze, the cues for auditory space are computed relative to the orientation of the head [11]. A means of aligning these RFs is necessary by the stage at which the visual signals guide auditory spatial adaptation. Our previous studies suggest that a mixture of eye-centered and head-centered RFs is associated with recalibration in the central region of the audiovisual field [12] while the head-centered RF dominates for VAE locally induced in a single hemifield in the visual periphery [13]. These results imply that the RF used in VAE is location dependent, possibly due to non-homogeneity in the auditory spatial representation. Specifically, recent evidence suggests that, in mammals, auditory space encoding is based on two or more spatial channels roughly aligned with the left and right hemifields of the horizontal plane [14, 15]. The current modeling explores an alternative hypothesis about the location-dependence of the RF of VAE. It assumes that the RF transformations are the same across the audio-visual field, and that the observed location-dependence is due to other adaptive processes, e.g., related to auditory saccade adaptation, as saccades were used to measure behavioral responses in the Kopco et al. [12, 13] studies. The main modeling goal is then to determine whether such a uniform, location-independent spatial adaptation is only driven by head-orientation referenced visual signals, or whether signals in eye-centered RF also contribute.
The second question explored here is how to separate the effect of auditory saccade adaptation from the ventriloquism-induced auditory space adaptation. Previous studies show that auditory saccades can overestimate or underestimate the actual sound locations [16] and that the amount of visually induced adaptation does not depend on whether the resulting saccades are hypometric or hypermetric [17]. Here, in the Appendix, we analyze the data from Kopco et al. [12, 13] to determine whether the ventriloquism effect and aftereffect show asymmetries depending on the resulting adaptation type (hypometric vs. hypermetric), as well as on the saccade amplitude magnitude. Based on this analysis, the current model assumes that the magnitude of the ventriloquism aftereffect is proportional to the magnitude of the ventriloquism effect, independent of whether these shifts result in hypometric or hypermetric saccades, and independent on the saccade magnitude.
Finally, Kopco et al. [13] observed a new adaptive phenomenon induced by aligned audiovisual stimuli presented in the periphery, exhibited as a shift in responses to sounds presented alone in the central region. The shift magnitude depended on the gaze direction and, thus, was at least partly in the eye-centered RF. However, no such shift was observed when aligned audiovisual stimuli were presented in the central region [12]. The current model proposes a mechanism of a priori biases in the saccade responses, possibly due to auditory saccade adaptation, that can describe this phenomenon.
In the paper, we first summarize the Kopco et al. [12, 13] data modeled here, and, in the Appendix, provide a new analysis of these data to examine 1) how VAE magnitude depends on whether it results in hypometric vs. hypermetric saccades, and 2) how the VAE magnitude relates to the magnitude of the ventriloquism effect. Then, the model is introduced and two versions of it are examined in 4 simulations, each focusing on different aspects of the data and model components. The main result of the simulations is that a common location-independent mechanism can describe the data best when visual signals adapt the auditory spatial map in both head-centered and eye-centered reference frames, consistent with the idea that the reference frame of ventriloquism aftereffect is mixed.
2. Experimental data
This section summarizes the experimental methods and results from Kopco et al. [12, 13]. Additionally, Appendix presents results of a new analysis of the data aimed at examining the dependence of the results on the properties of auditory saccades used by subjects for responding.
In the experiments, ventriloquism was induced by audio-visual training trials either in the central or peripheral subregion of the horizontal audio-visual field while the eyes fixated one location (red ‘+’ symbol; upper and middle panels of Figure 1(A)). The aftereffect was evaluated on interleaved auditory-only probe trials using a wide range of target locations while the eyes fixated one of two locations (lower panel of Figure 1(A)). The listener’s task in both audio-visual and auditory-only trials was to perform a saccade to the perceived location of the auditory stimulus/component from the FP. It was expected that the AV stimuli with displaced visual component would induce a local ventriloquism aftereffect when measured with the eyes fixating the training FP (red dash-dotted lines in Figure 1(B) illustrate this prediction for the peripheral-training experiment). Confirming this expectation, the red solid and dashed lines in Figure 1(B) show that maximum ventriloquism was induced in the peripheral and central training subregion, respectively. The critical manipulation of these experiments was that a subset of probe trials was performed with eyes fixating a new, non-training fixation point (blue ‘+’ symbol), located 23.5° to the left of the training fixation. As illustrated by the blue dash-dotted line in Figure 1(B), if the RF of VAE is purely head-centered, then moving the eyes to a new location is expected to have no effect, resulting in the same pattern of ventriloquism for the non-training and training FPs. On the other hand, if the RF is purely eye-centered, the observed pattern of induced shifts is expected to move with the eyes when the eyes are moved to a new location, as illustrated by the cyan dash-dotted line. The experimental data showed that, in the central experiment, moving the fixation resulted in a smaller ventriloquism aftereffect with the peak moving in the direction of eye gaze (blue dashed line), while in the peripheral experiment no effect of eye gaze position was observed (blue solid line). To better visualize these results, the lower panels of Figure 1(B) shows predictions and data expressed as difference between responses from training vs. non-training FPs from the respective upper panels. The head-centered RF always predicts that the effect would be identical for the two FPs. Thus, all head-centered predictions (brown lines) are always at zero. The yellow dash-dotted line shows a hypothetical prediction for eye-centered RF, obtained by subtracting the cyan from the red dash-dotted line. Similarly, the solid and dashed yellow lines show, respectively, for the peripheral and central data, the eye-centered RF predictions obtained by subtracting from the red lines the same red lines shifted 23.5° to the left. Finally, the black solid and dotted lines show the actual differences between the respective red and blue data from the upper panels. For the central data, the black dashed line falls approximately in the middle between the head-centered and eye-centered predictions, showing a mixed nature of the RF of VAE induced in this region. On the other hand, the black solid line is always near zero, confirming that the RF of VAE induced in the periphery is predominantly head-centered. The current model aims to describe these differences by considering a uniform representation and adaptation process that guided by signals in both eye-centered and head-centered reference frames.
The results described in Figure 1(B) are based on ventriloquism aftereffect induced by visual stimuli displaced to the left or to the right of the corresponding auditory stimuli. Figure 1(C) shows the baseline data obtained in runs with auditory and visual stimuli aligned. In the central-training experiment, the responses from the two FPs were similar, unbiased at the central locations and with a slight expansive bias in the periphery (both red and blue dotted lines are near zero in the center, negative in the left-hand portion and positive in the right-hand portion of the graph). On the other hand, in the peripheral-training experiment the responses in the central region differed between the two fixations, where the non-training FP responses fell well below the training-FP responses (compare the red and blue solid lines).
Thus, the peripheral AV-aligned stimuli induced a fixation-dependent adaptation in the auditory-only responses in the central region. The black dashed and solid lines in Figure 1(C), showing the difference between the corresponding training and non-training FP data, highlight the FP-dependence of the peripheral experiment in contrast to the FP-independence in the central experiment. The current model assumes that these adaptive effects can be explained by a combination of biases in visual saccades to auditory stimuli and a visually guided adaptation in the spatial auditory representation.
3. Model Description
3.1 Overview
Figure 2A shows the outline of the model. The model predicts the azimuthal bias in the saccade response to an auditory-only probe (the “Response” block in panel A) as a function of the probe azimuth, with additional parameters of the fixation location on a given trial (“Probe stimulus and fixation” block) and the audio-visual training locations and the measured audio-visual response biases in a given experimental training session (“Ventriloquism” block). Thus, the model does not require information about the direction of audio-visual stimulus displacement during training (whether the visual stimuli were shifted to the left, right, or aligned with the auditory stimuli). Instead, it only uses the information about where the training occurred and what the resulting ventriloquism effect was. Here, the model assumes that there is a direct relation between the observed ventriloquism effect and aftereffect, as shown in the Appendix. The ventriloquism aftereffect prediction is then modeled as an additive combination of two components, a saccade-related bias in eye-centered reference frame and a saccade-independent visually guided adaptation of auditory space representation (square blocks in panel A). The saccade-related bias is present a priori and it is not directly adapted by ventriloquism, while the auditory spatial representation is locally adapted by the ventriloquism signals in different reference frames and its size also depends on the saccade-related bias.
Two versions of the model are evaluated, differing only by the assumed form of adaptation of the auditory space representation. First, in the HC model, the visual signals adapt the auditory spatial representation exclusively in the head-centered reference frame (the “HC” arrow in panel A), so the signals are assumed to be transformed to HC before inducing adaptation. In the HEC model, the visual signals adapt the auditory spatial representation in both head-centered and eye-centered RFs (“HC” and “EC” arrows) such that the relative contribution of the HC and EC RFs can be arbitrary. I.e., the HEC model reduces to the HC model if the weight of the EC path is set to zero, or it can produce predictions using only EC RF if the HC weight is set to zero.
In summary, both models assume that the spatial representations and adaptations are uniform, predicting the same results independent of whether the training occurs in the center or in the periphery. The main difference between the two models is that the HC model assumes that the auditory space adaptation occurs purely in head-center coordinates, while it is the gaze-direction-referenced properties of the auditory saccades that cause any eye-centered effects observed in the data. On the other hand, the HEC model assumes that, even after accounting for the saccade-related effects, the auditory spatial representation receives the adaptive visual signals in both reference frames, causing adaptation that always depends on the position of the stimuli relative to the eye gaze direction. Importantly, the model assumes that if the ventriloquism aftereffect is not induced and measured by auditory saccades, as used in the Kopco et al. [12, 13]. studies, the saccade-related bias would not affect the performance.
3.2 Detailed Specification
The following model specification applies to the more general HEC model version, with the differences applying to the HC model described as needed. Panels B-D of Figure 2 provide visualizations of the behavior of different parts of the model.
Equation 1 describes the predicted bias in responses r̂ to a given auditory stimulus location s as a weighted sum of a saccade-related bias rE. and a ventriloquism-related adaptation in auditory spatial representation rV. where w ∈ [0, ∞] is a free parameter specifying the relative weight of the ventriloquism adaptation. In addition to the stimulus location s, the prediction (illustrated in Fig. 3D) also depends on the fixation point on a given trial f on the training region specified by the training AV stimulus locations sAV, and on the observed biases in AV stimulus responses at these locations rAV (all variables in the units of degrees).
The saccade-related bias at a specific location x for eyes fixating the location f is modeled as a sigmoidal function where h,k, and c are free parameters characterizing the sigmoid. The saccade-related bias (Figure 2B) is broad and referenced to the FP (i.e., it uses EC RF), exhibiting a combination of underestimations and overestimations commonly observed in studies of auditory saccades [9, 16, 18]. However, the specific shape of the functions used here was chosen to best fit the peripheral and central no-shift data shown in Fig. 1C. Specifically, the predictions roughly follow the values observed at each location in Fig. 1C when no audiovisual training is used at a given location (the central-experiment data for the right-most location triplet, the peripheral-experiment data for the central triplet, and data from both experiment for the left-most triplet). Thus, it is assumed that this saccade-related bias is present a priori, independent of the induced ventriloquism. Also, it is assumed that the bias only depends on the probe location re. FP location, which, for the current data means that the bias graphs for training and non-training FPs are symmetrical about the origin with respect to each other (blue and red lines in Fig. 2B).
The ventriloquism-driven auditory space adaptation causes bias defined at location x, for eyes fixating the location f, and for ventriloquism induced at training locations sAV and resulting in AV response biases rAV, as a weighted sum: where N is the number of training locations (N = 3 for the current study), i is an index through these locations, sAV,i is the i-th training location azimuth, and rAV,i is the AV response bias observed at the -th training location. The differences rAV,i - rs(sAV,i) represent the disparity between the AV response biases (green diamonds in Figure 2B) and the saccade-related bias (red/blue lines in Figure 2B) at the training locations. The disparities are shown in Figure 2C by the red and blue full diamonds. wv,i(x) is the strength with which the disparity at the i-th training location adapts the spatial representation at the location x. In the HEC model, this value is a weighted sum of the adaptation strengths in head-centered and eye-centered reference frames, defined as: where w ∈ (0, 1) is a parameter determining the relative weight of the EC reference frame vs. the HC RF (in the HC model, wE = 0). Finally, wvH,i and wvE,i use normalized Gaussian functions centered at training locations as a measure of influence of the i-th training location on the target location x, in the two reference frames:
In Eqs. 5 and 6, the parameters σH and σE represent the width of the influence of the ventriloquism shift at individual training locations, separately for the two reference frames. wvH,i (Eq. 5) is always centered on the i-th training location in the HC RF, whereas wvE,i (Eq. 6) is centered on the -th training location in the EC RF (for the training FP, the two RFs are aligned). Finally, the Gaussian functions are normalized (Eq. 7) such that the maximum wvH,i or wvE,i after summing across the three training locations is 1 (the normalization locations 7.5 · (i - 2 are specific for the current training and they need to be modified for other data with different training locations).
Figure 2C shows the operation of the ventriloquism adaptation. As mentioned above, the red and blue filled diamonds are the disparities at the individual training locations driving the adaptation in HC RF. The blue open diamonds are identical to the blue filled diamonds except that they are shifted to the left by the difference between the two FPs to illustrate how the eye gaze shift affects where the adaptation is expected to occur in the EC RF. The red and blue lines are then the resulting biases rv for the two fixation locations, each corresponding to the sum of Gaussians centered at different training locations in the two RFs (and with widths defined by the σ’s). Parameter wE determines the relative weights of the peaks in the blue line corresponding to the open diamonds vs. those corresponding to the filled diamonds. In summary, the blue and red lines show how visually guided adaptation is local and RF-dependent, decreasing with distance from location at which AV stimuli were present in HC and EC RFs. It also shows that since adaptation causes shifts from the saccade-bias response locations towards AV response locations, if AV responses fall on saccade bias locations, no visually guided adaptation is predicted to occur.
Finally, Figure 2D shows that the model prediction is a sum of the saccade bias (from Figure 2B) and ventriloquism bias (Figure 2C) weighted by the parameter w (note that no scaling parameter is needed for the saccade bias as parameter already can make this bias arbitrarily large).
4. Methods
4.1 Stimuli
The data from studies of Kopco et al. [12, 13], simulated here, induced ventriloquism by presenting training stimuli with visual component either shifted to the left, to the right, or aligned with the auditory component, while the eyes fixated one location (Fig. 1A; upper and middle panels). The aftereffect was always measured by presenting auditory-only stimuli while eyes fixated one of the two FPs (Figure 1A; lower panel). Thus, nominally, there were 6 conditions (3 shift directions by 2 training regions), corresponding to AV locations and responses shown by triplets of open symbols in Figure A1A. For these conditions, predictions could be compared to data for 9 locations at 2 FPs. However, the main experimental results simulated here were observed when differences between FPs were considered on aftereffect magnitude data, obtained by subtracting positive-shift data from negative-shift data and halving the result (Figure 1B; lower right panel; note that the latter difference is equivalent to averaging the magnitudes of “positive shift – no shift” and “negative shift – no shift”). These “double differential” (“positive – negative” difference of “training FP – non-training FP” difference) data were the most stable as they eliminated a lot of between-subject variability related to individual biases in responses (as will be illustrated later). Therefore, to focus the model on these important differences, the data were also transformed into the difference representation in two steps.
First, the data for the two training FPs were orthogonally transformed such that instead of using training and non-training FP, a sum and a difference across the two FPs was used. I.e., instead of having for each condition 18 data points corresponding to 9 locations at 2 FPs, we used 18 data points consisting of 9 locations summed across the two FPs and 9 locations for difference across the 2 FPs.
Second, the positive-shift and negative-shift condition data were transformed in a similar way, such that instead of positive and negative shift we used the aftereffect magnitude (i.e., a halved difference between the two shifts) and average across the two shifts. The no-shift data were left unmodified.
The complete data set therefore consisted of 108 data points [9 (locations) x 2 (transformed FPs) x 3 (transformed shifts) x 2 (training regions)]. Across-subject mean and standard deviation data were used in the simulations.
4.2 Simulations
Four simulations were performed in this study, each assessing both the HC and HEC models on a different subset of the Kopco et al. [12, 13] data. The first two simulations, No-Shift and All Data simulations, tested two main hypotheses about the current data and reference frame. Two supplementary simulations, Central Data and Peripheral Data simulations, were performed confirm that the model behavior matches the conclusions of the Kopco et al. [12, 13] studies when considered separately.
No Shift simulation assessed the models on the AV-aligned baseline no-shift data from both experiments (Figure 1C), examining the interaction between the saccade-related bias and visual signals when no ventriloquism is induced.
All Data simulation is the main simulation of this study. In this simulation the models were fitted on the complete dataset from both experiments (Figure 1B and C) to examine whether a uniform representation of the reference frame of ventriloquism aftereffect is mixed or purely head-centered.
Central Data simulation fitted only the central-training data from the positive-shift and negative-shift conditions (dashed lines in Figure 1B) while predictions were generated for all the data. The main goal was to examine the reference frame in which the ventriloquism aftereffect is induced in the central region.
Peripheral Data simulation fitted only the peripheral-training data from the positive-shift and negative-shift conditions (solid lines in Figure 1B) while predictions were generated for all the data. The main goal was to examine the reference frame in which the ventriloquism aftereffect is induced in the audiovisual periphery.
4.3 Model Fitting and Evaluation
Each simulation was performed by fitting the two models to the corresponding subset of the transformed data using a two-step procedure. First, a systematic search through the parameter space was performed, using all combinations of 10 values for each parameter, listed in Table 1 (HEC model used all 7 parameters, while HC model only used 5 of them). The limits of the range were chosen by piloting to cover the expected range of behaviors of the model. Note that quadratic spacing was chosen for parameters k, and c as the behavior of the sigmoidal function varies non-uniformly with the parameter values (k was sampled more densely at the lower end of the range, c at the higher end). Then we selected the best 100 parameter combinations in terms of weighted MSE, in which each data point was weighted by the inverse of the across-subject standard deviation in that data point. These parameter combinations were then used as starting positions for non-linear iterative least-squares fitting procedure (Matlab function lsqnonlin) which, again, minimized the weighted MSE. The parameter values obtained by the best of these fits were chosen as the optimal values.
To compare the models’ performance while accounting for the number of parameters used by each model, we computed the Akaike information criterion AICc [19, 20] for each optimal fit, defined as: where n is the number of experimental data points, K is the number of fitted parameters, and SSE(J) is the sum of squares of errors across the data points (i.e., differences between predictions and across-subject mean data xi) weighted for each data point by the inverse of its across-subject standard deviation . In general, the model with the lower AICc is considered to be a better fit for the data. Then, to determine whether the data provide substantial support for one model over the other one, we computed ΔAIC as the difference in AICc values of the model with the higher AICc vs. the one with the lower AICc. And, we use the following rule to determine whether the model with the lower AICc is substantially better than the other model [19]: “Models having ΔAIC < 2 have substantial support (evidence), those in which 4< ΔAIC < 7 have considerably less support, and models having ΔAIC > 10 have essentially no support.”. Thus, only if ΔAIC is substantially larger than 2, the result is interpreted as evidence in favor of the model with lower AICc.
5. Simulation Hypotheses and Results
The results of the 4 simulations performed in this study are summarized in Table 2, which shows for each simulation and model the fitted model parameter values and the model’s performance measured using the AICc criterium.
5.1 No-shift simulation
This simulation focused on the AV aligned data, examining the hypothesis that the saccade-related bias combined with auditory space adaptation in HC RF causes the training-region-dependent differences in the AV-aligned baseline data (Figure 1C). I.e., it was predicted that EC visual signals adapting the auditory space representation do not need to be considered to explain the different adaptation effects observed in central vs. peripheral AV-aligned training. This hypothesis would be confirmed if the two models, HC and HEC, captured the behavioral data equally well.
Figure 3 presents the results of the simulation of the AV-aligned baseline no-shift condition from both experiments. Panel A shows the biases of the two model components (rows) for each of the two models (colors) with the fitted parameters as listed in Table 2, separately for the two fixation points (columns). The same fitted model parameters apply to both the central and peripheral training experiments. For the saccade-related bias (upper row) that means that the plotted graphs apply to both data equally. However, for the auditory space adaptation component (lower row), the plotted graphs apply to central training, since they show the effect of training at the 3 central locations (-7.5°, 0°, +7.5°). The graphs need to be shifted to the right by 22.5° to see their effect for peripheral training data.
Panel B shows the data (circles with error bars corresponding to the standard error of the mean) and predictions of the two models (lines), separately for the two training points (upper and middle rows), as well as for the difference between the FPs (lower row). The columns represent the two training regions. Each prediction in the upper and middle rows is, roughly, a weighted sum of the corresponding components from panel A, while the predictions in the lower row of panel B show the differences of the predictions from the upper and middle rows.
Considering the model predictions of the mean data, both models captured all the significant trends in these data. Specifically, for the central training data, both models predicted the slight expansion of the space for the central training data identical for both FPs (upper and middle row of the left-hand column), as well as the FP-dependence of the peripheral training data at the central locations (upper and middle row of the right-hand column). Most importantly, both models captured very well the difference data, which are near zero for the central training experiment and have a positive deviation for the peripheral training (bottom row). This conclusion is confirmed by the AICc evaluation which showed no evidence that either of the models should be preferred (ΔAIC = 2.4).
The data in panel B are replotted from Figure 1C, now also including the error bars. These error bars show that there was a lot of across-subject variability when the individual FPs were considered (upper and middle row), while a large portion of that variability was eliminated when the differences in biases across the FPs were computed (lower row). This illustrates why the models were fitted on the transformed data, as those were much more consistent across subjects, and, with the transformation, the fitting weighed the difference data (lower row) more as they were much more reliable. Note that the second transformed data set, the average across FPs, is not shown, as it can be easily estimated from the individual FP data in the upper two rows of panel B.
Panel A illustrates how the models achieved the correct prediction. Both models predicted similar saccade-related bias, consisting of expansion at the peripheral target locations (+/-15°, +/-22.5°, and +/-30°) and bias towards the fixation location for the central 3 locations (upper row). This saccade-related bias was then modulated by the auditory space adaptation such that at the training locations the model predictions were shifted towards the AV responses, which were near zero for both the central and peripheral training (FigureA1A). The HC model predicts that this “corrective” ventriloquism shift only occurred in HC RF (brown lines in the lower row of panels), while the HEC model predicts a considerable contribution of the EC RF (magenta lines at locations -30° to -15° at the bottom right). However that contribution only had a small effect on the overall predictions, as shown by the small differences between the brown and magenta lines in panel B.
5.2 All Data simulation
This was the main simulation of this study. The two models were fitted on the positive-shift and negative-shift data, in addition to the no-shift data from the previous simulation (Figure 1B and C). Also, the simulation was performed on the data from both experiments. Thus it assumed that the reference frame of ventriloquism aftereffect is uniform across the audiovisual field, as the models were optimized to fit both the central and peripheral training data simultaneously. The simulation further assumed that the saccade-related component of the model accounts for all the saccade-related effects (which are EC-referenced), an assumption supported by the results of the No Shift simulation. With these assumptions, the simulation examined the hypothesis that the RF is mixed, using visual signals in both head-centered and eye-centered coordinates. This hypothesis would be confirmed if the HEC model, using both HC and EC referenced visual signals, captured the behavioral data significantly better than the HC model, which only uses HC RF for the ventriloquism adaptation of the auditory space.
Figure 4 presents the results of this simulation. Panel A shows the biases of the two model components for the fitted parameter values from Table 2, in a format similar to panel A of Figure 3. Panel B shows the data (circles with error bars corresponding to the standard error of the mean) and predictions of the two models (lines). Panel B shows for this simulation only the difference of Training vs. Non-training FP data, equivalent to the black lines in Figure 1B and 1C. The upper row of panel B shows the no-shift data replotted from Figure 1C (also shown in the bottom row of Figure 3B), while in the lower row shows the difference between the positive-shift and negative-shift data, equivalent to a doubling of the aftereffect magnitude data from Figure 1B (black solid and dashed lines).
The data and model predictions addressing the main hypothesis of this simulation are in the lower row of panel B. The central training data show a large positive deviation in the middle of the target range, corresponding to the mixed reference frame, while the peripheral training data are always close to zero, an evidence of the head-centered RF. The HEC model (magenta line) approximates this pattern by predicting a positive deviation in both training regions accompanied by a negative deviation of similar size for the targets to the left of the training regions. This pattern captures the main characteristics of the data even though the predicted positive deviation is weaker than that observed for the central central-training data. On the other hand, the HC model (brown line) always predicts no deviation from zero, as that model assumes that the adaptation always occurs in the HC RF. These differences between the models confirm the hypothesis that auditory representation is adapted uniformly by visual signals in both head-center and eye-center reference frames. This conclusion is confirmed by the AICc evaluation which showed almost no support for the HC model compared to the HEC model (ΔAIC = 7.9).
The model predictions for the no-shift data (upper row of panel B) are almost identical for the two models. Thus, the difference in performance between the models cannot be explained by differences in accounting for the no-shift data. Notably, the predictions for the two training regions are fairly similar to each other, and slightly worse than those obtained in the No Shift simulation. However, they still capture the pattern of biases fairly well. Finally, note that the predictions for the average of positive and negative shift data is not shown, even though these transformed data were also used for fitting. These data were omitted as both the data and model predictions are very similar to the no-shift results shown in the upper row of panel B.
Looking at across-subject variability in the data, the error bars in panel B tend to be smaller for the positive-vs-negative shift plots (lower row) than for the no-shift plots (upper row). This difference is in fact much larger, since the plotted error bars are for the difference between the two shift directions, whereas the aftereffect magnitude equal to half of the difference was used in the fitting. This shows that additional between-subject variability was caused by idiosyncratic biases in each subject’s responses that are consistent within each subject, and which therefore cancel out when the difference between positive and negative shift data is computed. This again shows the importance of fitting the models on the transformed data, which resulted in weighing the positive-vs-negative shift difference data (lower row) even more than the no-shift training-vs-non-training FP data (upper row).
Panel A illustrates the behavior of individual components that resulted in the models’ predictions. The saccade-related bias is almost identical for the two models (upper row), and overall similar to the pattern observed in the NoShift simulation (Figure 3A). The auditory space adaptation is broad for both models, and only slightly different between the models (magenta vs. brown lines between in the lower row of Figure 4B). The size of the difference is mainly determined by parameter wE (see Table 2) which defines the relative contribution of the eye-centered vs. head-centered RF to the combined representation in the HEC model (in this simulation wE = 0.15, indicating that the EC RF only had a 15% weight in the mixed reference frame). So, it can be concluded that even though this contribution is highly significant, the HC RF has still a dominant role when uniform representation of the auditory space is assumed.
5.3 Central and Peripheral Data simulations
Two additional simulations were performed, each of them fitting separately the data for only one training region. The main goal of the simulations was to verify that, when the models are fitted to the two data sets separately, they will confirm the conclusions of the behavioral experiments about the mixed reference frame for the central-training data and the head-centered reference frame for the peripheral-training data. Additionally, these simulations only fitted the transformed positive-shift and negative-shift data, while also producing model predictions for the no-shift data. Thereby, the simulations tested whether the behavior of the saccade-related model component observed in the previous simulations is dependent on the presence of the no-shift data, or whether the models would find a similar predicted pattern even if only the positive/negative shift data are considered.
Central Data simulation fitted only the central-training data from the positive-shift and negative-shift conditions (dashed lines in Figure 1B). The main hypothesis tested in the simulation was that the RF is mixed when VAE is induced in the central region. This hypothesis would be confirmed if the HEC model is significantly better than the HC model. Figure 5 presents the results of this simulation using a layout identical to Figure 4. The lower row of panel B shows the predictions of the two models for the difference data. As expected, the HEC model (magenta) fits the central-training data well (better than in the All Data simulation) while the HC model’s prediction (brown) is again fixed at zero. This difference confirms the hypothesis that the EC RF contributes significantly to the ventriloquism adaptation in central region, a conclusion also confirmed by the AICc evaluation (HEC model better than HC model; ΔAIC = 5.9). However, it is also noticeable that the HEC model underestimates the central data for targets at azimuths around 0° while it predicts a negative deviation at azimuths around -20°, not observed in the data. This negative deviation is due to the structure of the model which always predicts that a positive deviation is accompanied by a negative deviation at locations shifted in the direction of the new, non-training FP location. For the peripheral experiment, the HEC model predictions depart considerably from the data, as expected since the data do not show a strong EC RF contribution. On the other hand, for the no-shift data, both models largely capture the main trends even though they were not fitted on these data (upper row of panel B), confirming that the FP-dependent adaptation observed in the no-shift data is not specific to these data as the model generalizes to predict it even if only trained on the positive and negative shift data.
Considering the individual model components (Panel A), the results are overall similar to the All Data simulation (Figure 4). The main difference in the current simulation is that the EC-referenced contribution to auditory spatial adaptation in the HEC model is considerably stronger, resulting in larger differences between the two models (bottom row). However, even here the HC RF still has more weight (wE = 0.3 in Table 2), suggesting that it is the more dominant RF for ventriloquism aftereffect in general.
Peripheral Data simulation fitted only the peripheral-training data from the positive-shift and negative-shift conditions (dashed lines in Figure 1B). The main goal was to confirm the hypothesis that the RF is head-centered when VEA is induced in the peripheral region, in agreement with the behavioral results. This hypothesis would be confirmed if the HEC and HC models performed similarly in the simulation.
Figure 6 presents the results of this simulation using a layout identical to Figure 4. The lower row of panel B shows the predictions of the two models for the positive vs. negative shift difference data. As expected, both models fit the near-zero peripheral-training data well, while failing to predict the central-training data. This confirms that the EC RF does not contribute to the ventriloquism adaptation in the peripheral region, a conclusion also supported by the AICc evaluation, in which the HC model is better than the HEC model; ΔAIC = 5.6 in Table 2). Similar to the Central Data simulation, for the no-shift data, both models largely captured the main trends even though they were not fitted on these data (upper row of panel B). These results are also confirmed when considering the individual model components (Panel A). First, the saccade-related bias component (upper row) again behaves identically in the two models similarly to the previous simulations. Second, the auditory space adaptation component (lower row) behaves nearly identically for the two models, determined by the low the relative weight of the EC RF in the HEC model (wE = 0.04 in Table 2).
5.4 Model parameter values
The behavior of the models in different conditions can be analyzed by looking at the fitted values of the model parameters. Here, the first main modeling question concerned the ability of the models to predict the EC-dependence of the no-shift data observed in the peripheral, but not in the central, training condition. The critical model parameters here are the parameters h and w, which determine the relative strength of the saccade-related and auditory space adaptation components of the model (Figure A1 and Table 2). The values of the two parameters are overall similar in all simulations, suggesting that both components contributed critically to all the predictions.
The parameter wE determined the relative strength of the EC RF contribution to the ventriloquism-driven auditory spatial adaptation, while the parameters σH and σE determined, respectively, how broad-vs-specific was the influence of the HC and EC RFs. The value of wE was always much smaller than 0.5 (in relevant simulations smaller or equal to 0.3) and σHwas always much larger than σE. Both these observations indicate that while the EC-referenced signals influence the ventriloquism adaptation significantly, their effect is mostly modulatory, while the HC-referenced signals dominate.
Finally, the fitted values of parameters k and c did not change dramatically across the simulations, always resulting in similar predictions about the saccade-related bias component of the model.
6. Summary and Discussion
The HC/HEC model introduced here aims to characterize the reference frame in which auditory and visual signals are combined to induce the ventriloquism aftereffect. It focuses on the experimental data in which ventriloquism was induced locally in either the audiovisual center or periphery, in which a change in fixation point was used to dissociate the head-center from eye-centered reference frames, and in which saccades were used for responding during training and testing [12, 13]. The model assumes a population of adaptive units representing the auditory space with auditory and visual inputs, similar to the channel processing model proposed in [21]. However, instead of explicitly implementing a population of units, it describes the adaptive effects by only considering the locations from which the auditory components of audiovisual training stimuli were presented. Then, for each unit there is a Gaussian neighborhood in which the AV training affects the A-only responses in either HC-only RF (HC model) or in a combined HC+EC RF (HEC model). Also, the model assumes that there are intrinsic biases associated with auditory saccade responses, and that the effect of ventriloquism is to shift the auditory-only responses from these saccade-related biases towards the locations of the responses on the audiovisual training trials.
Since the model only uses the responses on audiovisual training trials to guide adaptation, independent of the direction of audiovisual disparity used during training, and independent of whether the adaptation results in hypometric or hypermetric saccades, it is assumed that there is a direct relation between the audiovisual responses during training and the auditory-only responses during testing. Specifically, the assumed relationship is that the ratio of observed ventriloquism aftereffect to the observed ventriloquism effect is constant, as confirmed by our behavioral data analysis (see Appendix) which found a ratio of approximately 0.5. This ratio is not aftereffect by whether the aftereffect results in hypometric or hypermetric saccades, consistent with Pages and Groh [17]. However, the analysis also found that there is an asymmetry in the ventriloquism effect when measured using audiovisual saccades. Specifically, the effect reaches 100% of audio-visual disparity if resulting in hypometric saccades, while it is only 80% of the disparity when resulting in hypermetric saccades. Future studies will need to determine whether there is really a difference in the presence/absence of the hypo/hypermetric asymmetry when saccades are used for ventriloquism effect and aftereffect measurement, or whether the current results are different for the effect vs. aftereffect only because the aftereffect data are noisier.
The four simulations presented here showed that the HC/HEC model can describe the different phenomena observed in the Kopco et al. [12, 13] studies. First, in the No-Shift simulation, the simpler HC model accurately predicted the newly reported adaptation by AV-aligned stimuli [13] as a combination of the intrinsically present saccade-related biases locally “corrected” by the visually guided adaptation at the training locations. Thus, the model predicts that this AV-aligned adaptation for the peripheral-training data is purely driven by some adaptive processes affecting the motor representations related to audiovisual/auditory saccades. This, as well as the existence of the saccade-related bias component of the model, can be tested in future studies, as the currently available data are not consistent as to whether auditory saccades are predominantly hypermetric or hypometric [16, 18]. Both these predictions can be experimentally tested by performing ventriloquism experiments in which saccades are not used for responding [22].
The second, All Data simulation addressed the main question of this study about the reference frame of the ventriloquism aftereffect. Its results provide an evidence that a uniform auditory spatial representation uses a mixed reference frame, with visual signals adapting the auditory spatial representation in both head-centered and eye-centered RFs, as implemented in the HEC model and consistent with physiological studies [23, 24]. Importantly, the current results suggest that, in the mixed frame, the relative contribution of the EC RF is only 15% vs. 85% for the HC RF. Moreover, even when only the central-training data are considered (Central-Data simulation), the relative contribution of the EC only reaches 30%. Thus, the HC RF is always dominant for the ventriloquism aftereffect adaptation, an observation that is further supporter by the comparison of the fitted sigma parameters (which showed that the HC-referenced adaptation is more broad than the EC-referenced adaptation). The second simulation also showed that the model in its current form always predicts the same difference in biases between the FPs, independent of the training region. This effect is mainly due to the implicit model assumption that the distribution of the spatial channels is uniform across space. If the model assumed a denser representation of space near the midline (e.g., see [25]), it could predict adaptation that is stronger in the center than in the periphery.
Importantly, the current model was fitted on data transformed so that the differences between the two FPs and differences between the positive and negative shift data were used. This was particularly critical for this simulation in which the EC contribution is visible when the double difference is computed, and it was also important since, in this representation, a lot of noise in the data is removed. Note that when the All data simulation was repeated on untransformed data, the AICc evaluation did not find a significant difference between the HC and HEC models, since the across-subject variability in the responses considered separately for the two FPs was too large, dominating over the differences between the FPs critical to evaluate the reference frames (data not shown).
The final two simulations examined the model behavior when fitted separately to the central vs. peripheral training data. In both simulations the model predictions were in agreement with the behavioral data. Specifically, the HEC model using a mixed reference frame better predicted the central data, while the HC model using the head-centered reference frame better predicted the peripheral data. The central-data simulation also showed one weakness of the model: in its current form it always predicts that if there is a region in which VAE magnitude is larger for the training-FP than non-training-FP data, then there also has to be a region in which the relationship is reversed. An extension of the model which would make the strength of the adaptation depend not only on the distance from the training stimuli, but also on the distance from the training FP, could correct this discrepancy.
Finally, the Central and Peripheral Data simulations accurately captured the no-shift data, even though the models were not fitted on them, confirming that the pattern of adaptation exhibited in these data is also present in the positive-shift and negative-shift data from which it can generalize to the no-shift data. However, as discussed above, the no-shift data biases are most likely related to the saccade responses, not to the spatial representation adapted by ventriloquism, which is of primary interest here.
The neural mechanisms of the ventriloquism aftereffect and its reference frame are not well understood. Cortical areas involved in ventriloquism aftereffect likely include Heschl’s gyrus, planum temporale, intraparietal sulcus, and inferior parietal lobule [26–29]. Multiple studies found some form of hybrid representation or mixed auditory and visual signals in several areas of the auditory pathway, including the inferior colliculus [30], primary auditory cortex [31], the posterior parietal cortex [23, 32, 33], as well as in the areas responsible for planning saccades in the superior colliculus and the frontal eye fields [34, 35]. In the current model, the saccade-related component likely corresponds to the saccade-planning areas. The auditory space representation component likely corresponds to the higher auditory cortical areas or the posterior parietal areas, not the primary cortical areas. This can be expected because there is growing evidence that, in mammals, auditory space is primarily encoded non-homogeneously, based on two spatial channels roughly aligned with the left and right hemifields of the horizontal plane [14, 15, 36–38] and the ventriloquism adaptations modeled here are local (within a hemifield or just in the central region), not consistent with broad adaptation predicted by the hemifield code. However, note that there are also theories which incorporate additional channels, such as a central channel, in addition to the hemifield channels [39]. Such extended models might be compatible with the current data.
Even though most previous recalibration studies examined the aftereffect on the time scales of minutes [1, 2, 40, 41], recent studies demonstrated that it be elicited very rapidly, e.g., by a single trial with audio-visual conflict [42]. If it is the case that the adaptive processes underlying the ventriloquism aftereffect occur on multiple time scales, as also suggested in several models of slower ventriloquism aftereffect [6, 7], then an open question is whether the reference frame is the same at the different scales or whether it is different. The current results are mostly applicable to the slow adaptation on the time scale of minutes, while the RF on the shorter time scales has not been previously explored.
In summary, while some previous models considered the reference frame of the ventriloquism effect [9, 10], the current HC/HEC model is, to our knowledge, the first one to focus on the RF of the ventriloquism aftereffect. In addition, it also considers how saccade-related adaptation might influence auditory saccades. In the future, it can be combined with the existing models of spatial and temporal characteristics of the ventriloquism aftereffect to obtain a more general model of this important multisensory phenomenon.
Acknowledgments
This work was supported by VEGA-1/0355/20 and VVGS UPJS VVGS-2020-1514.
Appendix
To examine whether auditory saccades used for responding have properties that might be important for the current modeling, responses to auditory and audiovisual stimuli in the training regions of both experiments were further analyzed (FigureA1). Two questions were addressed. First, we examined whether the observed saccades were longer or shorter depending on whether the presence of visual component/adaptation resulted in saccades that were hypometric (shorter than needed to reach the auditory target) or hypermetric (longer than needed to reach the auditory target). Such asymmetry, if observed, would suggest that some of the effects described in Section 2, e.g., the eye-centered RF effects, might have been caused by the saccade responses. Second, we evaluated whether the ratio of the magnitudes in auditory-only responses to the respective AV responses for a given AV stimulus is constant for all combinations of audiovisual stimuli. If that is the case, then, independent of any possible hypo/hypermetric dependence, the model can assume that the predicted ventriloquism aftereffect is directly related to the measured ventriloquism effect.
FigureA1A shows the biases in saccade responses from the training FP for targets in the training regions from both experiments (circles vs. squares). Open symbols represent audio-visual responses, filled symbols auditory-only responses. Black symbols represent the AV-aligned runs, while the cyan and magenta symbols represent, respectively, the runs in which the response shifts towards the visual component/adaptation resulted in saccades that were hypometric and hypermetric. Specifically, the magenta circles represent the central-training data with visual component shifted to the right, i.e., towards the fixation point, while the magenta squares represent the peripheral-training data with visual component shifted to the left, i.e., again towards the fixation point (the cyan data then represent the corresponding data for visual components shifted in the opposite direction). Note that the filled symbols here show the same data as the red lines in the training regions of Figure 1B, C.
The black symbols in FigureA1A show that, in both experiments, all the saccades in the AV-aligned runs were fairly accurate. Specifically, responses to the AV stimuli were within +/-0.5° (open black symbols) while the saccades to the auditory targets (filled black symbols) tended to be hypometric (rightward bias for targets to the left of the FP and leftward for the targets to the right) by up to 1°, except for one data point (7.5°), discussed in detail later.
Comparison of the respective magenta and cyan symbols shows that the adaptation direction (i.e., visual component displacement) that led to hypometric saccades tended to result in larger biases than the direction leading to hypermetric saccades (for example, all the magenta filled circles are clustered around the value of 3, while the corresponding cyan filled circles are around -1). To analyze this asymmetry while accounting for the biases in the AV-aligned responses, FigureA1B shows the hypometric and hypermetric data from panel A referenced to the respective baselines and plotted such that positive values always represent bias in the direction of the visual component displacement (i.e., all the cyan squares and magenta squares had their signs flipped after subtracting the baseline). The magenta open symbols show that, independent of the training region, the VE responses measured in conditions resulting in hypometric saccades were aligned with the visual component (which was separated by 5°), while the responses resulting in hypermetric saccades (open cyan symbols) only reach approximately 80% of the visual component displacement. A mixed ANOVA with a between-subject factor of Experiment (Central, Peripheral) and within-subject factors of Shift Direction (Hypometric, Hypermetric), and Azimuth (Small, Medium, Large) performed on these data confirmed these results, showing a significant main effect of shift direction (F(1,12) = 5.78; p = 0.033). The ANOVA also found a significant Azimuth X Experiment interaction (F(2,24) = 9.71; p = 0.006) reflecting a dependence of the effect on the target location that is not further considered here, and no other significant main effects or interactions (p > 0.1). On the other hand, for the VAE data, no significant difference between hypometric and hypermetric saccades was observed (a similar ANOVA on these data only found a main effect of Azimuth; F(2,24) = 7.94; p = 0.002). Thus, the strong asymmetry between the hypometric and hypermetric AV data in in panel A (filled cyan vs. magenta symbols) can be ascribed to overall hypometry of the auditory saccades exhibited also by the No-Shift data (black filled symbols). Also note that there is one hypermetric AV data point for which the response referenced to baseline is near 0 (left-most filled cyan circle), not following the pattern observed for all the other points. Most likely, this inconsistency is caused by some specific characteristic of the baseline auditory-only saccades, as this point corresponds to the only black filled symbol that shows hypermetry instead of hypometry in panel A (the black filled circle at the 7.5° location).
Finally, panel C shows the observed VAE as a proportion of the observed VE (i.e., each symbol in panel C shows the ratio of the corresponding filled and open symbols from panel B). In this analysis, one subject was identified as outlier (in at least one data point the subject differed from the across-subject mean by more than 3 standard deviations). This subject is plotted separately (crosses) and not included in the across-subject graphs. For the remaining subjects, FigureA1C shows that there is a constant relationship between the induced ventriloquism effects and aftereffects such that the aftereffect is always approximately one half of the effect (with a slight tendency to grow with the target amplitude), independent of whether the shift is hypo/hypermetric or of the training region. Confirming this observation, ANOVA with the same factors as above only found a main effect of Azimuth (F(2,22)=10.34, p=0.0007). The only other factor that approached significance was Training Region (F(1, 11)=3.83, p=0.076) while all the other factors and interactions were not significant (p > 0.15). These results are used in the current modeling in which it is assumed that there is a constant relationship between the induced ventriloquism effect and aftereffect, independent of whether the induced shift is hypometric or hypermetric.