Abstract
Despite recent progress in multisensory research, the absence of stimulus-computable perceptual models fundamentally limits our understanding of how the brain extracts and combines task-relevant cues from the continuous flow of natural multisensory stimuli. In previous research, we demonstrated that a correlation detector initially proposed for insect motion vision can predict the temporal integration of minimalistic audiovisual signals. Here, we demonstrate how a population of such units can process natural audiovisual stimuli and accurately account for human, monkey, and rat behaviour, across simulations of 69 classic psychophysical, eye-tracking, and pharmacological experiments. Given only the raw audiovisual stimuli (i.e., real-life footage) as input, our population model could replicate observed responses with an average correlation exceeding 0.97. Despite relying on as few as 0 to 4 free parameters, our population model provides an end-to-end account of audiovisual integration in mammals—from individual pixels and audio samples to behavioural responses. Remarkably, the population response to natural audiovisual scenes generates saliency maps that predict spontaneous gaze direction, Bayesian causal inference, and a variety of previously reported multisensory illusions. This study demonstrates that the integration of audiovisual stimuli, regardless of their complexity, can be accounted for in terms of elementary joint analyses of luminance and sound level. Beyond advancing our understanding of the computational principles underlying multisensory integration in mammals, this model provides a bio-inspired, general-purpose solution for multimodal machine perception.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
Revised version with additional simulations. These include psychophysical data from rats, and eye-tracking data from monkeys.