Abstract
In face-to-face communication, audio-visual (AV) stimuli can be fused, combined or perceived as mismatching. While the left superior temporal sulcus (LSTS) is admittedly the locus of AV integration, the process leading to combination is unknown. Analysing behaviour and time-/source-resolved human MEG data, we show that fusion and combination both involve early detection of AV physical features discrepancy in the LSTS, but that this initial registration is followed, in combination only, by the activation of AV asynchrony-sensitive regions (auditory and inferior frontal cortices). Based on dynamic causal modelling and neural signal decoding, we further show that AV speech integration outcome primarily depends on whether the LSTS quickly converges or not onto an existing multimodal syllable representation, and that combination results from subsequent temporal re-ordering of the discrepant AV stimuli in time-sensitive regions of the prefrontal and temporal cortices.