Abstract
Animals continuously combine information across sensory modalities and time, and use these combined signals to guide their behaviour. Picture a predator watching their prey sprint and screech through a field. To date, a range of multisensory algorithms have been proposed to model this process including linear and nonlinear fusion, which combine the inputs from multiple sensory channels via either a sum or nonlinear function. However, many multisensory algorithms treat successive observations independently, and so cannot leverage the temporal structure inherent to naturalistic stimuli. To investigate this, we introduce a novel multisensory task in which we provide the same number of task-relevant signals per trial but vary how this information is presented: from many short bursts to a few long sequences. We demonstrate that multisensory algorithms that treat different time steps as independent, perform sub-optimally on this task. However, simply augmenting these algorithms to integrate across sensory channels and short temporal windows allows them to perform surprisingly well, and comparably to fully recurrent neural networks. Overall, our work: highlights the benefits of fusing multisensory information across channels and time, shows that small increases in circuit/model complexity can lead to significant gains in performance, and provides a novel multisensory task for testing the relevance of this in biological systems.
Key Points
We introduce a novel multisensory task in which we provide task relevant evidence via bursts of varying duration, amidst a noisy background.
Prior multisensory algorithms perform sub-optimally on this task, as they cannot leverage temporal structure.
However, they can perform better by integrating across sensory channels and short temporal windows.
Surprisingly, this allows for comparable performance to fully recurrent neural networks, while using less than one tenth the number of parameters.
2 Introduction
Picture a predator trying to track prey through a dense field. How should they approach this challenge? One solution would be to rely solely on either visual or auditory cues, such as sightings of, or screeches from, their prey. However, these unisensory strategies will be sub-optimal in many situations, like poor lighting conditions or noisy environments. Consequently, many animals combine information across their senses and base their decisions on these merged signals: a fundamental process termed multisensory integration [Trommershauser et al., 2011, Fetsch et al., 2013].
To date, numerous algorithms have been proposed to describe how animals implement this process [Jones, 2016]. For example, n-look algorithms suggest that observers examine the inputs from n channels, but form their “multisensory” output using only one - which could be the channel with the strongest or fastest signal [Townsend and Wenger, 2004, Otto and Mamassian, 2012]. In contrast, fusion algorithms form their outputs by combining their inputs across sensory channels either linearly [Fetsch et al., 2012, Drugowitsch et al., 2014, Hou et al., 2019, Coen et al., 2023] or nonlinearly [Jones, 2016, Parise and Ernst, 2016, Ghosh et al., 2024]. These algorithms can all be interpreted as instantaneous input-output mappings, or coupled to drift-diffusion models and used to explore how observers integrate multisensory evidence over time. For example, how animals determine their heading direction from visual and vestibular cues [Fetsch et al., 2012, Hou et al., 2019]. However, in general, these algorithms treat successive observations independently. Meaning they are unable to leverage the temporal structure inherent to naturalistic signals.
In contrast, experiments using visual [Pascucci et al., 2023], auditory [Dyson, 2017] and multisensory stimuli [Lau and Maus, 2019, Kayser and Kayser, 2018] have demonstrated that our perception at any given moment is strongly influenced by our recent observations - a phenomenon termed serial dependence. For example, when observers are presented with a sequence of orientated Gabor’s and asked to report the orientation of each; their responses will be accurate - on average across the experiment, but systematically biased by recent trials - on a trial-by-trial basis [Fischer and Whitney, 2014]. This phenomenon, has been framed as being both advantageous - as integrating information over time will improve signal-to-noise, and disadvantageous - as recent stimuli could render an instantaneous response sub-optimal [Kiyonaga et al., 2017]. However, much of this research has focused on trial-trial dependence, rather than moment-to-moment changes in dependence within a trial.
Here, we explore moment-to-moment multisensory integration in three steps. First, we introduce a novel multisensory task in which we provide the same number of task relevant stimuli per trial (in a background of noise), but vary how this information is presented: from many short bursts to a few long sequences. Next, we demonstrate that prior multisensory algorithms perform sub-optimally on this task, though perform better by simply considering short temporal windows. Finally, we explore the more naturalistic case, in which information is structured at multiple timescales.
3 Results
3.1 Multisensory algorithms are blind to temporal structure
We previously introduced a family of multisensory tasks in which observers must track prey using sequences of multisensory signals over time (Fig. 1A) [Ghosh et al., 2024]. In these tasks, prey either hide or emit signals, which provide clues about their direction of motion, at every time step, and observers must estimate their direction of motion (e.g. left or right). However, like many prior multisensory tasks each time step is independent; meaning that consecutive sequences of signals are no more informative than those same signals spaced over time.
Here, we began by creating a new multisensory task with two key properties:
Prey emit bursts of cues which are informative of their direction of motion. The length of these bursts is set by a parameter we term k. Low k generates short bursts, while high k generates long sequences (Fig. 1B). When k = 1 there is no serial dependence and each time step is independent.
As k increases we decrease the number of bursts, such that the total number of time steps where the signal is present, or signal sparsity, is constant across trials (on average). For example, for a trial of a given length and signal sparsity, we would provide n bursts for a burst length of 4, 2n bursts for a burst length of 2 and 4n bursts for a burst length of 1.
Together, these properties generate trials in which the total duration of the signal is constant, but how this information is presented varies: from many short bursts to a few long sequences. Notably, as the time steps in this task are not independent, calculating optimal performance, in the Bayesian sense, is computationally intractable [Ma et al., 2023] and so we cannot easily compute an upper bound on performance.
We then varied the burst length k from 1 to 8, and trained and tested two multisensory algorithms: linear and nonlinear fusion LF , NLF , Fig. 1C-D). Both algorithms achieved good test accuracy (Fig. 2A), and consistent with prior work [Ghosh et al., 2024] NLF outperformed LF . At lower levels of signal sparsity, this difference would increase substantially [Ghosh et al., 2024], though here our focus is on temporal structure, not linear vs nonlinear fusion. Notably, both algorithms’ accuracies decrease slightly as a function of k, despite the additional temporal information available, and consistent signal sparsity. The reason for this is that, while the average number of signal events remains constant across k, the distribution of the number of signal events varies, and accuracy depends on both the mean and distribution.
Overall, these results illustrate the intuitive point that, as often implemented, linear and nonlinear fusion across channels cannot leverage temporal structure.
3.2 Incorporating time into multisensory algorithms
To capture temporal information, we adapted the nonlinear fusion algorithm to process sliding input windows of length w; a family of models we term NLFw (Fig. 1E). As the number of parameters in these models scales unfavourably (9w), we focus on NLF2 and NLF3. To capture dependencies over longer timescales, we also trained recurrent neural network models (RNNs) . Our models have the following numbers of learnable parameters: LF - 6, NLF - 9, NLF2 - 81, NLF3 - 729, RNN - 10,903 (Table 1).
3.2.1 In distribution
We, first considered how well these models performed in distribution. That is, when each model is trained and tested on bursts of the same length (k): train on k = 2, test on k = 2, etc. (Fig. 2A).
When k = 1, NLF2/3 and the RNN models all perform equivalently to NLF , as each time step is independent and there is no additional temporal structure to detect. However, as k increases these models outperform LF and NLF as they are able to leverage the additional temporal information available. Though, how each model’s accuracy varies with k differs. NLF2/3 increase their performance then plateau. While, RNN performance is flat then rises. As such, while, the RNNs excel at detecting longer sequences, the simpler NLFw models are better at detecting shorter bursts and surprisingly good at longer sequences, even when those sequences are longer than their window length (or memory).
Together, these results demonstrate the benefits of fusing information, not only over channels, but also time. However, in naturalistic settings, predators must perform well not only in response to motion patterns they have experienced, but also, in novel, unseen situations.
3.2.2 Generalisation
We next considered how well these models generalise. That is how well they perform when they are trained on one burst length (k) and tested on another (Fig. 2B-D).
When fit on k = 1 and tested on k > 1, all three model’s accuracies decrease slightly as a function of k (dark blue lines, Fig. 2B1 − D1). This reflects the fact that while these models have the capacity to detect sequences, there is no benefit learning to do so when your training data has no temporal structure (k = 1). As such, they perform equivalently to NLF (Fig. 2A). Similarly, when NLF3 is trained on k = 2 it only performs as well as NLF2; and perhaps even slightly worse (Fig. 2C1), as there is no benefit learning to detect longer sequences.
In contrast, as these models learn from longer sequences, past w in the case of NLF2/3, all three models generalize reasonably well; the maximum difference in accuracy we observe is less than 8% (Fig. 2B2 − D2). Though, again, we observe a notable difference between the NLFw and RNN models. Specifically, both NLF models generalize better when tested on longer sequences than they were trained on, and less well to shorter sequences. In contrast, the RNN generalizes better to shorter rather than longer sequences (Fig. 2B2 − D2).
Overall these results, demonstrate that all three models generalise well when tested out of distribution. However, these scenarios are still unrealistic, in the sense that the prey always emit bursts of similar length.
3.3 Capturing multi-timescale structure
To add further realism to our task, we next considered a variant in which, within each trial, prey emit bursts of varied length; drawn from either a uniform or Lévy distribution. We choose the latter distribution as it describes animal behaviours such as foraging [Viswanathan et al., 1999], which are composed of many short bursts (local exploitation) interspersed by occasional long flights (exploration).
We found that all three models (NLF2, NLF3 and RNNs) performed well when trained and tested on burst lengths drawn from either uniform or Lévy distributions. Furthermore, models trained on our fixed length tasks generalised well to these mixed length tasks (Fig. 3A − C).
However, NLFw models trained on mixed distributions tended to generalise better than those trained on fixed distributions (Fig. 3A − B). Though, the same was not true for the RNNs (Fig. 3C). For example, at k = 8, the NLFw models trained on mixed distributions outperformed those trained on fixed distributions (Fig. 3A− B). While, the RNNs trained on fixed distributions outperformed their counterparts trained on mixed distributions (Fig. 3C). This suggests that learning to detect a mixture of short bursts, in the case of NLF2/3, yields a parsimonious strategy that generalises well to longer sequences. While, with more resources (i.e. parameters) the RNNs can learn more specialised strategies (for each value of k).
Together these results demonstrate that in a more complex situation, where prey emit signals in multiple channels and at multiple time-scales, all three models perform reasonably well. However, across settings (testing in and out of distribution, on both fixed and mixed length signals), NLF3 often performs best or close to best (Fig. 4). Underscoring the benefit of fusing information, not only over channels, but also over short temporal windows.
4 Discussion
To date, many psychophysical tasks have been used to explore how animals combine information across their senses. In classical multisensory tasks, each sensory modality (or channel) provides evidence about an underlying target independently [Fetsch et al., 2012, Drugowitsch et al., 2014, Hou et al., 2019, Coen et al., 2023, Farahmandi et al., 2024]. As such, the optimal solution to these tasks is to linearly fuse (or integrate) evidence across channels. Though, in the case when channels are co-dependent the optimal solution is to nonlinearly fuse evidence across channels [Ghosh et al., 2024]. This case seems likely to arise in natural conditions; consider the relation between lip movements and sounds, for example. However, in both of these cases evidence should be fused linearly across time. This is because within these tasks, each time step is independent, meaning the time points within each trial could be shuffled without changing the results.
Here, we explored another scenario that seems likely to arise in natural conditions, where the evidence at any given moment depends on recent moments: like the bursts of sensory signals prey darting from cover to cover would emit. To do so, we adapted a task from [Ghosh et al., 2024] to make the target (prey) emit bursts of varied length; thereby introducing a sequential time dependence. From the point of view of estimating the target value (the predator’s perspective), this means that if you have recently seen evidence suggesting that the target is currently emitting a signal, you may wish to increase the relative weighting of the current evidence.
This intuition suggested two simple models which could perform this computation, and plausibly be implemented by neural circuits. The first combines information across channels and short temporal windows of a fixed length; a family of models we term nonlinear fusion over w time steps, or NLFw. The second are recurrent neural networks (RNNs) which combine the current evidence from each channel with their prior hidden states, and so can capture structure at multiple timescales. Notably, the Bayes optimal solution to this problem involves a combinatorial computation over all time steps [Ma et al., 2023], which is computationally intractable and unlikely to be implemented by neural circuits.
As expected, both models (NLFw and RNNs) were able to leverage temporal structure, and outperform time- independent algorithms (linear and nonlinear fusion). Though, across settings (in and out of distribution, and on both fixed and mixed length signals) NLF2/3 performed well, and were surprisingly comparable to the RNNs despite having less than one tenth the number of parameters. This suggests that in these tasks, short bursts are sufficiently informative of the target’s motion and there is little benefit to detecting longer sequences. In practice, this is akin to change detection, and would also allow for faster reaction times than waiting to observe longer sequences.
So, which is a more plausible model of how neural circuits integrate information over channels and time? NLFw could be implemented by a range of simple mechanisms. For example, a multisensory neuron receiving w inputs per channel, each offset with a different temporal delay would fuse information across channels and time. In contrast, the RNN model requires a population of densely connected units, and hence a higher energetic cost. For the tasks we explore here, this cost does not seem merited by the increase in performance. However, in tasks with even more detailed temporal structure, for example, complex multistep, multisensory sequences, it seems likely that a recurrent network would outperform the simpler model.
In conclusion, our results demonstrate the benefits of combining multisensory information across short temporal windows (NLFw) or prior states (RNNs). Either of which could easily be implemented by neural circuits. More broadly, our results underscore the importance of exploring more complex multisensory tasks, and highlights the fact that, despite their apparent complexity, these tasks are often solvable with simple, biologically plausible extensions to existing models.
5 Methods
In short, we build on the detection task from Ghosh et al. [2024] - in which observers must estimate the heading direction of a target from sequences of information in n channels; which represent different sensory modalities or independent sources of information from the same modality. In this task, at some (unknown) time points, the target emits signals and the sensory observations provide information about the target direction, while at other times, the sensor receives noise. In Ghosh et al. [2024] we showed that this task structure requires nonlinear fusion across channels to solve optimally, but since each time step is independent, only linear fusion across time steps was needed. Here we introduce limited temporal dependency in the signals by having signals switch on at unknown times but then remain on for a number of steps (where this number is either a constant or chosen from a random distribution). It is not feasible to compute the optimal classifier in this case, and so we investigated two classifiers with different types of short-term memory, one based on a linear sum of nonlinear functions operating on a window that slides across all time steps, and one based on a recurrent neural network. Below, we detail these tasks and inference models.
5.1 Tasks
We start by sampling a random direction M = ±1 with equal probability. The task is to estimate M .
We define a sequence of binary valued random variables Et to indicate whether the signal is emitting a signal at time t (where t = 1, … , n). When Et = 1 the distribution of the observations Xit in channel i follow a signal distribution (giving the correct value of M with probability pc, incorrect value with probability pi and a neutral value (0) with probability 1 − pc − pi). When Et = 0, Xit follows a noise distribution taking the correct/incorrect values of M with equal probability (1 − pn)/2 and the neutral value with probability pn.
To generate the sequence Et, we sample a generating sequence Gt of signal start times, so that whenever Gt = 1 the signal Et = 1 and will stay 1 for some period Lt (which can either be a constant or be drawn from a distribution). Finally, to ensure that the task is solvable, we filter out all cases where Et = 0 for all t. We set the probability that Gt = 1 so that the fraction of the time that Et = 1 is equal to a value pe that we choose.
5.1.1 General task structure
In more detail, the task variables are related by the following graphical model: where:
M ∈ {−1, 1} is the target direction, with each of the two values being equally likely.
{Gt ∈ 0, 1} are the start times for emission periods, taking value 1 with probability pg to indicate the start of an emission period.
Lt ∈ {1, … , Lmax} is the length of the emission period starting at t, and follows a different distribution for different tasks.
is whether or not the target is emitting a signal at time t, and has deterministic dependence on Lt and .
Xit ∈ {− 1, 0, 1} is the signal received in channel i ∈ {1, … , Nc } at time t. It’s distribution depends on M and Et as described below.
The Gt are independent variables except that if E1 = E2 = …= En = 0 then values are resampled, introducing a small time dependence that we ignore in the analysis.
The distribution of Xit depends on whether or not there is a signal being emitted or not. If not, it follows a noise distribution that is independent of M :
If a signal is being emitted, then it depends on M as follows:
5.1.2 Task variants
In the time-dependent detection task with on-time k (Detk), we have that Lt = k for all t. In the Lévy flight task, we have that:
5.2 Normalising for signal sparsity
In order to make performance comparable between the different task variants, we normalise certain parameters so that the average amount of useful information (ignoring time) is the same across tasks. Specifically, we choose pg so that the expected fraction of the time that a signal is being emitted 𝔼 [Et] is equal to a value pe that we choose. We outline the calculation of this normalisation for the Detk and Lévy flight tasks below.
5.2.1 Detk
In this task, we sample values Gt ∈ {0, 1} with P (Gt = 1) = pg independently. We compute Et = maxt− k<s≤ t Gs and then resample if all E1 =… = En = 0. Note that to compute E1 we need to have a value for G1−k+1 (and onwards to Gn). We start by computing P (Et = 1) without the resampling procedure, and then compute the effect of resampling.
Without resampling,
Each of these values Gt−k+1 to Gt are independent, and therefore we can write
We can then compute
Now write F for the event that E1 = …= En = 0 that we filter out. We want to compute the fraction of the time Et = 1 when this event does not take place, P (Et = 1| ¬ F). To calculate this, we note that either F does or does not take place, so computing the total probability over both these two possibilities we get
We computed P (Et = 1) above and by the definition of F we know that P (Et = 1 | F) = 0, and therefore we get that
It remains to calculate
This gives us
We can then numerically invert this to get the correct value for pg given a desired value pe. Note that by taking limits as pg → 0, the smallest pe achievable is k/(n + k − 1).
5.2.2 Lévy flight
The calculation for the Lévy flight task is similar to the Detk task but a little more involved.
To simplify notation later, write and
To compute P (Et = 0) we need to consider various possibilities depending on the on lengths Ls for s ≤ t. For Et = 0, the first thing that needs to be true is that Gt = 0 (because if Gt = 1 then automatically Et = 1). The next condition we need to be true is that either Gt − 1 = 0 or Gt −1 = 1 but Lt −1 = 1 meaning that the on period generated by Gt− 1 = 1 was only of length 1 and therefore did not cause Et = 1. And so on until we have gone back to the maximum number of previous steps Lmax that could cause Et = 1. Each of these events depend on Gs and Ls for a different value of s, and are therefore independent, so
This gives us P (Et = 1) without filtering, and to compute P (Et = 1 | ¬ F) we need to compute P (F). We break this event down into independent events F𝓁 defined as Gt = 0 for t = 1 − 𝓁 + 1, … , n whenever Lt = 𝓁. In other words, there can be no length-1 on periods that cause an Et = 1 (event F1), no length-2 on periods (event F2) and so on. These are each independent events and therefore
Putting it together,
As before, this can be inverted numerically to compute the pg value that gives a desired pe value.
5.3 Inference
The task is to estimate M from Xit with unknown hidden variables Gt, Et and Lt. It is straightforward to write down the optimal maximum a posteriori (MAP) estimator using Bayes’ theorem. We simply want to choose m that maximises P (M = m | X) = P (X | M = m)P (M = m)/P (X), and given P (M = m) = 1/2 and P (X) doesn’t depend on m, we just need to maximise P (X | M = m). We can marginalise over the hidden variables to get
However, given that the time steps are not independent, and the sum is over (2 + Lmax)n terms, it is hard to simplify this any further in a way that could be easily computable by the brain.
Instead, we consider two types of estimators with different types of short term memory. The first type, w-step nonlinear fusion (or NLFw) computes, for each time step, a likelihood based on the previous w time steps (which can be implemented as a table look up over all the possible observations Xit over w time steps), and then sums this over all time steps. In more detail, we enumerate every possible Xit for t = 1, …, w (of which there are , possibilities). For every trial, we then then count how often each of these possibilities occurs for every window t = [1 … w], [2 … w + 1], … [n −w + 1…n], and make these counts into a feature vector . We then train a logistic regression classifier to estimate M from F using scikit-learn [Pedregosa et al., 2011]. The parameters of the linear part of the model can be interpreted as the log-likelihoods of each possible observation in a window, and using counts is equivalent to summing these over all windows.
The second type of estimator, RNN, uses a recurrent neural network with one hidden layer and 100 hidden units, fed at each time step with inputs Xit and its previous state ht−1, and the output of this network is treated as a likelihood. In more detail, write ht for the activity of the hidden layer at time step t, and ot for the activity of the output layer. for nonlinear activation functions f , g, weight matrices Wih, Whh and Who, and biases bh and bo. The rectified linear unit function is defined as:
The hidden state h0 is initialized to zero, and outputs are summed across time steps to produce the final prediction ,. The network is trained with a standard cross-entropy loss on the output layer using PyTorch [Paszke et al., 2019] and the Adam optimiser [Kingma and Ba, 2014] with learning rate of 10−6.
6 Supplementary information
6.1 Task code
The following python code demonstrates how we generate trials for the tasks described in this study. The complete working code is provided in the repository.
6.1.1 Detection Task
6.1.2 Time-dependent Detection Task
6.1.3 Lévy flights
Generation of Lévy flights can be broken down into few simple steps.
First, we define a Levy distribution to determine the lengths of emission bursts.
Next, we create a sparse base emission sequence using a Bernoulli process.
We then apply the Levy flight principle to extend each emission point according to lengths drawn from the Levy distribution.
Finally, we generate observations based on this emission sequence, taking into account the target motion and probabilities for correct, incorrect, and neutral observations.
7 Acknowledgements
SA is supported by the Landesgraduiertenförderung Abschlussstipendium, issued by the Graduate Funding of the Land of Baden-Württemberg (LGFG). MG is supported by the Eric and Wendy Schmidt AI in Science Postdoctoral Fellowship, a Schmidt Sciences program. We thank members of the Neural Reckoning group at Imperial for their helpful input.