## Abstract

The brain has a hugely diverse, heterogeneous structure. By contrast, many functional neural models are homogeneous. We compared the performance of spiking neural networks trained to carry out difficult tasks, with varying degrees of heterogeneity. Introducing heterogeneity in membrane and synapse time constants substantially improved task performance, and made learning more stable and robust across multiple training methods, particularly for tasks with a rich temporal structure. In addition, the distribution of time constants in the trained networks closely matches those observed experimentally. We suggest that the heterogeneity observed in the brain may be more than just the byproduct of noisy processes, but rather may serve an active and important role in allowing animals to learn in changing environments.

**Summary** Neural heterogeneity is metabolically efficient for learning, and optimal parameter distribution matches experimental data.

## Introduction

The brain is known to be deeply heterogeneous at all scales (Koch and Laurent 1999), but it is still not known whether this heterogeneity plays an important functional role or if it is just a byproduct of noisy developmental processes and contingent evolutionary history. A number of hypothetical roles have been suggested (reviewed in (Gjorgjieva, Drion, and Marder 2016)), in efficient coding (Shamir and Sompolinsky 2006; Chelaru and Dragoi 2008; Osborne et al. 2008; Marsat and Maler 2010; Padmanabhan and Urban 2010; Hunsberger, Scott, and Eliasmith 2014; Zeldenrust, Gutkin, and Denéve 2019), reliability (Lengler, Jug, and Steger 2013), working memory (Kilpatrick, Ermentrout, and Doiron 2013), and functional specialisation (Duarte and Morrison 2019). However, previous studies have largely used simplified tasks or networks, and it remains unknown whether or not heterogeneity can help animals solve complex information processing tasks in natural environments. Recent work has allowed us, for the first time, to train biologically realistic spiking neural networks to carry out these tasks at a high level of performance, using methods derived from machine learning. We used two different learning models (Nicola and Clopath 2017; Neftci, Mostafa, and Zenke 2019) to investigate the effect of introducing heterogeneity in the time scales of neurons when performing tasks with realistic and complex temporal structure. We found that it improves overall performance, makes learning more stable and robust, and learns neural parameter distributions that match experimental observations, suggesting that the heterogeneity observed in the brain may be a vital component to its ability to adapt to new environments.

## Results

### Time scale heterogeneity improves learning on tasks with rich temporal structure

We investigated the role of neural heterogeneity in task performance by training recurrent spiking neural networks to classify visual and auditory stimuli with varying degrees of temporal structure. The model used three layers of spiking neurons: an input layer, a recurrently connected layer, and a readout layer used to generate predictions (Figure 1A), a widely used minimal architecture (e.g. Maass, Natschläger, and Markram 2002; Neftci, Mostafa, and Zenke 2019). Heterogeneity was introduced by giving each neuron an individual membrane and synaptic time constant. We compared four different conditions: initial values could be either homogeneous or heterogeneous, and training could be either standard or heterogeneous (Figure 1B). Time constants were either initialised with a single value (homogeneous initialisation), or randomly according to a gamma distribution (heterogeneous). The models were trained using surrogate gradient descent (Neftci, Mostafa, and Zenke 2019). Synaptic weights were always plastic, while time constants were either held fixed at their initial values in the standard training regime, or could be modified in the heterogeneous training regime.

We used five different datasets with varying degrees of temporal structure. Neuromorphic MNIST (N-MNIST; Orchard et al. 2015), Fashion-MNIST (F-MNIST; Xiao, Rasul, and Vollgraf 2017, and the DVS128 Gesture dataset (Amir et al. 2017) feature visual stimuli, while the Spiking Heidelberg Digits (SHD) and Spiking Speech Commands (SSC) datasets (Cramer et al. 2020) are auditory. N-MNIST and DVS128 use a neuromorphic vision sensor to generate spiking activity, by moving the sensor with a static visual image of handwritten digits (N-MNIST) or by recording humans making hand gestures (DVS128). F-MNIST is a dataset of static images that is widely used in machine learning, which we converted into spike times by treating the image intensities as input currents to model neurons, so that higher intensity pixels would lead to earlier spikes, and lower intensity to later spikes. Both SHD and SSC use a detailed model of the activity of bushy cells in the cochlear nucleus, in response to spoken digits (SHD) or commands (SSC). Of these datasets, N-MNIST and F-MNIST have minimal temporal structure, as they are generated from static images. DVS128 has some temporal structure as it is recorded motion, but it is possible to perform well at this task by discarding the temporal information. The auditory tasks SHD and SSC by contrast have very rich temporal structure.

We found that heterogeneity in time constants had a profound impact on performance on those training datasets where information was encoded in the precise timing of input spikes (Table 1, Figure 2A). On the most temporally complex auditory tasks, accuracy improved by a factor of around 15-20%, while for the least temporally complex task N-MNIST we saw no improvement at all. For the gesture dataset DVS128 we can identify the source of the (intermediate) improvement as the heterogeneous models being better able to distinguish between spatially similar but temporally different gestures, such as clockwise and anticlockwise versions of the same gesture (Figure 2B-C). This suggests that we might see greater improvements for a richer version of this dataset in which temporal structure was more important.

We verified that our results were due to heterogeneity and not simply to a better tuning of time constants in two ways. Firstly, we performed a grid search across all homogeneous time constants and used the best values for our comparison. Secondly, we observe that the distribution of time constants after training is very similar and heterogeneous regardless of whether it was initialised with a homogeneous or heterogeneous distribution (Figure 3A), indicating that the heterogeneous distribution is optimal.

Introducing heterogeneity allows for a large increase in performance at the cost of only a very small increase in the number of parameters (0.23% for SHD, because the vast majority of parameters are synaptic weights), and without using any additional neurons or synapses. Heterogeneity is therefore a metabolically efficient strategy. It is also a computationally efficient strategy of interest to neuromorphic computing, because adding heterogeneity adds *O*(*n*) to memory use and computation time, while adding more neurons adds *O*(*n*^{2}). Further, in some neuromorphic systems like BrainScaleS this heterogeneity is already present as part of the manufacturing process (Schmitt et al. 2017).

Note that it is possible to obtain better performance using a larger number of neurons. For example, Neftci, Mostafa, and Zenke (2019) obtained a performance of 83.2% on the SHD dataset without heterogeneity using 1024 neurons and data augmentation techniques, whereas we obtained 82.7% using 128 neurons and no data augmentation. We focus on smaller networks here for two reasons. Firstly, we wanted to systematically investigate the effect of different training regimes, and current limitations of surrogate gradient descent mean that each training session takes several days. Secondly, with larger numbers of neurons, performance even without heterogeneity approaches the ceiling on these tasks (which are still simple in comparison to those faced by animals in real environments), making it more difficult to see the effect of different architectures. Even with this limitation to small networks, heterogeneity confers such an advantage that our results for the SSC dataset are state of the art (for spiking neural networks) by a large margin.

We also tested the effect of introducing heterogeneity of other neuron parameters, such as the firing threshold and reset potential, but found that it made no appreciable difference. This was because for our model, changing these is almost equivalent to a simple scaling of the membrane potential variable. By contrast, Bellec et al. (2019) found that introducing an adaptive threshold did improve performance, presumably because it allows for much richer temporal dynamics.

### Predicted time constant distributions match experimental data

In all our tasks, the distribution of time constants after training approximately but not exactly fit a log normal or gamma distribution (with different parameters for each task), and are consistent across different training runs (Supplementary Materials Figure 5), suggesting that the learned distributions may be optimal. Using publicly available datasets including time constants recorded in large numbers of neurons in different animals and brain regions (P. B. Manis, Kasten, and Xie 2019; P. Manis, Kasten, and Xie 2019; Lein et al. 2007; Hawrylycz et al. 2012), we found very similar distributions to those we predicted (Figure 3B). The parameters for these distributions are different for each animal and region, just as for different tasks in our simulations. Interestingly, the distribution parameters are also different for each cell type in the experimental data, a feature not replicated in our simulations as all cells are identical. This suggests that introducing further diversity in terms of different cell types may lead to even better performance.

### Heterogeneity improves speech learning across time scales

Sensory signals such as speech and motion can be recognised across a range of speeds. We tested the role of heterogeneity in learning a circuit that can function across a wide range of speeds. We augmented the SHD spoken digits datasets to include faster or slower versions of the samples, multiplying all spike times by a temporal scale as an extremely simplified model that captures a part of the difficulty of this task (Figure 3C). During training, temporal scales were randomly selected from a distribution roughly matching human syllabic rate distributions (Lerner et al. 2014). The heterogeneous configurations performed as well or better at all time scales (Figure 3D), and in particular were able to generalise better to time scales outside the training distribution (e.g. accuracy of 47% close to a time scale of 4, around 7% higher than the homogeneous network, where chance performance would be 5%). Heterogeneous initialisation alone was sufficient to achieve this better generalisation performance, while fine tuning the distribution of time constants with heterogeneous training improved the peak performance but gave no additional ability to generalise.

### Heterogeneity improves robustness against mistuned learning

We tested the hypothesis that heterogeneity can provide robustness with two experiments where the *hyperparameters* were mistuned, that is where the initial distributions and learning parameters were chosen to give the best performance for one distribution, but the actual training and testing is done on a different distribution.

In the first experiment, we took the augmented SHD spoken digits dataset from the previous section, selected the hyperparameters to give the best performance at a time scale of 1, but then trained and tested the network at a different time scale (Figure 3E). With standard training (of weights only), hyperparameter mistuning leads to decreased performance, particularly for high time scales. This makes sense because at higher time scales, the degree of heterogeneity of the time constants will be smaller relative to the time scales of the stimuli, and therefore it will operate increasingly like a network with homogeneous initialisation. However, with heterogeneous training, performance is flat and therefore robust to hyperparameter mistuning.

In the second experiment, we switched to a very different learning paradigm, FORCE training of spiking neural networks (Nicola and Clopath 2017) to replay a learned signal, in this case a recording of a zebra finch call (Figure 4A; from Blättler and Hahnloser 2011). This method does not allow for heterogeneous training (tuning time constants), so we only tested the role of untrained heterogeneous neuron parameters. We tested three configurations: fully homogeneous (single 20ms time constant as in the original paper); intermediate (each neuron randomly assigned a fixed fast 20ms or slow 100ms time constant); or fully heterogeneous (each neuron randomly assigned a time constant drawn from a gamma distribution).

Nicola and Clopath (2017) showed that network performance is highly dependent on two hyperparameters (*G* and *Q* in their paper). We therefore tuned these hyperparameters for a network of a fixed size (*N* = 1000 neurons) and ran the training and testing for networks of different sizes (Figure 4B). As the network size started to diverge, the homogeneous network began to make large errors, while the fully heterogeneous network was able to give low errors for all network sizes. The intermediate network was able to function well across a wider range of network sizes than the fully homogeneous network, but still eventually failed for the largest network sizes. At these large network sizes, the homogeneous network becomes saturated, leading to poor performance (Figure 4C). The robustness of the heterogeneous version of the network can be measured by the area of the hyperparameter space that leads to good performance (Figure 4D). Adding partial or full heterogeneity leads to an improvement in learning for all points in the hyperparameter space. again suggesting that it can improve robustness of learning in a wide range of situations.

## Discussion

We trained spiking neural networks at difficult classification tasks, either forcing all time constants to be the same (homogeneous) or allowing them to be different (heterogeneous). We found that introducing heterogeneity improved overall performance across a range of tasks and training methods, but particularly so on tasks with richer intrinsic temporal structure. Learning was more robust, in that the networks were able to learn across a range of different environments, and when the hyperparameters of learning were mistuned. When the learning rule was allowed to tune the time constants as well as synaptic weights, a consistent distribution of time constants was found, akin to a log normal or gamma distribution, and this qualitatively matched time constants measured in experimental data. Note that we do not claim that the nervous system tunes time constants during its lifetime to optimise task performance, we only use our methods to find the optimal distribution of time constants.

We conclude from this that neural heterogeneity is a metabolically efficient strategy for the brain. Heterogeneous networks have no additional cost in terms of the number of neurons or synapses, and perform as well as homogeneous networks with an order of magnitude more neurons. This gain also extends to neuromorphic computing systems, as adding heterogeneity to the neuron model adds an additional time and memory cost of only *O*(*n*), while adding more neurons has a cost of *O*(*n*^{2}). In addition to their overall performance being better, heterogeneous networks are more robust and able to learn across a wider range of environments, which is clearly ethologically advantageous. Again, this has a corresponding benefit to neuromorphic computing and potentially machine learning more generally, in that it reduces the cost of hyperparameter tuning, which is often one of the largest costs for developing these models.

The question remains as to the extent of time constant tuning in real nervous systems. It could be the case that the heterogeneous distribution of time constants observed in different animals and brain regions (Figure 3B) is simply a product of noisy developmental processes. This may be so, but our results show that these distributions closely match the optimal ones found by simulation which confer a substantial computational advantage, and it therefore seems likely that the brain makes use of this advantage. We found that any degree of heterogeneity improves performance, but that the best performance could be found by tuning the distribution of time constants to match the task. Without a more detailed model of these specific brain regions and the tasks they solve, it is difficult to conclude whether or not the precise distributions observed are tuned to those tasks or not, and indeed having a less precisely tuned distribution may lead to greater robustness in uncertain environments.

A number of studies have used heterogeneous or tunable time constants (Fang et al. 2020; Quax, D’Asaro, and Gerven 2020; Yin, Corradi, and Bohté 2020), but these have generally been focussed on maximising performance for neuromorphic applications, and not considering the potential role in real nervous systems. In particular, we have shown that: heterogeneity is particularly important for the type of temporally complex tasks faced in real environments, as compared to the static ones often considered in machine learning; heterogeneity confers robustness allowing for learning in a wide range of environments; optimal distributions of time constants are consistent across training runs and match experimental data; and that our results are not specific to a particular task or training method.

The methods used here are very computationally demanding, and this has limited us to investigating very small networks (hundreds of neurons). Indeed, we estimate that in the preparation of this paper we used approximately 2 years of GPU computing. Finding new algorithms to allow us to scale these methods to larger networks will be a critical task for the field.

Beyond this, it would be interesting to see to what extent different forms of heterogeneity confer other advantages, such as spatial heterogeneity as well as temporal. We observed that in the brain, different cell types have different stereotyped distributions of time constants, and it would be interesting to extend our methods to networks with multiple cell types, including more biophysically detailed cell models.

Our computational results show a compelling advantage for heterogeneity, and this makes intuitive sense. Having heterogeneous time constants in a layer allows the network to integrate incoming spikes at different time scales, corresponding to shorter or longer memory trace, thus allowing the readout layer to capture information at several scales and represent a richer set of functions. It would be very valuable to extend this line of thought and find a rigorous theoretical explanation of the advantage of heterogeneity.

## Funding

The support of the EPSRC Centre for Doctoral Training in High Performance Embedded and Distributed Systems (HiPEDS, Grant Reference EP/L016796/1) and the Imperial College President’s PhD Scholarship is gratefully acknowledged.

## Author contributions

N.P.-N.: Conceptualisation, Methodology, Software, Validation, Visualisation, Writing - Original Draft; V.C.H.L.: Methodology, Software, Validation, Visualisation, Writing - Review & Editing; P.L.D.: Writing - Review & Editing, Supervision, Project administration; D.F.M.G.: Conceptualisation, Resources, Data Curation, Writing - Review & Editing, Supervision, Project administration.

## Data and materials availability

The code for this work and links to the different datasets used is available at `https://github.com/npvoid/neural_heterogeneity`.

## List of Supplementary Materials

Materials and Methods

Table S1

Figures S2-S5

## Supplementary materials

## Materials and Methods

### Neuron and synaptic models

We use the Leaky Integrate and Fire (LIF) neuron model in all our simulations. In this model the membrane potential of the *i*-th neuron in the *l*-th layer varies over time following (1).

Here, *τ _{m}* is the membrane time constant,

*U*

_{0}is the resting potential and is the input current. When the membrane potential reaches the threshold value

*U*a spike is emitted,

_{th}*U*(

_{i}*t*) resets to the reset potential

*U*and then enters a refractory period that lasts

_{r}*t*seconds where the neuron cannot spike.

_{ref}Spikes emitted by the *j*-th neuron in layer *l*−1 at a finite set of times can be formalised as a spike train defined as in (2)

The input current is obtained from the spike trains of all presynaptic neurons *j* connected to neuron *i* following (3)

Here *τ _{s}* is the synaptic time constant, is the feed-forward synaptic weight from neuron

*j*in layer

*l*−1 to neuron

*i*in layer

*l*and is the recurrent weight from neuron

*j*in layer

*l*to neuron

*i*in layer

*l*.

Thus, a LIF neuron is fully defined by six parameters *τ _{m}, τ_{s}, U_{th}, U*

_{0},

*U*plus its synaptic weights and . We refer to these as the

_{r}, t_{ref}**neuron parameters**and

**weights**respectively.

Since we are considering the cases where these parameters may be different for each neuron in the population we should actually refer to *τ _{m,i}*,

*τ*,

_{s,i}*U*,

_{th,i}*U*

_{0,i},

*U*,

_{r,i}*t*. However, for notational simplicity we will drop the

_{ref,i}*i*subscript and it will be assumed that these parameters can be different for each neuron in a population.

### Neural and synaptic model discretisation

In order to implement the LIF model in a computer it is necessary to discretise it. Assuming a very small simulation time step Δ*t*, (3) can be discretised to

With *α* = exp(−Δ*t*/*τ _{s}*). Similarly, (1) becomes

With *β* = exp(−Δ*t*/*τ _{m}*). Finally, the spiking mechanism

Notice how the last term in (5) introduces the membrane potential resetting. This would only work if we assume that the neuron potential at spiking time was exactly equal to *U _{th}*. This may not necessarily be the case since membrane potential update that crossed the threshold may result in and then the resetting mechanism will not set the membrane potential to

*U*. However, we found that this has a negligible effect in our simulations.

_{r}### Surrogate Gradient Descent Training

With the discretisation introduced in the previous section, a spiking layer consists of three cascaded sub-layers: current (4), membrane (5) and spike (6). The current and membrane sub-layers have access to its previous state and thus, they can be seen as a particular case of recurrent neural network (RNN). Note that while each neuron is a recurrent unit since it has access to its own previous state, different neurons in the same spiking layer will only be connected if any of the non-diagonal elements of *V*^{(l)} is non-zero. In other words, all SNNs built using this model are RNNs but not all SNNs are RSNNs.

We can cascade *L* spiking layers to conform a Deep Spiking Neural Network analogous to a conventional Deep Neural Network and train it using gradient descent. However, since equation (6) is non-differentiable, we need to modify the backwards pass as in (Neftci, Mostafa, and Zenke 2019) so that BPTT algorithm can be used to update the network parameters.

This means that while in the forward pass the network follows a step function as in (6), in the backwards pass it follows a sigmoidal function (7), with steepness set by *ρ*.

We can now use gradient descent to optimise the synaptic weights *W*^{(l)} and *V*^{(l)} as in conventional deep learning. We can also optimise the spiking neuron specific parameters *U _{th}, U*

_{0},

*U*since they can be seen as bias terms. The time constants can also be indirectly optimised by training

_{r}*α*and

*β*which can be seen as forgetting factors.

We apply a clipping function to *α* and *β* after every update.

In order to ensure stability, the forgetting factors have to be less than 1. Otherwise, the current and membrane potential would grow exponentially. Secondly to make the system causal these factors cannot be less than zero. This however, would allow for arbitrarily small time constants which would not have any meaning given a finite time resolution Δ*t*. Thus, we constrain the time constants to be at least 3Δ*t*. We also set clipping limits for *U _{th}, U*

_{0},

*U*such that they are always between the ranges specified in Table 2.

_{r}There are several ways in which the neuron parameters may be trained. One possibility is to make all neurons in a layer share the same neuron parameters. That is, a single value of *U _{th}, U*

_{0},

*U*is trained and shared by all neurons in a layer. Another possibility is to optimise each of these parameters in each neuron individually as we have done in our experiments. We also always trained the weight matrices

_{r}, α, β

*W*^{(l))}and

*V*^{(l))}. Training was done by using automatic differentiation on PyTorch ((Paszke et al. 2019)) and Adam optimiser with learning rate 10

^{−3}and betas (0.9, 0.999).

In all surrogate gradient descent experiments a single recurrent spiking layer with 128 neurons received all input spikes. This recurrent layer is followed by a feedforward readout layer with *U _{th}* set to infinity and with as many neurons as classes in the dataset.

For the loss function, we follow the *max-over-time* loss in (Cramer et al. 2020) to take the maximal membrane potential over the entire time in the readout layer. We then take these potentials and compute the cross-entropy loss
where *class* corresponds to the readout neuron index of the correct label for a given sample. The loss is computed as the average of *N _{batch}* training samples. This was repeated for a total of

*N*.

_{epochs}In order to improve generalisation we added noise to the input by adding spikes following a Poisson process with rate 1.2 Hz and deleting spikes with probability 0.001.

The parameters used for the network are given in Table 2 and Table 3 unless otherwise specified

All states *I _{i}*(

*l*)[0] and

*U*(

_{i}*l*)[0] are initialised to 0. For the weights

**and**

*W***, we independently sampled from a uniform distribution , with**

*V**k*being the number of afferent connections ((LeCun et al. 1998)).

### FORCE Training

The FORCE method is used to train a network consisting of a single recurrent layer of LIF neurons as in (Nicola and Clopath 2017). In this method, there are no feedforward weights and only the recurrent weights ** V** are trained. We can express these weights as

The first term in (10), namely *G v*

^{0}, remains static during training and it is initialised to set the network into chaotic spiking. The learned component of the weights

*∈ ℝ*

**ϕ**^{T}^{K×N}is updated using the Recursive Least Squares algorithm. The vector

**∈ ℝ**

*η*^{N×K}serves as a decoder and it is static during learning. The constants

*G*and

*Q*govern the ratio between chaotic and learned weights.

With this definition of ** V** we can write the currents into the neurons as the sum

**[**

*I**t*] =

*[*

**I**_{G}*t*] +

*[*

**I**_{Q}*t*] (we dropped the layer

*l*superscript since we only have a single layer) where we define

In order to stabilise the network dynamics we add a High Dimensional Temporal Signal (HDTS) as in (Nicola and Clopath 2017). This is an *M* dimensional periodic signal ** z**[

*t*]. Given the HDTS period

*T*, we split the interval [0,

*T*] into

*M*subintervals

*I*= 1,…,

_{m}, m*M*such that each of the components of

**[**

*z**t*] is given by

This signal is then projected onto the neurons leaving equation (4) as
where vector ** μ** ∈ ℝ

^{N×M}is just a decoder similar to

**in (10)**

*η*The aim of FORCE learning is to approximate a *K*-dimensional time varying teaching signal ** x**[

*t*]. The vector

**[**

*r**t*] is used to obtain an approximant of the desired signal

The weights are updated using the RLS learning rule according to:

During the training phase, the teaching signal ** x**(

*t*) is used to perform the RLS update. Then, during the testing phase, the teaching signal is removed.

We used the parameters given in Table 4 for all FORCE experiments unless otherwise specified.

The period *T* was chosen to be equal to length of the teaching signal ** x**[

*t*]. The membrane potential were randomly initialised following a uniform distribution . Vectors

**and**

*η***are randomly drawn from . The static weights**

*μ*

*v*^{0}are drawn from a normal distribution , then these weights are set to 0 with probability

*p*= 0.1. All other variables are initialised to zero unless otherwise specified.

### S1. Heterogeneity on other spiking neuron hyperparameters

### S2. Membrane time constant distribution consistency across trials

### S3. Synaptic time constant distribution consistency across trials

### S4. Confusion matrices DVS-gesture dataset

The classes correspond to

Hand clapping

Right hand wave

Left hand wave

Right arm clockwise

Right arm counter-clockwise

Left arm clockwise

Left arm counter-clockwise

Arm roll

Air drum

Air guitar

### S5. Grid search of parameters

## Acknowledgments

We are immensely grateful to the Allen Institute and Paul Manis for publicly sharing their databases that allowed us to estimate time constant distributions in the brain. Releasing this data is not only immensely generous and essential for this work, but more generally it accelerates the pace of science and represents an optimistic vision of the future.

## Footnotes

Minor tweaks before submission.