## ABSTRACT

Inspired by more detailed modeling of biological neurons, Spiking neural networks (SNNs) have been investigated both as more biologically plausible and potentially more powerful models of neural computation, and also with the aim of extracting biological neurons’ energy efficiency; the performance of such networks however has remained lacking compared to classical artificial neural networks (ANNs). Here, we demonstrate how a novel surrogate gradient combined with recurrent networks of tunable and adaptive spiking neurons yields state-of-the-art for SNNs on challenging benchmarks in the time-domain, like speech and gesture recognition. This also exceeds the performance of standard classical recurrent neural networks (RNNs) and approaches that of the best modern ANNs. As these SNNs exhibit sparse spiking, we show that they theoretically are one to three orders of magnitude more computationally efficient compared to RNNs with comparable performance. Together, this positions SNNs as an attractive solution for AI hardware implementations.

## Introduction

The success of brain-inspired deep learning in AI is naturally focusing attention back onto those inspirations and abstractions from neuroscience^{1}. One such example is the abstraction of the sparse, pulsed and event-based nature of communication between biological neurons into neural units that communicate real values at every iteration or timestep of evaluation, taking the rate of firing of biological spiking neurons as an analog value (Figure 1**a**). Spiking neurons, as more detailed neural abstractions, are theoretically more powerful compared to analog neural units^{2} as they allow the relative timing of individual spikes to carry significant information. A real-world example in nature is the efficient sound localization in animals like Barn Owl’s using precise spike-timing^{3}. The sparse and binary nature of communication similarly has the potential to drastically reduce energy consumption in specialized hardware, in the form of neuromorphic computing^{4}.

Since their introduction, numerous approaches to learning in spiking neural networks have been developed^{5–9}. All such approaches define how input signals are transduced into sequences of spikes, and how output spike-trains are interpreted with respect to goals, learning rules, or loss functions. For supervised learning, approaches that calculate the gradient of weights with respect to the loss have to deal with the discontinuous nature of the spiking mechanism inside neurons. Local linearized approximations like SpikeProp^{5} can be generalized to approximate “surrogate” gradients^{10}, or even calculated exactly in special cases^{11}. The use of surrogate gradients in particular has recently resulted in rapidly improving performance on select benchmarks, closing the performance gap with conventional deep learning approaches for smaller image recognition tasks like CIFAR10 and (Fashion) MNIST, and demonstrating improved performance on temporal tasks like TIMIT speech recognition^{12}. Still, spiking neural networks (SNNs) have struggled to demonstrate a clear advantage compared to classical artificial neural networks (ANNs)^{13, 14}.

Here, we introduce Spiking Recurrent Neural Networks (SRNNs), networks that include recurrently connected layers of spiking neurons (Figure 1**b**). We demonstrate how these networks can be trained to high performance on hard benchmarks, exceeding existing state-of-the-art in SNNs, and approaching or exceeding state-of-the-art in classical recurrent artificial neural networks. The high-performance in SRNNs is achieved by applying back-propagation-through-time (BPTT)^{15} to spiking neurons using a novel Multi-Gaussian surrogate gradient and using adaptive spiking neurons where the internal time-constant parameters are co-trained with network weights. The Multi-Gaussian surrogate gradient is constructed to include negative slopes, inspired by the ELU activation function^{16}: we find that the Multi-Gaussian surrogate gradient consistently outperforms other existing surrogate gradients. Similarly, co-training the internal time-constants of adaptive spiking neurons proved consistently beneficial. We demonstrate that these ingredients jointly improve performance to a competitive level while maintaining sparse average network activity.

We demonstrate the superior performance of SRNNs for well-known benchmarks that have an inherent temporal dimension, like ECG wave-pattern classification, speech (Google Speech Commands, TIMIT), radar gesture recognition (SoLi), and classical hard benchmarks like sequential MNIST and its permuted variant. We find that the SRNNs need very little communication, with the average spiking neuron emitting a spike once every 3 to 30 timesteps depending on the task. Calculating the theoretical energy cost of computation, we then show that in SRNNs cheap Accumulate (AC) operations dominate over more expensive Multiply-Accumulate (MAC) operations. Based on relative MAC vs. AC energy cost^{13, 14}, we argue that these sparsely spiking SRNNs have an energy advantage ranging from one to three orders of magnitude over RNNs and ANNs with comparable accuracy, depending on network and task complexity.

### Spiking Recurrent Neural Networks

We focus here on multi-layer networks of recurrently connected spiking neurons, as illustrated in Figure 1**b**; variations include networks that receive bi-directional input (bi-SRNNs; Figure S2**a**).

**Spiking neurons** are derived from models that capture the behavior of real biological neurons^{17}. While biophysical models like the Hodgkin-Huxley model are accurate, they are also costly to compute^{18}. Phenomenological models like the Leaky-integrate-and-fire (**LIF**) neuron model trade-of levels of biological realism for interpretability and reduced computational cost: the LIF neuron model integrates input current in a leaky fashion and emits a spike when its membrane potential crosses its threshold from below, after which the membrane potential is reset to the reset membrane potential; the current leak is determined by a decay time-constant *τ*_{m}.

As an exceedingly simple spiking neuron model, the LIF neuron lacks much of the complex behavior of real neurons, including responses that exhibit longer history dependency like spike-rate adaptation^{18}. Bellec et al.^{19} demonstrated how using a spiking neuron model that uses a generic form of adaptation improved performance in their SNNs. In this adaptive LIF (**ALIF**) neuron, the LIF neuron model is augmented with an adaptive threshold that is increased after each emitted spike, and which then decays exponentially with time-constant *τ*_{ad p}. Both LIF and ALIF neurons can be thought of as neural units with self-recurrency, as illustrated in Figure 1**c**.

### BPPT, Surrogate-Gradient and Multi-Gaussian

Given a loss-function ℒ(*t|θ*) defined over neural activity at a particular time *t*, the error-backpropagation-through-time (BPTT) algorithm^{15} updates network parameters *θ* in the direction that minimizes the loss by computing the partial gradient *∂ℒ*(*t*)/*∂θ* using the chain-rule. Here, the parameters *θ* include both the synaptic weights and the respective neural time-constants. In recurrently connected networks, past neural activations influence the current loss and by unrolling the network the contribution of these past activations to a current loss is accounted for. The roll-out of network activity through which the gradient is computed is illustrated in Figure 1**d**.

The discontinuous nature of the spiking mechanism in spiking neurons makes it difficult to apply the chain-rule connecting the backpropagating gradient between neural output and neural input^{5}; in practice, replacing the discontinuous gradient with a smooth gradient function, a “**surrogate gradient**” has proven effective^{10, 12, 21} and has the added benefit of allowing the mapping of spiking neural networks to recurrent neural networks in optimized Deep Learning frameworks like PyTorch and Tensorflow^{10}. Multiple surrogate gradient functions have been proposed and evaluated, including Gaussian, linear^{19} and SLayer^{6} functions; for these functions however, no significant differences in performance are reported^{10}.

Inspired by the ELU^{16} and LeakyRelu activation function, we here define the Multi-Guassian (MG) a novel surrogate gradient comprised of a weighted sum of multiple Guassians where the hyperparameter *h* and *s* are chosen such that the Multi-Guassian contains negative parts:
where *u*_{t} is the spiking neuron’s membrane potential and *ϑ* its internal threshold. The negative parts of the ELU and Leaky-RELU functions are thought to alleviate the “dying ReLU” problem^{22} where the sum of a neuron’s input is negative for all inputs, and the neuron effectively does not participate in the network computation. The shape of the Multi-Gaussian (MG) and various other surrogate gradient functions is illustrated in Figure 1**e**.

### Computational Cost

To estimate the efficiency of SNNs and compare them to ANNs, we calculate the number of computations required in terms of accumulation (AC) and multiply-and-accumulate (MAC) operations^{23}. We do this for an SRNN network with LIF or ALIF neurons and compare to a complex recurrent ANN structure like an LSTM^{24} in Figure 2 – for other ANNs, see Figure S2b. In ANNs, the contribution from one neuron to another requires a MAC for every timestep, multiplying each input activation with the respective weight before adding to the internal sum. In contrast, for a spiking neuron a transmitted spike requires only an AC at the target neuron, adding the weight to the potential, and where spike inputs may be quite sparse. In addition, the spiking neuron’s internal state requires updating every timestep at the cost of several MACs depending on the spiking neuron model complexity^{14}. As calculating MACs is much more energetically expensive compared to ACs (e.g., 31x on 45nm CMOS^{25}), the relative efficiency of SNNs is determined by the number of connections times activity sparsity and the spiking neuron model complexity. Additionally, we remark that in hardware, multiplication circuits require substantially more die area compared to addition circuits^{26}.

## Results

### Tasks

Recurrent neural networks (RNNs) provide state-of-art performance in various sequential tasks that require memory^{35} typically in small and compact networks, and can operate in an online fashion. We distinguish two kinds of sequential tasks: streaming tasks, where many inputs map to many specified outputs (many-to-many), and classification tasks where an input sequence maps to a single output value (many-to-one). Sequential classification tasks can additionally be computed in an online fashion, where a classification is determined for each timestep.

We selected benchmark tasks that have an inherent temporal dimension can also be computed with neural networks of modest size to fit the dynamics and constraints of spiking neural networks. For these tasks, we trained several different SRNN network architectures with various gradients, hyperparameters, and spiking neuron models and compared them to classical and state-of-the-art RNN architectures.

The electrocardiogram (**ECG**)^{36} signal is composed of six different characteristic waveforms – P, PQ, QR, RS, ST, and TP – whose shape and duration inform clinicians on the functioning of the cardiovascular system. The task requires the continuous recognition of all six waveforms, where we use signals from the QTDB dataset^{36}. The ECG-wave labeling is an online and streaming task using only past information.

The sequential- and permuted-sequential **S/PS-MNIST** datasets are standard sequence classification tasks of length 784 derived from the classical MNIST digit recognition task by presenting pixels one-at-time. The permuted version also first permutes each digit-class removing spatial information.

The Spiking Heidelberg Dataset (**SHD**) and Spiking Speech Command (**SSC**) Dataset^{27} are SNN specific sequence classification benchmarks comprised of audio converted into spike trains based on a detailed ear model.

The **SoLi** dataset^{31} gesture recognition task is comprised of a set of gestures in form of a sequence of radar returns collected the SoLi SoLid-state millimeter-wave radar sensor. We treat the SoLi task as both an online streaming and classification task by processing frames sequentially - we thus obtain two measures for the SoLi task, per-frame accuracy and whole sequence accuracy for streaming and classification respectively.

Both the Google Speech Commands (**GSC**) dataset^{37} and the **TIMIT** dataset^{38} are classical speech recognition benchmarks where for TIMIT, we compute the Frame Error Rate (FER) and where, similar to^{12}, we apply a bidirectional architecture such that also future information is used to classify each frame (illustrated in Figure S2**a**). Samples from the ECG, SHD and SoLi datasets are shown in Figures 3**a-c**.

As shown in Table 1, we find that these SRNNs achieve novel state-of-the-art for Spiking Neural Networks on all tasks, exceed conventional RNNs like LSTM models, and approach or exceeds the state-of-the-art of modern RNNs. We see that SRNNs substantially close the accuracy gap (SHD, SSC, GSC) compared to non-recurrent architectures like convolutional neural networks (CNNs) and Attention-based networks – the latter networks however are typically comprised of many more neurons or parameters and cannot be computed in an online or streaming fashion.

We plot the accuracy for the various tasks using different surrogate gradients in Figure 3**d**: while we see that there is little difference between previously developed gradients like Gaussian, Linear, and SLayer, we find that the Multi-Gaussian function consistently outperforms these gradients. We also find that independently of the surrogate gradient used, training the time-constants in the Adaptive LIF neurons consistently improves performance, as shown in the ablation study in Figure 3e: not training either *τ*_{m} or *τ*_{adp}, or neither, reduces performance. Much of the power of the SRNNs seem to derive from their multi-layer recurrent and self-recurrent architecture. When we make the spiking neurons non-spiking by eliminating the spiking mechanism and communicating the RELU value of the membrane potential, we find that for almost all tasks we achieve performance that slightly exceeds that of the spiking SRNNs.

The trained SRNNs communicate sparingly: most networks exhibit sparseness less than 0.1, and only the ECG task requires more spikes as it was tuned to use the smallest SRNN network (46 neurons). Sparseness of neural activity, expressed as average firing probability per timestep per neuron, is plotted in Figure 3**f**. In general, we find that increasing network sizes improves accuracy while decreasing average sparsity (Figure 3**h-i**) – though the total number of spikes used in the network increases. The total average number of spikes required per sample (SOPs) and per sample per step (SOP/step) for the highest performing SRNNs are given in Figure 3**g**.

Plotting the performance of networks using either ALIF or LIF neurons, we find that using ALIF neurons consistently improves both performance and activity sparseness in the networks (Figure 3**h**). Similarly, splitting a single large recurrent layer into two layers of recurrently connected layers in the SRNN architecture improves both performance and sparsity in the SHD task (Figure 3**i**); we observed similar improvements in the other tasks.

We carried out a grid search on the SoLi and SHD datasets for the *h* and *s* hyperparameters to determine the optimal parameter values for the Multi-Gaussian surrogate gradient. We find that there is a range of values where we can obtain both competitive accuracy and high sparsity (orange dotted line in Figure 4**a-b**) – we used a similar parameter search for the other tasks. The training procedure also substantially ‘learns’ the time-constants for the respective tasks: as shown in Figure 4**c** for the SHD task, starting from a tight distribution of time-constants, the spiking neurons in the trained network converge to using a wide variety of time-constants - the same effect is observed in the other tasks (not shown).

The streaming and online nature of several of the tasks allows the network to make any-time decisions. Figure 4**d** shows the classification for the various ECG waveforms for every timestep. When a new wave is presented, there is a brief delay before this class is correctly identified. In Figures 4**e-i**, the average online classification performance is shown for the S-MNIST, PS-MNIST, SHD, SSC, and SoLi datasets. We see that the S-MNIST and PS-MNIST digits can be recognized reliably quickly, while the SSC sounds require distinctly more time. The SHD sound recognition is much more erratic, and inspection of the data shows that this is caused by the various classes being placed at different times in the sound clip. Figure 4**i** plots the accuracy as a function of the number of frames shown for the SoLi task. Most gestures can be recognized reliably already after having presented only 25 out of the 42 frames - comparing favorably with^{31}: the SRNN allows decisions to be made earlier and with better accuracy.

Given the relative AC and MAC energy cost from^{14, 25, 39} and the computational complexity calculations from Figure 2**a**, we plot in Table 2 the relative energy efficiency of the various networks. We see that for the more complex tasks, SRNNs are theoretically at least 59x more energy efficient compared to RNNs at equivalent performance levels, where for most tasks the non-spiking (ReLU) SRNN compares most favourably. More classical RNN structures like LSTMs require many more parameters and operations, often being 1000x less efficient – we also calculate similar estimates for other RNN structures in Table S2.

## Discussion

We showed how multi-layered recurrent network structures are able to achieve new state-of-the-art performance for SNNs on sequential and temporal tasks. This was accomplished by using adaptive spiking neurons with learned temporal dynamics trained with backpropagation-through-time using a novel surrogate gradient, the Multi-Gaussian, where the Multi-Gaussian gradient proved to consistently outperform the other surrogate gradients. These results approach or equal the accuracy of conventional RNNs. When expressed in terms of computational operations, they demonstrate a decisive theoretical energy advantage of one to three orders of magnitude over conventional RNNs. This advantage furthermore increases for more complex tasks that required larger networks to solve accurately.

Neither the SRNNs nor the presented RNNs were optimized beyond accuracy and (for the SRNNs) sparsity: no optimizations like pruning and quantization were applied. When we compare the SRNN for the GSC task against the Attention-based CNN-network TinySpeech^{23}, representing the recent state-of-the-art in efficiency-optimized speech recognition, we find that at an equivalent performance level, the SRNN still requires 19.6x fewer MACs, and where, unlike TinySpeech, the SRNN operates in an online and streaming fashion (data in Table S2).

We focused on temporal or sequential problems with relatively limited input dimensionality. With RNNs, such problems can be solved with relatively small neural networks and hold direct promise for implementation in ultra-low power EdgeAI solutions. This also was the reason for emphasizing streaming or online solutions where no or fixed preprocessing and buffering is required: problems where a temporal stream first has to be segmented and where these segments are then classified greatly increase the complexity of such solutions. As we demonstrated, most classification decisions could be made early with near-optimal accuracy.

Using surrogate-gradients, the BPTT-gradient in the SRNNs can be computed using standard deep learning frameworks, where we used PyTorch^{40}. With this approach, complicated architectures and spiking neuron models can be trained with state-of-the-art optimizers, regularizers, and visualization tools. At the same time, this approach is costly in terms of memory use and training time, as the computational graph is fully unrolled over all timesteps, and the abundant spatial and temporal sparsity is not exploited in the frameworks. This also limits the size of the networks to which this approach can be applied: for significantly larger networks, either dedicated hardware and/or sparsity optimized frameworks are needed^{41}. Approximations to BPTT like eProp^{12} or alternative recurrent learning methods like RTRL^{42} may also help alleviate this limitation.

We remark that the energy advantage of SRNNs we computed is theoretical: while the computational cost in terms of MACs is well-accepted^{23, 39}, this measure ignores real-world realities like the presence or absence of sufficient local memory, the cost of accessing memory, and the potential cost of routing spikes from one neuron to another. In many EdgeAI applications, the energy-cost of conventional sensors may also dominate the energy equation. At the same time, the numbers we present are unoptimized in the sense that other than optimizing the surrogate gradient for both sparsity and accuracy, we did not prune the networks or applied other standard optimization and quantization techniques. Substantial improvements here should be fairly straightforward. Training parameters of spiking neuron models in the SRNNs can be extended further to approaches that include parameterized short-term plasticity^{43} and more complicated spiking neuron models.

The effectiveness of adjusting time-constant parameters to the task may also have implications for neuroscience: though effective time-constants of real spiking neurons are variable and dynamic^{17}, the benefit of training these parameters in SRNNs suggests these neural properties may be subject to learning processes in biology.

## Methods

In the SRNNs, the LIF spiking neuron is modeled as:
where is the input signal comprised of spikes at times *t*_{i} weighted by weight *w*_{i} and/or an injected current *I*_{inj,t}; *u* is the neuron’s membrane potential which decays exponentially with time-constant *τ*_{m}, *ϑ* is the threshold, *R*_{m} is the membrane resistance (which we absorb in the synaptic weights). The function *f*_{s}(*u*_{t}, *ϑ*) models the spike-generation mechanism as function of the threshold *ϑ*, which is set to 1 when the neuron spikes and otherwise is 0 (where the approximating surrogate gradient is then . The value for the reset potential *u*_{r} was set to zero. The ALIF neuron is similarly modeled as :
where *α, γ* are parameters related to the temporal dynamics, *α* = exp(−*dt/τ*_{m}) and *ρ* = exp(−*dt/τ*_{adp}), *ϑ* is a dynamical threshold comprised of a fixed minimal threshold *b*_{0} and an adaptive contribution *βη*_{t}; *ρ* expresses the single-timestep decay of the threshold with time-constant *τ*_{adp}. The parameter *β* is a constant that controls the size of adaptation of the threshold; we set *β* to 1.8 for adaptive neurons as default. Similarly, *α* expresses the single-timestep decay of the membrane potential with time-constant *τ*_{m}.

The SRNNs were trained using BPTT, various spiking neuron models with plastic time-constants and with various surrogate gradients. Apart from the SSC and SHD datasets, analog input values are encoded into spikes either using spikes generated by a level-crossing scheme (ECG) or by directly injecting a proportional current into the first spiking layer (S-MNIST, PS-MNIST, SoLi, TIMIT, GSC). To decode the output of the network, we used one of two methods: either spike-counting over the whole time-window, for the (P)S-MNIST task, non-spiking LIF neurons (TIMIT, SHD, SoLi, and GSC), or spiking ALIF neurons (ECG). With spike-counting, classification is decoded from the sum of the output spikes as where is the spike of the output neuron *i* at time *t*. For either non-spiking LIF neurons and spiking ALIF neurons as outputs, a softmax classification is computed from the output neurons’ membrane potential *u*_{out,t} at each timestep as *ŷ*_{t} = softmax(*u*_{out,t}). For ECG, we used spiking ALIF neurons for outputs as they performed best, which we believe is related to the fact that this is the only task where classification switches within the sample - the spiking then functions effectively as resets.

We use a standard BPTT approach^{12} to minimize the cross-entropy (CE) or negative-log-likelihood (NLL) loss for each task using the Adam^{44} optimizer, where we unroll all input timesteps from end to the start. The error-gradient is calculated and accumulated through all timesteps after which the weights are updated. BPTT for the spiking neurons is calculated retrogradely along with the self-recurrence circuits. As shown in Figure 1**d**, given an input sequence *X* = *x*_{0}, *x*_{1}, *x*_{2}, ..., *x*_{T}, and a neuron with initial states {*u*_{h,0}, *u*_{o,0}, *S*_{h,0}, *S*_{o,0}}, we obtain for each timestep *t* ∈ {0, *T*} the spiking neuron states {*u*_{h,t}, *S*_{h,t}, *u*_{o,t}, *S*_{o,t}}, where *S*_{h,t} refers to a neuron firing-or-not in a hidden layer and *S*_{o,t} to an output neuron (if spiking), and *u*_{h,t} and *u*_{o,t} denote hidden and output neurons’ membrane potentials. We then obtain a classification *ŷ*(*t*) either for each timestep or for the whole sequence *ŷ* and an associated loss. In classification tasks with *C* classes, the prediction probability of class *c* − *ŷ*_{c} is computed after having read the whole sequence, and then the loss of the network is calculated as , where *y*_{c} is the target probability of class *c*. In streaming tasks (ECG, SoLi), the total loss is computed as the sum of the loss at each timestep – . For the BPTT-derived gradient, we compute , and for recurrent weights *W*_{h2o}, we compute , where each term can be computed at each timestep *t′* as and , and where *W*_{h2h} refers to weights between neurons in the hidden layers, and *W*_{h2o} to weight between hidden and output neurons. The discontinuous spiking function enters the gradient as the term , and here we use the differentiable surrogate gradients^{10}.

For the Multi-Gaussian surrogate gradient, we found effective parameter values *h* = 0.15 and *s* = 6. based on a grid search, and we set *σ* to 0.5. The standard surrogate gradients were defined following^{10}, with the Linear surrogate gradient as the SLayer^{6} gradient as , and the Gaussian surrogate gradient as ; for all gradients, *α* is positive. We optimized all surrogate gradients hyperparameters in the experiments using grid searches; in the experiments we used *α*_{linear} = 1.0, *α*_{slayer} = 5.0, and *σ*_{G} = 0.5.

### Network initialization

Compared to ANNs, SRNNs require initializing both weight and also the spiking neurons’ hyperparameters (i.e, neuron type, time constants, thresholds, starting potential). We randomly initialize the time constants following the normal distribution (*μ, σ*) with per-layer specific parameters given in Table S1. For all neurons, the starting value of the membrane potential is initialized with a random value distributed uniformly in the range[0,*ϑ*]. The bias weights of the network are initialized as zero and all feedforward weights are initialized using Xavier-uniform initialization; weights for recurrent connections are initialized as orthogonal matrices. We compared networks with constant, uniform, and normal initializers for the time-constants and found that the normal initializer achieved the best performance (Figure S1).

For the various tasks, the loss-function, sequence length, maximum number of epochs, learning rate and decay schedule, and minibatch-size are specified in Table S1. Unless specified otherwise, the network architecture consists of inputs densely connected to one or more fully recurrently connected layers of spiking neurons connected to a layer of output neurons, as illustrated in Figure 1b. For the **ECG** task, the QTDB dataset^{36} consists of two channels of ECG signals. We apply level-crossing encoding^{45} on the normalized ECG signal to convert the original continuous values into a spike train, where each channel was transformed into two spike trains representing a value increasing or decreasing event respectively. The level crossing encoding we used is defined as
where *S*_{+}, *S*_{−} denote spikes for the positive and negative spike-train respectively and we used *L*_{+} = 0.3 and *L*_{−} = 0.3.

For the **Spiking Heidelberg Dataset (SHD)**, the audio records were aligned to 1s by cutting or completing with zeros. As in^{27}, two speakers were held out for the test dataset, and 5% of samples from other speakers were also added into the test dataset. The training dataset thus comprised of 8156 samples and test dataset contains 2264 samples. For the **Spiking Speech Command Dataset**, the speech commands were also uniformly aligned to 1s with a 250hz sampling frequency, and the dataset was randomly split into training, validation and test dataset with a ratio of 72%-8%-20% respectively. For the **SoLi dataset**, the sequence of 40 Range-Doppler images (RDI) was fed into the model frame-by-frame as input and split into training and testset as in^{31}. The original RDIs have 4 channels, but we found empirically that using one channel was sufficient. For the SoLi task, the first layer of the SRNN, we use a feedforward spiking dense layer, followed by a recurrent layer. As in^{31}, separate networks were trained for per-frame accuracy (Acc_{s}) and per-sequence accuracy (Acc_{c}), for the streaming and classification version of the task respectively. In the **S-MNIST** tasks, the network read the image pixel by pixel; for the **PS-MNIST** task, pixels are read into the network using a sliding window of size 4 with stride 1. For both tasks, the pixel value is fed into the network directly as injected current into the neurons of the first hidden layer as a fully connected layer with its own weights. We use the **Google Speech Command** v1^{37}. For preprocessing, Log Mel filters and their first and second-order derivatives are extracted from raw audio signals using Librosa^{46}. For the FFTs, a window of 30ms and a hop of 10ms is used. The timestep of the simulation is 10 ms. We calculate the logarithm of 40 Mel filters coefficient using the Mel scale between 20 Hz and 4kHz. Additionally, spectrograms are normalized to ensure that the signal in each frequency has a variance of 1 across time; we then selected the first three derivative orders as three distinct input channels. The input to the SRNN is thus a sequence of 101 frames, where each frame comprises of a 40-by-3 matrix.

The **TIMIT** database contains 3696 and 192 samples in training and test data respectively. We preprocessed the original audio data as in^{12} using MFCC encoding; 10% of the training dataset was randomly selected as validation dataset, and the network was trained on the remainder. Similar to bi-directional LSTMs, we use a bi-directional Adaptive SRNN for this task, illustrated in Figure S2**a**: we use two SRNN layers in the network, reading the sequence from the forward and backward direction respectively. The mean of these layer’s output is then fed into the last layer, an integrator, to generate the class prediction.

## Author contributions statement

B.Y., F.C. and S.B. conceived the experiment(s), B.Y. conducted the experiment(s), B.Y., F.C. and S.B. analysed the results. All authors reviewed the manuscript.

## Additional information

### Competing interests

The authors declare no competing interests.

## Supplementary Information

## Acknowledgements (not compulsory)

BY is funded by the NWO-TTW Programme “Efficient Deep Learning” (EDL) P16-25. The authors gratefully acknowledge the support from the organizers of the Capo Caccia Neuromorphic Cognition 2019 workshop and Neurotech CSA, as well as Jibin Wu and Saray Soldado Magraner, for helpful discussions.