## Abstract

Modern well-performing approaches to neural decoding are based on machine learning models such as decision tree ensembles and deep neural networks. The wide range of algorithms that can be utilized to learn from neural spike trains, which are essentially time-series data, results in the need for diverse and challenging benchmarks for neural decoding, similar to the ones in the fields of computer vision and natural language processing. In this work, we propose a spike train classification benchmark, based on open-access neural activity datasets and consisting of several learning tasks such as stimulus type classification, animal’s behavioral state prediction and neuron type identification. We demonstrate that an approach based on hand-crafted time-series feature engineering establishes a strong baseline performing on par with state-of-the-art deep learning based models for neural decoding. We release the code allowing to reproduce the reported results.

**Author summary** Machine learning-based neural decoding has been shown to outperform the traditional approaches like Wiener and Kalman filters on certain key tasks [1]. To further the advancement of neural decoding models, such as improvements in deep neural network architectures and better feature engineering for classical ML models, there need to exist common evaluation benchmarks similar to the ones in the fields of computer vision or natural language processing. In this work, we propose a benchmark consisting of several *individual neuron* spike train classification tasks based on open-access data from a range of animals and brain regions. We demonstrate that it is possible to achieve meaningful results in such a challenging benchmark using the massive time-series feature extraction approach, which is found to perform similarly to state-of-the-art deep learning approaches.

## Introduction

The latest advances in multi-neuronal recording technologies such as two-photon calcium imaging [2], extracellular recordings with multi-electrode arrays [3], Neuropixels probes [4] allow producing large-scale single-neuron resolution brain activity data with remarkable magnitude and precision. Some of the neural spiking data recorded in animals has been released to the public in the scope of data repositories such as CRCNS.org [5]. In addition to increasing experimental data access, various neural data analysis tools have been developed, in particular for the task of neural decoding, which is often posed as a supervised learning problem [1]: given firing activity of a population of neurons at each time point, one has to predict the value of a certain quantity pertaining to animal’s behaviour such as its velocity at a given point in time.

Such a formulation of the neural decoding task implies that it is a multivariate time-series regression or classification problem. An array of supervised learning methods focused specifically on general time-series data has been developed over the years, ranging from classical approaches [6] to deep neural networks for sequential data [7]. It is not fully clear, however, how useful these methods are for the specific tasks of learning from neural spiking data. In order to establish a sensible ranking of these algorithms for neural decoding, there is a need for a common spiking activity recognition benchmark. In this work, we propose a diverse and challenging spike train classification benchmark based on several open-access neuronal activity datasets. This benchmark incorporates firing activity from different brain regions of different animals (retina, prefrontal cortex, motor and visual cortices) and comprises distinct task types such as visual stimulus type classification, animal’s behavioral state prediction from individual spike trains and interneuron subtype recognition from firing patterns. All of these tasks are formulated as univariate time-series classification problems, that is, one needs to predict the target category based on an individual spike train chunk recorded from a single neuron. The formulation of the classification problems implies that the predicted category is stationary across the duration of the given spike train sample.

Our main contributions can be summarized as follows:

We propose a diverse spike train classification benchmark based on open-access data.

We show that global information such as the animal’s behavioral state or stimulus type can be decoded (with high accuracy) from

*single-neuron*spike trains containing several tens of interspike intervals.We establish a strong baseline for spike train classification based on hand-crafted time-series feature engineering that performs on par with state-of-the-art with deep learning models.

Well-established machine learning techniques such as gradient boosted decision tree ensembles and recurrent neural networks have been successfully applied both to neural activity decoding (predicting stimuli/action from spiking activity) [1, 8] as well as neural encoding (predicting neural activity from stimuli) [9]. Neural decoding tasks are often formulated as regression problems, wherein binned spiking count time-series of a single fixed neural population are used to predict the animal’s position or velocity in time.

A number of previous studies on feature vector representations of spike trains also focused on defining a spike train distance metric [10] for identification of neuronal assemblies [11]. Several different definitions of the spike train distance exist such as van Rossum distance [12], Victor-Purpura distance [13], SPIKE- and ISI-synchronization distances [14] (for a thorough list of existing spike train distance metrics see [10]). These distance metrics were used to perform spike train clustering and classification based on the k-Nearest-Neighbors approach [15]. Jouty et al. [16] employed ISI and SPIKE distance measures to perform clustering of retinal ganglion cells based on their firing responses to a given stimulus.

In addition to characterization with spike train distance metrics, some previous works relied on certain statistics of spike trains to differentiate between cell types. Charlesworth et al. [17] calculated basic statistics of multi-neuronal activity from cortical and hippocampal cultures and were able to perform clustering and classification of activity between these culture types. Li et al. [18] used two general features of the interspike interval (ISI) distribution to perform clustering analysis to identify neuron subtypes. Such approaches represent neural activity (single or multi-neuron spiking patterns) in a low-dimensional feature space where the hand-crafted features are defined to address specific problems and might not provide an optimal feature representation of spiking activity data for a general decoding problem. Finally, not only spike timing information can be used to characterize neurons in a supervised classification task. Jia et al. [19] used waveform features of extracellularly recorded action potentials to classify them by brain region of origin.

The aforementioned works were aimed at, to some extent or another, trying to decode properties of neurons or stimuli given recorded spiking data. In some of the cases the datasets used were not released to be openly available, in some of the cases the predictive models used constituted quite simple baselines for the underlying decoding/cell identification tasks. In this work, we aim to propose a benchmark base on open-access datasets that is diverse and challenging enough to robustly demonstrate gains of advanced time-series machine learning approaches as compared to some of the simple baselines used in previous works. We release the code allowing to reproduce the reported results.

## Materials and methods

### Overview of time series classification methods

We applied general time series feature representation methods [6] for classification of neuronal spike train data. Most approaches in time series classification are focused on transforming the raw time series data into an effective feature space representation before training and applying a machine learning classification model. Here we give a brief overview of state-of-the-art approaches one could utilize in order to transform time series data into a feature vector representation for efficient neural activity classification.

#### Neighbor-based models with time series distance measures

A strong baseline algorithm for time series classification is k-nearest-neighbors (kNN) with a suitable time series distance metric such as the Dynamic Time Warping (DTW) distance or the edit distance (ED) [6]. In this work, we evaluated performance of nearest-neighbor models for generic distance measures such as *l _{p}* and DTW distance, converting spike trains to the interspike-interval (ISI) time-series representation prior to calculating the spike-train distances. Some of the distance metrics we also used for evaluation are essentially distribution similarity measures (e.g. Kolmogorov-Smirnov distance, Earth mover’s distance) which allow comparing ISI value distributions within spike trains. Such a spike train distance definition would only use the information about the ISI distribution in the spike train, but not about its temporal structure. Alternatively, one can keep the original event-based representation of the spike train and compute the spike train similarity metrics such as van Rossum or Victor-Purpura distances or ISI/SPIKE distances [10].

The choice of the distance metric determines which features of the time series are considered as important. Instead of defining a complex distance metric, one can explicitly transform time series into a feature space by calculating various properties of the series that might be important (e.g. mean, variance). After assigning appropriate weights to each feature one can use kNN with any standard distance metric. Moreover, such a representation allows the application of any state-of-the-art machine learning classification algorithm beyond kNN to obtain better classification results. In the following, we discuss approaches using various feature space representations available for time series data.

#### Models using hand-crafted time series features

One of the useful and intuitive approaches in time series classification is focused on manually calculating a set of descriptive features for each time series (e.g. their basic statistics, spectral properties, other measures used in signal processing and so on) and using these feature sets as vectors describing each sample series. There exist approaches which enable automated calculation of a large number of time series features which may be typically considered in different application domains. Such approaches include automated time series phenotyping implemented in the *hctsa* MATLAB package [20] and automated feature extraction in the *tsfresh* Python package [21]. Here we utilize the *tsfresh* package which enables calculation of 779 descriptive time series features for each spike train, ranging from Fourier and wavelet expansion coefficients to coefficients of a fitted autoregressive process.

Once each time series (spike train) is represented as a feature vector, the spiking activity dataset has the standard form of a matrix with size [*n*_{samples}, *n*_{features}] rather than the raw dataset with shape [*n*_{samples}, *n*_{timestamps}]. This standardized dataset can be then used as an input to any machine learning algorithm such as logistic regression or gradient boosted trees [22]. We found this approach to set a strong baseline for all of the classification tasks we considered.

#### Deep learning models

Lastly, there are deep learning based approaches working well for time-series classification [7] such as deep recurrent networks like LSTMs and GRUs [23] and 1D convolutional neural networks (1D-CNNs) [24, 25]. While rather generic model architectures have been typically applied to neural decoding tasks [1, 8], there exist models specifically designed for time-series classification and regression tasks, like InceptionTime [25], achieving state-of-the-art results on benchmarks like the UCR Time Series Classification Archive [26]. Recent developments in deep learning model for time series also include the Time Series Transformer [27] and convolutional architectures like the Omniscale-CNN [28]. Perhaps surprisingly, we found that deep learning models could not significantly outperform the baseline with hand-crafted time series features on the spike train classification tasks, oftentimes performing worse than the baseline.

### The proposed spike train classification benchmark

We propose a spike train classification benchmark comprising several different open-access datasets and distinct classification tasks. The datasets used for the benchmark are as follows:

**Retinal ganglion cell stimulus type classification**based on the published dataset [29, 30]: Spike time data from multi-electrode array recordings of salamander retinal ganglion cells under four stimulus conditions: a white noise checkerboard, a repeated natural movie, a non-repeated natural movie, and a bar exhibiting random one-dimensional motion. We define the 4-class classification task to predict the stimulus type given the spike train chunk, also considering binary classification tasks for pairs of stimuli types (e.g. “white noise checkerboard” vs. “randomly moving bar”).**WAKE/SLEEP classification**based on fcx-1 dataset [31, 32]from CRCNS.org [5]: Spiking activity and Local-Field Potential (LFP) signals recorded extracellularly from frontal cortices of male Long Evans rats during wake and sleep states without any particular behavior, task or stimulus. Around 1100 units (neurons) were recorded, 120 of which are putative inhibitory cells and the rest is putative excitatory cells. Figure 1 shows several examples of spiking activity recordings that can be extracted from the fcx-1 dataset. The authors classified cells into an inhibitory or excitatory class based on the action potential waveform (action potential width and peak time). Sleep states (SLEEP activity class) were labelled semi-automatically based on extracted LFP and electromyogram features, and the non-sleep state was labelled as the WAKE activity class. We define the binary classification task as the prediction of WAKE or SLEEP animal state given a spike train chunk recorded from a putative excitatory cell.**Interneuron subtype classification task**based on the Allen Cell Types dataset [33]: Whole cell patch clamp recordings of membrane potential in neurons of different types. We selected the PV, VIP and SST interneurons from the whole dataset, as these interneuron groups comprise the majority of inhibitory cells in the prefrontal cortex [34]. We selected the spike trains recorded under the naturalistic noise stimulation protocol (as a proxy for the*in vivo*spontaneous activity in these cells). The non-trivial prediction task is defined for VIP vs. SST spike train classification, since the PV interneuron spike trains can be easily distinguished from the other interneuron types. The latter is due to a significantly higher firing frequency in PV interneurons that we found in the Allen Cell Types dataset.**Unsupervised temporal structure recognition task.**We defined a set of spike train classification tasks constructed in a self-supervised manner [35]. In such tasks, we take any set of (unlabelled) neuronal spike train recordings and generate a additional set of spike trains by applying a given transformation to the original data. The target classification task is to determine whether a given spike train chunk belongs to the original dataset or to the transformed one. Note that this task can be constructed for any spiking dataset without the need for the ground truth labels, i.e. in an unsupervised way. The spike train transformations we consider here are (i) adding spike timing jitter via, in particular, timing noise following a truncated normal distribution, (ii) random shuffling of the interspike intervals in the spike train, (iii) reversing the spike train. The models trained in such tasks learn to detect the temporal structure of the original spike trains, since the order/precise values of interspike intervals have been disrupted by the transformation (e.g. by ISI shuffling), while the ISI value distribution is preserved by some of the transformations (e.g. by the shuffling and reversal operations). The final trained model accuracy in a shuffled vs. non-shuffled spike train classification task can thus be thought of as a measure of temporal structure in the original spiking dataset (test set accuracy would be on the chance level if the ISI values in the original spike trains were independently sampled from a fixed value distribution, i.e. the exact ordering of the ISIs did not contain any predictive information). We consider the the fcx-1 and retinal ganglion cell datasets described above to construct the temporal structure recognition tasks.

### Validation scheme and data preprocessing

Suppose we are given a dataset containing data from several animals each recorded multiple times with a large number of neurons captured in each recording. For each recorded neuron, we have a corresponding spike train captured over a certain period of time (assuming that the preprocessing steps like spike sorting or spiking time inference from fluorescence traces were performed beforehand). The number of spikes within each spike-train is going to be variable. A natural way to standardize the length of spike-train sequences would be dividing the full spike train into chunks of *N* spike times, where N is fixed for each chunk. We produce these spike train chunks for all datasets by moving a fixed-size sliding window across each single-neuron spike train. Hence, the pre-processing pipeline for all dataset is as follows:

Encode all of the spike trains in the interspike interval format (i.e. time-series [

*ISI*_{1},*ISI*_{2},*ISI*_{3},…])Apply a rolling window of size

*N*with step*S*to each single-neuron spike train separately, producing several spike train chunks (of*N*interspike intervals) for each neuron.Pool all the chunks from all neurons together to form a data matrix of size [

*M, N*] where*M*is the total number of samples (chunks) in the training/test dataset and*N*is the chunk size.(Optional) Apply a logarithmic transform to each sample (due to the originally heavy-tail distribution of ISI values in the data) and standard scaling to the dataset. Alternatively, encode the spike train chunks using a binned spike count representation.

We train a decoder (classifier) using the training set data matrix, so a single decoder is trained for all the neurons in the training set from single-neuron spike train chunks.

The validation strategy we use in this work is based on group splits, which means we determine the split into the training and the test datasets based on animal identifiers available in the original data. The motivation is that, in cases recordings are performed in several animals and corresponding animal identifiers are available, the set of animals used to construct the training dataset and the set of animals for the test dataset should not overlap in order to test whether the trained decoding models could generalize across different animals. In case animal identifiers are not available, we split the dataset into training and testing based on non-overlapping neuron identifiers in train and test.

Most of the datasets in the benchmark have an imbalanced class distribution. For performance evaluation, we consider two different setups in this work: (i) keeping all the available data in the training/test datasets, and measuring performance with metrics that are robust to class imbalance, such as the Cohen’s kappa score [36], and the geometric mean score [37] and (ii) balancing the class distribution in the training and testing datasets by data undersampling and measuring standard classification metrics such as accuracy and AUC-ROC. The second approach allows to rank models without the effect of class imbalance on training and evaluation [38], however some data points are being lost in the undersampling process. Although AUC-ROC is generally considered to be a performance metric less affected by the class imbalance than e.g. accuracy, we do not report AUC-ROC values for the full (imbalanced) data setting because of its possible skeweness [39].

The base task we consider for all benchmark datasets is classification given an individual spike train chunk (part of a single-neuron recording). However, prediction performance can be improved by aggregating predictions from spike trains of several neurons (in case such data is available) or from several chunks of a large single-neuron spike train. If the final classification in such a setting is done by majority voting from all single spike-train chunk predictions and we assume that recorded spike-trains chunks are randomly sampled from the whole set, the optimistic estimate for accuracy improvement with the number of spike train chunks *N*_{chunks} would be
where *μ* is the probability that the majority vote prediction is correct, *p* is the probability of a single classifier prediction being correct (single spike train prediction accuracy), *N _{chunks}* is the number of predictions made,

*m*= [

*N*/2] + 1 is the minimal majority of votes. We found that the empirical values of accuracy improvement are close to the optimistic analytical estimate (1) in both cases when the spike train chunks are sampled from different neurons and from a large spike train of a single neuron (see Fig. S1).

_{chunks}## Results

### Visual stimulus type classification from retinal spike trains

We first start with looking the at the retinal ganglion cell spike train classification task. Recorded spike trains in the dataset are associated with one of the four categories corresponding to different visual stimulus types, labelled with “white noise checkerboard”, “randomly moving bar”, “repeated natural movie” and “unique natural movie”. The classification task is, given a chunk of the spike train recording, predict the corresponding stimulus type category. The number of neurons in the dataset belonging to each category is 155, 140, 178, 152, respectively. The number of interspike intervals is quite variable among individual cells (due to firing rate variability) ranging from 100 ISIs per recording to as much as 60000 ISIs per recording.

We focus on the binary classification task aimed at predicting one of the two types of stimuli: “randomly moving bar” (corresponding to a label of 0) or “white noise checkerboard” (corresponding to a label of 1). We select recorded spike trains corresponding to those stimuli types and split 70% of recorded neurons (108 neurons) for the training part of the dataset and the remaining 30% (46 neurons) for the testing dataset. We encode spike trains using the ISI representation and apply a rolling window of size equal to 200 ISIs with a stride (step) of 100 ISIs to each recorded neuron. This results in 8006 training samples and 3650 testing samples, each sample containing 200 ISIs. The average target value in the training set is 0.7660 and 0.7882 in the testing set, hence class imbalance is present in the retinal dataset.

#### Nearest-neighbor models for spike train classification

We evaluated performance of nearest-neighbor models with different distance metrics on the retinal stimulus classification task, results are shown in Table 1. The results in presented in Table 1 were obtained without performing undersampling to balance the class distribution, hence we looked at imbalance-robust metrics such as Cohen’s kappa and geometric mean scores. We found that the nearest neighbor model with the DTW distance is amongst the best performing ones, but is still outperformed by the 1-NN model with the Kolmogorov-Smirnov (KS) distance, suggesting that differences in ISI distributions contain significant discriminative information helpful for the classification task at hand. We further include the results obtained with the 1-NN KS-distance model as a baseline to compare against other methods.

#### Hand-crafted feature extraction + classification models

The kNN results clearly suggest that characteristics of the interspike-interval distribution of the given spike train are predictive of the category label in our classification task. At the same time, one would expect the the temporal (sequential) information contained in the spike train also has certain predictive power. A straightforward way to incorporate both types of features in the model is to build a corresponding vector embedding of the spike train time-series. An efficient way to do so is to use a set of hand-crafted time-series features, like for example the set of 779 features provided in the *tsfresh* Python package. In order to compute vector embeddings for the spike trains in the training and testing datasets, one has to convert spike times into a time-series, which in principle could be done using either an interspike-interval encoding (the time-series is the sequence of ISIs) or a spike-count encoding (time is binned and spike counts in each time bin comprise the time series). The latter type of encoding depends on an additional hyperparameter which is the size of the time bin while ISI-encoding is parameter-free. We tested both types of spike-train encoding for our task and observed that models trained using the ISI-encoding of spikes generally perform better than the ones using binned spike counts. Furthermore, we found that combining features corresponding to both encoding types leads to better performance compared to using a single encoding scheme (see Table 4 and Figure 2).

For each spike-train encoding type, we computed the 779-dimensional *tsfresh* time-series embeddings independently for each sample in the training and testing datasets (no statistic aggregation across samples is performed). We then performed simple pre-processing steps by (i) removing low-variance features from the embedding (features *f* satisfying std(*f*)/(mean(*f*) + *ε*) < *θ* with *θ* = 0.2 and *ε* = 10^{−9} were removed) and (ii) performing standard scaling for each feature using mean and variance statistics collected over the training dataset. Note that since the number of spikes in each data sample is fixed and interspike intervals are highly variable, the number of time bins also becomes variable from sample to sample. The fixed-size *tsfresh* embeddings can nevertheless be computed since they are applicable to variable-length time series.

We then trained classification models on the resulting spike-train vector embeddings. We chose a representative set of classification models comprising (i) a linear model, namely logistic regression with an *l*_{2} regularization penalty and (ii) several types of tree-based ensembles: a random forest classifier (via the scikit-learn’s RandomForestClassifier implementation), randomized decision trees (via the scikit-learn’s ExtraTreesClassifier), and a gradient boosted decision tree (GBDT) ensemble (via the XGBoost implementation). The classifier hyperpameter values we used are specified in the Supplementary Materials. We have not included validation metric values of the logistic regression models in figures and tables of the main text because we found that linear models always perform worse compared to decision tree ensembles; results for linear models can be found in Supplementary Materials.

Classification results obtained with the described tsfresh-based approach are presented in Figure 2 and Tables 2, 3, 4. Note that in cases where we report balanced test set metrics (accuracy and AUC-ROC) we have performed class balancing via undersampling on the dataset and also have randomly sampled 70% of the training set data over several trials to estimate the variance in validation metrics. In cases when we don’t perform any undersampling, we report the values of Cohen’s kappa and geometric mean score metrics.

We were able to reach significant performance levels (> 0.88 accuracy, > 0.95 AUC-ROC) with our best *tsfresh*-based models on the binary retinal stimulus classification task (“randomly moving bar” vs. “white noise checkerboard”). To make better sense of these metric values, we compared our *tsfresh-based* models against two simple baselines: (a) a logistic regression model on ISI-encoded spike-trains represented by 6 basic statistical features - the mean, median, minimum and maximum ISI values, the standard deviation and the absolute energy of the ISI-sequence (the mean of squared ISI values) and (ii) an XGBoost model trained directly on “raw” ISI-encoded spike trains.

The best performing model using the 6 basic features of the ISI time-series got a median balanced test set accuracy of 83.36, while training an XGBoost model directly on the ISI time-series gave an accuracy of 80.86. Using full *tsfresh* embeddings on ISI time-series boost the accuracy to 88.53 with the best model (XGBoost). We generally found that using the binned spike count encoding of spike trains works worse with *tsfresh* compared to the ISI encoding (see Table 4), while combining feature vectors obtained from both encodings results in improved accuracy (89.46 for the XGBoost model on the retina dataset).

We also evaluated the performance of state-of-the-art deep learning models in our retinal stimulus classification tasks, using implementations from the *tsai* package [40]. Accuracy results for a range of convolutional architectures (FCN, InceptionTime, XceptionTime, ResNet) are shown in Tables 2, 3. We found that convolutional neural nets generally outperform manual-feature-based methods, although by a relatively small margin both in cases of class-balanced and imbalanced datasets.

We then performed the same pre-processing steps for the WAKE/SLEEP and VIP/SST datasets as for the retinal stimulus classification dataset. The rolling window of size equal to 200 ISIs and a stride of 100 ISIs for the WAKE/SLEEP data produced a dataset of 24363 training samples (from 78 neurons) and 10634 testing samples (from 35 neurons) with average target values of 0.3796 and 0.3211 in the training and testing sets, correspondingly. A rolling window of size equal to 50 ISIs and a stride of 20 ISIs for VIP/SST interneuron data produced a dataset of 2690 training samples and 1217 testing samples, with mean target values of 0.6026 and 0.8504, correspondingly.

We observed similar trends both for the WAKE/SLEEP state and VIP/SST interneuron classification tasks, shown in Table 3. The best performing non-deep-learning models were found to be *tsfresh*-based ones using the combined ISI and spike count encodings of the underlying spike trains. Convolutional neural networks were found to outperform the classical models, sometimes by a considerable margin (i.e. the Allen cell types dataset).

#### Unsupervised spike train temporal structure recognition

The spike train temporal structure recognition task is defined as follows: for a set of spike train activity data, we generate a binary classification task by producing an additional category of spiking data consisting of spike trains from the original dataset with a certain transformation applied to them. We consider the following spike train transformations: (i) ISI shuffling inside the spike train (random shuffling applied to the ISI time series), (ii) reversing the ISI time series and (iii) adding spike timing jitter sampled from the truncated normal distribution to the time series. Note that the first two transformation types do not change the value distribution of the time series, only its temporal structure (the exact ordering of the interspike intervals of the spike train). Hence, if it is possible to construct a classification model capable of distinguishing between the two activity classes (original spiking activity versus the transformed one), then one could say that the model has learned to detect the temporal structure in the time series. In the case of classifiers trained on *tsfresh* feature vectors, the classification accuracy metrics obtained can be thought of as measures of the amount of temporal structure contained in the spike trains (to the extent encoded in *tsfresh* features). The classification results (AUC-ROC values) for different base spiking data and different transforms is shown in Table 5. Notably, one could observe higher AUC-ROC values for the retinal ganglion cell spiking data in case of the randomly moving bar stimulus as compared to the white noise checkerboard stimulus for all of the three transforms considered. The same difference in accuracy values is observed for the fcx-1 dataset whereby classification AUC-ROC value is higher for all of the three transforms when the SLEEP state is used as the base spiking dataset as opposed to the WAKE state. The accuracy values that we observe for the temporal structure recognition tasks are above the chance level in most cases, with low values for fcx-1 WAKE-state data in case of reverse and noise transforms.

#### The set of most discriminative spike train features

Being able to estimate feature importance ranks from trained decision tree ensembles allows us to detect the most discriminating features of ISI time-series for a particular classification problem. In order to select the important groups of discriminative features, we trained several random forest classification models (with different random seeds) on each dataset and used the feature importance scores extracted from the classifiers to rank the *tsfresh* features. We have used the mean scores averaged over different datasets to identify the features that are discriminative for spiking data across different scenarios. According to this feature ranking procedure, the following groups of *tsfresh* features are selected (see also Fig. S2):

*median, kurtosis, quantile_q*– simple statistics of the ISI value distribution in the series like the median ISI value,*q*quantiles and kurtosis of the ISI value distribution*change_quantiles*– this feature is calculated by fixing a corridor of the time series values (defined by lower and higher quantile bounds,*q*and_{l}*q*, which are hyperparameters), then calculating a set of consecutive change values in the series (differencing) and then applying an aggregation function (mean or variance). Another boolean hyperparameter_{h}*is_abs*determines whether absolute change values should be taken or not.*fft_coefficient*– absolute values of the fast Fourier transform coefficients (individual coefficient values and aggregates).*entropy*– values of the sample entropy, the approximate entropy and the binned entropy of the power spectral density of the time series.*agg_linear_trend*– features from linear least-squares regression (standard error in particular) for the values of the time series that were aggregated over chunks of a certain size (with different aggregation functions like min, max, mean and variance). Chunk sizes vary from 5 to 50 points in the series.

To visualize class separation for the WAKE vs. SLEEP state spike trains from the fcx-1 dataset as point clouds in two dimensions, we took the top-20 importance *tsfresh* features identified during the above feature selection procedure. We then used dimensionality reduction techniques on this reduced 20-dimensional dataset to visualize the structure of the data with respect to the WAKE/SLEEP state labels of the series. Results are shown in Fig. 3 for two Uniform Manifold Approximation and Projection, UMAP [41] low-dimension embedding algorithms. We also applied other methods such as PCA and t-SNE (t-distributed Stochastic Neighbor Embedding) which gave essentially the same results (not shown). In all cases, classes cannot be linearly separated in two-dimensional embedding spaces, however, there is a separation of large fraction of the points of the WAKE and SLEEP state classes.

We conclude that both the convolutional neural networks and the hand-crafted feature engineering approach combined with strong tree-based learning models set a strong baseline for spike train classification for all of the tree studied tasks.

## Discussion

In this work, we have introduced a diverse neuronal spike train classification benchmark to evaluate neural decoding algorithms. The benchmark consists of several single-neuron spike train prediction tasks spanning stimulus type prediction, neuron type identification and animal behavioural state prediction. The interneuron type classification task (SST vs. VIP interneurons) is shown to be non-trivial due to the firing patterns of these two neuron types being quite similar in properties (as opposed to, for instance, PV interneurons, which typically have significantly higher firing rates compared to SST/VIP neurons as well as pyramidal cells). The other two tasks relate spiking activity of individual neurons to the global state of the underlying neural circuit, which is in one of the cases stationary during a considerable time period (fcx-1 dataset) and the other one could be viewed as a transient stimulus-driven one (retinal dataset). In both cases, we have demonstrated that individual neuronal spike trains contain information related to the global state of the neural circuit and this information can be decoded (from a relatively small spike train) if appropriate time-series learning models are used. Extensive experiments on several datasets that we have conducted imply that not only ISI value distribution is important for global state identification but also the temporal information contained in the spike trains, that is, features related to the exact sequences of interspike intervals in neural firing. We have identified groups of features highly informative for neural decoding tasks and established that this feature encoding combined with strong supervised learning algorithms such as gradient boosted tree ensembles establishes a strong baseline on the proposed benchmark that performs on par with state-of-the-art deep learning approaches. It was shown that significantly large accuracy values can be obtained on all of the proposed tasks using the hand-crafted feature encoding approach on single-neuron spike train chunks containing as low as 50 interspike intervals. We suggest that accuracy values can further be improved by hyperparameter search, model ensembling and test-time data augmentation. We propose that neural decoding models be evaluated on diverse and challenging tasks including the proposed benchmark (as well as regression tasks used to evaluate decoding models previously [1, 8]) in order to establish a sensible model performance ranking similarly to what is done for computer vision and natural language understanding problems. We believe that this would drive further development of highly accurate neural decoding/neural activity mining approaches enabling their application in precision-critical tasks such as identifying pathological disease-related firing activity patterns in the brain.

## Conclusion

To summarize our contributions, we have proposed a challenging and diverse benchmark for *individual cell* spike train classification to evaluate neural decoding models.

We have shown that a classical machine learning baseline comprised of massive time-series feature extraction from different spike train encodings coupled with well-performing classification approaches such as gradient boosting produces results on par with deep learning models, although with deep neural nets still slightly outperforming the classical methods. Furthermore, we have shown that the firing of individual neurons contains information about the global state of the organism as well as the information about the neuron type that can be decoded with machine learning approaches. This approach was further generalized to the unsupervised (self-supervised) setting, which helped reveal interesting structural properties of the spiking data we considered, in particular the WAKE-state time-reversal invariance and spiking jitter robustness of the cortical activity in the fcx-1 dataset. The massive time-series feature engineering approach helped detect groups of time-series features that have discriminative power over a set of different tasks in our benchmark and might thus be useful in general neural decoding tasks.

## Supplementary materials

### S1 Appendix. Classifier hyperparameter values

Listed below are hyperparameter values and implementation references for all of the classifier types we used. For more reference, see the example script.

Random Forest: sklearn implementation,

*n_estimators*= 500,*max-depth*= 10.Extra Trees Classifier: sklearn implementation,

*n_estimators*= 500,*max_depth*=*None*(no limit on depth).Logistic Regression: sklearn implementation, l

_{2}penalty,*C*= 0.001XGBoost: xgboost implementation

–

*max_depth*= 8–

*learning_rate*= 0.1–

*n_estimators*= 500–

*objective*=*binary*:*logistic*–

*booster*=*gbtree*–

*gamma*= 0–

*min_child_weight*= 1–

*max_delta_step*= 0–

*subsample*= 0.7–

*colsample_bytree*= 1–

*colsample_bylevel*= 1–

*colsample_bynode*= 1–

*reg_alpha*= 0–

*reg_lambda*= 1–

*scale_pos_weight*= 1–

*base_score*= 0.5

FCN, InceptionTime, XceptionTime, ResNet: tsai implementation

–

*epochs*= 200–

*max Jr*= 0.1–

*optimizer*=*sgd*–

*weight_decay*= 1e — 4–

*batch_size*= 128–

*lr_schedule*=*cosine*–

*best_model*= with the largest*cohemkappa*or*accuracy*on the validation set

We have looked at how the performance of a CNN is robust to the above training hyperparameters using the XceptionTime architecture on the retinal stimulus classification dataset (see Table S1).

We found that generally the CNN performance is not significantly affected by the changes in the main hyperparameters, being relatively robust with respect to these changes. We do, however, expect that a thorough hyperparameter search (including search over the architecture’s parameters) could result in a significant improvement in performance.

## Footnotes

Training/validation splits of the benchmark datasets have been fixed, the codebase was updated so that one could fully reproduce the results reported in the manuscript.