ABSTRACT
Calcium imaging is a key method to record patterns of neuronal activity across populations of identified neurons. Inference of temporal patterns of action potentials (‘spikes’) from calcium signals is, however, challenging and often limited by the scarcity of ground truth data containing simultaneous measurements of action potentials and calcium signals. To overcome this problem, we compiled a large and diverse ground truth database from publicly available and newly performed recordings. This database covers various types of calcium indicators, cell types, and signal-to-noise ratios and comprises a total of >35 hours from 298 neurons. We then developed a novel algorithm for spike inference (CASCADE) that is based on supervised deep networks, takes advantage of the ground truth database, infers absolute spike rates, and outperforms existing model-based algorithms. To optimize performance for unseen imaging data, CASCADE retrains itself by resampling ground truth data to match the respective sampling rate and noise level. As a consequence, no parameters need to be adjusted by the user. To facilitate routine application of CASCADE we developed systematic performance assessments for unseen data, we openly release all resources, and we provide a user-friendly cloud-based implementation.
INTRODUCTION
Imaging of somatic calcium signals using organic or genetically encoded fluorescent indicators has emerged as a key method to measure the activity of many identified neurons simultaneously in the living brain1, 2. However, calcium signals are only an indirect, often non-linear and low pass-filtered proxy of the more fundamental variable of interest, somatic action potentials (spikes)3–5. The relationship between calcium signals and spike rates is ideally assessed directly by simultaneous electrophysiological recordings — preferably in the minimally disruptive juxtacellular configuration — and optical imaging of a calcium indicator signal in the same neuron. These dual recordings can serve as ground truth to calibrate and optimize algorithms for the inference of spike rates from other calcium imaging data (Fig. 1a). Based on such ground truth datasets, various model-based methods6–16, 12 as well as supervised machine learning algorithms15, 17–20 for spike inference have been developed.
Ideally, an algorithm should be applicable to infer spike rates in unseen calcium imaging datasets for which no ground truth is available. The relationship between spikes and the evoked calcium signals depends on multiple factors including the neuron type, the type of calcium indicator and its concentration, the optical resolution, the sampling rate and the noise level. Many of these parameters can vary substantially between experiments and even from neuron to neuron within the same experiment. As a consequence, experimental conditions of novel datasets are often not well matched to those of available ground truth data. It is therefore not clear how an algorithm based on a specific ground truth dataset generalizes to other datasets, which causes problems for the inference of spike rates from calcium imaging data under most experimental conditions12,13,21,22.
Here, we address the issue of generalization systematically. To assemble a large ground truth database, we performed juxtacellular recordings and two-photon calcium imaging using different calcium indicators and in different brain regions of zebrafish and mice. This database was then augmented with a carefully curated selection of publicly available ground truth datasets. Using this large database, we developed a supervised method for calibrated spike inference of calcium data using deep networks (CASCADE). CASCADE includes methods to resample the original ground truth datasets in order to match their sampling rate and noise level to a specific calcium imaging dataset of interest. This procedure allowed us to train machine learning algorithms upon demand on a broad spectrum of resampled ground truth datasets, matching a wide range of experimental conditions. Finally, we tested the performance of CASCADE systematically when applied to unseen data. CASCADE was robust with respect to any hyper-parameter choices and performed better than existing algorithms in benchmark tests performed across all ground truth datasets and across noise levels. The CASCADE algorithm can be used directly via a cloud-based web application and is also available together with the ground truth datasets as a simple and user-friendly Python-based toolbox.
RESULTS
A large dataset of curated ground truth recordings
To extend the spectrum of existing ground truth datasets, we performed simultaneous electrophysiological recordings and calcium imaging in adult zebrafish and mice (Fig. 1b-h; Table 1). In zebrafish, a total of 47 neurons in different telencephalic regions were recorded in the juxtacellular configuration in an explant preparation of the whole adult brain23 using the synthetic calcium indicators Oregon Green BAPTA-1 (OGB- 1) and Cal-520 as well as the genetically encoded calcium indicator GCaMP6f. Additional ground truth recordings were performed in head-fixed, anesthetized mice in hippocampal area CA3 using the genetically encoded indicator R-CaMP1.07 24. Furthermore, we extracted ground truth recordings from raw data of previous publications using Cal-520 and R-CaMP1.07, respectively, in mouse primary somatosensory cortex (S1; total of 21 neurons)25, 26, and from interneurons in mouse primary visual cortex (V1) using OGB- 1 in vivo and GCaMP6f in slices (total of 69 neurons)27, 28, complemented with a small new in vivo interneuron dataset using GCaMP6f (4 neurons). In addition, we surveyed openly accessible datasets and extracted ground truth from raw movies (when available) or preprocessed calcium imaging data15, 18, 29–33. Rigorous quality control (Methods) reduced the original number from a total of 193 available neurons to 157 neurons. Together with our own recordings, we have assembled 27 datasets comprising a total of 298 neurons, 8 calcium indicators and 9 brain regions in 2 species, totaling ∼38 hours of recording and 495,077 spikes.
Recording durations, imaging frame rates and spike rates varied greatly across ground truth datasets (Table 1). Typical spike rates spanned more than an order of magnitude, ranging from 0.4 to 11.6 Hz, and frame rates varied between 7.7 Hz and >160 Hz (Table 1; Fig. 1g). We used regularized deconvolution to compute the linear ΔF/F kernel evoked by the average spike and found that the area under the kernel curve varied substantially across datasets, even when data were obtained with the same indicator, and was substantially smaller for interneuron datasets, especially for parvalbumin-positive (PV) interneurons (Fig. 1h). Interestingly, kernels varied substantially even across neurons within the same dataset (Fig. 1h, Fig. S1). This diversity highlights the challenge faced by any algorithm that is supposed to generalize to unseen data.
Inference of spike rates with a deep convolutional network
Several favorable properties make supervised deep learning approaches well suited for spike inference from calcium imaging data. First, deep learning generally tends to outperform other classification or regression methods if the amount of training data is sufficiently high (typically >1000 data points for each category in classification tasks) 34–36. Second, the cost function can easily be modified to optimize the metric of interest, e.g., correlation with ground truth or mean squared error, without changing network architecture. Third, the temporal extent of receptive fields of deep networks can be adapted to account for history- dependent effects such as the dependence of action potential-evoked calcium transients on previous activity (see Fig. S2 for an example). Finally, deep networks are intrinsically non-linear, allowing to fit non- linear behaviors of calcium indicators.
We designed a simple convolutional network that uses a segment of the calcium signal trace (expressed as percentage fluorescence change ΔF/F) around a time point t to infer the spike rate at t. Compared to two-dimensional image classification and object labeling34, 37, 38, requirements on computational hardware are low because datasets are small and the inference task is only one-dimensional (time). For example, ImageNet39, a dataset used for visual object identification and detection in the deep learning field, is typically used at a resolution of 256 x 256 = 65,536 data points per sample, whereas the input used for spike inference in this study was smaller by approximately three orders of magnitude, typically consisting of a segment of the ΔF/F trace with 64 data points.
We used a network architecture with a standard convolutional design, consisting of rectifying linear units (ReLUs) that were distributed across three convolutional layers, two pooling layers and a single dense layer. The final dense layer projected to a single output unit that reported the estimated spike rate for the current time t (Fig. 2a; see Methods for more details).
Resampling of ground truth data for noise-matching
The key idea underlying our approach is that the ground truth (training data) is as important as the algorithm itself and should match as well as possible the noise level and sampling rate of the unseen population calcium data of interest (test data). We therefore devised a workflow where noise level and sampling rate are extracted from the test data and then used to generate noise- and rate-matched training data from the ground truth database (Fig. 2b) by temporal resampling and addition of noise. To facilitate gradient descent, the ground truth spike rate is smoothed with a Gaussian kernel (σ = 0.2 s, unless otherwise indicated; Methods).
To extract ΔF/F noise levels, we computed a standardized noise metric ν that is robust against outliers and approximates the standard deviation of ΔF/F baseline fluctuations. This metric was normalized by the square root of the frame rate to allow for comparison of noise measurements across datasets. Consequently, ν has units of %·Hz-1/2, which for simplicity we usually omit (Methods; Fig. S3). To generate training data with pre-defined ΔF/F noise levels, we explored several approaches based on sub-sampling of ROIs or additive artificial noise (Supplementary Note ‘Noise-matching of resampled ground truth data‘; Fig. S4). We identified the addition of artificial Poisson-distributed noise as the most suitable approach to transform the ground truth data into appropriate training data for the deep network.
To quantify deep network performance, we developed a set of complementary metrics for the accuracy of spike inference (equations and illustrations in Fig. S5). Following previous studies, we calculated the Pearson correlation between ground truth spike rates and inferred spike rates15, 18. However, this correlation measure of performance leaves the absolute magnitude of the inferred instantaneous spike rate unconstrained. We therefore also computed the positive and negative deviations of the absolute spike rate from the ground truth spike rate. The sum of the absolute deviations was defined as the error while the sum of the signed deviations was defined as the bias of the inference (Methods; Fig. S5). Error and bias were both normalized by the number of true spikes to obtain relative metrics that can be compared between datasets. Among these three metrics (correlation, error, bias), correlation is arguably the most important one because it estimates the similarity of inferred and true spike rates. Error and bias are relevant for the inference of absolute spike rates because they identify spike rates that are either incorrectly scaled or systematically too large or small.
The performance of the deep network degraded considerably when the noise level of the test dataset deviated substantially from the noise level of the ground truth. As expected by intuition, a network that had only seen almost noise-free data during training failed to suppress fluctuations in noisier recordings. Conversely, we also observed that a network trained on very noisy calcium signals was unable to fully benefit from low-noise calcium recordings, inferring only an imprecise approximation of the ground truth (Fig. 2c). A systematic iteration across combinations of noise levels for training and test datasets showed that for each test noise level the best model had been trained with a similar or slightly higher noise level (Fig. 2d-g; even more apparent when normalizing the metrics for each test noise level in Fig. S6). Very low noise levels (ν < 2) result in a special case (Fig. 2d,e): since some neurons of a given ground truth dataset do not reach the desired noise level even without addition of noise (cf. Table 1), the effective size of the training dataset decreases, resulting in slightly lower performance. In general, however, the results show that it is beneficial to train with noise levels that are adapted to the calcium data for which the algorithm will be applied after training.
Parameter-robustness of spike inference
Traditional models to infer spiking activity typically contain a small number of parameters10–12, 14 that describe biophysical quantities and are adjusted by the user. Deep networks, in contrast, contain thousands or millions of parameters adjusted during training that have no obvious biophysical meanings13, 15. The user can modify only a small number of hyper-parameters that define general properties of the network such as the loss function, the number of features per layer, or the receptive field size, i.e., the size of the input window shown in Fig. 2a. We therefore tested how spike inference performance depends on these hyper- parameters.
We found that the performance of the network was robust against variations of all hyper-parameters (Supplementary Note ‘Dependence of performance on hyper-parameters, overfitting and network architecture’; Fig. S7a-e), allowing us to leave all parameters unchanged for all conditions. Moreover, overfitting was moderate despite prolonged training, indicating that the abundance of noise and sparseness of events act as a natural regularizer (same Supplementary Note; Fig. S7f-h). Finally, we tested different deep learning architectures including non-convolutional or recurrent long short-term memory (LSTM) networks. While very large networks tended to slightly overfit the data, most networks performed almost equally well (same Supplementary Note; Fig. S8). Hence, the expressive power of moderately deep networks and the robustness of back-propagation with gradient descent enables multiple different networks to find good models for spike inference irrespective of the network architecture, hyper-parameter settings and the chosen learning procedure. This high robustness of the deep learning approach practically eliminates the need for manual adjustments of hyper-parameters.
Generalization across neurons within the same dataset
Ideally, the ground truth data used to train a network should match the experimental conditions in the test dataset (calcium indicator type, labeling method, concentration levels, brain region, cell type, etc.). To explore spike inference under such conditions we measured how well spike rates of a given neuron within a ground truth dataset can be predicted by networks that were trained using the other neurons in the dataset. First, all ground truth calcium ΔF/F data were resampled to a common sampling rate and adjusted to the same noise levels by adding Poisson noise. If the initial noise level of a given ground truth neuron was higher than the target noise level, the neuron was excluded from the respective noise level analysis. We then evaluated the performance of CASCADE as a function of the noise levels of the (re-sampled) datasets. As expected, correlations increased and errors decreased for lower noise levels, while average biases seemed not to be systematically affected (Fig. S9a-d). Performance metrics also varied considerably across different neurons within a single dataset when resampled at the same noise level ν. To better understand this variability, we performed additional analyses.
First, we found spike-evoked calcium transients to be variable across neurons from the same dataset (Fig. 1h, Fig. S1). Large errors and biases, as well as low correlations, were observed when spike-evoked calcium transients of a neuron deviated strongly from those of other neurons (red arrow in Fig. S9d; cf. Fig. S1r for the respective linear kernels of DS#18).
Second, spike inference may be complicated by movement artifacts or neuropil contamination. Movement artifacts typically had slow onset- and offset-kinetics (Fig. S10a), or a faster, quasi-periodic temporal structure related to breathing (Fig. S10d-e). Neuropil contamination is often very difficult to distinguish from somatic calcium signals and particularly severe when neurons are tightly packed and densely labeled1, 40–43 (Fig. S10b). For a subset of datasets, we tested the effect of simple center-surround subtraction of the neuropil signal29. Because subtraction is not perfect, decontaminated datasets still contained residual neuropil signals (Fig. S10b) or negative transients (Fig. S10c). Nonetheless, spike inference was significantly improved by neuropil decontamination (Fig. S11). More detailed inspection of the results showed that CASCADE was able to learn to ignore negative transients and movement artifacts, but only as long as they were distinguishable from true calcium transients. For example, artifacts were only sometimes correctly not associated with spikes (Fig S10a-c), limiting the precision of spike inference.
Third, we found that the activity of sparsely spiking neurons is less well predicted since the calcium signal of single action potentials is more likely to be overwhelmed by shot noise, particularly in the high-noise regime (arrows in Fig. S9a,c). We therefore evaluated conditions required for single-spike precision and observed that either shot noise or other noise sources were too prominent in all ground truth datasets to allow for reliable single-spike detection. The trained network thus systematically underestimated single spikes (Fig. S12). This observation was made using GCaMP indicators, which show a strongly nonlinear relationship between calcium concentration and fluorescence and therefore are less sensitive to isolated single spikes occurring during low baseline activity, but also using synthetic dyes (Fig. S12). These observations indicate that the network needs to learn a tradeoff between false-positive detections of noise events and false-negative detections of single spikes. Further details related to single-spike precision and the possibility to discretize inferred spike rates are discussed in the Supplementary Note ‘Discrete spikes and single-spike precision.
In summary, we showed that CASCADE is able to generalize to unseen neurons from the same ground truth training set. Not surprisingly, the precision of this generalization decreases for higher noise levels, in particular when spike rates are low. The precision is fundamentally limited by the variability of calcium kernels across neurons and probably also by the non-linearity of GCaMP-like indicators, and precision is further reduced when additional noise (motion artifacts, neuropil contamination) is prominent.
Generalization across datasets
We next explored how spike inference by a network trained on one ground truth dataset generalizes to other datasets. Using all available datasets, we quantified the median performance metrics across all possible combinations of datasets for training and testing (Fig. 3a,c,e) and analyzed the performance of each trained model across test datasets (Fig. 3b,d,f). In most training/test combinations, correlations were high whereas errors and biases remained low. However, models trained with some datasets showed low performance across datasets (e.g., DS#01), and some datasets, often with motion or neuropil contamination artifacts that were larger than typical calcium transients, could not be predicted well by any model trained on another dataset (e.g., DS#01-02, DS#21-23).
The entries of the matrix in Fig. 3 remained highly similar when parameters such as the resampling rate, temporal smoothing of the ground truth or the noise level were modified (Fig. S13). Interestingly, models trained on datasets of excitatory neurons (DS#01-21) also produced high quality-predictions of spike rate variations in interneurons (DS#22-26) (Fig. 3a,b). However, the separate analysis of error and bias revealed that absolute spike rates were substantially underestimated (Fig. 3c-f).44
Near-maximal correlation for a given dataset was often achieved by multiple models (Fig. 3a). In some datasets, the highest correlation was even achieved when the model was trained on ground truth from another dataset, rather than from the same dataset. Interestingly, the performance of training/testing combinations showed no obvious clustering related to indicator type (e.g., genetically encoded vs. organic indicators) or species (zebrafish vs. mouse). An attempt to explain the mutual predictability of datasets by more refined statistical dataset descriptors such as the mean spike rate or decay times was not very successful (Fig. S14). It is therefore not obvious how to select an optimal training dataset to predict spike rates for an unseen dataset.
To optimize dataset selection and network training for practical applications, we tested an alternative and simpler approach by training a model on all excitatory datasets except DS#01, hence called the ‘global EXC model’. We found that this global model performed better than all other models in cross-dataset tests (Fig. 3a-f; test dataset was always excluded from training data), not only due to the size but also due to the diversity of the training set (Fig. S15). Compared to randomly selecting a single dataset with excitatory neurons for training, correlations were increased by 0.05±0.05, errors were reduced by 0.05±0.05, and absolute biases were reduced by 0.25±0.90 (median ± s.d.). In addition, the global EXC model performed better than any of the 21 single models in all cross-dataset tests (p < 0.001 for all comparisons, paired signed-rank test). Compared to predictions across neurons within the same dataset (Fig. S9; diagonal elements in Fig. 3), the correlations resulting from the global EXC model were decreased by 0.02±0.04 (p = 0.04, Wilcoxon signed-rank test), errors were increased by 0.33±0.53 (p = 0.01), while the absolute bias was slightly decreased (0.40±0.40, p = 0.002). Hence, using dataset-specific ground truth can yield performance significantly better than the global EXC model. In the absence of such specific calibration data, however, training the algorithm with all available data is a simple and effective strategy to generate a model that generalizes robustly to unseen datasets.
Not surprisingly, a global INH model trained on all interneuron datasets generalized less well than the EXC model (Fig. 3). Indeed it was not more successful than the EXC model in predicting interneuron activity, most likely because the diversity of interneuron types is higher and because the amount of available ground truth was lower.
We also trained a model on a large artificial dataset (250 neurons) that was generated using the calcium imaging simulation environment NAOMi45 (Methods). The model performed well but lower than the global EXC model (correlation reduced by 0.05±0.04, p=0.0003; error slightly increased by 0.06±0.22, p=0.0006; bias not significantly changed, p=0.67; Fig. 3). We hypothesize that some relevant sources of variability at the neuronal level (e.g., variable decay times, transient shapes and non-linearities) are captured by experimental ground truth but not by simulated ground truth recordings. For new indicator types, where biophysical parameters are known but experimental ground truth is not available, ground truth generated by NAOMi could nonetheless be a useful way to train supervised spike inference algorithms.
Comparison with existing methods
To benchmark the performance of CASCADE, we compared it to five other model-based methods. In summary, we found that CASCADE performed better than all other algorithms across datasets, across noise levels, and for different temporal precisions. Moreover, CASCADE showed a less pronounced bias towards underestimating high spike rates.
We compared CASCADE to the fast online deconvolution procedure OASIS with two distinct implementations in CaImAn and Suite2p14, 42, 46, to the discrete change-point detection algorithm by Jewell and Witten47 (here referred to as Jewell&Witten), and to two more complex algorithms, Peeling and MLSpike. Peeling uses iterative template-subtraction to infer discrete spikes10. MLSpike was chosen because it outperformed various other methods in previous applications11, 15. Although model-based methods are, in principle, non-supervised, several parameters need to be tuned to achieve maximal performance on a given dataset12. To avoid sub-optimally tuned algorithms and to make the comparison with CASCADE as fair as possible, we used extensive grid searches to optimize parameter tuning of each algorithm-dataset combination (Methods; see Table 2 for the best model parameters for each dataset as a function of noise). This procedure allowed us to minimize the same loss function for all algorithms (mean squared error between ground truth and the inferred spike rate), using grid search for model-based approaches and backpropagation for CASCADE. Importantly, the neuron used for testing was always omitted during the training/fitting period (leave-one-out strategy). We refer to these models as “tuned” for specific datasets, as opposed to CASCADE’s “global EXC model” that was trained on other datasets (Fig. 3). The Peeling and Jewell&Witten algorithms infer discrete spikes rather than spike rates, which may result in a slight disadvantage. To convert their output to continuous rates, predicted spikes were convolved with a Gaussian kernel of a width that minimized the mean squared error.
The tested algorithms showed systematic differences in performance (Fig. 4a, Fig. S16). A quantitative comparison across all datasets for a fixed noise level revealed that performance varied strongly across ground truth datasets, single neurons, and algorithms (Fig. 4b, Fig. S17). Neurons that could be predicted well by one algorithm could often also be predicted well by other algorithms, in particular with respect to errors and biases (Fig. S17), suggesting that outlier neurons within datasets exhibit unusual properties that lead to biased predictions (cf. Fig. S9). The tuned CASCADE model and CASCADE’s global EXC model produced good predictions for the broadest set of neurons across datasets. High-level performance of the model-based algorithms was observed in fewer datasets. For example, in multiple neurons from diverse datasets, the performance of MLSpike was lower than that of CASCADE (Fig. 4b; datasets #7-8 [GCaMP6f in fish], #15-16 [GCaMP6s in V1], #24-26 [GCaMP6f in interneurons]). These datasets had relatively high (Table 1) and slowly changing spike rates rather than discrete bursts. Interestingly, the Peeling algorithm performed relatively well on some of these datasets. To more directly compare the performances across neurons with CASCADE, we calculated the difference in correlation achieved by CASCADE and other algorithms for each neuron. The resulting distributions (Fig. 4c) show that CASCADE yields better inferences for the majority of neurons across all compared algorithms (p<10-10 for all comparisons with other algorithms; p=0.068 when compared with CASCADE’s global EXC model; paired Wilcoxon signed-rank test).
We found that the performance of CASCADE was consistently better across different recording conditions. First, based on the finding that noise levels affect spike inference more strongly than other parameters (Fig. S13), we repeated the benchmarking in Fig. 4b across multiple noise levels and found that the performance ranking across algorithms was largely maintained (Fig. 4d), with the global EXC model achieving performance close to the tuned CASCADE model (significant difference: p=0.039, signed-rank test), followed by MLSpike, Peeling, Suite2p and CaImAn with similar performances, then followed by Jewell&Witten (p<10-10 for all algorithms). Although the error computed from CASCADE’s predictions was significantly lower than for most other algorithms (p<0.005 for Jewell&Witten, Suite2p, MLSpike and CASCADE’s global EXC model; p<0.01 for CaImAn; but p>0.05 for Peeling; paired Wilcoxon signed-rank), variability was high and relative effect sizes were low (Fig. 4e). Therefore, errors are not a very sensitive readout of performance. Finally, biases of predictions were negative (indicating underestimates of true spike rates) for all tuned model-based algorithms except for the tuned CASCADE model (Fig. 4f). CASCADE’s global EXC model exhibited the smallest overall bias.
We further found that all algorithms systematically underestimated spike rates as spike rates increased, but this effect was smallest for CASCADE. To visualize these results, we plotted the number of spikes for ground truth and predictions within each 2-s time bin (Fig. S18) and extracted the median lines of these distributions (Fig. 4g). An underestimate of high spike rates may be expected since periods of high spike rates are less frequent. False positive predictions of high spike rates would thus lead to larger performance drops than false negative omissions of rare events.
For spike inference evaluated at higher temporal precision, the performance (correlation with ground truth) dropped for all algorithms, but this effect was more modest for CASCADE than for all model-based algorithms (Fig. S19). We trained all algorithms to a ground truth that was smoothed in time to a variable degree (Gaussian smoothing kernel between σ = 0 ms and σ = 333 ms; default: σ = 200 ms). Example predictions highlight that several algorithms make impressive predictions also under these more difficult conditions (Fig. S19a), but some algorithms, in particular those based on discrete events (Peeling, Jewell&Witten) were not able to include graded certainties about spike times and therefore performed less well (Fig. S19b). However, also the performance of MLSpike, CaImAn and Suite2p dropped faster than the performance of CASCADE when spike rates were evaluated with increasing temporal precision (Fig. S19c).
Overall, predictions across algorithms were not only similar on the level of neurons (Fig. 4b, Fig. S17), but also exhibited highly correlated predictions and errors over the time course of single neurons (Fig. 4h-j). The shared variability, measured as the median correlation between predictions, was particularly high for the two closely related algorithms, Suite2p and CaImAn, but also high between CASCADE and MLSpike. Indeed, the correlation between CASCADE and MLSpike was as high as the correlation between CASCADE and the ground truth (Fig. 4h, bottom). To better understand these similarities, we explored false predictions shared by algorithms. To this end, we computed the similarities (correlation) of the unexplained, residual variances across algorithms. These shared errors were relatively prominent across algorithms (Fig. 4i,j). In particular, errors made by CaImAn, Suite2p, Peeling and Jewell&Witten were often correlated, but CASCADE and MLSpike also shared a relatively large fraction of unexplained variance. We further divided the unexplained variance in false positives (predictions were higher than the ground truth) and false negatives (predictions were lower than the ground truth). False negatives but not false positives were highly correlated across most algorithms, with the exception of Suite2p and CaImAn, which also shared false positives (Fig. 4j, right). Shared false negatives are clearly visible in typical predictions (red arrows in Fig. 4a, and more prominently in Fig. S16). Together, these analyses show highly similar predictions and similar missed spike events across algorithms.
Finally, we compared practical aspects arising during the application of different algorithms. With respect to processing speed we found that CASCADE (based on a GPU), CaImAn, and Jewell&Witten performed similarly fast (200k-300k samples/s). They were only outperformed by Suite2p (more than 5M samples/s). Peeling (ca. 5k samples/s) and in particular MLSpike (ca. 0.8k samples/s) were much slower, owing to their more complex fitting procedure. For optimization, CASCADE uses backpropagation, which is almost equally fast as inference, resulting in a total training time of <10 min for a typical ground truth dataset with 5M data points and a realistic number of 20 iterations (epochs) through the dataset. For model-based algorithms, extensive grid searches across parameters must be performed. We usually performed grid search in a 2D parameter space with 100-500 parameter combinations, which is feasible within minutes for Suite2p, CaImAn and Jewell&Witten. For Peeling and MLSpike, this procedure would take several days for a single model. In our analyses, we therefore reduced the number of training samples for MLSpike and Peeling such that the respective grid searches lasted approximately 2 hours per model for MLSpike. Furthermore, we found that best fit parameters for model-based approaches changed systematically with different noise levels, suggesting that new models have to be fit for each noise level (Table 2, supplementary material), an effect that was more pronounced for algorithms that do not use the noise level as an input (Suite2p and Jewell&Witten). As another possible drawback of some model-based algorithms, we found that inferred spike rates were often temporally shifted to later time points, and this delay was variable across datasets (0.16±0.14 s for MLSpike, 0.03±0.09 s for Peeling, 0.31±0.23 s for CaImAn, 0.29±0.22 s for Suite2p and 0.27±0.19 s for Jewell&Witten; mean delay ± s.d. across datasets). We corrected these shifts for all analyses presented here. However, using algorithms without this correction could result in misleading interpretations of the temporal dynamics of the inferred spike rates. This can be prevented by using an algorithm that does not induce strong delays (e.g., Peeling) or a model-free algorithm that directly learns the best delay from ground truth (CASCADE). Together, these aspects reflect that, unlike model-based algorithms, CASCADE can make use of ground truth datasets in an efficient and natural way.
Application to large-scale population calcium imaging datasets
A transformation of calcium signals into estimates of spike rates may be desired for multiple reasons. First, the reconstruction of spike rates can recover fast temporal structure in neuronal activity patterns that is obscured by slower calcium signals4, 5. Second, calcium signals contain shot noise and potentially other forms of noise that are unrelated to neuronal activity. A method that infers spike rates while ignoring noise can therefore de-noise activity measurements without the detrimental effects of over-expressed indicators1, 48 and without compromising temporal resolution. Third, while calcium signals usually represent relative changes in activity, spike rates provide absolute activity measurements that can be compared more directly across experiments. With these potential goals in mind, we applied CASCADE to different large- scale calcium imaging datasets.
In a brain explant preparation of adult zebrafish23, we measured odor-evoked activity in the posterior part of telencephalic area Dp (pDp), the homolog of piriform cortex, using OGB-1. Multi-plane two-photon imaging49 was performed with the same recording conditions as in DS#04 at a noise level of 2.36±0.97 (% · Hz−1/2; median ± s.d.) across 1,126 neurons. Under these conditions, predictions are expected to be highly accurate (Fig. S9a,e; correlation to ground truth: 0.87±0.06 for a standardized noise level of 2, median ± s.d.; Gaussian smoothing of the ground truth with σ=0.2 s). Consistent with previous electrophysiological recordings50, spiking activity estimated by CASCADE with a model trained on DS#04 was sparse (0.6±1.1 spikes during the initial 2.5 s of the odor response; mean ± s.d.; Fig. 5a) and variable across neurons (Fig. 5b), with a clear difference between the anatomically distinct dorsal and ventral regions of pDp (0.07±0.11 vs. 0.21±0.11 Hz; entire recording).
The comparison of ΔF/F signals and inferred spike rates showed that CASCADE detected phases of activity but effectively suppressed small irregular fluctuations in activity traces, indicating that spike inference suppressed noise. Consistent with this interpretation, spike inference by CASCADE increased the correlation between time-averaged population activity patterns evoked by the same odor stimuli in different trials (Fig. 5c,d).
Previous studies showed that odor-evoked population activity in pDp is dynamic50, 51 but the fine temporal structure has not been explored in detail. We applied an unsupervised non-negative matrix factorization method for sequence detection (seqNMF, ref. 52) to the inferred spike rate patterns to identify recurring short (2.5 s) sequences of population activity (hereafter referred to as factors) embedded in the overall population activity. This analysis required a high effective temporal resolution because factors exhibited rich temporal structure on a sub-second timescale (Fig. 5e). We found multiple factors that occurred with high precision and in a stimulus-specific manner at different phases of the odor response. For example, factors #2 and #4 in Fig. 5e were transient and associated with response onset, factor #5 persisted during odor presentation, and factor #6 was activated after stimulus offset (Fig. 5e). Odor-evoked population activity in pDp therefore shows complex dynamics on timescales that cannot be resolved without temporal deconvolution of calcium signals. The transformation of calcium signals into spike rate estimates by CASCADE therefore provides interesting opportunities to use calcium imaging for the analysis of fast network dynamics.
We next analyzed the Allen Brain Observatory Visual Coding dataset, comprising >400 experiments in mice with transgenic GCaMP6f expression, each consisting of approximately 100-200 neurons recorded at very low noise levels (0.94±0.25 % · Hz−1/2; mean ± s.d.; Fig. 6a)53. Using the global EXC model of CASCADE we estimated the absolute spike rates across all neurons for different transgenic lines (Fig. 6b; Fig. S20; Gaussian smoothing of the ground truth with σ=0.05 s). Spike rates across all 38,466 neurons were well described by a lognormal distribution centered around 0.1-0.2 Hz (Fig. 6c). Given the sampling rate (30 Hz) and noise level of this dataset we expect a correlation of 0.89±0.18, an error of 0.70±0.96 and a bias of 0.27±1.00 (median ± s.d. across neurons), based on our previous cross-dataset comparisons which also covered ground truth for transgenic lines included in this population imaging dataset (Fig. 3). Since we could not broadly test the generalization across a large set of interneuron datasets (Fig. 3), we did not include interneuron experiments in our analysis. Spike rates varied systematically across cortical layers, with highest activity in layer 5 (Fig. 6d,e), across transgenic lines (Fig. 6d) and across stimuli presented, with highest activation during naturalistic stimuli (natural scenes or movies; Fig. 6e). These results provide a comprehensive description of neuronal activity in the visual system of the mouse and reveal systematic differences in neuronal activity across cell types, brain areas, cortical layers, and stimuli.
Raw ΔF/F often exhibited correlated noise, visible as a vertical striping in matrix plots, which was small for individual neurons but tended to dominate the mean ΔF/F across neurons, possibly due to technical noise or neuropil signal (Fig. 6f). CASCADE effectively eliminated these artifacts (Fig. 6g). As a consequence, correlations between activity traces of different neurons were reduced across all experiments by 38±43% (mean ± s.d.; Fig. 6h; p < 10-15, paired signed-rank test). Many analyses of neuronal population activity require accurate measurements of pairwise neuronal correlations53–56. Noise suppression by spike inference can therefore help to make these analyses more reliable.
Together, these examples illustrate how calibrated spike inference by CASCADE can be applied to perform comprehensive analyses of neuronal activity, to identify complex temporal structure in neuronal population dynamics, and to remove shot noise and other noise from calcium imaging data.
A user-friendly toolbox for spike inference
The deployment of spike inference tools often suffers from practical problems. First, the difficulty to set up a computational pipeline might prevent wide-spread use. To address this problem, we generated a cloud-based solution using Colab Notebooks that can be applied without local installations. We also set up a well- documented Github repository (https://github.com/HelmchenLabSoftware/Cascade) containing ground truth datasets, pre-trained models, notebooks and demo scripts that can be easily integrated into existing analysis pipelines such as CaImAn, SIMA or Suite2P42, 46, 57. Since the algorithm works on regular laptops and workstations without GPU support, the main installation difficulties of typical deep learning applications are circumvented.
In a typical workflow, first the noise level for each neuron in a calcium imaging dataset is determined. Then, a pre-trained model that has been trained on noise-matched resampled ground truth data is loaded from an online model library and applied to the ΔF/F data without any need to adjust parameters. In addition, CASCADE can be easily modified and retrained to address additional specific needs, for example, more complex loss functions22 or a modified architecture. Also, the resampled ground truth can be adapted directly if desired. For example, we used a Gaussian kernel to smooth the ground truth spike rate, but this standard procedure can be disadvantageous to precisely determine the onset timing of discrete events. In CASCADE, it is simple to replace the Gaussian kernel by a causal smoothing kernel to circumvent this problem (Fig. S21).
A second problem is that experimenters may need additional tools and documentation for interpretation of the results. To address this issue, we included graphical outputs and guiding comments that are accessible also for non-specialists throughout the demo scripts. Together with existing literature on the appropriate interpretation of raw calcium data4,5, 22, 45, 58, these subsidiary tools will help to focus the attention on data quality and make users aware of the potentials and limitations of raw and deconvolved data.
DISCUSSION
We have created new tools and resources for spike inference from calcium imaging data. Any spike inference approach, in particular methods based on deep learning, critically depend on the availability and quality of ground truth datasets. We therefore created a ground truth database that is larger and more diverse than previous datasets15, 18, spanning multiple calcium indicators, brain regions and species (Fig. 1). Moreover, we developed CASCADE, a novel algorithm for spike inference based on deep learning. The central idea of CASCADE is to optimize the match between the training data and experimental datasets, rather than to invest primarily into the optimization of the inference algorithm itself. Previous supervised spike inference algorithms typically trained models with an existing, immutable ground truth15, 18, 19. Our algorithm, in contrast, resamples the ground truth datasets upon demand to match both frame rate and noise level in an automated fashion for each neuron (Fig. 2). Training with an appropriately resampled and noise-matched ground truth resulted in improved inference, highlighting the importance of training the algorithm not only with realistic calcium signals but also with realistic noise patterns.
The generalization of spike inference methods across unseen datasets had been investigated sporadically12, 13, 21 but never systematically in previous studies, presumably due to the lack of extensive ground truth data. Taking advantage of our diverse ground truth database, we explored how predictions depend on species (zebrafish or mouse), indicator type and brain region (Fig. 3) as well as on other experimental parameters that are presumed to strongly influence spike inference. Surprisingly, we found that some training datasets allowed for efficient generalization across these parameters, and a combined training dataset achieved very high performance across all test data. This result was obtained for both excitatory neurons and interneurons, although absolute spike rates were underestimated for interneuron data. Moreover, some datasets performed poorly as training sets while others performed poorly as test sets, even when compared against datasets with a similar indicator and/or from the same brain region. These observations suggest that generalization is affected significantly by experimental differences that are difficult to identify, such as indicator concentration or baseline calcium concentrations. However, this problem could be overcome by training networks on a diverse ground truth database, indicating that networks can learn to take these variations into account when sufficient information is provided during training. Highly efficient generalization across excitatory datasets was obtained by the unsupervised ‘global EXC model’ that was trained on all available excitatory ground truth datasets. This global model is therefore well-suited for practical applications of spike inference in unseen datasets.
In all investigated situations, our algorithm outperformed existing approaches (Fig. 4). Predictions were not only more precise, as measured by correlation metrics, but also less biased towards underestimates of true spike rates. We reason that the balance between spike detection and noise suppression is crucial for reliable spike inference. Our results suggest that less expressive models have to over-suppress if specific suppression is not possible. In contrast, the expressiveness of the employed deep networks enables CASCADE to better distinguish signal from noise, while their relatively small size prevents overfitting. It is however possible that other algorithms might perform better in regimes that are not covered by the current ground truth database (e.g., recordings with extremely low noise levels that can easily resolve single spikes, or tonically spiking neurons that are transiently inhibited59).
CASCADE was not sensitive to user-adjustable hyper-parameters or the class of the deep networks tested. This insensitivity has two consequences. First, it seems more valuable to us to optimize the acquisition of more specific and diverse ground truth and the preprocessing of calcium data rather than to focus on improvements of the deep networks. Second, because hyper-parameters do not need to be adjusted by the user, the application of spike inference becomes simple in practice. While some previous studies assumed that user-adjustable parameters in model-based algorithms increase the interpretability of the model10–13, we argue here that (1) biophysical model parameters are often ambiguous12 and therefore not directly interpretable, and (2) it is more important to focus on the interpretability of the results rather than the model. To this end, our toolbox provides methods to estimate the errors made during spike inference. Moreover, we included a detailed documentation in the Colaboratory Notebook to help the user interpret the results.
Quantitative inference of spike rates is critical for the analysis of existing and future calcium imaging datasets4, 5, 22. The approach usually requires single-neuron resolution and is less well suited when the ΔF/F values are less well defined (e.g., in endoscopic one-photon data with high background fluorescence, fiber photometry or wide-field calcium imaging). Moreover, ΔF/F can, in theory, only report spike rate changes. Nevertheless, we found that absolute spike rates can be reliably inferred when the baseline activity is sufficiently sparse to enable the determination of the fluorescence baseline level F0, which was the case in all datasets examined here (Fig. 5,6). The enhanced temporal resolution will be particularly useful for the analysis of neuronal activity during natural stimulus sequences and behaviors that occur on timescales shorter than typical durations of calcium transients. For example, the deconvolution of calcium signals can increase the effective temporal resolution to timescales below 100 ms, which will allow for the analysis of neuronal representations across theta cycles60 and for the resolution of early and late dynamics in cortical responses to sensory inputs that have been associated with different processing steps61. Moreover, the inference of absolute spike rates will help improve the calibration of precisely patterned optogenetic manipulations62–64 and the extraction of constraints, e.g., absolute spike rates, for computational models of neural circuits.
The reliability of spike inference obviously depends on the recording quality of the calcium imaging data. To improve data quality of ΔF/F signals, future work should focus on the reduction of movement artifacts and neuropil contamination both by experimental design45, 65 and by extraction methods40–43, including the correct estimation of the F0 baseline despite unknown background fluorescence. In the long term, the development of more linear calcium indicators66 and especially the acquisition and integration of more specific ground truth, e.g., for additional interneurons and subcortical brain regions, will enable quantitative spike inference for an even broader set of experimental conditions. We envision that our set of ground truth recordings will become enlarged over time, allowing to train more and more specific models for reliable inference of spike rates.
CONTRIBUTIONS
P.R. conceived the project, developed the algorithm, performed ground truth recordings (datasets 4-8), performed all analyses, developed the toolbox and wrote the paper. S.C. performed ground truth recordings (datasets 18 and 19). A.H. developed the toolbox. M.E. and K.K. (dataset 3), A.K. and Y.D. (datasets 2, 22 and 23), A.B. and S.H. (datasets 24-27) performed and preprocessed ground truth recordings. F.H. supervised ground truth recordings (datasets 18 and 19) and the development of the toolbox, and wrote the paper. R.W.F. supervised ground truth recordings (datasets 4-8) and the development of the algorithm, and wrote the paper.
COMPETING INTERESTS
The authors declare no competing interests.
METHODS
Simultaneous juxtacellular recordings and calcium imaging in adult zebrafish
All zebrafish experiments were approved by the Veterinary Department of the Canton Basel-Stadt (Switzerland). For the recordings in DS#04 and DS#05, the adult zebrafish brain was dissected ex vivo as described previously23 and OBG-1 AM or Cal-520 AM were injected and incubated in posterior Dp (pDp) as described previously67. During the dissection, the dura mater above pDp was carefully removed to prevent clogging of the patch pipette. Calcium indicators were injected for 1-2 min at two locations (injection 1: ∼210 μm dorsal from the ventralmost aspect of Dp and ∼130 μm from the lateral surface of Dp; injection 2: 180 μm and 60 μm) and was monitored by snapshot multiphoton images. The pressure was adjusted to avoid fast swelling of the tissue.
Juxtacellular recordings were performed >1h and <4h after the dye injection. Patch pipettes were pulled from 1 mm borosilicate glass capillaries (Hilgenrath), with a pipette resistance of 5-8 MΩ. Micropipettes were backfilled with ACSF (in mM: 124 NaCl, 2 KCl, 1.25 KH2PO4, 1.6 MgSO4, 22 D-(+)-Glucose, 2 CaCl2, 24 NaHCO3; pH 7.2; 300-310 mOsm) with 0.05 mM Alexa 594.
The explant preparation was rotated about the anterior-posterior axis to allow for optical access from the side (sagittal imaging). Using a multiphoton microscope, images generated from fluorescence and from the asymmetry of the signal on a four-quadrant detector for transmitted light were used to target the pipette to pDp, while continuous low pressure (30-40 mbar) was applied to prevent clogging of the pipette. The pipette then entered the tissue with initial high pressure (90-110 mbar) that was lowered after a few seconds. Neurons were approached using the shadow-patching technique described previously51, 68, but with lower pressure. Juxtacellular recordings were performed after establishing a loose seal (typically 30-50 MΩ) with a target neuron. In some cases, a small negative pressure was applied initially to improve the electrical contact with the target cell. In several cases, single micropipettes were used multiple times. Recordings were performed in voltage-clamp mode with the voltage adjusted such that the resulting current approximated zero69.
For DS#06-08, which were based on a transgenic line expressing GCaMP6f in the forebrain49, the experimental procedures were similar except for the injection of synthetic dyes. Because the baseline brightness of GCaMP6f is low it was often difficult to identify individual neurons. Upon application of odor stimuli, stimulus-responsive neurons that expressed GCaMP6f became brighter, which permitted reliable visual identification for targeted patching. For regions in dD (DS#07) with no obvious odor responses, cells were patched randomly based on shadow images generated by the blown-out Alexa dye68.
Simultaneous recordings of fluorescence and extracellular spikes of the same neuron were synchronized using Scanimage 3.8 for imaging70 and Ephus for electrophysiology71. Calcium imaging was performed at intermediate zoom (see Fig. 1) with a frame rate of 7.5 or 7.8125 Hz for DS#04 and DS#05 and at high zoom with a framerate of 30 Hz for DS#06-08. Electrophysiological recordings were low pass-filtered at 4 kHz (4-pole Bessel filter) and sampled at 10 kHz.
Recordings were performed in 120-s episodes, with repeated stimulation of the nose with food extract odorants as described previously51. In pDp, spike rates are usually very low. When no spiking activity was observed, the holding potential of the pipette was set to higher values (between +5 and +30 mV), leading to an extracellular current that depolarized the neuron if the seal resistance was sufficiently high, resulting in artificially generated spikes. If no spikes could be elicited over the full duration of the recording, the recording was not included in the ground truth dataset.
Anatomical location and further information about neurons in zebrafish ground truth datasets
DS#04: OGB-1, injected in the posterior part of the olfactory cortex homolog (pDp) in adult zebrafish. Recordings were performed throughout dorsal and ventral compartments of pDp and OGB-1 was injected as described previously67. Because OGB-1 localizes predominantly to the nucleus and because the resolution was high, neuropil contamination is negligible in this dataset.
DS#05: Cal-520, injected in the posterior part of the olfactory cortex homolog (pDp) in adult zebrafish. Same brain region as DS#03. Unlike OGB-1, Cal-520 is primarily cytoplasmic, resulting in considerable neuropil contamination. Cal-520 spread less than OGB-1 after injection and labeled only a small central volume in pDp.
DS#06: tg(NeuroD:GCaMP6f), anterior part of the olfactory cortex homolog (aDp) in adult zebrafish. In this transgenic line, GCaMP6f is strongly expressed throughout Dp. Recording location and framerate were chosen to match previous experiments51.
DS#07: tg(NeuroD:GCaMP6f), dorsal part of the dorsal pallium (dD) in adult zebrafish. All recorded neurons in dD were mapped onto brain regions Dm, Dl, rDc and cDc based on neuroD expression in the dorsal part of the dorsal pallium (Fig. S22, following Huang et al., 2020, ref. 72). Although the dorsal pallium is not known to be directly involved in olfactory processing, we noticed that several neurons were clearly inhibited during odor stimulation (duration, 10 - 30 s).
DS#08: tg(NeuroD), olfactory bulb (OB) in adult zebrafish. In the olfactory bulb of this transgenic line, GCaMP6f is restricted to a distinct, small subset of putative mitral cells and interneurons49. Neurons #1-#3, #5 and #7 were identified as interneurons based on their small size and morphology, while neurons #4, #6, #8 and #9 were classified as putative mitral cells.
Simultaneous juxtacellular recordings and calcium imaging in anesthetized mice
For virus-induced expression of R-CaMP1.07 (DS#19), AAV1-EFα1-R-CaMP1.07 and AAV1-EFα1-DIO-R- CaMP1.07 were stereotactically injected under isoflurane anaesthesia in barrel cortex of C57BL/6J mice and hippocampal area CA3 of tg(Grik4-cre)G32-4Stl mice as described previously25. We performed combined electrophysiology and in vivo calcium imaging in acute experiments in anesthetized animals (n = 3; at least two weeks after virus injection) as described previously for barrel cortex recordings25. A stainless steel plate was fixed to the exposed skull using dental acrylic cement. A 1x1 mm2 craniotomy was made over barrel cortex. The dura mater was cleaned with Ringer’s solution (containing in mM: 135 NaCl, 5.4 KCl, 1.8 CaCl2, 5 HEPES, pH 7.2 with NaOH) and carefully removed. To reduce tissue motion caused by heart beat and breathing, the craniotomy was filled with low concentration agarose gel and gently pressed with a glass coverslip. For CA3 recordings (dataset #18), a 4-mm Ø craniotomy was centred over the injection site. The overlying cortex was aspirated until the corpus callosum became visible. The cavity was filled with 1% agarose gel to reduce tissue motion. Juxtacellular recordings from R-CaMP1.07-expressing neurons were obtained with glass pipettes (4–6 MΩ tip resistance) containing Ringer’s solution. For pipette visualization, Alexa-488 (Invitrogen) was added to the solution or pipettes were coated with BSA Alexa-594 (Invitrogen). Action potentials were recorded in current clamp at 10 kHz sampling rate using an Axoclamp 2B amplifier (Axon Instruments, Molecular Devices) and digitized using Clampex 10.2 software. All experimental procedures were conducted in accordance with the ethical principles and guidelines for animal experiments of the Veterinary Office of Switzerland and were approved by the Cantonal Veterinary Office in Zurich.
For dataset #03, C57BL6/J male mice were anesthetized by intraperitoneal injection of 1.9 mg/g urethane and the skull was partly exposed and attached to a stainless steel frame as described previously26. In a small craniotomy of the barrel cortex, we removed the dura, filled the cranial window with 1.5% agarose and placed a coverslip over the agarose to minimize brain movements26. Cal-520 AM together with an Alexa dye were bolus-loaded in layer 2/3 of the barrel cortex (200–300 μm deep below the surface) and monitored by two-photon imaging on the Alexa channel26. Calcium imaging was performed more than 30 min after dye ejection. For simultaneous calcium imaging and loose-seal cell-attached recordings, we filled glass pipettes (5–7 MΩ) with the extracellular solution containing Alexa 594 (50 μM), inserted pipettes into the barrel and targeted Cal-520-loaded somata. At about 10 min after the establishment of the cell-attached configuration, we performed simultaneous loose-seal cell-attached recording and high-speed line-scan calcium imaging (500 Hz) on the soma of cortical neurons as described26. The electrophysiological data were filtered at 10 kHz and digitized at 20 kHz by using Multiclamp 700B and Digidata 1322A (Molecular Devices), and acquired using AxoGraph X (AxoGraph).
Datasets #24-#26 were recorded in slices of mouse visual cortex as described previously27. Interneurons were targeted by injecting GCaMP6f-expressing AAV1 virus into PV-Cre, VIP-Cre or SOM-Cre mice. Coronal slices were cut with a thickness of 350 µm and loose patch recordings performed at 32°C in ACSF. To induce activity in otherwise quiet slices, a potassium-based solution was applied to the slice through a second pipette. Simultaneous calcium imaging was performed with a two-photon microscope recording at 34 Hz through a 16x water immersion objective (0.8 NA, Nikon) as described27. Dataset #27 was recorded in anaesthetized mice as described previously73. Adult (> 8 weeks) PV-tdTomato mice (cross between Rosa-CAG- LSL-tdTomato (JAX: 007914) and PV-Cre (JAX: 008069)) were injected with GCaMP6f-AAV (AAV1.Syn.GCaMP6f.WPRE.SV40, UPENN) in primary visual cortex (V1, ∼2.5 mm lateral, ∼0.7 mm anterior of the posterior suture). Acute recordings were performed at least 2 weeks after the initial injection. Mice were initially anaesthetized with a mixture of fentanyl (0.05 mg/ml), midazolam (5.0 mg/kg), and medetomidin (0.5 mg/kg), a metal headplate was fixed on the skull and a craniotomy was opened above V1. Anesthesia was maintained with a low concentration of isoflurane (0.5% in O2). Borosilicate glass pipettes (6–8 MΩ) filled with a solution containing 110 mM potassium gluconate, 4 mM NaCl, 40 mM HEPES, 2 mM ATP-Mg, 0.3 mM GTP-NaCl, and 0.03 mM Alexa 594 (adjusted to pH 7.2 with KOH, ∼290 mOSM) were lowered in the visual cortex. Neurons expressing GCaMP6f and tdTomato were targeted for juxtacellular recording in loose-cell configuration under a two-photon microscope. For simultaneous electrophysiological and optical recordings, calcium was recorded with Scanimage70 at 30 Hz, and juxtacellular voltage was recorded using a Multiclamp 700B amplifier (Axon Instruments, USA), filtered at 20 kHz and digitized at 10 kHz (National Instruments, USA). 50 Hz noise was reduced by using a noise eliminator (Humbug).
Datasets #02, #22 and #23 were recorded in mouse primary visual cortex as described previously28. GFP- GIN mice were used to target SOM interneurons, PV-Cre mice crossed with loxP-flanked tdTomato reporter mice were used to target PV interneurons, and CaMKIIα-Cre mice crossed with loxP-flanked tdTomato mice were used to target excitatory neurons. 1h after loading of OGB-1 into V128, two-photon microscopy was used to target neurons 150-300 µm below the brain surface with the recording pipette, while the mouse was anesthetized with intraperitoneal injection of urethane and chlorprothixene. Two-photon imaging of neurons was performed with a 40x objective at a frame rate of 15.6 Hz while voltage was recorded in a loose-cell configuration from the same neuron as described28.
Analysis of simultaneous juxtacellular recordings and calcium imaging
Movies of calcium indicator fluorescence images were corrected off-line for movement artifacts, i.e., slow drifts due to relaxation of the brain tissue for zebrafish data or fast movement artifacts for recordings in anesthetized mice. Ground truth recordings from DS#03 were not corrected for movement artifacts due to the scanning modality (line-scan). Afterwards, regions of interest (ROIs) were manually drawn using a custom-written software tool (https://git.io/vAeKZ)51 for each trial to select pixels that reflected the calcium activity of the neuron. Fluorescence traces were extracted either as average across the ROI or individually for each pixel to allow for both natural and artificial sub-sampling of calcium signal noise levels (Fig. S4).
Spike times were extracted from juxtacellular recordings using a custom-written template-matching algorithm. In brief, peaks of the first derivative of a 1 kHz-filtered electrophysiological signal were detected using a threshold that differed between recordings and that was manually adjusted to safely exclude false positives. The original waveforms of the detected events were then averaged and used in a second step as a template to detect all events across the full recording more precisely via cross-correlation of the template with the original signal. A threshold adjusted manually for each recorded neuron extracted action potential events. The process of first generating a template that was afterwards used to detect stereotypic signals allowed to increase the signal-to-noise of detected events, similar to previous usages of template matching in electrophysiology74, 75.
Quality control
All electrical spiking events were inspected visually and compared to simultaneously recorded calcium transients. Any recordings that were ambiguous due to low electrophysiological signal-to-noise of action potentials were discarded and not used for any further analysis. Calcium recordings with excessive movement artifacts or apparent inconsistencies of juxtacellular and calcium recordings were discarded entirely. Excessive movement artifacts were defined as events when the neuron visibly moved out of the imaging plane, such that transients generated by these movements were almost as frequent and prominent as true calcium transients. Apparent inconsistencies of recordings were identified as recordings where no spike events corresponded to visible calcium transients and where a spike-triggered average (Fig. S1) did not show any signal, indicating that juxtacellular and calcium recordings were performed from different neurons. In addition, neurons were discarded when they did not spike at all even after application of currents, or when they became visibly brighter after establishing a loose seal due to unknown, possibly mechanical reasons. When the calcium recording clearly contained events without corresponding electrophysiological action potentials, the calcium trace of the manually drawn ROI and the calcium traces of adjacent neurons or neuropil were inspected together with the electrophysiological recordings in order to assess optical bleed-through, and ROIs were adjusted if necessary to avoid contamination. Occasionally, we also noted that mechanical stress exerted by the recording pipette can increase the brightness of the recorded neuron30, possibly by the release of calcium from internal stores. Recordings made during and after such events were discarded. Bursting can lead to adaptation of the extracellularly measured spike amplitude. Such recordings (e.g., in DS#18 with bursts of >10 APs with an inter-spike interval of ca. 5 ms) were carefully inspected for missed low-amplitude action potentials, in particular during these bursts.
Extraction of ground truth from publicly available ground truth datasets
Additional ground truth was extracted from publicly available datasets and quality-controlled for each neuron15,18,29,31–33.
The Allen Institute datasets
For DS#10-13 from Huang et al. (2020)29, raw fluorescence traces were extracted from the processed datasets which were downloaded from https://portal.brain-map.org/explore/circuits/oephys. Neuropil signal was subtracted using the same standard scaling value for all neurons to make recordings comparable with other datasets (neuropil contamination ratio 0.7), despite the caveats associated with this procedure29. A 6- s running 10% lowest percentile window was typically used to compute F0 for ΔF/F0 calculation, but percentile values were adjusted to the noisiness of the recording and over window durations that were adjusted to the baseline activity. Simultaneous juxtacellular and calcium imaging recordings were inspected for each ground truth neuron together with the raw movie, described in the Methods section ‘Quality control’.
The Spikefinder datasets
For DS#01, DS#15 and DS#16 from Theis et al. (2016)18, the ground truth recordings at their native sampling rates as released during the Spikefinder challenge15 were processed. This Spikefinder dataset consists of 5 separate datasets. Datasets 1 and 4 were excluded since fluorescence baseline and scaling were unknown. The other datasets were extracted as fluorescence traces, F0 was computed as the 10th percentile value (adjusted depending on the spike rate of each neuron) and used to compute ΔF/F0. Some ground truth neurons were discarded due to a highly unstable calcium recording baseline, but no strict quality control was possible since the raw calcium imaging data were not available. As found during a previous study, some datasets of the Spikefinder challenge come with calcium recordings that are delayed with respect to the electrophysiological recordings15. We therefore manually corrected for delays of the calcium recording with respect to the electrophysiological recording based on visual alignment of extracted linear kernels. The same correction delay was applied across all neurons of a given dataset.
The GENIE datasets
Datasets DS#09, DS#014, DS#017, DS#20 and DS#21 were downloaded from http://crcns.org/data-sets/methods31–33, 76,77. For DS#09 and DS#14 33, ROIs were extracted from raw calcium imaging data using the same approach as described above for R-CaMP1.07 data. Recordings with excessive movement artifacts or apparent inconsistencies of juxtacellular and calcium recordings were discarded entirely. Neuropil was subtracted using the same standard scaling value for all neurons (neuropil contamination ratio 0.7)33. F0 values were computed using percentile values that were adjusted to the noisiness of the recording, and over window durations that were adjusted to the baseline activity.
For datasets DS#17, DS#20 and DS#21, no raw calcium imaging data were available, therefore not allowing for strict quality control using raw calcium recordings as additional feedback. Neuropil signal was subtracted from raw fluorescence using the same standard scaling value for all neurons (neuropil contamination ratio 0.7)31, 32. F0 values were computed using percentile values that were adjusted to the noisiness of the recording, and over window durations that were adjusted to the baseline activity.
Population calcium imaging with OGB-1 in zebrafish pDp
Ex vivo surgeries, OGB-1 AM injections and calcium imaging were performed as described for juxtacellular recordings. Calcium imaging in Dp was performed using a custom-built multiplane multiphoton microscope based on a voice-coil motor for fast z-scanning as described49. Laser power below the objective was 29-35 mW (central wavelength 930 nm, temporal pulse width below the objective 180 fs), with higher laser power for deeper imaging planes.
Imaging was performed in 8 planes (256x512 pixels, ca. 100x200 µm) at 7.5 Hz over a z-range of approximately 100 µm. Due to slowly relaxing brain tissue, movement correction was applied every 5 min by acquiring local z-stacks with a z-range of ±6 µm. The maximum cross-correlation between a reference stack acquired before the experiment and the local z-stack indicated the optimal positioning which was targeted using the stage motors of the microscope.
For odor stimulation, amino acids (His, Ser, Ala, Trp; Sigma) were diluted to a final concentration of 10-4 M and bile acid (TDCA; Sigma) was diluted to 10-5 M in ACSF immediately before the experiment. Food extract was prepared as described51. Odors were applied for 10 s through a constant stream of ACSF using a computer-controlled peristaltic pump51 in pseudo-random order with three repetitions of each odor presentation.
Extraction of linear kernels from ground truth data
To extract linear kernels, we used simple regularized deconvolution using the deconvreg(Calcium,Spikes) function in Matlab (Mathworks). This function computes the kernel which, when convolved with the observed Spikes, results in the best approximation of the Calcium trace. Linear kernels were similar on average when extracted using different deconvolution algorithms (Wiener deconvolution, Lucy-Richardson algorithm; data not shown).
To compute the variability of linear kernels across neurons within and across datasets (Fig. S1), we split the ground truth recording of each neuron in five separate parts and computed the linear kernels for each of the segments separately. If the coefficient of variation (standard deviation divided by mean) across these five values was lower than 0.5, the computation of the kernel amplitude was considered reliable and included in the plots in Fig. S1.
Computation of noise levels
In the shot-noise limited case the mean fluorescence F0 scales with N, the number of photons collected by the detector per second, and the fluorescence baseline fluctuations σF scale with . Thus, the ΔF/F baseline noise scales with . If the fluorescence signal is sampled at frame rate fr, the number of photons collected per frame reduces to N/fr, thus σΔF/F scales with . To define a noise measure that is independent of frame rate, we therefore normalized σΔF/F for this shot-noise effect and defined the standardized noise ν as:
The units for ν are % · Hz−1/2, which for the purpose of readability we omit in the text. When computed for ΔF/F data in this way, ν is quantitatively comparable across datasets. A value of ν = 1 will always be a very low level, while ν = 8 will always be high, independent of frame rate.
Metrics to quantify performance of spike inference algorithms: Correlation, Error and Bias
The ground truth spike rates were generated from detected discrete spikes by convolution with a Gaussian smoothing kernel (except in Fig. S21, where a non-Gaussian, causal kernel was applied). The precision of the ground truth was adjusted by tuning the standard deviation of the smoothing Gaussian (σ = 0.2 s for 7.5 Hz recordings and σ = 0.05 s for 30 Hz recordings). The ground truth spike rate was then compared to the inferred spike rate.
There is no single metric to reliably reflect the goodness of performance of a spike inference algorithm. Correlation between the inferred spike rate and the ground truth is widely used15 but does not contain information about different absolute scaling or offsets. F1-scores combine false positives and negatives11 but are difficult to compare across datasets when the baseline spike rates vary (which is the case for our database). Other metrics try to combine the strengths of the correlation measure with a sensitivity to the correct number of spikes78 but are less intuitive.
We defined three intuitive and complementary metrics (illustrated as color-coded equations in Fig. S5). First, we used the Pearson’s correlation between ground truth spike rate and inferred spike rates as a standard measure of the similarity. Second, the relative error (abbreviated as error) results from the sum of false positives and false negatives when subtracting the ground truth from the inferred spike rate, normalized by the absolute number of spikes in the ground truth. For example, an error of 0.7 would indicate that the number of either incorrectly inferred or omitted spikes is about 70% of the number of spikes in the ground truth. Third, the (relative) bias is defined as the difference of false positives and false negatives, again normalized by the absolute number of spikes in the ground truth. Algorithms that systematically underestimate spike rates will tend towards the minimum of the bias, -1, whereas other algorithms may tend to systematically overestimate spike occurrences (bias > 0). Importantly, the error can be very high when the number of false positives and false negatives is high, but the bias may still be zero. Error and bias are therefore two metrics that describe the absolute errors in terms of spike rates, thus complementing the correlation metric.
Architecture of the default convolutional network
The default network consists of a standard convolutional network with in total 6 hidden layers, including 3 convolutional layers. The input consists of a window of 64 time points symmetric around the time point for which the inference is made. The three convolutional layers have relatively large but decreasing filter sizes (31, 19, 5 time points), with an increasing number of features (20, 30, 40 filters per layer). After the second and the third layer, maximum pooling layers are inserted. A final densely connected hidden layer consisting of 10 neurons relays the result to a single output neuron. While all neurons in hidden layers are based on rectified linear units (ReLUs), the output neuron is based on a linear identity transfer function. In total, the model consists of 18’541 trainable parameters.
The properties of the calcium imaging data are accounted for by resampling the ground truth with the appropriate noise levels and the matching frame rate. The ground truth is smoothed with a time-symmetric Gaussian kernel of standard deviation 0.2 s unless otherwise indicated for resampling at 7.5 Hz and 0.05 s for 30 Hz or a causal kernel (inverse Gaussian distribution) to facilitate gradient descent.
Training deep networks for spike inference
To train the deep networks, the mean squared error between the smoothed ground truth spike rates and inferred spike rates was used as the loss function. This loss function not only optimizes the similarity of both signals (correlation), but also the absolute magnitude of the inferred spike rates. Based on errors computed via backpropagation, gradient descent was performed using a standard optimizer (adagrad; cf. Fig. S7). Based on a given resampled ground truth dataset, the network was trained using every single data point from this set, completing an epoch. Typically, training lasted for 10-20 epochs (except when analyzing overfitting in Fig. S7 and Fig. S8).
Crucially, in all spike inferences presented here, without exception, a leave-one-out strategy was employed. For example, to infer the spike rates of a given neuron in a dataset, the network was trained on all neurons of this dataset except the neuron of interest. To infer spike rates for a given set of datasets, the training set always excluded the dataset for which inferences were made. This strategy of cross-validation is absolutely crucial and is strictly distinct from the process of fitting parameters for a neuron or a dataset, which would yield better result for a given neuron but would fail to generalize to new data.
Architecture of alternative deep learning networks
All deep learning architectures (Fig. S8) were trained with the same loss function, the same input and the same optimizer as the default network.
Small convolutional filters network
The same architecture as the default network, with the only difference that smaller convolutional filter sizes were used, (15, 9, 3) instead of (31, 19, 5). Total of 9’891 trainable parameters.
Single convolutional layer network
Consisting of the first convolutional layer of the default network, a single max pooling layer and a single dense layer of 10 neurons. Total of 1’021 trainable parameters.
Deeper convolutional network (5 CNN layers)
Consisting of 5 convolutional layers with filter sizes (11, 9, 7, 5, 3) and filter numbers (20, 30, 40, 40, 40), with three max pooling layers after the second, fourth and fifth convolutional layers, and a final dense layer expansion of 10 neurons. The reduction of the filter sizes compared with the default network is necessary since no zero-padding was applied, resulting in a decrease of the size of the 1D trace with increasing layer depth. Total of 27’421 trainable parameters.
Deeper convolutional network (7 CNN layers)
Consisting of 7 convolutional layers with filter sizes (7, 6, 5, 4, 3, 3, 3) and filter numbers (20, 30, 40, 40, 40, 40, 40), with three max pooling layers after the second, fifth and seventh convolutional layers, and a final dense layer expansion of 10 neurons. Total of 31’221 trainable parameters.
Batch normalization
Same as the default network, but with batch normalization79 for regularization after each convolutional and dense layer, but before the respective ReLU transfer functions of each network layer. Total of 18’741 trainable parameters.
Locally connected network
Same as the default network, but with locally connected filters instead of convolutional filters. For convolutional filters, filter weights are shared across each position in the image space (here, in the temporal window), while the filters are different for each position for locally connected networks. The rationale behind this architecture is that different filters can be learned for each position, which is intuitive given that spike detection is not invariant to the position of the calcium transient in the window. Using different weights for each position of the filter sets results in a total of 229’231 trainable parameters.
Naïve LSTM model
LSTM units are complex neuronal units with internal states and gates that are used in recurrent networks to overcome the problem of vanishing gradients when backpropagating through time80, 81. The time points of the input window are sequentially fed into the recurrent network, which are processed by the recurrent network, with earlier time points retained through recurrent activity or LSTM states and used to activate the network for processing of later time points. The investigated model consisted of two layers of each 25 LSTM units with ReLU as activation functions, followed by a simple dense expansion layer of 50 neurons with ReLU activation functions. Total of 4’051 trainable parameters.
Bi-directional LSTM model
The time points of the input window (64 data points) are split into past (32 data points) and future (32 data points) with respect to the time point used for spike inference (“presence”). Past and a reversed version of the future are each fed into a recurrent network based on a single layer of 25 LSTM units (with ReLU activations), such that the time point closest to “presence” is fed in last82, 83. The output of the two recurrent networks for past and future is concatenated and connected with a dense fully connected layer of 50 simple units (ReLU activations). Total of 8’001 trainable parameters.
Linear network
Same as the default network, but with linear activation functions instead of rectifying linear units (ReLUs). The network is therefore entirely linear, but is based on the same architecture in terms of connectivity. Total of 18’541 trainable parameters.
Discretization of spiking probabilities
To obtain discrete spiking events from inferred probabilities, a brute-force fitting procedure was applied. The Gaussian kernel used to smooth the ground truth was used as a prior for the inferred spike rate that corresponds to a single action potential. The fit therefore consisted of optimally fitting a set of Gaussian kernels of the expected width and height to the inferred spike rate. We made a first guess that was then optimized by random modifications. The first guess was generated using Monte Carlo importance sampling, such that the overall number of discrete spikes matched the integral of inferred probabilities. Next, events were ranked in how they contributed to the fit by comparing the fit quality when single events were omitted. Lowest-ranking events were discarded and replaced by newly drawn events, again using importance sampling based on the residual probability distribution. Finally, each spike was shifted randomly over the entire duration and the best fit was used. This approach is relatively slow but results in a reliable fit. To speed up the procedure, spiking probabilities were divided in continuous sequences of non-zero support (divide-and-conquer strategy). For Fig. S12 and to allow for comparison against raw inferred spike rates, the resulting discrete spikes were convolved with the Gaussian smoothing kernel that had been used to generate the ground truth. We provide a demo script that infers discrete spikes from spike rates predicted with CASCADE (available on Github: https://git.io/JtZe4).
Generalized linear model to fit predictability across datasets
To predict how well a model trained on a given ground truth dataset (e.g., DS#08) is able to infer activity for another dataset (e.g., DS#14), a set of descriptors (regressors) was extracted for each dataset, and a generalized linear model (GLM) was trained to predict this relationship based on the regressors of the two respective datasets (Fig. S14). In total, 8 predictors were used, separately or together.
First, indicator species was set to 1 if training and test dataset had the same indicator species (synthetic dyes vs. genetically encoded dyes) and 0 otherwise. Animal species was set to 1 if training and test dataset had the same animal species (zebrafish vs. mouse) and 0 otherwise. Spike rate was computed as the absolute difference between median spike rates across neurons from training and test datasets. Burstiness was computed as the number of spikes that were spike within 50 ms of the timing of a given spike. This metric quantifies the likelihood that a given spike is surrounded by other spikes. The Fano factor was computed by dividing the variance of inter-spike-intervals (ISIs) by the mean of ISIs84. Measured Fano factors were broadly distributed across datasets with a median of 3.7 and a standard deviation of 5.9, and an outlier dataset DS#18 in mouse CA3 with a Fano factor of 30.0. The area of the linear kernel was computed by summing up the area under the curve for the extracted linear kernel for each dataset. The kernel decay constant was computed without exponential fit by measuring the time between rise and decay time of the kernel directly. Rise and decay time points were identified by finding the first and last time point where the kernel surpassed 1/e of its maximum amplitude. The correlation time course was computed as the correlation of between the kernels of training and test dataset.
The GLM was fitted based on these regressors using the glmfit() command in Matlab with an identity linker function.
Artificial ground truth generated with NAOMi
The package NAOMi was used to generate simulated two-photon calcium recordings of neurons with known spike patterns45. We used the default parameters, which had been optimized for the simulation of GCaMP6f based on previous calibrations33. Artificial ground truth was generated at 30 Hz with a detection NA of 0.6 and an excitation NA of 0.8 at a depth of 100 µm below the cortical surface, in a volume of 250x250x100 µm3. To increase the signal-to-noise ratio of the simulated recordings we used a relatively high simulated laser power of 70 mW. We simulated recordings of the central plane of five such volumes over a duration of 166 s. We extracted the cleanest components of each simulation by selecting the spatial components (from the ideal components returned by NAOMi) that correlated most highly with the known ground truth signals (correlation with a known somatic ground truth signal >0.80). We chose to only include the best- matching components since other components typically had much stronger neuropil contamination than our experimentally obtained ground truth recordings. Then, we extracted the fluorescence of the selected components and performed neuropil subtraction with a 2 pixel-ring around the detected component using a factor of 0.45 for neuropil subtraction. Afterwards, we computed the ΔF/F signal, using the 2nd percentile across the entire recording to determine F0. This procedure resulted in ground truth recordings from a total of 250 simulated neurons.
Adaptation of model-based spike inference algorithms
The MLSpike algorithm was downloaded from https://github.com/MLspike/spikes and used within Matlab 2017a 11. Parameter settings were manually explored for several datasets using the graphical demo user interface. Then, some parameters (noise level sigma and inverse frame rate dt) were fixed to the values constrained by the ground truth. The drift parameter was set to 0.1. For synthetic dyes (DS#01-05, DS#22- 23), a saturating non-linearity (saturation = 0.01) was used, whereas for all other datasets a GCaMP-like nonlinearity (pnonlin = [1.0 0.0]) was defined and kept the same across datasets, since predictions have been described to only slightly depend on the precise values of the non-linearity11. Based on manual exploration, the two parameters tau (decay time constant) and amplitude (amplitude of a single action potential) were explored in a grid search for all ground truth datasets and all noise levels separately. The grid search ranged from 0.1 to 5 s for tau and from 0.01 to 0.35 for amplitude.
The Peeling algorithm was downloaded from https://github.com/HelmchenLab/CalciumSim and used within Matlab 2017a 10. A single-exponential linear model with default values was used. A grid search was performed over two parameters for all ground truth datasets: time constant of the exponential decay (tau1) and the amplitude of a single spike (amp1). Grid search ranged from 0.25 – 5 s for tau1 and from 2.5 – 35 for amp1. Discrete spike predictions were convolved with a Gaussian kernel such that the resulting trace optimized the loss function (mean squared error between predictions and ground truth).
The Python implementation of the L1-regularized OASIS algorithm in CaImAn was downloaded from https://github.com/j-friedrich/OASIS and used within Python 3.7 14. The constrained version of OASIS was used to reduce the number of free parameters, with only one single free parameter, g. g relates to the exponential time fluorescence decay constant τ with the frame rate f via the formula: g = e−1⁄τf . Grid search was performed for g in the range between 0.02 and 0.98, with a granularity of 0.02. The Python implementation of the FastL0SpikeInference algorithm from Jewell et al. (hence called Jewell&Witten) was downloaded from https://github.com/jewellsean/FastLZeroSpikeInference and used within Python 3.7 47. A grid search was performed over two parameters for all ground truth datasets: Optimization was performed between 0.10 and 0.95 for the decay constant parameter gamma, and between 0.0001 and 0.75 for the L0 parameter penalty. Discrete spike predictions were convolved with a Gaussian kernel such that the resulting trace optimized the loss function (mean squared error between predictions and ground truth).
The Python implementation of the OASIS algorithm in Suite2p was downloaded from https://github.com/MouseLand/suite2p and used within Python 3.7 46. Out of three tunable parameters (tau, sig_baseline and win_baseline), only the first two significantly affected the performance of the algorithm in our hands. win_baseline was set to 150 for all analyses. A grid search was performed over the two remaining parameters for all ground truth datasets: Optimization was performed between 0.5 and 3 for the decay time constant parameter tau, and between 2.5 and 20 for the parameter sig_baseline.
The optimal parameters resulting from the grid searches, which optimized the mean squared error between ground truth and inferred spike rates, are listed in Table 2 and provided as a csv file via Github (https://git.io/JtZe0). In addition to these parameters, we further used Gaussian smoothing kernels of variable standard deviation to find the amount of smoothing for each algorithm and dataset that optimized the mean squared error. Finally, to compensate for the propensity of several model-based algorithms to infer spike rates with a temporal lag compared to ground truth spike rates, we tested time shifts between - 1 and +1 s and used the respective shift that optimized the mean squared error for a given dataset to evaluate the algorithm in our analyses.
Computational cost of spike inference
The six investigated algorithms exhibit different behaviors when scaling up the length of the calcium traces. For example, MLSpike and Peeling suffer from supra-linear cost when the duration of an analyzed calcium trace is increased, while CASCADE shows the opposite behavior due to its capability to parallelize spike inference. Therefore, all 26 full ground truth datasets, resampled at a noise level of 2 and a frame rate of 7.5 Hz, were used as a benchmark, consisting of recordings ranging from 10s of seconds up to several minutes. Processing time was averaged across all data points from all datasets. The time required to load the data from hard disk was not included. For CASCADE, the time for pre-processing the raw calcium data in order to generate a 64 point-wide segment for each time point was included in the benchmarking.
Unsupervised sequence extraction using seqNMF
The Matlab-based toolbox seqNMF was used to extract temporal patterns for Fig. 5 in an unsupervised fashion52. Based on initial parameter exploration we used the following settings: K=7, L=20 and λ=0.002. K indicates the number of extracted patterns, L the number of time points for each pattern, λ serves as a regularizer to decorrelate the detected patterns52. The result of this unsupervised non-negative matrix factorization approach are K=7 temporal patterns that are each of them associated with a temporal loading which indicates when the temporal pattern became active. The temporal patterns and the temporal loadings provide low-complexity factors that break down the more complex population dynamics (Fig. 5).
Allen Brain Observatory data
The complete calcium imaging data of the Allen Brain Observatory Visual Coding dataset were downloaded from http://observatory.brain-map.org/visualcoding via the AllenSDK with a Python interface. Layers were assigned based on imaging depth as described53. Imaging depth, transgenic lines, cortical areas and fluorescence traces were extracted from NWB files. For analysis, neuropil-corrected calcium traces from the Allen Brain Observatory dataset were used. Since all recordings were performed at an imaging rate of approximately 30 Hz, a single set of CASCADE models (,global EXC model‘ at 30 Hz, Fig. 3a) was used to predict spiking activity.
Statistical tests and box plots
Statistical analysis was performed in Matlab 2017a and R. Only non-parametric tests were used. The Mann- Whitney rank sum test was used for non-paired samples (e.g., comparison across datasets) and the Wilcoxon signed-rank test for paired samples (e.g., comparison of predictions for the same set of neurons using two different algorithms). Two-sided tests were applied unless noted otherwise. Effect sizes Δ±CI (pseudo-median Δ and 95% confidence intervals CI unless otherwise indicated) were computed in R. Boxplots used standard settings in Matlab, with the central line at the median of the distribution, the box at the 25th and 75th percentile and the whiskers at the location of approximately 99.3 percent coverage for the case of a normal distribution.
Supplementary information
⃞ Supplementary Note: Noise-matching of resampled ground truth data
⃞ Supplementary Note: Dependence of performance on hyper-parameters, overfitting and network architecture
⃞ Supplementary Note: Discrete spikes and single-spike precision
⃞ Supplementary Figures
SUPPLEMENTARY NOTE: NOISE-MATCHING OF RESAMPLED GROUND TRUTH DATA
To ensure reliable inference of spike rates it is advantageous to train the supervised deep network with a training dataset that matches the noise level of the test neuron from the population imaging data. In practice, the noise level of each neuron from the population imaging data is determined and an existing model trained with approximately the same noise levels is loaded for spike inference. Noise-suppression is only effective if the noise statistics of the ground truth used for training the network resemble the noise statistics of the calcium data used for testing. To generate a ground truth that matches this requirement, we tested two different approaches to increase the noise of ground truth recordings to match population recordings (Fig. S4).
First, we used the raw imaging data to extract not only the mean fluorescence trace, but the fluorescence trace of each pixel of the region of interest (ROI) that defines the neuron. To achieve a given noise level ν, a random subset of pixels was drawn from the ROI pixels until the average fluorescence trace of this sub- ROI reached the desired noise level (Fig. S4a). This method generates realistic noise characteristics through spatial subsampling but is computationally costly and cannot be applied to ground truth datasets when raw fluorescence movies are not available.
Alternatively, we extracted only the mean fluorescence trace of the ROI of a ground truth neuron and added artificial noise until the overall high-frequency noise matched the test calcium dataset (Fig. S4a). This procedure can in theory be repeated to produce an infinite number of examples (replicas) from a simple ground truth recording. However, since the mean ΔF/F of a ground truth recording already is associated with a certain noise level, these noise patterns would be correlated across replicas. To avoid this undesired effect, which could lead to overfitting of correlated noise during training, the number of replicas was restricted to a number n that was computed with the noise level of the mean ΔF/F of a neuronal ROI, νROI, and the target noise level, νtarget, by n = (νtarget⁄νROI)2 and thresholded at a maximum of n = 500. We tested both simple Gaussian noise as well as Poisson noise, where the variance of the noise is proportional to the signal amplitude, as is typical for photon shot noise.
To test whether artificial noise enables the network to learn and suppress natural noise patterns we generated ground truth with either natural (spatial sub-sampling) or artificial (Gaussian or Poisson) noise for a subset of the available ground truth datasets (Fig. S4a-c). Using models trained on artificial rather than natural noise slightly but significantly decreased the correlation of predictions with the ground truth when applied to test datasets based on natural noise (decrease for Gaussian noise: Δ1 = 0.010±0.004, pseudo-median ± 95% C.I., p < 1e-4, paired Wilcoxon test; decrease for Poisson noise: Δ1 = 0.006±0.004, p < 0.005). This decrement indicates how much better a model trained with naturally sampled noise would likely perform. Conversely, testing models trained with artificially generated ground truth on artificially instead of naturally sampled ground truth increased the correlation of predictions slightly (Gaussian noise: Δ2 = 0.006±0.008, p = 0.13; Poisson noise: Δ2 = 0.005±0.009, p = 0.28). This increment indicates how much the correlations with the ground truth will be overestimated when training and testing is done with artificially generated ground truth only. Differences were generally more pronounced for larger noise levels, but overall remained very small compared to the absolute values (not shown). We therefore conclude that artificial noise allows deep networks to effectively learn noise patterns that can be applied to natural noise, with only minor performance loss compared to spatially subsampled recordings, and we further conclude that using Poisson-distributed noise yields slightly improved performance compared to artificial Gaussian noise.
SUPPLEMENTARY NOTE: DEPENDENCE OF PERFORMANCE ON HYPER-PARAMETERS, OVERFITTING AND NETWORK ARCHITECTURE
We tested how spike inference performance depends on the choice of hyper-parameters and network architecture. Networks were trained on a specific ground truth dataset using all neurons except one, which was held out for testing (leave-one-out strategy). The algorithm turned out highly robust with respect to changes of the optimizer for gradient descent, the batch size during learning, the number of convolutional features per layer, the number of neurons in the dense layer, and the extent of the temporal window of the receptive field (Fig. S7a-e). All of these observations could be confirmed over a surprisingly large range, indicating that the network performance is highly robust with respect to any hyper-parameter choices.
In addition, we also investigated potential overfitting of the training dataset. We observed that the performance of the network was high already after one training epoch (i.e., as soon as every sample had been seen once by the network), then reached a maximum after 10-30 training epochs, and slightly decreased thereafter (Fig. S7f). This learning behavior suggests only moderate overfitting. At the same time the training loss decreased monotonically (Fig. S7g). We believe that the high abundance of noise and sparseness of events acts as a natural regularizer that prevents overfitting. More importantly, while the learning curve was smooth on average (Fig. S7f), individual network instances sometimes reached unfavorable states. As expected from known properties of deep networks85, this effect could be easily eliminated by ensemble averaging over 5 networks (Fig. S7h).
Finally, we also tested the performance when employing a network architecture different from the standard convolutional architecture (hence called ‘default’). We tested a large variety of standard deep learning architectures, including recurrent LSTM networks and non-convolutional deep networks with only the input, output and loss function remaining unchanged (see Methods for detailed descriptions and explanations). Most of these networks performed well, and the performance of several networks was statistically indistinguishable: the default convolutional network, a convolutional network with reduced filter size, a locally connected network, and a bi-directional LSTM network (Fig. S8a). Much smaller networks (single convolutional layer, Fig. S8a) tended to underfit the ground truth. On the other hand, networks with larger numbers of parameters (the deeper convolutional networks and the locally connected network) overfitted the data when training continued (dashed lines in Fig. S8b), consistent with previous observations15.
The locally connected network and the bi-directional LSTM network performed equally well compared to the default convolutional network despite very different architectures. However, some architectures that were not adapted to spike inference showed lower performance, for example the naïve LSTM network, which by its recurrent design prevents the network from looking precisely at the time point of interest (see Methods for details). Another example is a network identical to the default convolutional network but with purely linear activation functions (Fig. S8a,b), which prevents the algorithm from non-linearly adjusting decision boundaries.
SUPPLEMENTARY NOTE: DISCRETE SPIKES AND SINGLE-SPIKE PRECISION
Many existing spike inference methods do not aim at the inference of spike rates, but rather of discrete spikes10, 11, 14. Previous publications reported that the average ΔF/F value triggered by a single spike is larger than zero, and that the calcium transients corresponding to single spikes can be detected in selected neurons33. However, the precise identification of individual spikes in practical situations is more challenging since a detection scheme should also work in unseen data. In addition, the task may depend in unknown ways on the variable expression of calcium indicators, on shot noise, on other noise sources, on low sampling rates and on the non-linear response of calcium indicators3, 21, 29, 30, 86. It is therefore not clear whether discrete spikes can be reliably inferred in realistic scenarios. We therefore devised two approaches to test whether single-spike precision can be achieved. First, we focused on single, isolated spikes in the ground truth and compared them with inferred spike rates for the same time window. This approach is discussed in the main text. Second, we transformed spike rates into discrete spikes and analyzed whether the discretization improved spike inference.
With respect to the second approach, we argue that the spike rates inferred by the deep network will exhibit a tendency to quantize if the predictions are close to single-spike precision. Therefore, a procedure that takes into account the prior about discretized spiking could improve the inferred spike rates. We therefore applied an algorithmic procedure that uses prior knowledge about the spike rate (spiking probability) waveform associated with a single spike to fill up the inferred probability trace with discrete spikes using an optimization procedure based on Monte-Carlo importance sampling (Methods).
We found that spiking probabilities that were almost correctly inferred by the deep network were optimized by suppression of noise or by rounding of close matches (blue arrows in Fig. S12a-d). However, this procedure can also enhance small false positive or small negative errors (red arrows in Fig. S12a-d). Over all ground truth datasets, the correlation metric slightly but consistently decreased when spike rates were discretized (Fig. S12e), indicating that the data quality did not allow for efficient use of the prior. Although discretized spike rates tended to decrease the error (Fig. S12e), this effect was primarily due to suppression of small noise events in the absence of spiking, and we found that this positive effect could be achieved without reduction of the correlation by thresholding the inferred spike rates (Fig. S12d,f). Together, this suggests that the available datasets do not permit discretization of predicted spike rates without performance loss.
Despite these caveats, discretization of spike rates might still be useful for two reasons. First, discrete spike events may be more intuitive visualizations of activity than smooth probabilities. Second, while spike rates are smoothed with a Gaussian kernel for each spike, the detection of a single spike that optimally explains this spike provides better temporal resolution. Here, we see a potentially useful application of discrete spike inference, which is, however, beyond the scope of this study. We include the algorithm to discretize spike rates as a script in the public repository (https://git.io/JtZe4).
SUPPLEMENTARY FIGURES
ACKNOWLEDGEMENTS
We thank the members of the GENIE project, the Allen Institute and the Spikefinder project for publicly providing existing ground truth datasets together with excellent documentation. We thank Philipp Berens and Emmanouil Froudarakis for providing additional information on the Spikefinder datasets. We thank Gwendolin Schoenfeld for helpful discussions on dataset 18, and Hendrik Heiser, Nesibe Temiz, Chie Satou, Gwendolin Schoenfeld and Henry Luetcke for testing earlier versions of the toolbox. This work was supported by grants to F.H. from the Swiss National Science Foundation (Project grant 310030-127091; Sinergia grant CRSII5-18O316) and by the European Research Council (ERC Advanced Grant BRAINCOMPATH, grant agreement no. 670757), by grants to K.K. from MEXT, Japan (Scientific Research for Innovative Areas, no. 17H06313), by grants to R.W.F. from the Swiss National Science Foundation (Project grant 310030B-152833/1) and from the European Research Council (ERC Advanced Grant MCircuits, grant agreement no. 742576), by the Novartis Research Foundation, by a UZH Forschungskredit and a fellowship from the Boehringer Ingelheim Fonds to P.R..
Footnotes
↵† Contributed equally
Addition of new datasets with interneuron data; completely revised comparison with other algorithms (new Fig. 4); comparison with artificial ground truth simulated with NAOMi (see Fig. 3); and various smaller additions: new Fig. S15 to analyze the performance of CASCADE as a function of ground truth dataset size, and Fig. S19 to analyze the performance of CASCADE and other algorithms as a function of temporal precision of predictions.
REFERENCES
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.
- 8.
- 9.
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.
- 55.
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.
- 86.↵