Deep learning based feature extraction for prediction and interpretation of sharp-wave ripples

Local field potential (LFP) deflections and oscillations define hippocampal sharp-wave ripples (SWR), one of the most synchronous events of the brain. SWR reflect firing and synaptic current sequences emerging from cognitively relevant neuronal ensembles. Current spectral methods fail to capture their mechanistic complexity, thus limiting progress. Here, we show how one-dimensional convolutional networks operating over high-density LFP hippocampal recordings allowed for automatic identification of SWR. When applied to ultra-dense hippocampus-wide recordings, we discovered physiologically relevant processes associated to the emergence of SWR, prompting for novel classification criteria. To gain interpretability, we developed a method to interrogate the operation of the artificial network. We found it relied in feature-based specialization, which permit identification of spatially segregated oscillations and deflections, as well as synchronous population firing. Thus, using deep learning based approaches may change the current heuristic for a better mechanistic interpretation of these relevant neurophysiological events.


Introduction 29
Interpreting brain signals is essential to understand cognition and behavior. Biologically 30 relevant oscillations are considered reliable markers of brain operation (Buzsáki et al.,31 2012; Friston et al., 2015). Thus, analysis of either surface electroencephalography 32 (EEG) or intracranial LFP is typically based on spectral methods relying on gold-33 standard definitions (Niedermeyer and Lopes da Silva, 2005). However, other features 34 of EEG/LFP signals such as the slope, polarity and latency to events are equally 35 important (Modi and Sahin, 2017). While interpreting neurophysiological signals is 36 strongly influenced by this heuristics, methodological issues limit further advances. 37 During memory consolidation and retrieval, the hippocampal system releases short 38 memory traces in the form of neuronal sequences 39 Pfeiffer, 2017; Pfeiffer and Foster, 2015). Such activity comes often in tandem with 40 spatially segregated oscillations (100-250 Hz) and LFP deflections dubbed sharp-wave 41 ripples (Buzsáki, 2015). They result from active recruitment of dedicated cell-type 42 specific microcircuits ( in capturing the complexity of SWR events across hippocampal layers. 52 Here, we exploit the extraordinary capability of Convolutional Neural Networks (CNN) 53 for real-time recognition to identify SWR (Bai et al., 2018). Instead of adopting standard 54 approaches used for temporal data such as in speech recognition, we chose to rely on 55 unfiltered LFP profiles across hippocampal strata as individual data points making up 56 an image. The one-dimensional object is equivalent to a clip of one-row pixels with as 57 many colors as LFP channels. We show how one-dimensional CNN operating over 58 high-density LFP hippocampal signals overcome spectral methods in detecting SWR. 59 Moreover, we develop a strategy to decode and explain CNN operation. In doing so, 60 we discovered some features of SWR that permit their detection at distant layers when 61 applied to Neuropixels recordings (Jun et al., 2017). Using these tools allow for a more 62 comprehensive interpretation of SWR signatures across the entire hippocampal 63 system. 64 to provide a single-output probability for the occurrence of a SWR event in a given 74 temporal window (Fig.1A, bottom trace). Therefore, the input "object" is equivalent to a 75 stream of pixels (1 x number of data samples) with 8-channels instead of colors. 76 Convolutional layers search for particular features in the input data by using kernels. 77 The kernels of the first layer (L1) have dimensions of 8-channels x length, with length 78 reflecting the number of data samples. They advance along the temporal axis moving 79 forward a similar number of non-overlapping samples defined by the stride (Fig.1B). 80 The result of this operation is the kernel activation signal, which reflects the presence 81 of some input features. L1 kernel length should be defined by considering the desired 82 output resolution of the network. To ease subsequent online applications, we chose 83 either 32 ms (CNN32, L1 kernel length 5) or 12.8 ms resolution (CNN12, L1 kernel 84 length 2). 85

94
A probability threshold can be used to identify SWR events. Note some events can be predicted well in 95 advance. F, Offline P-R curve (mean is dark; sessions are light) (left), and F1 score as a function of 96 normalized thresholds for the CNN at 32 and 12.8 ms resolution as compared with the Butterworth filter 97 (right). Data reported as mean ± 95% confidence interval for sessions (n=15 sessions; n=5 mice). H,

98
Online detection performance of CNN12 as compared with the Butterworth filter (n=6 sessions, n=3 mice; 99 p=0.033). I, Mean and per session P-R curve (left), and F1 score as a function of the optimized threshold 100 for online sessions, as analyzed post-hoc (right). Data from n=6 sessions from 3 mice.

102
Our CNN operates by receiving the 8-channels input into each of the four kernels of L1 103 (Fig.1C). Kernels process the LFP and output a kernel activation signal (Fig.1D). 104 Therefore, after passing through L1, the 8-channels are transformed into 4-channels, 105 one per kernel (e.g. L1K1, L1K2, etc.). L1 output is then transformed by a Batch Norm 106 layer (L2), and a Leaky ReLU layer (L3), before entering the next block (L4-L5-L6 and 107 so on; Fig.1C). The size of subsequent kernels is defined by the input data from the 108 convolutional layers of the previous block (see Methods). Inspired by YOLO, we  109  staggered blocks with kernels of large and short length to allow for alternate  110  convolution of the temporal and channel axes. As data are processed along these  111  blocks, resolution decreases and hence the kernel length becomes progressively  112 shorter. 113 We defined a suitable number of blocks that optimized the input (8 channels) and 114 output features (1 channel output at 32 ms or 12.8 resolution), resulting in 7 blocks for 115 a total of 21 layers (Fig.1C). The final layer (L22) is a Dense Layer with a sigmoidal 116 activation function, so that the CNN output (between 0 and 1) can be interpreted as the 117 SWR probability. A SWR event can be detected using an adjustable probability 118 threshold (Fig.1E). Note that our CNN network operates along all streamed LFP data 119 without any specification of the ongoing oscillatory state (i.e. theta or non-theta 120 segments accompanying running and immobility periods, respectively). 121

CNN performance offline and online 123
Having defined the main network architecture, we used a dataset manually tagged by 124 an expert for training and initial validation (1794 events, 2 sessions from 2 mice; 125 Sup. Table.1). An important decision we made was manually annotating the start and 126 the end of the SWR event so that the CNN could learn recognizing them early from the 127 onset. 128 Given the large number of parameter combinations, we run two optimization rounds 129 using training and test chunks from the previous dataset. We first tested a subset of 130 hyper-parameters to look for the 10-best networks (Fig.S1A, green shaded), and chose 131 the one with the lower and more stable learning curve (Fig.S1B, arrowhead). 132 Stabilization of the loss function error for the training and test subsets along epochs 133 excluded potential overfitting (Fig.S1B an additional non-ripple channel was used to veto detection of common artifacts. We 160 found better online performance of the CNN at 12.8 ms resolution as compared with 161 the filter ( Fig.1H; p<0.033). When it came to the ability to anticipate SWR events 162 online, the CNN slightly overtook the Butterworth filter (time-to-SWR-peak for CNN12: -163 6.6 ± 1.9 ms; Butterworth filter: -4.2 ± 2.3 ms; Mann-Whitney U test, p<0.00001). A post 164 hoc evaluation of online sessions confirmed a better performance of the CNN versus 165 the filter, for all normalized thresholds (Fig.1I). 166

Detection limits of SWR and CNN operation 184
Are there any practical detection limit for SWR? How good is CNN performance and 185 how much is it determined by the expert heuristics? First, we sought to compare CNN 186 and the filter at its maximal capability using data from all sessions (21 sessions from 8 187 mice). To this purpose, we equated the methods using the best possible detection 188 threshold for each session (the one that optimized F1) and found roughly similar values 189 ( Fig.2A were some natural differences between experts in a session-by-session basis (Fig.2B). 203 Interestingly, when we confronted the network detection with the consolidated ground 204 truth, we noted that the CNN could be actually detecting many more SWR events than 205 initially accounted by each individual expert (one-way ANOVA, CNN12: F(2)=4.75, 206 p=0.016; CNN32: F(2)=4.22, p=0.024). In contrast, the filter failed to exhibit such an 207 improvement (Fig.2B). Notably, an expert acting as a classifier of the other expert's 208 ground truth provided a mean reference of best performance at 0.69 ± 0.14 ( Fig.2C). 209

220
We used the hc-11 dataset (Grosmark and Buzsáki, 2016) at the CRCNS public repository 221 (https://crcns.org/data-sets/hc/hc-11/about-hc-11) to evaluate the effect of the definition of the ground 222 truth. The data consisted in 10-channel high-density recordings from the CA1 region of freely moving rats.

223
We randomly selected 8-channels to cope with inputs dimension of our CNN, which was not retrained. The 224 dataset comes with annotated SWR events (dark shadow) defined by stringent criteria (coincidence of 225 both population synchrony and SWR). CNN False Positives defined by this partially annotated ground truth 226 were re-reviewed and validated (light shadow). E, Performance of the original CNN, without retraining, at 227 both temporal resolutions over the originally annotated (dark colors) and after False Positives validation 228 (light colors). Performance of the Butterworth filter is also shown. Data from 3 sessions, 2 rats. See 229 Supp. Table 1.   230   231 To evaluate this point further, and to test the capability of the CNN to generalize 232 beyond its original training using head-fixed mice data, we used an externally 233 annotated dataset of SWR recorded with high-density silicon probes from freely moving 234 rats (Grosmark and Buzsáki, 2016) ( Fig.2D; 1056 events; 3 sessions from 2 rats; 235 Sup. Table.1). In that work, SWR detection was conditioned on the coincidence of both 236 population synchrony and LFP definition thus providing a "partial ground truth". 237 Consistently, the network recalled most of the annotated events (R=0.80 ± 0.18), but 238 precision was apparently low (P=0.42 ± 0.18) (Fig.2E). Hence, we evaluated all False 239 Positive predictions and found that many of them were actually unannotated SWR (839 240 events), meaning that precision was actually higher (P=0.72 ± 0.12 for CNN32, P=0.83 241 ± 0.10; for CNN12, both at p<0.05 for paired t-test; Fig.2E). As above, the filter failed to 242 improve performance (Fig.2E). Importantly, a CNN trained in data from head-fixed mice 243 was able to generalize to freely moving rats. 244 Altogether, this analysis indicates that detection limits of SWR may be determined by 245 the expert's criteria. CNN performance improves when confronted with the extended 246 ground truth, suggesting that it learns to generalize beyond existing data. 247 248

Unveiling SWR latent features 249
Interpretability is a major issue in modern machine learning ( textures of an image, we found the kernels focused in detecting distinct LFP features. 261 Consistently with data above, kernels from the first layers specialized in detecting 262 rhythmic and periodic patterns (e.g. L1K1 and L1K2), while later layers seem to focus 263 in identifying these patterns along time (e.g. L19K18; Fig.3B). By computing the 264 pattern-matching function between saliency maps and the 8-channels LFP, we 265 evaluated how the kernels accounted for different features of True Positive events, i.e. 266 SWR (Fig.3C). For example, L1K1 was maximally activated at the peak of ripple 267 oscillations, while L1K2 and L19K18 were maximal at the onset, supporting the network 268 ability to anticipate SWR. Pattern-matching between true SWR events and the saliency 269 map of the output layer L22 provided an idea of what the CNN recognized as an ideal 270 "object". In contrast, pattern-matching values in the absence of SWR events (True 271 Negative events) were typically lower as compared with those obtained from the 272 ground truth (Fig.3D). 273  As shown above, the CNN ability relies on feature extraction by the different kernels. 310 To gain explanatory power on how this applies to SWR detection, we sought to 311 visualize and quantify the CNN kernel operation. 312 First, we examined the weights of the first layer kernels, which act directly over high-313 density LFP inputs. We noted that their profiles were especially suited for assessing 314 critical LFP features, such as the laminar organization of activity. For example, L1K1 315 seemed to act along the spatial scale by differentially weighting LFP channels along the 316 somatodendritic axis and deep-superficial layers (Fig.4A), consistent with the saliency 317 map shown above. In contrast, weights from L1K2 likely operated in the temporal scale 318 with major differences along the kernel length (Fig.4A). In this case, by positively 319 weighting upper channels at later samples this filter may help to anticipate some SWR 320 motifs as shown before (see Fig.3C). Interestingly, there were opposing trends 321 between top and bottom channels, suggesting some spatial effect as well. L1K3 and 322 L1K4 provided less obvious integration across the spatial and temporal scales. The same reasoning applies to the next layers. However, since CNN acts to transform 343 an LFP 'object' into a probability value, the spatial and temporal features of SWR 344 events become increasingly abstract. Notwithstanding, their main features are still 345 recognized. For example, L4K1 and L4K2 outputs likely reflected the spatiotemporal 346 organization of the input SWR event, in particular the slower components and uneven 347 distribution of ripples (see Fig.1D and Fig.S2A). 348 To quantify these observations we evaluated how the different kernels were activated 349 by a similar number of LFP events centered at either the ground truth or at a random 350 timing (Fig.4B, 7491  probability in a 2-dimensional cloud ( Fig.4C; Fig.S2C), supporting that the entire CNN 360 is coding for different features of SWR across layers (Fig.S2D). 361 (undetected ground truth) and True Negative (unannotated and undetected events). 375 We found striking segregation across the UMAP cloud with True Positive and True 376 Negative events falling apart ( Fig.4D; Fig.S2C). False Negatives were mostly located at 377 the intermediate region, suggesting they could be detected with less conservative 378 thresholds. Interestingly, False Positive predictions were scattered all around the cloud, 379 supporting the idea that they reflect heterogeneous events as seen above. Mapping all 380 the previously validated False Positive events (see Fig.3F, G) over the UMAP cloud 381 confirmed that those corresponding to population synchrony and sharp-waves without 382 ripples distributed over the ground truth, while those corresponding to artifacts mostly 383 fell apart (Fig.4E). 384 Altogether, these analyses permitted us to understand how the one-dimensional CNN 385 operates to detect SWR events. Our study suggests that a CNN relying on feature-386 based detection allows to capture a wider diversity of SWR events, in contrast to 387 spectral filtering. The new method could potentially facilitate discovery and 388 interpretation of the complex neurophysiological processes underlying SWR. 389 390

Leveraging CNN capabilities to interpret SWR dynamics 391
Equipped with this tool we sought to understand the dynamics of SWR across the 392 entire hippocampus. To this purpose, we obtained Neuropixels recordings from different 393 rostro-caudal penetrations in head-fixed mice ( Fig.5A; n=4 sessions, 4 mice; 394 Sup. Table.1). Detailed post-hoc histological analysis validated the probe tracks passing 395 through a diversity of brain regions, including several thalamic nuclei as well as the 396 dorsal and ventral hippocampus (Fig.5B, Fig.S3A). 397 By exploiting the ultra-dense configuration of Neuropixels, we simulated consecutive 398 penetrations with 8-channel high-density probes covering the entire dorsoventral axis 399 (Fig.5A). We run offline detection using eight neighboring Neuropixels channels as the 400 inputs, then move four channels downward/upward and repeat detection again, up to 401 the end of the probe. We used the original CNN32 without retraining, the Butterworth 402 filter and RippleNET to evaluate detection performance against the ground truth. 403 Consistent with data above, we found successful detection of SWR events from the 404 dorsal CA1 region (Fig.5A). While detection was optimal at the CA1 cell layer (stratum 405 pyramidale; SP), we noted many events were actually identified from SWR-associated 406 LFP signatures at the radiatum (SR) and lacunosum-moleculare (SLM) layers ( Fig.5C; 407 Fig.S3B). When evaluated per layer, detection of SWR was better at the dorsal than at 408 the ventral hippocampus, except for SR and SLM (Fig.5D, left). We found no major 409 differences except for precision, when all layers were pooled together (Fig.5D, right). 410 In spite that only a subset of SWR could be identified from recordings at SR and SLM 411 (i.e. R-values were low), precision was very high (i.e. over 80% of predictions were 412 consistent with the ground truth). A close examination of the morphology of these 413 events confirmed they exhibited LFP and oscillatory features consistent with the kernel 414 saliency maps (Fig.5E, Fig.S3C). Remarkably, both the Butterworth filter and 415 RippleNET failed to identify SWR associated signatures beyond the dorsal SP 416 (Fig.S3D,E). 417  To gain insights into the underlying physiology and to discard for potential volume 440 conduction effects, we simulated linear penetrations through the dorsal hippocampus 441 and estimated the associated current-source density (CSD) signals of events detected 442 at different layers (Fig.5F, top). We found larger sinks and sources for SWR that can be 443 detected at SLM and SR versus those detected at SP only (Fig.5G). We also exploited 444 Neuropixels to isolate activity from putative pyramidal cells (n=99) and interneurons 445 (n=29, all penetrations) during the different SWR event types (Fig.5F, bottom). For 446 pyramidal cells, we found striking reorganization of the firing rate and timing during 447 SWR detected at SO, SR and SLM versus those detected only at SP (Fig.5H). 448 Interneurons exhibited similar variability (Fig.S3E). Timing and rate differences of 449 pyramidal cell and interneuronal firing with respect to SWR events detected at different 450 layers support the idea that they reflect activation of different hippocampal ensembles. 451 Our CNN thus provide unique opportunities to study the so far elusive dynamics 452 accompanying SWR responses. Here, we report how one-dimensional convolutional networks operating over high-469 density LFP recordings allows for unprecedented detection and interpretation of 470 hippocampal SWR events. While the network was trained in a subset of LFP data 471 recorded around the dorsal CA1 cell layer of head-fixed mice, detection generalized 472 across strata, brain locations (e.g. ventral hippocampus), preparations (i.e. freely 473 moving) and species (i.e. rats). Our CNN exhibited a much higher stability, less 474 threshold-dependent sensitivity and overall higher performance as compared with the 475 spectral filter and RippleNET, a recurrent neural network solution. This unique 476 capability of our CNN relies on feature-based analysis of LFP signals, which provide 477 similar explanatory power as standard LFP profiling. Such a developmental potential of 478 convolutional neural networks permits challenging the interpretation of brain signals 479 (Frey et al., 2021), and SWR in particular (this study). 480 From a physiological perspective, studying brain function relies in understanding 481 activity in relation to behavior and cognition (Cohen, 2017;Friston et al., 2015 SLM strata independently on their alleged local generation at the CA1 cell layer. 521 The configuration of the current sinks and sources associated to independently 522 detected SWR events suggest that the weighted interaction between fluctuating input 523 pathways may entail contribution by different factors across behavioral states (Buzsáki, 524 2015 SWR events suggest that brain-wide subcircuits inherit the different representational 539 dynamics of a variety of replays (Sosa et al., 2020). The detection unfolding of CNN 540 thus permit an unbiased categorization without relying on more elusive spectral criteria. 541 Critically, both the filter and RippleNET failed to capture SWR diversity across strata 542 further confirming the suitability of CNN to capture critical LFP features accompanying 543 a wealth of events. 544 Our method also identified events beyond the expert ground truth. Careful examination 545 of those False Positives reveal sharp-waves associated to population synchronous 546 firing without ripples, as well as other unclassified forms of oscillatory activities. While 547 there is no specific report of these events, the ability of CA3 inputs to bring about 548 gamma oscillations and multi-unit firing associated with sharp-waves is already 549 recognized ( Since the power spectral level operationally defines the detection of SWR, part of this 552 microcircuit intrinsic variability may be escaping analysis when using spectral filters. 553 Understanding how the brain encodes for memory is challenging. Recent data suggest 554 that replay emerging from SWR is more complex than originally thought (Hannah R Joo  555 and Frank, 2018). Cell-type specific subcircuits operating over a variety of 556 interneuronal classes and under the influence of different input pathways, provide 557 mechanistic support for a wealth of SWR events (de la Prida, 2020 Convolutional layers, the kernel size and stride were set to the same value so that the 863 network operates similarly offline an online. The kernel size and stride determined the 864 duration of the input window, so they were tuned in order to fit either a 32ms window 865 (CNN32) or a 12.8ms window (CNN12). Values of kernel size and stride for 866 Convolutional layers 1, 4, 7, 10, 13, 16 and 19 of CNN32 were: 5, 1, 2, 1, 2, 1, 2, 867 respectively. For CNN12, the values were 2, 1, 2, 1, 2, 1 and 2. Since max-pooling 868 layers can be replaced by Convolutional layers with increased stride, we chose not 869 using max-pooling layers to avoid issues with the input window size (Springenberg et  870 al., 2014). 871 872 The number of kernels and kernel regularizers were selected after performing an initial 873 parametric search (initial learning rate, number of kernels factor and batch size; 874 Fig.S1A). For the Dense layer we used a sigmoid activation function operating over 1 875 unit to produce 1 channel output. All the other parameters for the Convolutional layers, 876 as well as for the Dense layer, were set initially at default values. Additionally, we also 877 tested whether adding two LSTM layers before the final Dense layer improved 878 performance in the preliminary parameter tests. 879 880 We selected our CNN32 as that with the lower and more stable training evolution (see 881 below) from the 10-best networks in the initial parameter search (out of 107; Fig S1B) annotated. An important decision we made was to manually annotate the start and the 900 end of SWR events so that the network could learn anticipating events in advance. The 901 start of the event was defined at the first ripple or the sharp-wave onset. The end of the 902 event was defined at the latest ripple or when sharp-wave resumed. While there was 903 some level of ambiguity on these definitions, we opted for including these marks in 904 order to facilitate transition to ground truth detection. An additional expert (new expert) 905 tagged SWR independently using a subset of sessions, to allow for comparisons 906 between experts in the same lab. 907 908 Data preparation 909 Datasets used for training and development of the CNN were created by loading a 910 number of experimental sessions and storing them in two different 3-dimensional 911 matrices, Y and Y. 912 913 Matrix X stored several chunks of 8-channels LFP recordings. From each session, LFP 914 data from all probe shanks displaying any SWR were loaded, unless specific shanks 915 were selected. If a shank had more than 8 channels, then they were randomly selected 916 while giving priority to those located at the stratum pyramidale of CA1. All LFP signals 917 were down-sampled to 1250 Hz and normalized using z-score. LFP signals were sliced 918 into chunks of 57.6 seconds, which is exactly divisible by 0.032s and 0.0128s, in order 919 to keep a consistent matrix shape even when various sessions of different durations 920 were used. This chunk size maintains the properties of long duration signals, which is 921 essential for the CNN to reach a high performance score when fed with continuous 922 data. Chunks with no SWR events would be discarded, but that was an extremely rare 923 case. At the end of this process the result is a matrix X with dimension (n, 72000, 8), 924 where n is the number of chunks of 72000 samples (57.6 seconds sampled at 1250 925 Hz) for each of the 8 LFP channels. 926 927 Matrix Y contained the annotated labels for each temporal window (32 or 12.8 ms) 928 stored in X. 929 To create Y, each chunk was separated in windows of 32 or 12.8 ms and then assigned 930 a label, a number between 0 and 1, depending on the percentage of the window 931 occupied by a SWR event. Therefore, dimension of matrix Y was (n, 1800, 1) for 932 CNN32, since there are 1800 32ms windows in a chunk of 57.
where N is the number of windows, y i is the label of window I, and p(y i ) is the 948 probability predicted for window i. 949 950 In order to evaluate the network performance, two different datasets were used: the 951 training set described above, and the validation set, consisting of 15 sessions from 5 952 different animals that were not used for training or development (Sup. Table.1).  953  954 To detect SWR event, we set a probability threshold to identify windows with positive 955 and negative predictions. Accordingly, predictions were classified in four categories: 956 True Positive (TP) when the prediction was positive and the ground truth window did 957 contain a SWR event; False Positive (FP) when the prediction was positive in a window 958 that did not contain any SWR; False Negative (FN) when the prediction is negative but 959 the window contained a SWR; and True Negative (TN) when the prediction was 960 negative and the window did not contain any SWR event. 961 962 Intersection over Union (IOU) methodology was employed to classify predictions into 963 those four categories. It was calculated by dividing the intersection (overlapping) of two 964 windows by the union of them: