SongExplorer: A deep learning workflow for discovery and segmentation of animal acoustic communication signals

Many animals produce distinct sounds or substrate-borne vibrations, but these signals have proved challenging to segment with automated algorithms. We have developed SongExplorer, a web-browser based interface wrapped around a deep-learning algorithm that supports an interactive workflow for (1) discovery of animal sounds, (2) manual annotation, (3) supervised training of a deep convolutional neural network, and (4) automated segmentation of recordings. Raw data can be explored by simultaneously examining song events, both individually and in the context of the entire recording, watching synced video, and listening to song. We provide a simple way to visualize many song events from large datasets within an interactive low-dimensional visualization, which facilitates detection and correction of incorrectly labelled song events. The machine learning model we implemented displays higher accuracy than existing heuristic algorithms and similar accuracy as two expert human annotators. We show that SongExplorer allows rapid detection of all song types from new species and of novel song types in previously well-studied species.


29
Animals produce diverse sounds (Kershenbaum et al., 2016), vibrations (Hill, 2006), and periodic 30 electrical signals (Zakon et al., 2008) for many purposes, including as components of courtship, to sense 31 their surroundings, and to localize prey. Quantitative study of these "sounds" is facilitated by 32 automated segmentation. However, heuristic segmentation algorithms sometimes have low accuracy 33 and fail to generalize across species (Arthur et   Quantitative study of animal sounds typically starts with supervised discovery of sounds. For 63 species that produce loud and stereotyped sounds, like frogs and birds, it can be straightforward to 64 identify individual types of songs. For other species, like many small insects, the initial step of identifying 65 song types often requires examination of long recordings of audio and possibly video due to the sparse 66 and quiet nature of their songs. Recent work has demonstrated that largely unsupervised clustering 67 algorithms of song events can highlight distinct song types (Clemens et al., 2018). Therefore, to 68 accelerate discovery, we provide methods for unsupervised clustering and visualization of song events. 69 The data view region on the left side of the browser window includes a box ( Figure 1C) that 70 displays data in a dimensionaly reduced form, either as a UMAP or tSNE representation, in either two or 71 three dimensions. Samples of sound wave forms or frequency spectra can be projected into these 72 reduced dimensionality spaces (Clemens et al., 2018). However, we have found that the hidden layer 73 5 activations of a trained neural network provide more discrete representations of unique song types than 74 do the original song events, which can facilitate identification of new song types (see later). 75 In SongExplorer, the reduced dimensionality space can be navigated rapidly with a modifiable 76 "lens." A sample of up to 50 sound events within the lens are represented in the adjacent window 77 ( Figure 1D). Raw sound traces are presented along with a spectrogram and a label indicating the song 78 type. In the example shown in Figure 1, sounds were automatically detected (within SongExplorer) using 79 thresholds either for high amplitude events (labelled "time") or for events with a relatively strong signal 80 in a subset of the spectrogram (labelled "frequency"). Clicking on each song event reveals the song 81 event within a longer context of the recording in a separate window ( Figure 1E). The context window 82 ( Figure 1E) can be navigated by zooming in or out and panning using buttons located below the context 83 window. We have found that Drosophila song can sometimes be discriminated from other sounds most 84 easily by listening to the song and watching associated video. Therefore, SongExplorer facilitates 85 annotation by allowing the portion of the recording currently shown in Figure 1E to be played as audio 86 and, if video data are available, for the video section for this region of the recording to be played.

87
Individual song types can be named in boxes below the context window and labelled by using a 88 computer mouse to double-click on events or to drag over ranges of continuous sounds. The number of 89 annotated and automatically detected events are tracked below the context window.

90
The software is designed to encourage users to follow analysis pipelines that are enabled by 91 "wizard" buttons ( Figure 1B). For example, a user can explore a new dataset containing unlabeled data 92 by selecting the "label sounds" wizard button, which enables five buttons that can be activated, from 93 left to right, to perform the following steps: (1) automatically detect sounds above user-defined   To examine accuracy of the deep-learning neural network, we aligned pulse annotations to the 119 nearest peak within five milliseconds and labelled all other points in time as "other" or "ambient" 7 depending on whether a time-domain threshold was exceeded or not, respectively. We withheld five of 121 the 25 recordings for validation and used the remaining 20 recordings to train the classifier. The  courtship song using chambers that were smaller than the chambers employed for the recordings 142 described above. This is a challenging test case, because the noise characteristics differed systematically 143 8 between the two sets of recordings ( Figure S4). Two human experts independently performed dense 144 annotations of pulse events for 23 randomly-chosen one-minute segments of these recordings without 145 prior discussion of how they would label events. Different humans often disagree on annotation of low 146 signal-to-noise events and, indeed, Person 1 labelled more events than Person 2. Both annotators and 147 the classifier agreed on most pulse events ( Figure 2C). Overall, the classifier and the two humans 148 displayed similar levels of unlabeled events, suggesting that the classifier, even when trained on 149 different sets of recordings, performed approximately as well as the "average" human. Given the high accuracy of the classifier at detecting pulse song, we asked whether a classifier 154 could be trained to accurately predict many song types across multiple species. We performed sparse 155 annotations of all definable courtship song types from nine additional species, systematically labelled 156 inter-pulse intervals between labelled pulse events, and trained a classifier to recognize each of the 37 157 song types ( Figure 3A). Since these samples were labelled sparsely, we cannot examine accuracy as we 158 did above for dense annotations. Instead, we examined the likelihood that an event was labeled 159 correctly, given that there was an annotated song event at a particular time, and present the results as a 160 confusion matrix ( Figure 3B). The classifier assigned most events with greater than 90% accuracy, 161 suggesting that the neural network architecture that we employed can be used to classify song from 162 many Drosophila species.

163
The classifier discriminated with high accuracy similar song types within a species and very 164 similar song types from closely related species. For example, D. persimilis produces two different pulse 165 types ( Figure 3D-E) and the classifier accurately discriminated between these subtypes ( Figure 3B). 166 9 Surprisingly, the classifier also accurately distinguished between the very similar pulse events of the 167 sister species D. simulans and D. mauritiana recorded on the same set of microphones, which, in our 168 experience, cannot be discriminated by humans ( Figure 4F-K). The classifier employs a context window 169 of 204.8 ms surrounding each event, and the classifier may therefore have used information about the 170 diverged inter-pulse intervals between species to discriminate between these pulse types.

171
Some song types were not discriminated well, such as D. erecta sine 1 vs sine 2 and D. persimilis 172 sine 1 vs sine 2. In both cases, the alternative labels were assigned during manual annotation prior to 173 the availability of SongExplorer. Post hoc examination of songs within the SongExplorer interface 174 revealed that sine song is rare in both species and there is no compelling evidence for multiple sine 175 types, suggesting that the classifier correctly failed to discriminate between sine song subtypes in these 176 species because the species do not produce multiple subtypes. Alternatively, it is possible that the 177 classifier failed to discriminate multiple sine song types because these songs are rare. However, we 178 found weak dependence of classifier accuracy and precision on song event sample size ( Figure 3C) and 179 sample sizes above 100 had similar accuracies.  consuming and tedious, but it is also subjective and can frustrate identification of rare sounds. We 203 therefore sought a method to rapidly identify both common and rare sounds. We found that the 204 activities of the hidden-layer neurons of a trained network exhibit considerable structure about song 205 types ( Figure 5) and allow rapid identification of novel song types. We illustrate two modes of this 206 discovery approach.

207
First, we trained a network on manually labelled male Drosophila melanogaster pulse and sine 208 song. We also included automatically generated labels for inter-pulse interval and ambient noise

235
This allowed discovery of new song types in species of the Drosophila nasuta species group. One 236 12 previous paper has reported song types from some of these species (Hongguang et al., 1997). Notably, 237 we identified multiple song types in several species that were not identified in the earlier study ( Figure   238 S6). We estimate that using SongExplorer we discovered many or all song types for each species within 239 approximately 20 minutes per species of exploring songs.      used to estimate the best threshold for the classifier probabilities. Each time point was originally 348 annotated with two classes, "pulse" vs "not-pulse". We added an additional automatically defined class 349 "other pulse" which were pulse-like events originally labeled as "not-pulse". Events originally labeled   Training the species classifier using automatically generated annotations for unsupervised discovery 382 Sound events from recordings of each species were detected using both time and frequency 383 domain thresholds as described above, and all time points were labeled with the species class label.

384
After training, the classifier was applied to the same detected sounds and the resulting hidden layer 385 activations were used to generate low-dimensional embeddings for interactive discovery of song types.      A -Analysis pipeline to assess ability of classifier to recognize songs from many Drosophila species. Song 520 events from recordings of ten species were sparsely annotated and species-specific song-type labels 521 were used to train the classifier. Classifier performance was tested by assessing the frequency with 522 which the classifier correctly assigned a song type at manually annotated events.

568
D-G -Classifier performance using different latencies relative to manual annotation by two humans.

569
Performance was considerably lower for isolated pulses relative to pulses within trains. Classifier 570 performance for the first and last pulse of each train were also lower than for pulses within trains.         Figure S6. Song types discovered for nine species of the D. nasuta species group. Phylogeny of the 647 species examined is shown on the left, with the samples used for song analysis in the same color font as 648 the songs shown on the right. One previous study had found one song type each for 6 of the species we 649 studied and none for two of the species we studied. In contrast, we identified song in all species we 650 studied, and from two to seven apparently distinct song types in different species.