Machine Source Localization of Tursiops truncatus Whistle-like Sounds in a Reverberant Aquatic Environment

Most research into bottlenose dolphins’ (Tursiops truncatus’) capacity for communication has centered on tonal calls termed whistles, in particular individually distinctive contact calls referred to as signature whistles. While “non-signature” whistles exist, and may be important components of bottlenose dolphins’ communicative repertoire, they have not been studied extensively. This is in part due to the difficulty of attributing whistles to specific individuals, a challenge that has limited the study of not only non-signature whistles but the study of general acoustic exchanges among socializing dolphins. In this paper, we propose the first machine-learning-based approach to identifying the source locations of semi-stationary, tonal, whistle-like sounds in a highly reverberant space, specifically a half-cylindrical dolphin pool. We deliver time-of-flight and normalized cross-correlation measurements to a random forest model for high-feature-volume classification and feature selection, and subsequently deliver the selected features into linear discriminant analysis, linear and quadratic Support Vector Machine (SVM), and Gaussian process models. In our 14-point setup, we achieve perfect classification accuracy and high (MAD of 0.6557 m, IQR = 0.3395 - 1.5694) regression accuracy with fewer than 10,000 features. The regression models yielded better accuracy than the established Steered-Response (SRP) method when all training data were used, and comparable accuracy – even when interpolating at several meters – in the lateral directions when prompted to infer missing training data at testing sites, and at the advantage of significantly improved computation time and the potential for superior accuracy with more training data.

Dolphin communication research is in an active period of growth. Many researchers exchanges [17,24]. Nevertheless, such studies constitute a logical prerequisite to an 23 understanding of the communicative potential of whistles. 24 The scarcity of such studies can be explained in part by a methodological limitation 25 in the way in which dolphin sounds are recorded. In particular, no established method 26 exists for recording the whistles of an entire social group of dolphins so as to reliably 27 attribute the signals to specific dolphins. The general problem of sound attribution, 28 which is encountered in almost every area of communication research, is typically 29 approached in one of two ways: (1) by attaching transducers to all potential sound 30 sources, in which case the source identities of sounds can usually be obtained by 31 discarding all but the highest-amplitude sounds in each source-distinctive recorder, or 32 (2), by using a fixed array (or arrays) of transducers, a physics-based algorithm for 33 identifying the physical origin of each sound, and cameras that monitor the physical 34 locations of all potential sources for matching. 35 While notable progress has been made implementing attached transducers (or tags) 36 to identify the sources of dolphin whistles [25][26][27], shortfalls include the need to 37 manually tag every member of the group under consideration, the tendency of tags to 38 fall off, and the tags' inherent lack of convenient means for visualizing caller behavior. 39 Most significant to research with captive dolphins, the use of tags can conflict with best 40 husbandry practices (e.g., due to risk of skin irritation, of ingestion) and be forbidden, 41 as is the case at the National Aquarium. At such locations, less invasive means of sound 42 attribution are necessary. Unfortunately, a reliable implementation of the array/camera 43 approach to dolphin whistles has not been achieved, though it has been achieved for 44 more tractable dolphin clicks [28]. In the context of whistles in reverberant 45 environments, authors have noted the complications introduced by multipath effects - 46 resulting from the combination of sounds received from both the sound source and 47 acoustically reflective boundaries -to standard signal processing techniques. These 48 complications generally arise from the overlap of original and reflected sounds that 49 confound standard, whole-signal methods of obtaining time-of-flight differences. 50 Standard techniques have at best obtained modest results in relatively irregular, 51 low-reverberation environments where they have been evaluated [29][30][31][32]. In unpublished 52 work, we have achieved similar results. One method of improving a standard signal 53 processing tool for reverberant conditions, the cross-correlation, has been proposed 54 without rigorous demonstration and has not be reproduced [33]. Among all previous 55 methods we have identified those falling under the umbrella of Steered-Response Power 56 (SRP) most effective. In short, these methods rely on maximizing the sum of 57 cross-correlations between all pairs of hydrophone signals with respect to sets of time 58 shifts between signals/hydrophones, each set corresponding to a hypothetical source 59 location in the pool [34]; hypothetical source locations may correspond to equally-spaced 60 grid points in the suspect zone, the naive approach, or be iteratively chosen in a more 61 efficient fashion [35]. While the details of the particular SRP method employed are left 62 November 26, 2019 2/16 vague, Rebecca E. Thomas et al. [36] have rigorously demonstrated such a method with 63 reasonable success (used for about 40% recall of caller identity) [36]. 64 We propose the first machine-learning-based solution to the problem of localizing were obtained from the EP, when the seven resident dolphins were in the holding pools; 83 their natural sounds were present in recordings.

84
To ensure that the sound samples used for classification were not previously 85 distorted by multipath phenomena (i.e., were not pre-recorded), were obtained in 86 sufficient quantity at several precise, known locations inside the EP, and were 87 representative of the approximate "whistle space" for Tursiops truncatus, we chose to 88 use computer-generated whistle-like sounds that would be played over an underwater 89 Lubbell LL916H speaker. 90 We generated 128 unique sounds (with analysis done on 127) to fill the available 91 time. To be acoustically similar to actual T. truncatus whistles, these sounds were to be 92 "tonal" -describable as smooth functions in time-frequency space, excluding harmonics -93 and to be defined by parameters and parameter ranges, given in Table 1, representative 94 of those used and observed by field researchers to characterize dolphin whistles [37,38]. 95 To construct a waveform to be played, we began with an instantaneous frequency, f (t), 96 that described a goal time-frequency (or spectrographic) trace, for instance the trace represented a piecewise function that modulated the intensity of the signal at different 104 times (the beginning and end of a signal were gradually increased to full intensity and 105 decreased to zero, respectively, as functions of the "Power Onset/Decay Rate," and the 106 absolute beginning occurred at a peak or trough of the sinusoid f (t) according to 107 "Phase Start"). Alternatively, the phase derived for f (t) could be transformed with a 108 heuristic into a waveform corresponding to a slightly modified version of f (t), 109 specifically a quasi-sinusoid with "sharpened" peaks and approximate 110 whistle-harmonic-like traces higher in frequency than the fundamental. This heuristic is 111 , where m is a parameter that simultaneously affects the 112 "sharpness" of the peaks and the number of harmonics; we used a value of 0.8.

113
Waveforms were played in Matlab through a MOTU 8M audio interface at calibrated 114 volumes and a sampling rate of 192 kHz. An example of pre-speaker output is given in 115  The LL916H speaker was suspended by rope from a custom flotation device and moved 121 across the pool surface by four additional ropes extending from the device to research 122 assistants standing on ladders poolside. Importantly, the speaker was permitted to sway 123 from its center point by approximately 1/3 m (as much as 1 m) in arbitrary direction 124 during calibration. These assistants also used handheld Bosch 225 ft (68.58 m) Laser 125 Measure devices to determine the device's distance from their reference points (several 126 measurements were taken for each location), and through a least-squares trilateration 127 procedure [39] the device location could always be placed on a Cartesian coordinate 128 system common with the hydrophones. Each sound in a 128-sound run was played after 129 a 2-second delay as well as a 0.25-second, 2-kHz tone, that allowed for the creation of a 130 second set of time-stamps in order to compensate for clock drift during the automated 131 signal extraction.  hydrophones were obtained by making underwater tape-measure measurements as well 141 as above-water laser-rangefinder measurements; various calibrations were performed kHz by two networked MOTU 8M audio interfaces into the Audacity AUP sound 144 format, to avoid the size limitations of standard audio formats. The same two audio 145 interfaces were involved in playing the sounds. Standard passive system operation was 146 managed by Matlab scripts recording to contiguous WAV files; for consistency Matlab 147 was also used for most data management and handling. Data are available at quasi-sinusoids with the same parameters grouped together given their close similarity. 155 Each sound was initially digested into 1,333,877 continuous, numerical features (or 156 variables), together composing the so-called feature set.

157
The first 120 features were time-difference-of-arrivals (TDOA's) obtained using the 158 Generalized Cross-Correlation Phase Transform (GCC-PHAT) method [40], which in 159 currently unpublished work we found to be most successful among correlation-based 160 methods for obtaining whistle TDOA's. Briefly, a TDOA is the time difference between 161 the appearance of a signal (such as a whistle) in one sensor (such as a hydrophone) 162 versus another. Possession of a signal's TDOA's for all pairs of 4 sensors (with known 163 geometry) in 3-dimensional space is theoretically, but often not practically, sufficient to 164 calculate the exact source position of that signal from geometry [41]. While we establish 165 elsewhere that the 120 TDOA's (i.e., one TDOA for each distinct pair of our 16 166 hydrophones, excluding self-pairs) obtained from GCC-PHAT are practically not 167 sufficient to calculate the exact source position of our signals using a standard 168 geometric technique that accommodates more than four sensors, Spherical 169 Interpolation [42,43], we suspected that these TDOA's might still contain information 170 helpful to a machine learning model. whistle's two audio snippets. While each correlation series was initially 384,000 elements 175 long (192,000 samples/second x 2 seconds), we only kept the central 6601 elements from 176 each, corresponding to a time-shift range of approximately ±17 milliseconds; based on 177 geometry, the first incidence of any sound originating from inside the pool must have 178 arrived within 17 milliseconds between any two sensors (conservatively), meaning 179 cross-correlation elements corresponding to greater delays could be expected to be less 180 helpful to in-pool source location prediction.

181
Lastly, we included 27,126 x 16 discrete Fourier transform elements (one set for each 182 of 16 hydrophones, for frequencies from 0 Hz to 27,126 Hz). However, preliminary 183 analysis found these features to be unhelpful to classification. Thus, they were 184 discarded from the feature set, leaving 897,871 features.

185
Possessing the above feature set for each whistle, each "labeled" by the whistle 186 source location and coordinates, we constructed predictive models on the 90% of 187 whistles made available for model construction (or training). We began with multiclass 188 classification [45], training models to predict which of the 14 possible locations a novel 189 whistle originated from. Given our limited computational resources, our feature set 190 remained too large to accommodate most classifiers. A notable exception was the 191 Breiman random forest [46,47], which was suitable not only for classification -being a 192 powerful nonlinear multiclass classifier with built-in resistance to overfitting -but for classification accuracy), via the permuted variable delta error metric. The permuted 195 variable delta error roughly describes the increase in classification error when a 196 particular feature is effectively randomized, providing a measure of that feature's 197 importance in classification. We grew a Breiman random forest composed of CART 198 decision trees [48] on the training data; each tree was sequentially trained on a random 199 subset of˜75% of the training samples using a random˜√897, 871 feature subset (as 200 per standard practice). Out-of-bag (OOB) error, referring to the classification error on 201 samples not randomly chosen for the training subset, was used for validation. While an 202 introduction to machine learning is beyond the scope of this paper, these and 203 subsequent techniques are standard with reader-friendly documentation available at 204 such sources as MathWorks (https://www.mathworks.com) and sciikit-learn 205 (https://scikit-learn.org/stable/), to which we direct interested readers. 206 We subsequently used permuted variable delta error as a measure of feature 207 importance, both to examine the selected features for physical significance -recall that 208 cross-correlation element features correspond to pairs of sensors -and to obtain a 209 reduced feature set appropriate for training additional models. The reduced feature set 210 included the 6,788 features with nonzero permuted variable delta error. On the reduced 211 feature set, we considered a basic CART decision tree [45,48], a linear and quadratic 212 Support Vector Machine (SVM) [45,49], and linear discriminant analysis [50]. 213 We also considered Gaussian process regression (also termed kriging) [51, 52] -a 214 nontraditional, nonparametric method of regression that could accommodate our 215 under-constrained data. Whereas the purpose of our classification models was to predict 216 from which of 14 possible points a sound originated, the purpose of our regression 217 models was to predict the three Cartesian coordinates from which the sound originated. 218 Localization by regression was performed two ways. In the first way, all training sounds 219 were used to generate three models (one per dimension) for predicting the coordinates 220 of all test sounds. In the second way, training sounds from all but one grid point were 221 used to generate models for predicting the coordinates of test sounds from the excluded 222 point; the process was repeated for all grid points, the results aggregated. While we 223 were doubtful of our models' ability to precisely interpolate at distances of several 224 meters from 14 points, this test was envisioned to show that reduced-feature-set models 225 are capable of a degree of spatial interpolation.

226
For comparison, we employed a Steered-Response Power approach [34] to localizing 227 the sounds in the test set. As the details to the particular method used by Thomas et 228 al. [36] are unpublished, we followed a standard procedure: for a hypothetical sound 229 originating from each of every point of a 6" (15.25 cm) virtual gridding of the pool, we 230 calculated the theoretical differences in the time of arrival for our 16 hydrophones.

231
Shifting the 16 signals of a received whistle in our test set by each of every set of 232 differences, then cross-correlating the shifted signals and summing the results along 233 both axes, the predicted source location corresponded to the grid point that produced 234 the largest value thus computed. We calculated the speed of sound from the Del Grosso 235 equation [53]; the pool salinity was 31.5 ppt and temperature 26.04°C. As an aside, we 236 attempted to replace the standard cross-correlation in the prior calculation with the 237 Generalized Cross Correlation with Phase Transform [40], which constitutes the 238 SRP-PHAT technique for sound localization discussed in [34], but quickly found the 239 results to be inferior. 240 Lastly, we obtained a minimal, nearly sufficient feature set by training a single, 241 sparse-feature (or parsimonious) decision tree classifier on all features of all training 242 data. We then investigated these minimal features for physical significance, by mapping 243 features' importance (again, using a random forest's permuted variable delta error) back 244 to the sensor and array pairs from which they were derived.

246
The random forest classification model trained on the full feature set, as described above, 247 reached 100.0% OOB accuracy at a size of approximately 180 trees. We continued 248 training to 300 trees, and evaluated the resulting model on the test set: 100.0% accuracy 249 was achieved, with 6,788 features possessing permuted variable delta error greater than 250 0 (based on OOB evaluations). Note that, given the stochastic construction of the 251 random forest, these features did not represent a unique set or superset of sufficient 252 features for obtaining 100.0% test accuracy. When we considered which array pairs the 253 6,778 TDOA and cross-correlation features represented, we found that all pairs of the 254 four hydrophone arrays were represented with no significant preference. 255 We trained several more models on the reduced, 6,788-item feature set, including a 256 basic decision tree, a linear and quadratic SVM, and linear discriminant analysis, using 257 10-fold cross-validation.     Mean ranks of test sound localization error. The mean rank for each test protocol and axis of localization, computed ahead of a Kruskal-Wallis test (discussed in the text), are shown with 95% CI's. Comparing every pair of groups using post-hoc Dunn's tests with an overall significance threshold of 0.05, we confirmed that interval overlaps visualized here reflect an inability to reject the null hypothesis that the underlying samples are drawn from the same distribution; similarly, interval non-overlaps visualized here reflect rejection of the null hypothesis that the underlying samples are drawn from the same distribution. Note, "S GR" (Standard Gaussian regression) refers to Gaussian regression performed with training data from grid locations of test sounds made available for model building; "LO GR" (Leave-out Gaussian regression) refers to Gaussian regression performed with training data for grid locations of test sounds excluded from model building.
[97.73 -100.0]). Thus, we considered this feature set both sufficient and sparse enough 292 to meaningfully ask whether classification is making use of features derived from a 293 spatially mixed set of hydrophone and hydrophone array pairs, consistent with a 294 geometric approach to sound source localization. The permuted variable delta error was 295 summed across hydrophone and hydrophone array pairs, which is visualized in Fig 6. 296 Overall, we note that, directly or indirectly, features representing all pairs of 297 hydrophone arrays are utilized for classification.

299
We provided a proof of concept that sound source localization of semi-stationary 300 bottlenose whistle playbacks can be achieved implicitly as a classification task and 301 explicitly as a regression task in a standard, highly reverberant, half-cylindrical captive 302 dolphin enclosure. Moreover, for the same conditions we showed that, for the 303 localization of whistles originating near training-set sounds (within˜1/3 m, the speaker 304 sway range), Gaussian regression outperforms a standard Steered-Response Power (SRP) 305 approach to whistle localization. Localizing in the lateral directions, which is often Gaussian process models prompted to predict test sound coordinates at novel grid points 308 (potentially interpolating over several meters) did so with similar accuracy to SRP.

309
The data consisted of 16 independent recordings (from four four-hydrophone arrays) 310 of 127 unique bottlenose-dolphin-whistle-like sounds played at 14 positions in the 311 dolphin exhibit pool (EP) at the National Aquarium, semi-randomly divided into 312 "training" and "test" sets for model building and evaluation, respectively. First, we 313 showed that a random forest classifier with fewer than 200 trees can achieve 100%  Although it remains unclear to what extent sounds originating off-grid are classified 325 to the most logical (i.e., nearest) grid points, we note that our classifiers' success was 326 achieved despite the˜1/3 m drift of the speaker during play-time; this together with the 327 success of regression may indicate a degree of smoothness in the classifiers' 328 decision-making. Also, that a linear classifier, which by definition cannot support 329 nonlinear decision making, suffices for this task on features that are generally expected 330 to vary continuously in value across space (TDOA's, cross-correlations) is reassuring.

331
Nevertheless, this question does warrant further investigation, perhaps using 332 faster-moving sound sources -an investigation of this nature should also be performed 333 to better evaluate the method's ability to localize sounds produced by dolphins. 334 We more suitably addressed the question of off-grid prediction using Gaussian 335 process regression to predict the coordinates of the test sounds. This method was also 336 quite successful when trained on the full training data set, achieving test MAD of 0.6557 337 m (IQR = 0.3395 -1.5694) -less than the expected length of an adult common 338 bottlenose dolphin [54]. In order to better assess the regression models' capacity for 339 interpolation, we evaluated the regression models' performance on test sounds for which 340 no training sounds from the same grid points were used for model generation. While the 341 regressors' overall performance on novel points was not satisfactory, admitting error prediction error was significantly dominated (referring to Fig 5) by localization error in 347 the direction of pool depth. This is significant because sounds from only two distinct 348 pool depths were obtained, which is intuitively unsuitable for interpolation. We think it 349 is reasonable to suggest interpolation in this direction would improve with finer 350 sampling. Moreover, we note that MAD for interpolation in the other two directions 351 was less than average adult dolphin body length.

352
For comparison, we localized the test sounds using a standard SRP approach that 353 has met success elsewhere, particularly in Thomas et al. [36]. Referring to Fig 5,354 Gaussian regression significantly outperformed SRP when all training data was used.

355
When the regression models were deprived of training data from grid points of evaluated 356 test sounds, SRP performed better overall and along the "EP Wall" (i.e., depth) axis; 357 there was no significant performance difference along the other two component axes.

358
While this suggests that with the current training data Gaussian regression does not 359 outperform SRP in three dimensions when prompted to interpolate at longer distances, 360 it also suggests the interpolation models are capable of comparable accuracy in the 361 lateral directions; the same might be expected in the vertical direction were more than 362 two depth points available for interpolation. Moreover, the results suggest that 363 Gaussian regression can perform just as well as SRP for localizing sounds across the 364 pool surface (i.e., disregarding depth), which is often sufficient for distinguishing among 365 potential sound sources based on overhead imaging (as in Thomas et al.), even at the 366 disadvantage of needing to interpolate over distances of several meters.

367
Nevertheless, it remains unclear to what extent naturally produced dolphin whistles 368 can be localized. Such whistles are produced by sources that are faster-moving and 369 possessive of different anisotropic properties than our speaker. Evaluation would require 370 a large set of dolphin whistles generated at known locations in the pool, which we do 371 not possess at present, and cannot obtain due to removal of our equipment. However, 372 even were an evaluation of real dolphin whistles to fail due to the models' inability to 373 generalize in "whistle space," we note that in general captive dolphins' whistle 374 repertoires tend to be limited -groups seem to possess less than 100 unique types [21] -375 and that it would be realistic to train classification/regression models with whistles closely resembling group members' sounds, avoiding the need for the model to generalize 377 over all whistle space. 378 Lastly, we showed that an extremely sparse, 22-item feature set that lends itself to 379 relatively strong classification accuracy includes time-of-flight comparisons from all four 380 pairs of arrays. As sound amplitude information was removed in the process of feature 381 creation, this suggests that the classification and regression methods discussed here 382 implicitly use time-of-arrival information for classification from four maximally spaced 383 sensors, consistent with a naive analytic-geometric approach to sound source 384 localization. However, the inner logic of the models ultimately remains unknown. 385 Overall, we feel this study offers a strong argument that machine learning methods 386 are suitable to solving the problem of bottlenose whistle localization in highly 387 reverberant aquaria, where tag-based solutions to whistle attribution are not feasible. 388 We offer evidence to suggest that these methods might be capable of greater accuracy 389 than SRP methods given adequate training data, coming at smaller computational 390 expense -requiring evaluation of approximately 6,788 features per sound versus 391 performing multiple signal cross-correlations per sound. While we caution that these 392 methods still must be evaluated for real dolphin whistles (representing sources that are 393 faster-moving and possessing different anisotropy than our speaker), we opine that our 394 results are encouraging and warrant further research.