Decoding Speech and Music Stimuli from the Frequency Following Response

The ability to differentiate complex sounds is essential for communication. Here, we propose using a machine-learning approach, called classification, to objectively evaluate auditory perception. In this study, we recorded frequency following responses (FFRs) from 13 normal-hearing adult participants to six short music and speech stimuli sharing similar fundamental frequencies but varying in overall spectral and temporal characteristics. Each participant completed a perceptual identification test using the same stimuli. We used linear discriminant analysis to classify FFRs. Results showed statistically significant FFR classification accuracies using both the full response epoch in the time domain (72.3% accuracy, p < 0.001) as well as real and imaginary Fourier coefficients up to 1 kHz (74.6%, p < 0.001). We classified decomposed versions of the responses in order to examine which response features contributed to successful decoding. Classifier accuracies using Fourier magnitude and phase alone in the same frequency range were lower but still significant (58.2% and 41.3% respectively, p < 0.001). Classification of overlapping 20-msec subsets of the FFR in the time domain similarly produced reduced but significant accuracies (42.3%–62.8%, p < 0.001). Participants’ mean perceptual responses were most accurate (90.6%, p < 0.001). Confusion matrices from FFR classifications and perceptual responses were converted to distance matrices and visualized as dendrograms. FFR classifications and perceptual responses demonstrate similar patterns of confusion across the stimuli. Our results demonstrate that classification can differentiate auditory stimuli from FFR responses with high accuracy. Moreover, the reduced accuracies obtained when the FFR is decomposed in the time and frequency domains suggest that different response features contribute complementary information, similar to how the human auditory system is thought to rely on both timing and frequency information to accurately process sound. Taken together, these results suggest that FFR classification is a promising approach for objective assessment of auditory perception.

Responses vary in temporal and spectral composition; the strongest spectral peaks occur at F0 for every stimulus.

136
Each participant completed an FFR recording session and perceptual test. As even short-term  The FFR session involved six recording blocks, with a single stimulus presented in each block.

142
Ordering of blocks was randomized for each participant. Stimuli were presented with a 70-msec  confusion matrices contain classifier predictions aggregated across all cross-validation folds. 204 We performed FFR classifications and visualized the results using the publicly available Mat- (but likely improving their SNR). As classification of 500-sweep pseudo-trials was found to produce 228 higher accuracy than classification of 100-sweep trials, we performed all subsequent classifications 229 on 500-sweep representations of the data. 230 We next classified the data using a leave-one-participant-out (LOO) cross-validation scheme.

231
Here, we performed 13-fold cross validation, where in each fold all observations from a single par-232 ticipant were withheld for testing, while the model was trained on the data from the remaining 233 participants. We performed two such classifications. First, training and testing was performed  In a clinical setting, the predictive power of classification becomes especially relevant for assessing 313 responses from previously unseen patients. To explore the feasibility of this scenario, we next 314 iteratively trained the classifier on data from all but one participant and then tested on the data 315 from that holdout participant. As participant-specific attributes of the test data cannot be taken 316 into account during training, this is a more challenging task. However, it also more closely resembles  for the present study we also explored whether FFRs could be classified in the frequency domain.

333
For our first frequency-domain analysis, we input real and imaginary Fourier coe cients between 334 0-1,000 Hz to the classifier. As expected, the resulting classification accuracy was similar to that 335 obtained with the time-domain response (74.6%, p <0.001). As can be seen in the confusion matrix 336 and dendrogram in Panel A of Figure 3, the structure of similarities was also similar to that of the 337 time-domain classification, with strongest similarity between ba/da, followed by di /piano. 338 We next decomposed the frequency-domain representation of the responses into Fourier mag-339 nitudes and phases for frequencies up to 1,000 Hz, and classified each of these representations

374
Each FFR recording was accompanied by a separate perceptual identification test. Given that intact 375 stimuli were used in the experiment and all participants had normal hearing, we expected perceptual 376 accuracy to be near 100%. Indeed, perceptual performance was high, with mean accuracy of 90.6% 377 (p <0.001). As overall perceptual accuracies were high for each stimulus (84.9%-98.1%, Figure 5), 378 we observed greater distance among all categories.

379
Perceptual accuracy of all but one participant exceeded 75% (Supplementary Figure S5). We did 380 not observe a significant correlation between perceptual performance and leave-one-participant-out 381 FFR classification performance when pseudo-trials were computed within-participant (rho =0.04, 382 p =0.66).

384
In this study we have demonstrated that FFRs elicited by both speech and music stimuli can be 385 successfully classified, and that the pattern of classification approximates that observed with a 386 perceptual-identification task. Here, overall accuracy on the perceptual-identification task was 387 90.6%, while the overall classifier accuracy was 72.3%. Our classifier accuracy for these CV 388 phonemes and musical instruments is similar to that observed with vowels alone (⇠70-80%; (Sadeghian correctly assigned stimulus category labels. This analysis approach thus more closely emulates the 396 process of sound identification that humans perform repeatedly across the lifespan.

397
Our results suggest that overall classifier performance is heavily driven by accurate labeling of 398 responses to musical instrument and di stimuli. For these stimuli, time-domain classwise accuracy 399 ranged from 74% for di to 91% for tuba (Figure 2A). In contrast, responses to ba and da phonemes 400 classified at 55.4% and 50.8%, respectively. While these accuracies exceed the six-class chance 401 level of 16.7%, they are notably lower, and the majority of misclassifications occur between the 402 two categories. One plausible explanation is that the di↵erence in classifier accuracy for these 403 FFRs reflects how robustly the acoustic characteristics of the signal are represented in the neural 404 response. For example, da and di di↵er in the vowel portion of the CV phoneme, while ba and Hall III, J. W. (2007b). Frequency-specific auditory brainstem response (abr) and auditory steady-state response