ABSTRACT
Vocalization is an essential medium for social and sexual signaling in most birds and mammals. Consequently, the analysis of vocal behavior is of great interest to fields such as neuroscience and linguistics. A standard approach to analyzing vocalization involves segmenting the sound stream into discrete vocal elements, calculating a number of handpicked acoustic features, and then using the feature values for subsequent quantitative analysis. While this approach has proven powerful, it suffers from several crucial limitations: First, handpicked acoustic features may miss important dimensions of variability that are important for communicative function. Second, many analyses assume vocalizations fall into discrete vocal categories, often without rigorous justification. Third, a syllable-level analysis requires a consistent definition of syllable boundaries, which is often difficult to maintain in practice and limits the sorts of structure one can find in the data. To address these shortcomings, we apply a data-driven approach based on the variational autoencoder (VAE), an unsupervised learning method, to the task of characterizing vocalizations in two model species: the laboratory mouse (Mus musculus) and the zebra finch (Taeniopygia guttata). We find that the VAE converges on a parsimonious representation of vocal behavior that outperforms handpicked acoustic features on a variety of common analysis tasks, including representing acoustic similarity and recovering a known effect of social context on birdsong. Additionally, we use our learned acoustic features to argue against the widespread view that mouse ultrasonic vocalizations form discrete syllable categories. Lastly, we present a novel “shotgun VAE” that can quantify moment-by-moment variability in vocalizations. In all, we show that data-derived acoustic features confirm and extend existing approaches while offering distinct advantages in several critical applications.