Abstract
Across taxa, the forms of vocal signals are shaped by their functions1–15. In humans, a salient context of vocal signaling is infant care, as human infants are altricial16,17. Humans often produce “parent-ese”, speech and song for infants that differ acoustically from ordinary speech and song18–35, in fashions that are thought to support parent-infant communication and infant language learning36–39; modulate infant affect33,40–45; or credibly signal information to infants46. These theories predict a universal form-function link in infant-directed vocalizations, with consistent differentiation between infant-directed and adult-directed vocalizations across cultures. Some evidence supports this prediction23,27,28,32,47–50, but the limited generalizability of individual ethnographic reports and laboratory experiments51 and small stimulus sets52, along with intriguing reports of counterexamples53–60, leave the question open. Here, we show that infant-directed speech and song are robustly differentiable from their adult-directed counterparts, within voices and across cultures. We built a corpus of 1615 recordings of infant- and adult-directed singing and speech produced by 410 people living in 21 urban, rural, and small-scale societies and played the recordings to 45,745 people recruited online from many countries. We asked them to guess whether or not each vocalization was, in fact, infant-directed. The patterns of inferences of these naïve listeners, supported by acoustic analyses and predictive modelling, demonstrate acoustic cues to infant-directedness that are cross-culturally robust. The cues to infant-directedness differ across language and music, however, informing hypotheses of the psychological functions and evolution of both.
Main
The forms of many animal signals are shaped by their functions, a link arising from production- and reception-related rules that help to maintain reliable signal detection within and across species1–6. Form-function links are widespread in vocal signals across taxa, from meerkats to fish3,7–10, causing acoustic regularities that allow cross-species intelligibility11–13,15. This facilitates the ability of some species to eavesdrop on the vocalizations of other species, for example, as in superb fairywrens (Malurus cyaneus), who learn to flee predatory birds in response to alarm calls that they themselves do not produce14.
In humans, an important context for the effective transmission of vocal signals is between parents and infants, as human infants are particularly helpless16. To elicit care, infants use a distinctive alarm signal: they cry17. In response, adults produce infant-directed language and music (sometimes referred to as “parent-ese”) in forms of speech and song with putatively stereotyped acoustics18–35.
These stereotyped acoustics are thought to be functional: supporting language acquisition36–39, modulating infant affect and temperament33,40,41, and signalling information to infants46. These theories all share a key prediction: like the vocal signals of other species, the forms of infant-directed vocalizations should be universally shaped by their functions, instantiated with clear regularities across cultures. Evidence for a universal form-function link is mixed, however, given the limited generalizability of individual ethnographic reports and laboratory studies51; small stimulus sets52; and a variety of counterexamples53,54,56–60.
In language, infant-directed speech is primarily characterized by higher and more variable pitch61 and more exaggerated and variable vowels23,62,63, in modern industrialized societies23,28,47,48,50,64,65 and a few small-scale societies49,66. Infants are themselves sensitive to these features, preferring them, even if spoken in unfamiliar languages67–69. But these acoustic features are less exaggerated in some cultures58,64,70 and apparently vary in relation to the age and sex of the infant64,71,72.
In music, infant-directed songs also have stereotyped acoustic features. Lullabies, for example, tend toward slower tempos, reduced accentuation, and simple repetitive melodic patterns31,32,35,73, supporting functional roles associated with infant care33,41,46 in industrialized34,74–76 and small-scale societies77,78. Infants are soothed by these acoustic features, whether produced in familiar44,45 or unfamiliar songs79, and both adults and children reliably associate the same features with a soothing function31,32,73. But cross-cultural studies of infant-directed song have primarily relied upon archival recordings from disparate sources29,31,32; an approach that poorly controls for differences in voices, behavioral contexts, recording equipment, and historical conventions.
The degree to which infant-directed vocalizations are acoustically stereotyped across cultures is therefore unclear. To address this, we created a corpus of infant-directed song, infant-directed speech, adult-directed song, and adult-directed speech from a diverse set of 21 human societies, totaling 1615 field recordings of 410 individual voices (Fig. 1a, Table 1, and Methods; the corpus is open-access at https://doi.org/10.5281/zenodo.5525161). Participants were asked to provide all four vocalization types, enabling within-voice analyses.
Here, we report analyses of the corpus, using computational methods and a citizen-science experiment, to study three questions: (i) Is infant-directedness mutually intelligible across cultures? (ii) Are the acoustic cues to infant-directedness cross-culturally robust? (iii) Are human inferences about infant-directedness aligned to such acoustic cues?
Naïve listeners distinguish infant-directed from adult-directed vocalizations
We played excerpts from the vocalization corpus to 45,745 people in the “Who’s Listening?” game on https://themusiclab.org (after exclusions; see Methods). The participants resided in 184 countries and reported speaking 164 native languages. We asked them to judge, quickly, whether each vocalization was directed to a baby or to an adult (see Methods and Extended Data Fig. 1). We only included recordings that lacked confounding contextual/background cues (e.g., an audible infant; see Methods). Unless noted otherwise, all estimates reported here are generated by mixed-effects linear regression, adjusting for fieldsite (as a random effect), and with p-values generated via linear combination tests.
Corpus-wide, infant-directed speech was far more likely to be rated as infant-directed than was adult-directed speech from the same voice (Fig. 2a; ID speech = 51%, AD speech = 22%; χ2(1) = 25.3, p < .0001); and infant-directed song was far more likely to be rated as infant-directed than was adult-directed song from the same voice (Fig. 2a; ID song = 72%, AD song = 57%; χ2(1) = 13.58, p < .001). These results were robust to learning effects, as they repeated when only analyzing each participant’s first exposure to a vocalization in the experiment and listener accuracy increased by only 0.06% after each trial (Extended Data Fig. 2). They were also robust to post-hoc data trimming decisions, such as excluding recordings with confounding background noise and/or trials where the listener could likely understand the words in the vocalization (Extended Data Fig. 3).
There was, however, an overall bias toward “baby” responses for songs (67% of all responses were “baby”, but only 51% of songs were infant-directed) and toward “adult” responses for speech (64% “adult” responses vs. 56% actually adult-directed), however, which led adult-directed songs to be reliably mis-identified as infant-directed. To quantify sensitivity to infant-directedness independently from this bias, we ran a d- prime analysis at the level of each vocalist, i.e., analyzing participants’ ability to identify infant-directedness within each voice after correcting for response bias. Sensitivity was significantly higher than the chance level of 0 (speech: d′ = 1.05, 95% CI [0.64, 1.46]; song: d′ = 0.42, 95% CI [0.22, 0.62]; ps < .0001) implying that the naïve listeners reliably differentiated between infant- and adult-directed vocalizations across both speech and song, and with ∼2.5 times higher sensitivity in speech.
We also analyzed performance in the task within the subset of recordings drawn from each fieldsite. Cross-site variability was evident, especially in the size of effects (but less so in their direction); we caution that some fieldsites had small samples, making it impossible to know whether such effects represent true cross-cultural variability, sampling variability, or both. In 20 of 21 fieldsites, mean “baby” ratings were higher for infant-directed speech than adult-directed speech (Fig. 2b) and in 17 of 21 fieldsites, mean “baby” ratings were higher for infant-directed song than adult-directed song (Fig. 2b). In all fieldsites that failed to replicate the overall pattern in song, however, the mean “baby” rating for infant-directedness was nonetheless above the chance level of 50%. Fieldsite-wise d′ scores are reported in Extended Data Table 1.
Listener sensitivity within each fieldsite was also correlated with a number of society-level characteristics: rank-order population size (speech: τ = 0.53; song: τ = 0.6), distance from fieldsite to nearest urban center (speech: r = -0.75; song: r = -0.49), and number of children per family (speech: r = -0.57; song: r = -0.8; all ps < .001). Each of these predictors were highly correlated with each other (all r > 0.6), however, suggesting that they did not each contribute unique variance. There was no correlation with ratings of how frequently infant-directed vocalizations were used within each society (ps > .4).
Tests of cross-cultural variability among listeners also revealed strong similarity in the perception of infant-directedness. On trials where the vocalization being judged was in a closely related language to the native language of the listener (e.g., when the vocalization was in Spanish and the listener’s native language was English, which are both Indo-European languages), performance increased only modestly relative to trials where the language family did not match (e.g., when the vocalization was in Mentawai, an Austronesian language, and the listener’s native language was Mandarin, a Sino-Tibetan language); the effect was statistically significant but small (difference in d′ = 0.18, p = 0.01; Extended Data Fig. 4). Linguistic relatedness therefore only accounted for a small amount of variability in naïve listeners’ intuitions of infant-directedness. More generally, random effects of listener country, gender, and age on sensitivity were all small (each varying by < 1%), implying cross-demographic consistency in listener intuitions.
Acoustic correlates of infant-directedness across cultures
What enables such a diverse group of people to arrive at such similar conclusions about unfamiliar, foreign vocalizations, in languages that they do not understand? One possibility is that there exists a universal set of acoustic features driving listeners’ inferences concerning the intended targets of speech and song, which are reliably instantiated within and across societies, as suggested by functional accounts of infant-directed vocalization33,36–43,46.
To test this possibility, we studied 15 types of acoustic features in each recording (e.g., pitch, rhythm, timbre) via multiple variables (e.g., median, interquartile range); these were treated to reduce the influence of atypical observations (e.g., extreme values caused by loud wind, rain, and other background noises), and standardized within-voices to eliminate between-voice variability. This yielded a total of 99 variables (see Methods; a codebook is in Extended Data Table 2).
Following a preregistered exploratory-confirmatory design, we fitted a multi-level mixed-effects regression predicting each acoustic variable from the vocalization types, after adjusting for voice and fieldsite as random effects, and using linear combinations to test for infant-directedness differences in song and speech separately. To reduce the risk of Type I error, we performed this analysis on a randomly selected half of the corpus (exploratory, weighting by fieldsite) and only report results that successfully replicated in the other half (confirmatory). We did not correct for multiple tests because the exploratory-confirmatory design restricts the tests to those with a directional prediction.*
This procedure identified 16 acoustic features that distinguished infant-directedness in song, speech, or both (Fig. 3; statistics are in Extended Data Table 3), in the context of producing infant-directed vocalizations “when baby is fussy”. For example, across cultures and within voices, infant-directed speech had considerably higher pitch, greater pitch range, and more contrasting vowels than adult-directed speech. These results repeated consistently in each fieldsite: pitch, energy-rolloff, and inharmonicity showed the same direction of difference in all 21 fieldsites; and other features, such as vowel contrasts and attack curve slopes, were consistent in the majority of them (see the doughnut plots in Fig. 3a). These patterns align with prior claims of pitch and vowel-contrast being robust features of infant-directed speech23,65, and substantiate them across many cultures.
The distinguishing features of infant-directed song were more subtle, however, but nevertheless corroborate its purported soothing functions33,41,46: reduced loudness, intensity, and acoustic attack; reduced pitch range; and purer-sounding vocal qualities (reduced roughness and inharmonicity), which were mostly consistent across sites. The smaller effects in song, relative to speech, may result from the fact that while solovoice speaking is fairly natural and representative of most adult-directed speech (i.e., people rarely speak at the same time), much of the world’s song occurs in social groups where there are multiple singers and accompanying instruments32,46,80. Asking participants to produce solo adult-directed song may have biased participants toward choosing more soothing and intimate songs (e.g., ballads, love songs; see Extended Data Table 4), or less naturalistic renditions of songs that would normally be sung in less constrained social contexts. Further, the adult-directed songs were produced in the presence of an infant, which can in principle alter participants’ singing style35 (although this may comparably alter the adult-directed speech examples; see Methods for one test of this question). Thus the distinctiveness of infant-directed song (relative to adult-directed song) may be underestimated in these data.
Some acoustic correlates of infant-directedness had very different trends across language and music. For example, whereas median pitch strongly differentiated infant-directed speech from adult-directed speech, it had no such effect in music; pitch variability had the opposite effect across language and music; and similar patterns were evident in first and second formants. Loudness-related features showed a similar pattern, where intensity and attack slope were increased in infant-directed speech and decreased in infant-directed song, on average, and relative to their adult-directed counterparts. That some basic acoustic features operate differently across infant-directed speech and song supports the possibility of differentiated functional roles18,33,34,45,46,79,81.
But some acoustic features were nevertheless common to both language and music; in particular, overall, infant-directedness was characterized by reduced roughness and inharmonicity, which may facilitate parent-infant signalling5,41 through better contrast with the sounds of screaming and crying17,82; and increased vowel contrasts, potentially to aid language acquisition36,37,39 or as a byproduct of socio-emotional signalling1,63.
Last, we conducted an exploratory principal components analysis of the full 99 features (Fig. 3b; the analysis accounts for ∼40% of total variability in acoustic features). The results provide convergent evidence that the main forms of acoustic variation partition into orthogonal clusters distinguishing (PC1) speech from song overall; (PC2) infant-directedness in song; and (PC1 and PC3) infant-directedness in speech. Factor loadings are in Extended Data Table 5; they largely replicate the findings of the exploratory-confirmatory analyses. One further pattern that the principal components analysis highlights is that infant-directedness makes speech more “songlike”, in terms of higher pitch and reduced roughness (PC3); but speech strongly differed from song overall in terms of the variability and rate of variability of pitch, intensity, and vowels, and infant-directedness further exaggerated these differences for speech (PC1).
Human intuitions of infant-directedness are modulated by vocalization acoustics
Last, we assessed whether these acoustic features alone are sufficient to replicate human performance in classifying infant-directedness. To do this, we trained two least absolute shrinkage and selection operator (LASSO) classifiers83 with fieldsite-wise leave-one-out cross-validation, separately for speech and song recordings. This approach32 gives a strong test of the cross-cultural consistency of acoustical correlates of infant-directedness, as the model’s classification accuracy is evaluated on held-out data from a fieldsite that it has not been trained on.
Both models performed significantly above the 50% chance level (Fig. 4a; speech: 77% correct, 95% CI [71%, 83%]; song: 65% correct, 95% CI [59%, 71%]). When accounting for response bias, model performance was highly similar to the aggregate guessing patterns of human listeners, as evaluated via a receiver operating characteristic analysis (Extended Data Fig. 6), for both speech (human AUC: 90.77, 95% CI [88.41, 93.14]; model AUC: 92.13, 95% CI [90.33, 93.93]) and song (human AUC: 75.52, 95% CI [71.7, 79.33]; model AUC: 77.37, 95% CI [74.14, 80.6]). Using this same bias-free metric, both models also performed similarly to humans at the level of each individual fieldsite (speech: r = 0.38, p = 0.04; song: r = 0.56, p = 0.004; see Fig. 4a and Extended Data Fig. 7). These results demonstrate that the measured acoustic correlates of infant-directedness operate reliably across the 21 societies studied, at least with sufficient consistency to replicate the overall level of human classification performance.
We then examined the precise relations between acoustic features and the experiment-wide proportions of infant-directedness ratings for each vocalization, in a similar approach to prior research73. The proportions are a more strenuous target to predict than a binary classification (as in the first two LASSO models) in that they form a continuous measure of infant-directedness per the ears of the naïve listeners. We trained two further LASSO models to predict the proportions, using the same cross-validation procedure. Both models explained considerable variation in human listeners’ intuitions (Fig. 4b; speech R2 = 0.56; song R2 = 0.21, ps < .0001), albeit more so in speech than in song.
We also measured the relations between the influence of each acoustic cue on human intuitions and the effect sizes of each variable in the corpus-wide acoustical analyses. If human inferences are attuned to some universal profile of acoustic correlates of infant-directedness, one might expect a close relationship between the strength of actual acoustic differences between vocalizations on a given feature and the relative influence of that feature on human intuitions. We compared the variable importance scores from the LASSO model predicting human inferences (visualized in the bar plots in Fig. 4a) to a measure of how acoustically salient each feature was (estimated as mean differences in the corpus; Fig. 3). We found a significant positive relationship for speech (r = 0.82, p < .001) but not for song (r = 0.32, p = 0.14), implying that human intuitions concerning infant-directed song were likely driven by more subjective features of the recordings, higher-level acoustic features that we did not measure, or both; this contrasts with intuitions concerning infant-directed speech, which were largely explicable from simple, objective acoustic features.
Discussion
We provide convergent evidence for cross-cultural regularities in the acoustic design of infant-directed speech and song. Naïve listeners reliably identified infant-directed vocalizations as infant-directed, despite the fact that the vocalizations were of largely unfamiliar cultural, geographic, and linguistic origin; acoustic analyses showed cross-culturally reliable acoustic differentiation of infant-directed and adult-directed vocalizations, in both speech and song; and these acoustic distinctions explained substantial variability in human intuitions concerning infant-directedness.
Thus, despite evident variability in language, music, and infant care practices worldwide, when people speak or sing to fussy infants, they modify the acoustic features of their vocalizations in similar and mutually intelligible ways across cultures. This implies that the forms of infant-directed vocalizations are shaped by their functions, in a fashion similar to the vocal signals of many non-human species.
By analyzing both speech and song recorded from the same voices, we discerned precise differences in the ways infant-directedness is instantiated in language and music. In response to the same prompt of addressing a “fussy infant”, infant-directedness in speech and song was instantiated with opposite trends in acoustic modification (relative to adult-directed speech and song): infant-directed speech was more intense and contrasting (e.g., more pitch variability, higher intensity) while infant-directed song was more subdued and soothing (e.g., less pitch variability, lower intensity). These acoustic dissociations suggest functional dissociations, with speech being more attention-grabbing, the better to distract from baby’s fussiness37,38; and song being more soothing, the better to lower baby’s arousal32,33,41–43,45,79. Speech and song are both capable of playful or soothing roles60 but each here tended toward one acoustic profile over the other, despite both types of vocalization being elicited here in the same context: vocalizations used “when the baby is fussy”.
Many of the reported acoustic differences are consistent with the bioacoustics of vocal signalling in non-human animals1–15. For example, in both speech and song, infant-directedness was robustly associated with purer and less harsh vocal timbres, and greater formant-frequency dispersion. In non-human animals, these features have convergently evolved across taxa in the functional context of signalling friendliness or approachability in close contact calls1,3,63,84, in contrast to alarm calls or signals of aggression, which are associated with rough sounds that have less formant dispersal4,85–87. The use of these features in infant care may originate from signalling approachability to baby, but may have later acquired further functions more specific to the human context. For example, greater formant-frequency dispersion accentuates vowel contrasts, which could facilitate language acquisition36,63,88–90; and purer vocal timbre may facilitate communication by contrasting conspicuously with the acoustic context of infant cries5 (for readers unfamiliar with infants, this context is acoustically harsh17,82).
Higher pitch is also routinely a cue for animal vocal signalling of approachability and friendliness; accordingly, one of the largest and most robust results in our study was that infant-directedness raised the vocal pitch (f0) of speech to a songlike level. But infant-directedness had no effect on pitch within song. This curious asymmetry is consistent with the idea that pitched aspects of music may originate from elaborations to generic infant-directed vocalizations, where both use less harsh but more variable pitch patterns and more temporally variable and expansive vowel spaces to provide infants with ostensible “flashy” signals of attention and pro-social friendliness41,46,61,91,92. This does not mean that pitch alterations are absent from infant-directed song (indeed, in one study, mothers sang a song at higher pitch when producing a more playful rendition, and a lower pitch when producing a more soothing rendition44), but on average, both infant- and adult-directed song, along with infant-directed speech, tend to be higher in pitch than adult-directed speech.
We leave open at least two further questions. First, the results are suggestive of universality, because the corpus covers a swath of geographic locations (21 societies on 6 continents), languages (12 language families), and different subsistence regimes (8 types) (see Table 1). But these do not constitute a representative sample of humans, so strong claims of universality are not justified; indeed, we found both cross-cultural consistency and variability (e.g., with the fieldsite in Wellington, New Zealand demonstrating main effects an order of magnitude larger than some other fieldsites). In addition to studying more representative samples of infant-directed vocalizations; other future approaches may (i) use phylogenetic methods to examine whether people in societies that are distantly related nonetheless produce similar infant-directed vocalizations; (ii) test perceived infant-directedness in more diverse samples of listeners, to more accurately characterize cross-cultural variability in the perception of infant-directedness; and (iii) test listener intuitions among groups with reduced exposure to a given set of infant-directed vocalizations, such as very young infants or people from isolated, distantly related societies, as in related efforts27,67,93. Such research would benefit in particular from a focus on societies previously reported to have unusual vocalization practices, infant care practices, or both53,56–58; and would also clarify the extent to which convergent practices across cultures are due to cultural borrowing (in the many cases where societies are not fully isolated from the influence of global media).
Second, speech and song are used in a multiple contexts with infants, of which “addressing a fussy infant” (the type of vocalization we elicited from participants) is just one18,34. One curious finding may bear on this question: naïve listeners displayed a bias toward “adult” guesses for speech and “baby” guesses for song, regardless of their actual targets. This suggests that listeners treated “adult” and “baby” as the default reference levels for speech and song, respectively, against which acoustic evidence was compared, a pattern consistent with theories that posit song as having a special connection to infant care in human psychology33,46.
Methods
Vocalization corpus
We built a corpus of 1,615 recordings of infant-directed song, infant-directed speech, adult-directed song, and adult-directed speech (all audio is available at https://doi.org/10.5281/zenodo.5525161). Participants (N = 411) living in 21 societies (Fig. 1a and Table 1) produced each of these vocalizations, respectively, with a median of 15 participants per society (range: 6-57). From those participants for whom information was available, most were female (86%) and nearly all were parents or grandparents of the focal infant (95%).
Recordings were collected by principal investigators and/or staff at their field sites, all using the same data collection protocol. They translated instructions to the native language of the participants, following the standard research practices at each site. There was no procedure for screening out participants, but we encouraged our collaborators to collect data from parents rather than non-parents. Fieldsites were selected partly by convenience (i.e., via recruiting principal investigators at fieldsites with access to infants and caregivers) and partly to maximize cross-fieldsite diversity (see Table 1).
For infant-directed song and infant-directed speech, participants were asked to sing and speak to their infant as if they were fussy, where “fussy” could refer to anything from frowning or mild whimpering to a full tantrum. At no fieldsites were difficulties reported in the translation of the English word “fussy”, suggesting that participants understood it. For adult-directed speech, participants spoke to the researcher about a topic of their choice (e.g., they described their daily routine). For adult-directed song, participants sang a song that was not intended for infants; they also stated what that song was intended for (e.g., “a celebration song”). The record collection protocol is posted at https://github.com/themusiclab/infant-speech-song.
For most participants (90%) an infant was physically present during the recording (the infants were 48% female; age in months: M = 11.4; SD = 7.61; range 0.5-48). When an infant was not present, participants were asked to imagine that they were vocalizing to their own infant or grandchild, and simulated their infant-directed vocalizations. Prior research has shown that simulated infant-directedness is qualitatively similar, albeit less exaggerated than when authentic, for both speech94 and song35. Indeed, a model of the naïve listener results adjusting for fieldsite indeed showed a small decrease in “baby” guesses when an infant was not present (ID song: 7.1%, ID speech: 8.4%, AD song: 6.5%, AD speech: 4.3%, ps < .0001), and this effect was stronger for vocalizations that were infant-directed than adult-directed (χ2(1) = 5.67, p = 0.02). Both the naive listener results and acoustic analyses were robust to whether these simulated infant-directed vocalizations were included or excluded, however.
In all cases, participants were free to determine the content of their vocalizations. This was intentional: imposing a specific content category on their vocalizations (e.g., “sing a lullaby”) would likely alter the acoustic features of their vocalizations, which are known to be influenced by experimental contexts95. Some participants produced adult-directed songs that shared features with the intended soothing nature of the infant-directed songs; data on the intended behavioral context of each adult-directed song are in Extended Data Table 4.
All recordings were made with Zoom H2n digital field recorders, using foam windscreens (where available). To ensure that participants were audible along with researchers (who stated information about the participant and environment before and after the vocalizations), recordings were made with a 360° dual x-y microphone pattern. This produced two uncompressed stereo audio files (WAV) per participant at 44.1 kHz; we only analyzed audio from the two-channel file on which the participant was loudest.
The principal investigator at each fieldsite also provided standardized background data on the behavior and cultural practices of the society (e.g., whether there was access to mobile-phones/TV/radio, and how commonly people used ID speech or song in their daily lives). Most items were based on variables included in the D-PLACE cross-cultural corpus96. Complete data are posted on the project GitHub repository.
The 21 societies varied widely in their characteristics, from cities with millions of residents (Beijing) to small-scale hunter-gatherer groups of as few as 35 people (Hadza). All of the small-scale societies studied had limited access to TV, radio, and the internet, mitigating against the influence of exposure to the music and/or infant care practices of other societies. Four of the small-scale societies (Nyangatom, Toposa, Sápara/Achuar, and Mbendjele) were completely without access to these communication technologies.
The societies also varied in the prevalence of infant-directed speech and song in day-to-day life. The only site reported to lack infant-directed song in contemporary practice was the Quechuan/Aymaran site, although it was also noted that people from this site know infant-directed songs in Spanish and use other vocalizations to calm infants. Conversely, the Mbendjele BaYaka were noted to use infant-directed song, but rarely used infant-directed speech. In most sites, the frequency of infant-directed song and speech varied. For example, among the Tsimane, song is reportedly infrequent in the context of infant care; when it appears, however, it is specifically used to soothe and encourage infants to sleep.
Naïve listener experiment
We analyzed all data available at the time of writing this manuscript from the “Who’s Listening?” game at https://themusiclab.org/quizzes/ids, a continuously running jsPsych97 experiment distributed via Pushkin98. A total of 63,481 participants began the experiment, the first in January 2019 and the last in October 2021.
We played participants vocalizations from a subset of the corpus, excluding those that were less than 10 seconds in duration (n = 113) and those with confounding sounds that were not produced by the target voice in the first 5 seconds of the recording (e.g., a crying baby or laughing adult in the background; n = 364), as determined by two independent annotators who remained unaware of vocalization type and fieldsite (with disagreements resolved by discussion). We also excluded trials where the native language of the listener matched the language of the vocalization (N = 85,968 of 709,628 trials, or 12.1%), as this could enable listeners to infer whether a vocalization was infant-directed independently of the vocalization’s acoustic characteristics. Robustness checks confirmed that the data trimming decisions did not substantially alter the results (Extended Data Fig. 3). Irrespective of the recordings each participant was assigned, we also excluded participants who reported having previously participated in the same experiment (n = 3,514); participants who reported being younger than 12 years old (n = 1,340); and those who reported having a hearing impairment (n = 1,201).
This yielded a sample of 45,745 participants (gender: 20,664 female, 24,126 male, 922 other, 33 did not disclose; age: median 22 years, interquartile range 18-29). Participants self-reported living in 184 different countries (Fig. 1b) and speaking 164 different native languages; roughly half the participants were native English speakers from the United States.
Participants listened to at least 1 and at most 16 vocalizations drawn from the subset of the corpus (as they were free to leave the experiment before completing it) for a total of 388,985 ratings (Fig. 1b; infant-directed song: n = 109,994; infant-directed speech: n = 77,317; adult-directed song: n = 104,023; adult-directed speech: n = 97,651). The vocalizations were selected with weighted randomization, such that a set of 16 trials included 4 vocalizations in English and 12 in other languages; roughly half the corpus was English-language vocalizations, so this method ensured that participants heard a substantial number of vocalizations in other languages. This yielded over 46 ratings per vocalization (median = 447; interquartile range 151-496.75) and thousands of ratings for each society (median = 18,631; interquartile range: 12,100-21,393).
We asked participants to classify each vocalization as either directed toward a baby or an adult (Extended Data Fig. 1), as quickly as possible, either by pressing a key corresponding to a drawing of an infant or adult face (when the participant used a desktop computer) or by tapping one of the faces (when the participant used a tablet or smartphone). The locations of the faces (left vs. right on a desktop; top vs. bottom on a tablet or smartphone) were randomized participant-wise. As soon as they made a choice, playback stopped. After each trial, we told participants whether or not they had answered correctly and how long, in seconds, they took to respond; at the end of the experiment, we gave participants a total score and percentile rank (relative to other participants).
In revising this manuscript, we discovered that a small subset of the corpus had been erroneously excluded from the main experiment. In most cases, these were recordings that had been too-conservatively edited to be too short to include in the experiment (but could reasonably be edited to include longer sections of audio); in some other cases, the original excerpting included confounding background noises that, upon additional editing, were avoidable. To ensure maximal coverage of the fieldsites studied here, we re-excerpted the audio of 103 examples and collected supplemental naïve listener data on these recordings via a Prolific experiment (N = 97, 54 male, 42 female, 1 other, mean age = 29.7 years). The Prolific experiment was identical to the citizen-science experiment, except that each participant was paid (at US$15/hr) rather than volunteering; and each participant rated 188 instead of up to 16. In addition to the erroneously excluded recordings, we included in the Prolific experiment 85 additional recordings randomly selected from those that were included in the citizen-science experiment, ensuring that each Prolific participant heard an exactly balanced set of vocalization types. The two cohorts’ ratings of the recordings in common across the two experiments were highly correlated (r = 0.95, p < .0001), demonstrating that they had similar intuitions concerning infant-directedness in speech and song. As such, in the main text, we report all the ratings together without disambiguating between the cohorts.
Acoustic feature extraction
We manually extracted the longest continuous and uninterrupted section of audio from each of the four samples per participant (i.e., isolating vocalizations by the participant from interruptions from other speakers, the infant, and so on), using Adobe Audition. We then used the silence detection tool in Praat99, with minimum sounding intervals at 0.1 seconds and minimum silent intervals at 0.3 seconds, to remove all portions of the audio where the participant was not speaking (i.e., the silence between vocalization phrases). These were manually concatenated in Python, producing denoised recordings, which were subsequently checked manually to ensure minimal loss of content.
We extracted and subsequently analyzed acoustic features using Praat99, MIRtoolbox100, temporal modularity using discrete Fourier transforms for rhythmic variability101, and normalized pairwise variability indices102. These features consisted of measurements of pitch (e.g., F0, the fundamental frequency), timbre (e.g., roughness), and rhythm (e.g., tempo); all summarized over time: producing 99 variables in total. We standardized feature values within-voices, eliminating between-voice variability. In the main acoustic analyses (Fig. 3a), we restricted the variable set to 26 summary statistics of median and interquartile range, as these correlated highly with other summary statistics (e.g., maximum, range) but were less sensitive to extreme observations. The principal components analysis (Fig. 3b) used the full variable set of 99 variables.
Praat
We extracted intensity, pitch, and first and second formant values from the denoised recordings every 0.03125 seconds. For male participants, the pitch floor was set at 75 Hz, with a pitch ceiling at 300 Hz, and a maximum formant of 5000 Hz. For female participants, these values were 100 Hz, 600 Hz, and 5500 Hz, respectively. From these data, several summary values were calculated per recording: mean and maximum first and second formants, mean pitch, and minimum intensity. In addition to these summary statistics, we measured the intensity and pitch rates as change in these values over time. For vowel measures, the first and second formants were used to calculate both the average vowel space used, as well as the vowel change rate (measured as change in Euclidean formant space) over time.
MIRtoolbox
All MIRtoolbox (v. 1.7.2) features were extracted with default parameters100. mirattackslope returns a list of all attack slopes detected, so final analyses were done on summary features (e.g., mean, median, etc.). Final analyses were also done on summary features for mirroughness, which returns time series data of roughness measures in 50ms windows. We RMS-normalized the mean of mirroughness following103. MIRtoolbox features were computed on the denoised recordings, with the exception of mirtempo and mirpulseclarity, where removing the silences between vocalizations would have altered the tempo.
Rhythmic variability
For temporal modulation spectra we followed Ding’s104 method, which combines discrete Fourier transforms applied to contiguous six-second excerpts. To analyze the entirety of each recording, we appended all recordings with silence to be exact multiples of six-seconds. The location of the peak (Hz) and variance of the temporal modulation spectra were extracted from their RMS values.
Normalized pairwise variability index
The nPVI represents the temporal variance of data with discrete events, which makes it especially useful for comparing speech and music101. We used an automated syllable- and phrase-detection algorithm to extract events102. We computed nPVI in two ways: by averaging the nPVI of each phrase within a recording, as well as by treating the entire recording as a single phrase. Because intervening silence would influence both temporal modulation and nPVI measures, we used recordings before they had been denoised.
Outlier preprocessing
Because automated acoustic analyses are highly sensitive at extremes (e.g., impossible values caused by non-vocal sounds, like loud wind), we Winsorized all variables. This process arbitrarily defines outliers as being those exceeding the lowest and highest 5 percentile ranks, recoding them as precisely the values of those percentile boundaries. These data were used for all acoustic analyses. This decision had no impact on the interpretation of results, but is preferable to trimming extreme values105; pilot analyses using an alternate method, imputing extreme values with the mean observation for each feature within each fieldsite, yielded comparable results.
End notes
Data, code, and materials availability
A fully reproducible manuscript; data, analysis code, and visualizations; other materials; and code for the naïve listener experiment are available at https://github.com/themusiclab/infant-speech-song. The audio corpus is available at https://doi.org/10.5281/zenodo.5525161. The preregistration for the auditory analyses is at https://osf.io/5r72u. Readers may participate in the naïve listener experiment by visiting https://themusiclab.org/quizzes/ids.
Author contributions
S.A.M. and M.M.K. conceived of the research, provided funding, and coordinated the recruitment of collaborators and creation of the corpus.
S.A.M. and M.M.K. designed the protocol for collecting vocalization recordings with input from D.A., who piloted it in the field.
L.G., A.G., G.J., C.T.R., M.B.N., A.M., L.K.C., S.E.T., J. Song, M.K., A.S., T.A.V., Q.D.A., J.A., P.M., A.S., C.D.P., G.D.S., S.K., M.S., S.A.C., J.Q.P., C.S., J. Stieglitz, C.M., R.R.S., and B.M.W collected the field recordings.
S.A.M., C.M.B., and J. Simson designed and implemented the online experiment.
C.J.M. and H.L-R. processed all recordings and designed the acoustic feature extraction with S.A.M. and M.M.K.; C.M.B. provided associated research assistance.
C.M. designed the fieldsite questionnaire with assistance from M.B. and C.J.M., who collected the data from the principal investigators.
C.B.H. and S.A.M. led analyses, with additional contributions from C.J.M., M.B., and D.K., and M.M.K.
C.B.H. and S.A.M. designed the figures.
C.B.H. wrote computer code, with contributions from S.A.M., C.J.M., and M.B.
C.J.M., H.L-R., M.M.K., and S.A.M. wrote the initial manuscript.
C.B.H. and S.A.M. wrote the revision, with contributions from C.J.M. and M.B., and all authors approved it.
Ethics
Informed consent was obtained from all participants. Ethics approval for the naïve listener experiment was provided by the Committee on the Use of Human Subjects, Harvard University’s Insitutional Review Board (protocol #IRB17-1206). Ethics approval for the collection of recordings and their use in research was decentralized; each collaborating research arranged ethics approval with their local institution.
Additional information
The authors declare no competing interests.
Supplementary information is available for this paper.
Correspondence and requests for materials should be addressed to S.A.M.
Supplementary Information
Acknowledgments
This research was supported by the Harvard University Department of Psychology (M.M.K. and S.A.M.); the Harvard College Research Program (H.L-R.); the Harvard Data Science Initiative (S.A.M.); the National Institutes of Health Director’s Early Independence Award DP5OD024566 (S.A.M. and C.B.H.); the Academy of Finland Grant 298513 (J.A.); the Royal Society of New Zealand Te Apārangi Rutherford Discovery Fellowship RDF-UOA1101 (Q.D.A., T.A.V.); the Social Sciences and Humanities Research Council of Canada (L.K.C.); the Polish Ministry of Science and Higher Education grant N43/DBS/000068 (G.J.); the Fogarty International Center (P.M., A.S., C.D.P.); the National Heart, Lung, and Blood Institute, and the National Institute of Neurological Disorders and Stroke Award D43 TW010540 (P.M., A.S.); the National Institute of Allergy and Infectious Diseases Award R15-AI128714-01 (P.M.); the Max Planck Institute for Evolutionary Anthropology (C.T.R., C.M.); a British Academy Research Fellowship and Grant SRG-171409 (G.D.S.); the Institute for Advanced Study in Toulouse, under an Agence nationale de la recherche grant, Investissements d’Avenir ANR-17-EURE-0010 (L.G., J. Stieglitz); the Fondation Pierre Mercier pour la Science (C.S.); and the Natural Sciences and Engineering Research Council of Canada (S.E.T.). We thank the participants and their families for providing recordings; L. Sugiyama, for supporting pilot data collection; J. Du, E. Pillsworth, P. Wiessner, and J. Ziker, who collected or attempted to collect additional recordings; S. Atwood, A. Bergson, Z. Jurewicz, D. Li, L. Lopez, E. Radyte?, and S. Ccari Cutipa for research assistance; and J. Kominsky, L. Powell, and L. Yurdum for feedback on the manuscript.
Footnotes
corrections to author spellings and affiliations
↵* We note one important deviation from the preregistration: we originally planned post-hoc linear combinations to test hypothesized differences between (1) infant-directed and adult-directed vocalizations overall; (2) infant-directed song and adult-directed song; and (3) infant-directed song and infant-directed speech. We retain the second comparison in the main text, but no longer focus on (1) or (3) as the analysis approach is confounded by the fact that acoustic differences between speech and song overall far outstrip the acoustic correlates of infant-directedness. Instead, we adopted the simpler and more informative approach of post-hoc comparisons that are only within speech and within song. We also retained the exploratory-confirmatory design, as it mitigates the potential for inflated Type I errors. For transparency, we still report the preregistered post-hoc tests in Extended Data Fig. 5, but suggest that these comparisons be interpreted with caution.
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.
- 9.
- 10.↵
- 11.↵
- 12.
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.
- 20.
- 21.
- 22.
- 23.↵
- 24.
- 25.
- 26.
- 27.↵
- 28.↵
- 29.↵
- 30.
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.
- 56.↵
- 57.
- 58.↵
- 59.
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.
- 87.↵
- 88.↵
- 89.
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.↵
- 96.↵
- 97.↵
- 98.↵
- 99.↵
- 100.↵
- 101.↵
- 102.↵
- 103.↵
- 104.↵
- 105.↵