Abstract
Neural activity in the auditory system synchronizes to sound rhythms, and brain–environment synchronization is thought to be fundamental to successful auditory perception. Sound rhythms are often operationalized in terms of the sound’s amplitude envelope. We hypothesized that – especially for music – the envelope might not best capture the complex spectro-temporal fluctuations that give rise to beat perception and synchronize neural activity. This study investigated 1) neural entrainment to different musical features, 2) tempo-dependence of neural entrainment, and 3) dependence of entrainment on familiarity, enjoyment, and ease of beat perception. In this electroencephalography study, 37 human participants listened to tempo-modulated music (1–4 Hz). Independent of whether the analysis approach was based on temporal response functions (TRFs) or reliable components analysis (RCA), the spectral flux of music – as opposed to the amplitude envelope – evoked strongest neural entrainment. Moreover, music with slower beat rates, high familiarity, and easy-to-perceive beats elicited the strongest neural response. Based on the TRFs, we could decode music stimulation tempo, but also perceived beat rate, even when the two differed. Our results demonstrate the importance of accurately characterizing musical acoustics in the context of studying neural entrainment, and demonstrate entrainment’s sensitivity to musical tempo, familiarity, and beat salience.
Introduction
Neural activity synchronizes to different types of rhythmic sounds, such as speech and music (Doelling and Poeppel, 2015, Nicolaou et al., 2017, Ding et al., 2017, Kösem et al., 2018), over a wide range of rates. Neural oscillations are involved in the regulation of (patho-)physiological activity and are important for gating input during sensory perception and temporal processing (Giraud and Poeppel, 2012, Henry and Herrmann, 2014). For this reason, bringing neural oscillations into temporal alignment with a rhythmic stimulus – neural entrainment – can influence perception in the auditory, visual, and somatosensory modalities (Henry and Obleser, 2012, Spaak et al., 2014, Gundlach et al., 2016). In the auditory domain, neural oscillations entrained at syllabic, prosodic, and semantic rates in speech seem to play an important role in speech perception and intelligibility (Doelling et al., 2014, Peelle et al., 2013, Kösem et al., 2018). The current study examined neural entrainment to music.
Music is highly rhythmic, and neural oscillations can be entrained by the beat, the most prominent isochronous pulse in music, to which listeners would sway their bodies or tap their feet (Tierney and Kraus, 2015, Nozaradan et al., 2012, Large and Snyder, 2009, Doelling and Poeppel, 2015). Most studies that have examined cortical tracking of musical rhythm used simplified musical stimuli, such as MIDI melodies or click tracks (Kumagai et al., 2018, Nozaradan et al., 2012, Di Liberto et al., 2020, Nozaradan et al., 2011, Wollman et al., 2020) or monophonic melodies (Doelling and Poeppel, 2015); only a few studies have focused on naturalistic, polyphonic music (Tierney and Kraus, 2015, Madsen et al., 2019, Kaneshiro et al., 2020). Listeners show a strong preference for music at beat rates around 2 Hz (here, we use the term tempo to refer to the beat rate). The preference for 2 Hz coincides with the modal tempo of Western pop music (Moelants, 2002) and the most prominent frequency of natural adult body movements (MacDougall and Moore, 2005). Indeed, previous research showed that listeners perceive rhythmic sequences at beat rates around 2 Hz especially salient when they are able to track the beat by moving their bodies (Zalta et al., 2020). Despite the perceptual and motor evidence, studies looking at tempo-dependence of neural entrainment are scarce (Doelling and Poeppel, 2015, Nicolaou et al., 2017) and we are not aware of any study using naturalistic polyphonic musical stimuli that are tempo-modulated. By examining entrainment across a relatively wide and finely spaced range of musical tempi (1–4 Hz, corresponding to the neural δ band), we aimed to test whether the preference for music with beat rates around 2 Hz is reflected in the strength of neural entrainment. In addition, a number of different musical, behavioral, and perceptual measures have been shown to modulate neural entrainment and influence music perception, including complexity, familiarity, repetition of the music, musical training of the listener, and attention to the stimulus (Kumagai et al., 2018, Madsen et al., 2019, Doelling and Poeppel, 2015). Thus, we investigated the effects of enjoyment, familiarity and the ease of beat perception on neural entrainment.
Most studies assessing neural entrainment to music have examined entrainment to either the stimulus amplitude envelope, which quantifies intensity fluctuations over time (Doelling and Poeppel, 2015, Kaneshiro et al., 2020, Wollman et al., 2020), or “higher order” musical features such as surprise and expectation (Di Liberto et al., 2020). This mimics approaches used for studying neural tracking of speech, where neural activity has been shown to be entrained by the amplitude envelope (Peelle and Davis, 2012), which roughly corresponds to syllabic fluctuations (Doelling et al., 2014), as well as by “higher order” semantic information (Broderick et al., 2019). “Higher order” musical features are difficult to compute for naturalistic music, which is typically polyphonic and has complex spectro-temporal properties (Zatorre et al., 2002). However, amplitude-envelope entrainment is well documented: neural activity synchronizes to the amplitude fluctuations in music between 1 Hz and 8 Hz, and entrainment is especially strong for listeners with musical expertise (Doelling and Poeppel, 2015).
Because of the complex nature of natural polyphonic music, we hypothesized that amplitude envelope might not be the only or most dominant feature by which neural activity would be entrained (Mller, 2015). Thus, the current study investigated neural responses to different musical features that evolve over time and capture different aspects of the stimulus dynamics. Here, we use the term musical feature to refer to time-varying aspects of music that fluctuate on time scales corresponding roughly to the neural δ band, as opposed to elements of music such as key, harmony or syncopation. We examined amplitude envelope, the first derivative of the amplitude envelope (usually more sensitive to sound onsets than the amplitude envelope), beat times, and spectral flux, which describes spectral changes of the signal on a frame-to-frame basis by computing the difference between the spectral vectors of subsequent frames (Mller, 2015). One distinct advantage of spectral flux over the envelope or its derivative is that spectral flux is sensitive to rhythmic information that is communicated by changes in pitch even when they are not accompanied by changes in amplitude.
The current study investigated neural entrainment to natural music by using two different analysis approaches: Reliable Components Analysis (RCA) (Kaneshiro et al., 2020) and temporal response functions (TRFs) (Di Liberto et al., 2020). RCA typically relies on stimulus–response correlation or stimulus–response coherence (Kaneshiro et al., 2020). These approaches have been criticized because of their potential susceptibility to autocorrelation, which is argued to be minimized in the TRF approach (Zuk et al., 2021). Thus, we tested the agreement between these two analysis approaches.
We aimed to answer four questions. 1) Does neural entrainment to natural music depend on tempo? 2) Which musical feature shows the strongest neural entrainment during natural music listening? 3) How compatible are RCA- and TRF-based methods with quantifying neural entrainment to natural music? 4) How do enjoyment, familiarity, and ease of beat perception affect neural entrainment? To answer these research questions, we recorded electroencephalography (EEG) data while participants listened to instrumental music presented at different tempi (1–4 Hz). Strongest neural entrainment was observed in response to the spectral flux of music, for tempi between 1–2 Hz, to familiar songs, and to songs with an easy-to-perceive beat. Moreover, a classifier trained on the neural responses to each musical feature predicted the metrical level at which listeners tapped the beat. This indicates that the brain responded to perceived tempo, even when it was different from the stimulus tempo.
Results
Scalp EEG activity of 37 human participants was measured while they listened to instrumental segments of natural music from different genres (Supplementary Table 1). Music segments were presented at thirteen parametrically varied tempi (1–4 Hz in 0.25-Hz steps; see Materials and Methods). We assessed neural entrainment to four different musical features: amplitude envelope, first derivative of the amplitude envelope, beat times, and spectral flux. Neural entrainment was quantified using two different analysis pipelines and compared: 1) RCA combined with time- and frequency-domain analyses, and 2) TRFs (Crosse et al., 2016, Kaneshiro et al., 2020). As different behavioral and perceptual measures have been shown to influence neural entrainment to music (Madsen et al., 2019, Cameron et al., 2019), we investigated the effects of enjoyment, familiarity, and the ease with which a beat was perceived (Fig. 1A). To be able to use a large variety of musical stimuli on the group level, and to decrease any effects that may have arisen from individual stimuli occurring at certain tempi but not others, participants were divided into four subgroups that listened to different pools of stimuli (for more details please see Materials and Methods). The subgroups’ stimulus pools overlapped, but the individual song stimuli were presented at different tempi for each subgroup.
(A) Schematic of the experimental procedure. Each trial consisted of the presentation of one music segment, during which participants were instructed to listen attentively without moving. After a 1-s silence, the last 5.5 s of the music segment was repeated while participants tapped their finger along with the beat. At the end of each trial, participants rated their enjoyment and familiarity of the music segment, as well as the ease with which they were able to tap to the beat (Translated English example in Figure: “How much did you like the song?” rated from “not at all” to “very much”). (B) Exemplary traces of the four musical features of one music segment. (C) Z-scored mean amplitude spectrum of all 4 musical features. (D) Mutual information (MI) for all possible feature combinations (green) compared to a surrogate distribution (yellow, three-way ANOVA, *pFDR<0.001, rest: pFDR<0.05). Boxplots indicate the median, the 25th and 75th percentiles. (E) MI scores between all possible feature combinations (*pFDR<0.001, rest: pFDR<0.05).
Musical features
We examined neural synchronization to the time courses of four different musical features (Fig. 1B). First, we quantified energy fluctuations over time as the gammatone-filtered amplitude envelope (we report analyses on the full-band envelope in Supplementary Figures 2 and 4). Second, we computed the half-wave-rectified first derivative of the amplitude envelope, which is typically considered to be sensitive to the presence of onsets in the stimulus (Bello et al., 2005). Third, a percussionist drummed along with the musical segments to define beat times, which were here treated in a binary manner. Fourth, a spectral novelty function, referred to as spectral flux (Mller, 2015), was computed to capture changes in frequency content (as opposed to amplitude fluctuations) over time. In contrast to the first derivative, the spectral flux is better able to identify note onsets that are characterized by changes in spectral content (pitch or timbre), even if the energy level remains the same. To ensure that each musical feature possessed acoustic cues to the stimulation-tempo manipulation, we computed a fast Fourier transform (FFT) on the musical-feature time courses separately for each stimulation-tempo condition; the mean amplitude spectra are plotted in Figure 1C. Overall, amplitude peaks were observed at the intended stimulation tempo and at the harmonic rates for all stimulus features.
In order to assess the degree to which the different musical features might have been redundant, we calculated mutual information (MI) for all possible pairwise feature combinations and compared MI values to surrogate distributions calculated separately for each feature pair (Fig. 1D, E). MI quantifies the amount of information gained about one random variable by observing a second variable (Cover and Thomas, 2005). MI values were analyzed using separate three-way ANOVAs (MI data vs. MI surrogate × Tempo × Subgroup) for each musical feature. Spectral flux shared significant information with all other musical features; significant MI (relative to surrogate) was found between amplitude envelope and spectral flux (F(1,102)=24.68, pFDR=1.01e-5, η2=0.18), derivative and spectral flux (F(1,102)=82.3, pFDR=1.92e-13, η2=0.45) and beat times and spectral flux (F(1,102)=23.05, pFDR=1.3e-5, η2=0.13). This demonstrates that spectral flux captures information from all three other musical features, and as such, we expected that spectral flux would be associated with strongest neural entrainment. Unsurprisingly, there was also significant shared information between the amplitude envelope and first derivative (F(1,102)=14.11, pFDR=4.67e-4, η2=0.09; other comparisons: (Fenv-beat(1,102)=8.44, pFDR=0.006, η2=0.07; Fder-beat(1,102)=6.06, pFDR=0.016, η2=0.05).
There was a main effect of Tempo on MI shared between the amplitude envelope and derivative (F(12,91)=4, pFDR=2e-4, η2=0.32) and the spectral flux and beat times (F(12,91)=5.48, pFDR=4.35e-6, η2=0.37) (Supplementary Fig. 1). This is likely due to the presence of slightly different songs in the different tempo conditions, as the effect of tempo on MI was unsystematic for both feature pairs (see Materials and Methods and Supplementary Table 1). MI for the remaining feature pairs did not differ significantly across tempi.
No significant differences in MI were observed between subgroups, despite the subgroups hearing slightly different pools of musical stimuli: (Fenv-der(3,100)=0.71, pFDR=0.94, η2=0.01; Fenv-beat(3,100)=2.63, pFDR=0.33, η2=0.07; Fenv-spec(3,100)=0.3, pFDR=0.94, η2=0.01; Fder-beat(3,100)=0.43, pFDR=0.94, η2=0.01; Fder-spec(3,100)=0.46, pFDR=0.94, η2=0.01; Fbeat-spec(3,100)=0.13, pFDR=0.94, η2=0.002).
Neural entrainment was strongest in response to slow music
Neural entrainment to music was investigated using two converging analysis pipelines based on (1) RCA followed by time- (stimulus-response correlation, SRCorr) and frequency- (stimulus-response coherence, SRCoh) domain analysis and (2) TRFs.
First, an RCA-based analysis approach was used to assess tempo effects on neural entrainment to music (Fig. 2, Supplementary Fig. 2). RCA involves estimating a spatial filter that maximizes correlation across data sets from multiple participants (for more details see Materials and Methods) (Kaneshiro et al., 2020, Parra et al., 2018). The resulting time course data from a single reliable component can then be assessed in terms of its correlation in the time domain (SRCorr) or coherence in the frequency domain (SRCoh) with different musical feature time courses. Our analyses focused on the first reliable component, which exhibited an auditory topography (Fig. 2A). SRCorrs were significantly tempo-dependent for all four musical features (repeated-measure ANOVAs with Greenhouse-Geiser correction where required: Fenv(12,429)=3.39, pGG= 0.003, η2=0.1; Fder(12,429)=2.2, pGG=0.04, η2=0.06; Fbeat(12,429)=2.42, pGG=0.02, η2=0.07; Fspec(12,429)=5.67, pGG=4.88e-5, η2=0.15). Highest correlations were found at slower tempi (∼1-2 Hz). No significant differences were observed across subgroups (Fenv(3,30)=1.94, pFDR=0.25, η2=0.1; Fder(3,30)=1.53, pFDR =0.25, η2=0.11; Fbeat(3,30)=1.47, pFDR =0.25, η2=0.11; Fspec(3,30)=1.67, pFDR=0.25, η2=0.12). In the frequency domain, normalized SRCoh (Fig. 2E-H) showed clear peaks at the stimulation tempo and harmonics. Overall, SRCoh was stronger at the first harmonic of the stimulation tempo than at the stimulation tempo itself, regardless of the musical feature (Fig. 2I, paired-sample t-test, envelope: t(12)=-6.18, pFDR=0.0002, re=0.78; derivative: t(12)=-5.37, pFDR=0.0005, re=0.74; beat: t(12)=-3.29, pFDR=0.01, re=0.56; spectral flux: t(12)=-6.4, pFDR=0.0002, re=0.79). The stimuli themselves mostly also contained highest FFT amplitudes at the first harmonic (Fig. 2J, envelope: t(12)=-6.81, pFDR= 5.23e-5, re=0.81; derivative: t(12)=-6.88, pFDR= 5.23e-5, re=0.81; spectral flux: t(12)=-8.04, pFDR= 2.98e-5, re=0.85), apart from the beat onsets (beat: t(12)=6.27, pFDR= 8.56-5. re=0.79). For evaluating tempo-dependent effects, we averaged SRCoh across the stimulation tempo and first harmonic and submitted the average SRCoh values to repeated-measure ANOVAs for each musical feature. SRCoh was highest for slow music, but this tempo dependence was only significant for the spectral flux (Fenv(12,429)=1.83, pGG=0.1, η2=0.07; Fder(12,429)=1.63, pGG=0.14, η2=0.05; Fbeat(12,429)=1.35, pGG=0.24, η2=0.04; Fspec(12,429)=3.18, pGG=0.01, η2=0.09). No significant differences for the SRCoh were observed across subgroups (Fenv(3,30)=1.72, pFDR=0.25, η2=0.13; Fder(3,30)=2.56, pFDR =0.17, η2=0.1; Fbeat(3,30)=2.43, pFDR =0.17, η2=0.16; Fspec(3,30)=0.7, pFDR=0.56, η2=0.06). Individual data examples of the SRCorr and SRCoh can be found in Supplementary Figure 3.
(A) Projected topography of the first reliable component (RC1). (B) Average SRCorr across tempi for each musical feature (±SEM; shaded area). Highest correlations were found at slow tempi (repeated-measure ANOVA for evaluating tempo differences and Greenhouse-Geiser correction where applicable). The slopes of regression models were used to compare the tempo-specificity between musical features (repeated-measure ANOVA). (C) Mean SRCorr across musical features. Highest correlations were found in response to spectral flux. There were significant differences between all possible feature combinations except between the envelope and beat onsets (repeated-measure ANOVA, Tukey’s test, pFDR<0.001). Boxplots illustrate the median, 25th and 75th percentiles. (D) Same as (C) for the frequency based SRCoh. All possible feature combinations were significantly different from each other apart from the envelope and beat onsets (pFDR<0.001). Coherence values were averaged over the stimulus tempo and first harmonic. Normalized SRCoh in response to the (E) amplitude envelope, (F) first derivative, (G) beat onsets and (H) spectral flux. Each panel depicts the stimulus response coherence as colorplot (left) and the pooled SRCoh values at the stimulation tempo and first harmonic (right). (I) Mean differences of SRCoh values at the stimulation tempo and the first harmonic (negative values: higher SRCoh at harmonic, positive values: higher SRCoh at stimulation tempo, paired-sample t-test, *pFDR<0.05; **pFDR<0.001). (J) Same as (I) based on the FFT amplitudes of each musical feature.
Second, TRFs were calculated for each stimulation tempo. A TRF-based approach is a linear-system identification technique that serves as a filter describing the mapping of stimulus features onto the neural response (forward model) (Crosse et al., 2016). Using linear convolution and ridge regression to avoid overfitting, the TRF was computed based on mapping each musical feature to “training” EEG data. Using a leave-one-trial-out approach, the EEG response for the left-out trial was predicted based on the TRF and the stimulus feature of the same trial. The predicted EEG data were then correlated with the actual, unseen EEG data (we refer to this correlation value throughout as TRF correlation). We analyzed the two outputs of the TRF analysis: the filter at different time lags, which typically resembles evoked potentials, and the TRF correlations (Fig. 3, Supplementary Fig. 4). Again, strongest neural entrainment (here quantified as Pearson correlation coefficient between the predicted and actual EEG data) was observed for slower music (Fig. 3A). Repeated-measure ANOVAs showed that, significant effects of Tempo were observed for the envelope, beat onsets and spectral flux, with TRF correlations being strongest at slower tempi (∼1-2 Hz) (Fenv(12,429)=2.61, pGG=0.01, η2=0.08; Fder(12,429)=1.81, pGG=0.08, η2=0.05; Fbeat(12,429)=2.56, pGG=0.02, η2=0.07; Fspec(12,429)=9, pGG=5.72e-10, η2=0.22).
(A) Mean TRF (±SEM) correlations as a function of stimulation tempo per stimulus feature (p-values next to the legend correspond to a repeated-measure ANOVA across tempi for every musical feature and the p-value below to the slope comparison of a linear regression model). TRF correlations were highest for spectral flux and combined musical features for slow tempi. (B) Violin plots of the TRF correlations across musical features. Boxplots illustrate the median, 25th and 75th percentiles (n=34). Significant pairwise musical feature comparisons were calculated using a repeated-measure ANOVA with follow-up Tukey’s test, *pFDR<0.001. (C) Top panel: TRF time lags (0-400 ms) of the amplitude envelope. Each line depicts one stimulation tempo (13 tempi between 1 Hz, blue and 4 Hz, green). Middle panel: Colormap of the normalized TRF weights of the envelope in the same time window across stimulation tempi. Lower panel: Topographies of the TRF correlations in response to the amplitude envelope. (D) Same as (C) for the first derivative, (E) beat onsets and (F) spectral flux. Cluster-based permutation testing was used to identify significant tempo-specific time windows (red dashed box, p<0.05). (G) Mean TRF weights in response to the spectral flux for time lags between 102-211 ms.
Spectral flux drives strongest neural entrainment
As natural music is a complex, multi-layered auditory stimulus, we sought to explore the neural response to different musical features and to identify the stimulus feature or features that would evoke strongest neural entrainment. Regardless of the dependent measure (RCA-SRCorr, RCA-SRCoh, TRF correlation), strongest neural entrainment was found in response to the spectral flux (Fig. 2C-D, 3B). In particular, significant differences (as quantified with a repeated-measure ANOVA followed by Tukey’s test) were observed between the spectral flux and all other musical features using the SRCorr (FSRCorr(3,132)=43.99, pGG= 1.85e-11, η2=0.58), SRCoh (FSRCoh(3,132)=30.75, pGG =2.33e-9, η2=0.49) and TRF correlations (FTRF(4,165)=30.25, pGG=5.36e-11, η2=0.49).
As the TRF approach offers the possibility of running a multivariate analysis, all musical features were combined and compared to the single-feature TRF correlations (Fig. 3B). Although there was a significant increase in TRF correlations in comparison to the amplitude envelope (repeated-measure ANOVA with follow-up Tukey’s test, pFDR=1.66e-08), first derivative (pFDR =1.66e-8) and beat onsets (pFDR=1.66e-8), the spectral flux alone showed an advantage over the multi-featured TRF (pFDR=3.39e-4). Thus, taking all stimulus features together is not a better descriptor of the neural response than the spectral flux alone, indicating together with the MI results from Figure 1 that spectral flux is a more complete representation of the rhythmic structure of the music than the other musical features.
To test how strongly modulated TRF correlations were by each musical feature, a regression line was fitted to single-participant TRF correlations as a function of tempo, and the slopes were compared across musical features (Fig. 3A). Linear slopes were significantly higher for the spectral flux and the multivariate model compared to the remaining three musical features (repeated-measure ANOVA with follow-up Tukey’s test, envelope-spectral flux: pFDR=5.44e-6; envelope – all: pFDR=3.54e-5; derivative-spectral flux: pFDR=9.98e-7; derivative – all: pFDR=1.53e-5; beat-spectral flux: pFDR=4.54e-7; beat – all: pFDR=3.46e-6; spectral flux – all: pFDR=0.12). The results for SRCorr were qualitatively similar (envelope-spectral flux: pFDR=1.24e-4; derivative-spectral flux: pFDR=2.21e-5; beat-spectral flux: pFDR=9.31e-5; Fig. 2B).
We also examined the time courses of TRF weights (Fig. 3C–F) for time lags between 0 and 400 ms. Cluster-based permutation testing (1000 repetitions) was used to identify time windows in which TRF weights differed across tempi for each musical feature (see Materials and Methods for more details). Significant effects of tempo on TRF weights were observed for the spectral flux between 102–211 ms (p=0.01; Fig. 3 F-G). The tempo specificity was observable in the amplitudes of the TRF weights, which were largest for slower music (Fig. 3G). The TRFs for the amplitude envelope and first derivative demonstrated similar patterns to each other, with strong deflections in time windows consistent with a canonical auditory P1–N1–P2 complex, but did not differ significantly between stimulation tempi (Fig. 3C-D). Similarly, the full-band (Hilbert) amplitude envelope and the corresponding first derivative (Supplementary Fig. 4) displayed tempo-specific effects at time lags of 250–400 ms (envelope, p=0.01) and 281–400 ms (derivative, p=0.02). Visual inspection suggested that TRF differences for these musical features were related to latency, as opposed to amplitude (Figure Supplementary Fig. 4E-F, I-J). Therefore, we identified the latencies of the TRF-weight time courses within the time window of P3 and fit a piece-wise linear regression to those mean latency values per musical feature (Supplementary Fig. 4G, K). In particular, TRF latency in the P3 time window decreased over the stimulation tempo conditions from 1–2.5 Hz and from 2.75–4 Hz for both stimulus features (derivative: T1-2.5Hz=-1.08, p=0.33, R2=0.03; T2.75-4Hz=-2.2, p=0.09, R2=0.43), but this was only significant for the envelope (T1-2.5Hz=-6.1, p=0.002, R2=0.86; T2.75-4Hz=-5.66, p=0.005, R2=0.86).
Results of TRF and SRCorr/SRCoh converge
So far, we demonstrated that both RCA- and TRF-based measures of neural entrainment lead to similar results at the group level, and reveal strongest neural entrainment to spectral flux and at slow tempi. Next, we wanted to quantify the relationship between the SRCorr/SRCoh and TRF correlations across individuals (Fig. 4, Supplementary Fig. 3). This could have implications for the interpretation of studies focusing only on one method. To test this relationship, we predicted TRF correlations from SRCorr or SRCoh values (fixed effect) in separate linear mixed-effects models with Participant and Tempo as random effects (grouping variables). Each musical feature was modeled independently. For all four musical features, SRCorr significantly predicted TRF correlations (tenv(440) =9.77, βenv=0.53, pFDR<1e-15, R2=0.51; tder(440) =8.09, βder=0.46, pFDR= 5.77e-14, R2=0.28; tbeat(440) =12.12, βbeat=0.67, pFDR<1e-15, R2=0.61; tspec(440) =12.49, βspec=0.56, pFDR= 1e-15, R2=0.76). The strongest correlations between neural entrainment measures were found for the beat onsets and spectral flux of music (Fig. 4C, D). In the frequency domain, we examined the SRCoh values at the stimulation tempo and first harmonic separately (Supplementary Fig. 5). SRCoh values at both the intended stimulation tempo and the first harmonic significantly predicted TRF correlations for all musical features. For all musical features, the first harmonic was a better predictor of TRF correlations than the intended stimulation tempo except for the beat onsets (intended tempo: tenv(440) =4.78, βenv=0.17, pFDR= 3.15e-6, R2=0.34; tder(440) =3.06, βder=0.1, pFDR= 0.002, R2=0.13; tbeat(440) =8.12, βbeat=0.28, pFDR= 1.95e-14, R2=0.5; tspec(440) =3.42, βspec=0.09, pFDR= 7.9e-4, R2=0.64; first harmonic: tenv(440) =6.17, βenv=0.09, pFDR= 3.07e-9, R2=0.33; tder(440) =4.98, βder=0.09, pFDR= 1.43e-6, R2=0.16; tbeat(440) =8.79, βbeat=0.2, pFDR<1e-15, R2=0.51; tspec(440) =6.87, βspec=0.09, pFDR= 5.82e-11, R2=0.64). Overall, these results suggest that, despite their differences, TRF and RCA–SRCorr/RCA-SRCoh pick up on similar features of the neural response, but may potentially strengthen each other’s explanatory power when used together.
(A) Linear-mixed effects models of the SRCorr (predictor variable) and TRF correlations (response variable) in response to the amplitude envelope. Each dot represents the mean correlation of one participant (n=34) at one stimulation tempo (n=13) (=grouping variables; blue, 1 Hz-green, 4 Hz). Violin plots illustrate fixed effects coefficients (β). (B)-(D) same as (A) for the first derivative, beat onsets and spectral flux. For all musical features, the fixed effects were significant.
Familiar songs and songs with an easy-to-tap beat drive strongest neural entrainment
Next, we tested whether neural entrainment to music depended on 1) how much the song was enjoyed, 2) the familiarity of the song, and 3) how easy it was to tap the beat of the song; each of these characteristics was rated on a scale ranging between –100 and +100. We hypothesized that difficulty to perceive and tap to the beat in particular would be associated with weaker neural entrainment. Ratings on all three dimensions are shown in Figure 5A. To evaluate the effects of tempo on the individual’s ratings, separate repeated-measure ANOVAs were conducted for each behavioral rating. All behavioral ratings were unaffected by tempo (enjoyment: F(12,429)=0.58, p=0.85, η2=0.02; familiarity: F(12,429)=1.44, pGG=0.18, η2=0.04; ease of beat tapping: F(12,429)=1.62, p=0.08, η2=0.05).
(A) Normalized (to the maximum value per rating/participant), averaged behavioral ratings of enjoyment, familiarity and easiness to tap to the beat (±SEM). Significant differences across tempo conditions were observed (repeated-measure ANOVA with Greenhouse-Geiser correction). (B) Mean TRF correlations topography across all ratings (based on the analysis of 15 trials with highest and lowest ratings per behavioral measure). (C) Violin plots of TRF correlations comparing low vs. highly enjoyed, low vs. highly familiar, and subjectively difficult vs. easy beat trials. Strongest TRF correlations were found in response to familiar music and music with an easy-to-perceive beat (paired-sample t-test, *pFDR<0.05). (D) Mean TRFs (±SEM) for time lags between 0-400 ms of more and less enjoyable music songs. (E)-(F) Same as (D) for trials with low vs. high familiarity and difficult vs. easy beat ratings.
To assess the effects of familiarity, enjoyment, and beat-tapping ease on neural entrainment, TRFs in response to spectral flux were calculated for the 15 trials with the highest and the 15 trials with the lowest ratings per participant per behavioral rating condition (Fig. 5B-F). TRF correlations were not significantly different for less enjoyed compared to more enjoyed music (paired-sample t-test, t(33)=1.91, pFDR=0.06, re=0.36; Fig. 5C). In contrast, significantly higher TRF correlations were observed for familiar vs. unfamiliar songs (t(33)=-2.57, pFDR=0.03, re=0.46), and for songs with an easier-to-perceive beat (t(33)=-2.43, pFDR=0.03, re=0.44). These results were reflected in the TRFs at time lags between 0-400 ms (Fig. 5D-F).
Next, we wanted to entertain the possibility that musical training could modulate neural entrainment to music. Therefore, participants with less than 2 years of regular, daily music training were assigned to a “non-musician” group (n=17) and participants with over 6 years of regular music training were labelled as “musicians” (n=12). Although there is little agreement about the specific criterion that should be used to defined musician and non-musician participants, this division had the advantages that it ignored participants with medium amounts of training and it roughly equally divided our sample. Subsequently, TRF correlations were compared between groups (Supplementary Fig. 6). Regardless of the stimulus feature, no significant differences were detected between participants with different levels of musical expertise (paired-sample t-test, envelope: pFDR=0.998; derivative: pFDR=0.998; beats: pFDR=0.833; spectral flux: pFDR=0.998). Moreover, the Goldsmith’s Musical Sophistication Index (Gold-MSI) was used to quantify musical “sophistication” (referring not only to the years of musical training, but also e. g. musical engagement or self-reported perceptual abilities (Müllensiefen et al., 2014)), which we then correlated with neural entrainment. No significant correlations were observed between musical sophistication and TRF correlations (Pearson correlation, envelope: R=-0.24, pFDR=0.35; derivative: R=-0.11, pFDR=0.55 beats: R=-0.2, pFDR=0.35; spectral flux: R=-0.22, pFDR=0.35; Supplementary Fig. 6).
Brain responses to musical features predict perceived beat rate
In natural music, the beat can be perceived at multiple metrical levels. For that reason, it was possible that listeners did not perceive the beat at the tempo we intended (the stimulation tempo), but may have instead perceived the beat at double or half that rate. Thus, we wanted to explore whether our TRF-based measures of neural entrainment simply reflected the stimulus tempo that we presented, or whether they might be sensitive to perceived beat rate when that differed from the stimulation tempo, i.e., the intended beat rate. For this analysis, we made use of the tapping data that were collected in the final part of each trial, during which participants finger-tapped to the beat for 5.5 s. Trials with at least three consistent taps were assigned to a perceived tempo condition (1-4 Hz in steps of 0.25-Hz, see Materials and Methods for more details). In this study, we will use the term “stimulation tempo” to refer to the predominant beat frequency in each music segment, whereas we will use the term “tapped beat rate” when referring to the tapped frequency. The preferred tapped beat rate on the group level was ∼1.55 Hz (Supplementary Fig. 7C, mode of skewed Gaussian fitted to mean histograms of the relative number of trials per tapped beat rate).
We wanted to test if we could identify the stimulation tempo (chosen by us) or the tapped beat rate (rate the participant tapped to) based on the neural data, in particular when the stimulation tempo and the tapped beat rate were different. We used a support vector machine (SVM) classifier to first, predict the stimulation tempo (Fig. 6A-B) and second, to predict the perceived (tapped) rate based on the neural response to different musical features (Fig. 6C-D). For predicting the stimulation tempo, we identified two sets of 6 trials (per participant) each, one set where the participants tapped the intended stimulation tempo and the other set where they tapped the same rate, but the intended stimulation tempo was twice as fast as what the participants tapped, i.e., participants tapped the subharmonic of the stimulation tempo. We were able to do this for 18 of our 34 participants. Next, TRFs were computed in response to each musical feature for each set of trials (tapped rate = intended stimulation tempo vs. same tapped rate = 2*stimulation tempo). The SVMs were computed using bootstrapping (100 repetitions) and a leave-one-out approach. The mean SVM prediction accuracies for each musical feature were compared to a surrogate distribution generated by randomly shuffling the tempo labels (tapped rate = intended stimulation tempo vs. same tapped rate = 2*stimulation tempo) when training the SVM classifier. We observed significantly higher prediction accuracies in comparison to the surrogate data for all musical features (paired-sample t-test, envelope: t(17)=141.56, pFDR=5.91e-27, re=0.999; derivative: t(17)=126.33, pFDR=2.05e-26, re=0.999; beat onsets: t(17)=107.09, pFDR=2.26e-25, re=0.999; spectral flux: t(17)=43.88, pFDR=6.17e-19, re=0.994; Fig. 6A). This shows that even if the perceived tempo of two musical pieces is the same, the intended (acoustic) stimulation tempo evokes varying levels of neural entrainment. For comparing the prediction accuracies across musical features, an accuracy index((AccuracyData-AccuracySurr)/(AccuracyData+AccuracySurr)) was submitted to a repeated-measure ANOVA. No significant differences between musical features were observed (F(3,68)=0.93, p=0.43, η2=0.06; Fig.6B).
(A) Violin plot of the mean accuracies of a support vector machine (SVM) classifier predicting the stimulation tempo (n=18; tapped rate = intended stimulation tempo vs. same tapped rate = 2*stimulation tempo). Based on the TRFs to all musical features, significant differences in prediction accuracies were computed in comparison to a surrogate (paired-sample t-test, *pFDR<0.001). (B) Comparison of SVM classifier accuracies ((AccuracyData-AccuracySurr)/(AccuracyData+AccuracySurr)) across musical features revealed no significant differences in predicting the stimulation tempo (repeated-measure ANOVA, p=0.43). (C)-(D) Same as (A)-(B), but here the SVM classifier predicted the tapped rate based on the TRFs (n=13; intended stimulation tempo = tapped rate vs. same stimulation tempo = 2*tapped rate) (paired-sample t-test, *pFDR<0.001). No differences were observed in SVM prediction accuracies across musical features (repeated-measure ANOVA, p=0.39).
Next, the neural response to different musical features were used to predict the tapped beat rate for sets of trials with the same stimulation tempo (intended stimulation tempo = tapped rate vs. same stimulation tempo = 2*tapped rate). Analogous to the previously described analysis pipeline, 13 individual datasets from different tempo conditions (this time from only 9 participants with each one dataset and two participants with each two datasets to increase the sample size) were identified that met the criterion. All SVM classifier prediction accuracies yielded significant differences in comparison to the surrogate data (paired-samples t-test, envelope: t(12)=46.81, pFDR= 1.81e-14, re=0.995; derivative: t(12)=95.32, pFDR< 4.75e-18, re=0.999; beat onsets: t(12)=27.64, pFDR= 3.1e-12, re=0.985; spectral flux: t(12)=35.55, pFDR=2.09e-13, re=0.991; Fig. 6C), suggesting that entrained neural responses also possess unique signatures of the perceived beat rate, even when it is different from the stimulation tempo. No significant differences in predicting the tapped beat rate between musical features were observed (F(3,48)=1.04, p=0.39, η2=0.09; Fig. 6D).
Discussion
We investigated neural entrainment to naturalistic, polyphonic music presented at different tempi. The music stimuli varied along a number of dimensions in idiosyncratic ways, including the familiarity and enjoyment of the music, and the ease with which the beat was perceived. The current study demonstrates that neural entrainment is strongest to 1) music with beat rates between 1 and 2 Hz, 2) spectral flux of music, 3) familiar music and music with an easy-to-perceive beat. In addition, 4) brain responses to the music stimuli were informative regarding the listeners’ perceived metrical level of the beat, and 5) analysis approaches based on TRF and RCA revealed converging results.
Neural entrainment was strongest to music with beat rates in the 1–2 Hz range
Strongest neural entrainment was found in response to stimulation tempi between 1 and 2 Hz in terms of SRCorr (Fig. 2B), TRF correlations (Fig. 3A), and TRF weights (Fig. 3C-F). Moreover, we observed a behavioral preference to tap to the beat in this frequency range, as the group preference for music tapping was at 1.55 Hz (Supplementary Fig. 7C). Previous studies have shown a preference to listen to music with beat rates around 2 Hz (Bauer et al., 2015), which is moreover the modal beat rate in Western pop music (Moelants, 2002) and the rate at which the modulation spectrum of natural music peaks (Ding et al., 2017). Even in nonmusical contexts, spontaneous adult human locomotion is characterized by strong energy around 2 Hz (MacDougall and Moore, 2005). Moreover, when asked to rhythmically move their bodies at a comfortable rate, adults will spontaneously move at rates around 2 Hz (McAuley et al., 2006) regardless whether they use their hands or feet (Rose et al., 2020). Thus, there is a tight link between preferred rates of human body movement and preferred rates for the music we make and listen to that was moreover reflected in our neural data. This is perhaps not surprising, as musical rhythm perception activates motor areas of the brain, such as the basal ganglia and supplementary motor area (Grahn and Brett, 2007), and is further associated with increased auditory–motor functional connectivity (Chen et al., 2008). In turn, involving the motor system in rhythm perception tasks improves temporal acuity (Morillon et al., 2014), but only for beat rates in the 1–2 Hz range (Zalta et al., 2020).
In the frequency domain, SRCoh was strongest at the stimulation tempo and its harmonics (Fig. 2E-I). In fact, highest coherence was observed at the first harmonic and not at the stimulation tempo itself (Fig. 2I). This replicates previous work that also showed higher coherence (Kaneshiro et al., 2020) and spectral amplitude (Tierney and Kraus, 2015) at the first harmonic than at the musical beat rate. There are several potential reasons for this finding. One reason could be that the stimulation tempo that we defined for each musical stimulus was based on beat rate, but natural music can be subdivided into smaller units (e.g., notes) that can occur at faster time scales. A recent MEG study demonstrated inter-trial phase coherence for note rates up to 8 Hz (Doelling and Poeppel, 2015). Hence, the neural responses to the music stimuli in the current experiment likely tracked not only the beat rate, but also faster elements such as notes. In line with this hypothesis, FFTs conducted on the stimulus features themselves showed higher amplitudes at the first harmonic than the stimulation tempo for all musical features except the beat onsets (Fig. 2J). Moreover, there are other explanations for higher coherence at the first harmonic than at the beat rate. For example, the low-frequency beat-rate neural responses fall into a steeper part of the 1/f slope, and as such may simply suffer from worse signal-to-noise ratio than their harmonics.
Regardless of the reason, since frequency-domain analyses separate the neural response into individual frequency-specific peaks, it is easy to interpret neural tracking (SRCoh) or stimulus spectral amplitude at the beat rate and the note rate – or at the beat rate and its harmonics – as independent (Keitel et al., 2021). However, music is characterized by a nested, hierarchical rhythmic structure, and it is unlikely that neural tracking at different metrical levels goes on independently and in parallel. One potential advantage of TRF-based analyses is that they operate on relatively wide-band data compared to Fourier-based approaches, and as such are more likely to preserve nested neural activity and perhaps less likely to lead to over- or misinterpretation of frequency-specific effects.
Neural entrainment is driven by spectral flux
Neural entrainment was strongest in response to the spectral flux of music, regardless whether the analysis was based on TRFs or RCA. Similar to speech-tracking studies, music-tracking studies typically use the amplitude envelope of the sound to characterize the stimulus rhythm (Vanden Bosch der Nederlanden et al., 2020, Kumagai et al., 2018, Doelling and Poeppel, 2015, Decruy et al., 2019, Reetzke et al., 2021). Although speech and music share features such as amplitude fluctuations over time and hierarchical grouping (Patel, 2003), there are differences in their spectro-temporal composition that make spectral information especially important for music perception. For example, while successful speech recognition requires 4-8 spectral channels, successful recognition of musical melodies requires at least 16 spectral channels (Shannon, 2005) – the flipside of this is that music is more difficult than speech to understand based only on amplitude-envelope information. Moreover, increasing spectral complexity of a music stimulus enhances neural entrainment (Wollman et al., 2020). Critically, both temporal and spectral information influence the perceived accent structure in music (Pfordresher, 2003).
A recent study claimed that neuronal activity synchronizes less strongly to music than to speech (Zuk et al., 2021); notably they focused specifically on amplitude envelope to characterize the stimulus rhythms. We argue that the amplitude envelope – even when passed through a model of the peripheral auditory system – is a suboptimal measure to approximate individual note onsets that convey rhythmic structure in music and to which neural activity can be entrained (Mller, 2015). Imagine listening to a melody played in a glissando fashion on a violin. There might never be a clear onset that would be represented by the amplitude envelope – all of the rhythmic structure is communicated by spectral changes. Thus, in this study we wanted to compare neural entrainment by the amplitude envelope to neural entrainment by spectral flux, which compares spectral content, i.e., power spectra, on a frame-to-frame basis, and which is arguably a more appropriate measure of rhythmic and metrical structure in music. Indeed, many automated tools for extracting the beat in music used in the musical information retrieval (MIR) literature rely on spectral flux information (Oliveira et al., 2010). Also in the context of body movement, spectral flux has been associated with the type and temporal acuity of synchronization between the body and music at the beat rate (Burger et al., 2018) to a greater extent than other acoustic characterizations of musical rhythmic structure. As such, we found that spectral flux drove stronger entrainment than the amplitude envelope.
Using TRF analysis, we found that not only was neural entrainment to spectral flux stronger than to any other musical feature, it was also stronger than to the response to a mutli-variate predictor that combined all musical features. For this reason, we calculated the shared information (MI) between each pair of musical features, and found that spectral flux shared significant information with all other musical features (Fig. 1). Hence, spectral flux seems to capture information also contained in, for example, the amplitude envelope, but contains unique information about rhythmic structure that cannot be gleaned from the other acoustic features (Fig. 3). This finding has potentially important implications for direct comparisons of neural tracking of music and speech, or music and natural sounds (Zuk et al., 2021). We would caution that conclusions about differences in how neural activity entrains to different categories of sounds should be sure to characterize stimuli as fairly as possible rather than relying on the amplitude envelope as a one-size-fits-all summary of rhythmic structure.
Neural entrainment was strongest to familiar songs and songs with an easy beat
We found that the strength of neural entrainment depended on the familiarity of music and the ease with which a beat could be perceived (Fig. 5). This is in line with a previous study showing stronger neural entrainment to familiar music (Madsen et al., 2019). It is likely that songs a person knows – familiar songs – increase engagement. We note that we did not have a measure of engagement, though engagement has been shown to be a major driver of neural entrainment during film viewing (Dmochowski et al., 2014).
There was also higher neural entrainment to music with subjectively “easy-to-tap-to” beats. However, both neural entrainment and ease of beat tapping were highest for slow stimulation tempi; faster songs were associated with weaker entrainment and tended to be rated as more difficult to tap to. Thus, in the current study, it is not possible to separate the influences of stimulation tempo and beat salience on neural entrainment. Here, we chose music stimuli with salient, easy-to-perceive beats. However, a design including more “weakly metrical” or syncopated rhythms may have more success in doing so. Overall, we interpret our results as indicating that stronger neural entrainment is evoked in response to music that is more predictable: familiar music and with easy-to-track beat structure.
Musical training did not affect the degree of neural entrainment in response to tempo-modulated music (Supplementary Fig. 6). This contrasts with previous music research showing that musicians’ neural activity was entrained more strongly by music than non-musicians’ (Madsen et al., 2019, Doelling and Poeppel, 2015, Di Liberto et al., 2020). There are several possible reasons for this discrepancy. One is that our study recruited participants with varying level of musical expertise and did not aim for a specific target group; our study was not intended to examine the role of musical training in neural entrainment. Furthermore, most studies that have done so have focused on classical music (Doelling and Poeppel, 2015, Madsen et al., 2019, Di Liberto et al., 2020), whereas we incorporated music stimuli with different instruments and from different genre (e. g. Rock, Pop, Techno, Western, Hip Hop or Jazz). We suspect that musicians are more likely to be familiar with, in particular, classical music, and as we have shown that familiarity with the individual piece increases neural entrainment, these studies may have inadvertently confounded musical training with familiarity.
Neural responses predicted tempo perception
One interesting yet difficult aspect of music, when it comes to studying entrainment, is that music has metrical structure; that is, there are several levels at which nested periodicities can be perceived. Here, we asked participants to tap along with short sections of each musical stimulus so that we could confirm that their perceived (tapped) beat rate matched our intended stimulation tempo. Although participants mostly tapped at the rate we intended, they sometimes tapped at half or double the intended stimulation tempo, especially when the stimulation tempo was particularly fast or slow, respectively. Here, we applied a classification approach to demonstrate that entrained neural responses to music can predict a) whether participants tapped at double-time or half-time to stimuli with the same stimulation tempo, or b) whether stimuli to which participants tapped identically belonged to the double-time or half-time stimulation-tempo condition. Importantly, neural activity was measured in response to auditory stimulation (without movement) and the perceived metrical level was based on the beat tapping rate established in a separate part of each trial after the listening portion was over. To our knowledge, this study constitutes the first to successfully identify the specific metrical level at which individuals perceived a beat in the absence of overt movement. Nonetheless, there are a few caveats to mention. First, we chose musical stimuli that all had a relatively easy-to-perceive beat. As a result, only 11 participants had enough trials with metrically ambiguous tapping behaviour to stimuli belonging to the same intended stimulation tempo condition for conducting TRF analysis. Moreover, we initially only included the beat-tapping section of each trial as a verification of the validity of our tempo manipulation. As such, we only collected tapping responses for 5.5 s per trial, and tapping behavior was quite difficult to analyze due to the short tapping epochs, which resulted in many tapping trials being discarded.
TRF- and RCA-based measures show converging results
In the present study, we used the TRF and RCA analysis approaches to quantify neural entrainment. Here, we have purposefully avoided the debate about whether these metrics measure entrainment “in the narrow sense” (Obleser and Kayser, 2019), meaning phase-locked and (mainly) unidirectional coupling between a rhythmic input and neural activity generated by a neural oscillator (Lakatos et al., 2019) or whether neural tracking reflects convolution with an evoked response (Zuk et al., 2021). Here, we prefer to remain agnostic, and refer rather to “entrainment in the broad sense” (Obleser and Kayser, 2019), that is neural tracking of music independent of the underlying physiological mechanism.
RCA and TRF approaches share their ability to characterize neural responses to single-trial, ongoing, naturalistic stimuli. As such, both techniques afford something that is challenging or impossible to accomplish with “classic” ERP analysis. However, we made use of two techniques in parallel in order to leverage their unique advantages. RCA allows for frequency-domain analysis such as SRCoh, which can be useful for identifying neural tracking responses specifically at the beat rate, for example. Past music studies often used a “frequency-tagging” approach for this, which is based on averaging over trials in the time domain (so requires repetition of stimuli) rather than relating the neural response to the stimulus time course, and moreover operates in electrode as opposed to component space (Nozaradan et al., 2012, Nozaradan et al., 2011). TRFs rather take into account wider-band neural data, which may better capture the tracking of nested metrical structure as in music. Moreover, TRFs offer a univariate and multivariate analysis approach that allowed us to show that adding other musical features to the model did not improve the correspondence to the neural data over and above spectral flux alone. Despite their differences, we found strong correspondence between the dependent variables from the two approaches. Specifically, TRF correlations were strongly correlated with stimulation-tempo SRCoh, and this correlation was higher than for SRCoh at the first harmonic of the stimulation tempo for the amplitude envelope, derivative and beat onsets (Supplementary Fig. 5). Thus, despite being computed on a relatively broad range of frequencies, the TRF seems to be correlate with frequency-specific measures at the stimulation tempo.
Conclusions
This study presented new insights into neural entrainment to natural music. We compared neural entrainment to different musical features and showed strongest neural responses to the spectral flux. This has important implications for research on neural entrainment to music research, which has so far often quantified stimulus rhythm with what we would argue is a subpar acoustic feature – the amplitude envelope. Moreover, our findings demonstrate that neural entrainment is strongest for slower beat rates, and for predictable stimuli, namely familiar music with an easy-to-perceive beat.
Materials and Methods
Participants
Thirty-seven participants completed the study (26 female, 11 male, mean age = 25.7 years, SD = 4.33 years, age range = 19-36 years). Target sample size for this was estimated using G*Power3, assuming 80% power for a significant medium-sized effect. We estimate a target sample size of 24 (+ 4) for within-participant condition comparisons and 32 (+ 4) for correlations, and defaulted to the larger value since this experiment was designed to investigate both types of effects. The values in parentheses were padding to allow for discarding ∼ 15% of the recorded data. The datasets of three participants were discarded because of large artefacts in the EEG signal (see section EEG data Preprocessing), technical problems and for not following the experimental instructions. The behavioral and neural data of the remaining 34 participants were included in the analysis.
Prior to the EEG experiment, all participants filled out an online survey about their demographic and musical background using LimeSurvey (LimeSurvey GmbH, Hamburg, Germany, http://www.limesurvey.org). All participants self-identified as German speakers. Most participants self-reported normal hearing (7 participants reported occasional ringing in one or both ears). Thirty-four participants were right- and three were left-handed. Seventeen participants reported having no musical background (0-2 years of daily music training, here termed as “non-musicians”) and 12 reported at least 6 years of musical training (“musicians”). Musical expertise was assessed using the Goldsmith Music Sophistication Index (Gold-MSI;(Müllensiefen et al., 2014)). Participants received financial compensation for participating (Online: 2.50 €, EEG: 7€ per 30 min). All participants signed the informed consent before starting the experiment. The study was approved by the Ethics Council of the Max Planck Society Ethics Council in compliance with the Declaration of Helsinki.
Stimuli
The stimulus set started from 39 instrumental versions of musical pieces from different genres, including techno, rock, blues, and hip-hop. The musical pieces were available in a *.wav format on Qobuz Downloadstore (https://www.qobuz.com/de-de/shop). Each musical piece was segmented manually using Audacity (Version 2.3.3, Audacity Team, https://www.audacityteam.org) at musical phrase boundaries (e.g., between chorus and verse), leading to a pool of 93 musical segments with varying lengths between 14.4 – 38 s. We did not use the beat count from any publicly available beat-tracking softwares, because they did not track beats reliably across genres. Due to the first Covid-19 lockdown, we assessed the original tempo of each musical segment using an online method. Eight tappers, including the authors, listened to and tapped to each segment on their computer keyboard for a minimum of 17 taps; the tempo was recorded using an online BPM estimation tool (https://www.all8.com/tools/bpm.htm). In order to select stimuli with unambiguous strong beats that are easy to tap to, we excluded 21 segments due to high variability in tapped metrical levels (if more than 2 tappers tapped different from the others) or bad sound quality.
The remaining 72 segments were then tempo-manipulated using a custom-written MAX patch (Max 8.1.0, Cycling ’74, San Francisco, CA, USA). Each segment was shifted to tempi between 1–4 Hz in steps of 0.25 Hz. Subsequently, the authors screened all of the tempo-shifted music and eliminated versions where the tempo manipulation led to acoustic distortions, made individual notes indistinguishable, or excessively repetitive. Overall, 703 music stimuli with durations of 8.3–56.6 sec remained. All stimuli had a sampling rate of 44,100 Hz, were converted from stereo to mono, linearly ramped with 500-ms fade-in and fade-out and root-mean-square normalized using Matlab (R2018a; The MathWorks, Natick, MA, USA). A full overview of the stimulus segments can be found in the Supplementary Material (Supplementary Table 1).
Each participant was assigned to one of four pseudo-randomly generated stimulus lists. Each list comprised 4–4.6 min of musical stimulation per tempo condition (Kaneshiro et al., 2020), resulting in 7–17 different musical segments per tempo and a total of 159–162 segments (trials) per participant. Each segment was repeated only once per tempo but was allowed to occur for up to three times at different tempi within one experimental session (tempo difference between two presentations of the same segment was 0.5 Hz minimum). The presentation order of the musical segment was randomly generated for each participant prior to the experiment. The music stimuli were played at 50 dB sensation level (SL), based on individual hearing thresholds that were determined using the method of limits (Leek, 2001).
Experimental design
After attaching the EEG electrodes and seating the participant in an acoustically and electrically shielded booth, the participant was asked to follow the instructions on the computer screen (BenQ Monitor XL2420Z, 144Hz, 24”, 1920×1080, Windows 7 Pro (64-bit)). The auditory and visual stimulus presentation was achieved using custom-written Matlab scripts using Psychtoolbox (PTB-3, (Brainard, 1997)) in Matlab (R2017a; The MathWorks, Natick, MA, USA).
The overall experimental flow for each participant can be found in Figure 1A. First, each participant conducted a self-paced spontaneous motor tempo task (SMT; (Fraisse, 1982)) which is a commonly used technique to assess individual’s preferred tapping rate (Rimoldi, 1951, McAuley, 2010). To obtain SMT, each participant tapped for thirty seconds (3 repetitions) at a comfortable rate with a finger on the table close to a contact microphone (Oyster S/P 1605, Schaller GmbH, Postbauer-Heng, Germany). Second, we estimated individual’s hearing threshold using the method of limits. All sounds in this study were delivered by a Fireface soundcard (RME Fireface UCX Audiointerface, Audio AG, Haimhausen, Germany) via on-ear headphones (Beyerdynamics DT-770 Pro, Beyerdynamic GmbH & Co. KG, Heilbronn, Germany). After a short three-trial training, the main task was performed. The music stimuli in the main task were grouped into eight blocks with approximately 20 trials per block and the possibility to take a break in between.
Each trial comprised two parts: attentive listening (music stimulation without movement) and tapping (music stimulation + finger tapping; Fig. 1A). During attentive listening, one music stimulus was presented (8.3–56.6 s) while the participant looked at a fixation cross on the screen; the participant was instructed to mentally locate the beat without moving. Tapping began after a 1-s interval; the last 5.5 s of the previously listened musical segment were repeated, and participants were instructed to tap a finger to the beat of the musical segment (as indicated by the replacement of the fixation cross by a hand on the computer screen). Note that 5.5 s of tapping data is not sufficient to conduct standard analyses of sensorimotor synchronization; rather, our goal was to confirm that the participants tapped at the intended beat rate based on our tempo manipulation. After each trial, participants were asked to rate the segment based on enjoyment/pleasure, familiarity and ease of tapping to the beat with the computer mouse on a visual analogue scale ranging from -100 to +100. At the end of the experiment, the participant performed the SMT task again for three repetitions.
EEG data acquisition
EEG data were acquired using BrainVision Recorder (v.1.21.0303, Brain Products GmbH, Gilching, Germany) and a Brain Products actiCap system with 32 active electrodes attached to an elastic cap based on the international 10-20 location system (actiCAP 64Ch Standard-2 Layout Ch1-32, Brain Products GmbH, Gilching, Germany). The signal was referenced to the FCz electrode and grounded at the AFz position. Electrode impedances were kept below 10 kOhm. The brain activity was acquired using a sampling rate of 1000 Hz via a BrainAmp DC amplifier (BrainAmp ExG, Brain Products GmbH, Gilching, Germany). To ensure correct timing between the recorded EEG data and the auditory stimulation, a TTL trigger pulse over a parallel port was sent at the onset and offset of each musical segment and the stimulus envelope was recorded to an additional channel using a StimTrak (StimTrak, Brain Products GmbH, Gilching, Germany).
Data Analysis
Behavioral data
Tapping data was processed offline with a custom-written Matlab script. To extract the taps, the *.wav files were imported and downsampled (from 44.1 kHz to 2205 Hz). The threshold for extracting the taps was adjusted for each trial manually (SMT and music tapping) and trials with irregular tap intervals were rejected. The SMT results were not analyzed as part of this study and will not be discussed further. For the music tapping, only trials with at least three taps (two intervals) were included for further analysis. Five participants were excluded from the music tapping analysis due to irregular and inconsistent taps within a trial (if > 40% of the trials were excluded).
One of our goals was to test whether we could identify trials based on the neural data where the perceived tempo differed from the intended stimulation rate (see Brain responses to musical features can predict the produced beat tapping rate). For this analysis, we identified two subsets of participants: those that tapped the same tempo to two sets of stimuli with different intended stimulation tempi, and those that tapped the intended stimulation tempo on some trials and a different tempo than what was intended (the harmonic or subharmonic) on other trials. We identified 18 participants that tapped for at least 6 trials at the intended stimulation tempo and tapped for at least 6 trials at the same tempo when the stimulation tempo was something different (double the tapped tempo; i.e., participants tapped at half the intended stimulation tempo). In contrast, we identified 11 participants that tapped for at least 6 trials at the intended stimulation tempo and for at least 6 trials at half-/double the stimulation tempo. TRFs were submitted to a SVM classifier (see section EEG – Temporal Response Function).
On each trial, participants were asked to rate the musical segments based on enjoyment/pleasure, familiarity and ease to tap to the beat. The rating scores were normalized to the maximum absolute rating per participant and per category. For the group analysis the mean and standard error of the mean (SEM) were calculated. For assessing the effects of each subjective dimension on neural entrainment, the 15 trials with the highest and lowest ratings (regardless of the tempo) per participant were further analyzed (see EEG – Temporal Response Function).
Audio Analysis
We assessed neural entrainment to four different musical features (Fig. 1B-C). Note that the term “musical feature” is used to describe time-varying features of music that operate on a similar time-scale as neural entrainment as opposed to the classical musical elements such as syncopation or harmony; 1) Amplitude envelope – gammatone filtered amplitude envelope in the main manuscript and absolute value of the full-band Hilbert envelope in the Supplementary Material; the gammatone filterbank consisted of 128 channels linearly spaced between 60-6000 Hz. 2) Half-wave rectified, first derivative of the amplitude envelope, which detects energy changes over time and is typically more sensitive to onsets (Daube et al., 2019, Di Liberto et al., 2020). 3) Binary-coded beat onsets (0= no beat; 1=beat); a professionally trained percussionist tapped with a wooden drumstick on a MIDI drum pad to the beat of each musical segment at the original tempo (3 trials per piece). After latency correction, the final beat times were taken as the average of the two takes with the smallest difference (Harrison and Müllensiefen, 2018). 4) Spectral novelty (“spectral flux”) (Mller, 2015) was computed using a custom-written Python script (Python 3.6, Spyder 4.2.0) using the packages numpy and librosa. For computing the spectral flux of each sound, the spectrogram across frequencies of consecutive frames (frame length = 344 samples) was compared. All stimulus features were z-scored and downsampled to 128 Hz for computing the stimulus-brain synchrony. To account for slightly different numbers of samples between stimulus features, they were cut to have matching sample sizes.
To validate that each musical feature contained acoustic cues to our tempo manipulation, we conducted a discrete Fourier transform using a Hamming window on each musical segment (resulting frequency resolution of 0.0025 Hz), averaged and z-scored the amplitude spectra per tempo and per musical feature (Fig. 1C).
To assess how much information the different musical features share, a mutual information (MI) score was computed between each pair of musical features (Fig. 1D). MI (in bits) is a time-sensitive measure that quantifies the reduction of uncertainty for one variable after observing a second variable (Cover and Thomas, 2005). MI was computed using quickMI from the Neuroscience Information Theory Toolbox with 4 bins, no delay, and a p-value cut-off of 0.001 (Timme and Lapish, 2018). For each stimulus feature, all trials were concatenated in the same order for each tempo condition and stimulation subgroup (Time x 13 Tempi x 4 Subgroups). MI values for pairs of musical features were compared to surrogate datasets in which one musical feature was time reversed (Fig. 1D). To statistically asses the shared information between musical features, a three-way ANOVA test was performed (with first factor: data-surrogate comparison; second factor: tempo and third factor: stimulation subgroup).
EEG data preprocessing
Unless stated otherwise, all EEG data were analyzed offline using custom-written Matlab code (R2019b; The MathWorks, Natick, MA, USA) combined with the Fieldtrip toolbox (Oostenveld et al., 2011). The continuous EEG data were bandpass filtered between 0.5-30 Hz (Butterworth filter), re-referenced to the average reference, downsampled to 500 Hz, and epoched between 1 s after stimulus onset (to remove onset responses to the start of the music stimulus) until the end of the initial musical segment presentation (attentive listening part of the trial). Single trials and channels containing large artefacts were removed based on an initial visual inspection. Missing channels were interpolated based on neighbouring channels with a maximum distance of 3 (ft_prepare_neighbours). Subsequently, Independent Component Analysis (ICA) was applied to remove artefacts and eye movements semi-automatically. After transforming the data back from component to electrode space, electrodes that exceeded 4 standard deviations of the mean squared data for at least 10% of the recording time were excluded. If bad electrodes were identified, pre-processing for that recording was repeated after removing the identified electrode (Kaneshiro et al., 2020). For the RCA analysis, if an electrode was identified for which 10% of the trial data exceeded a threshold of mean + 2 standard deviations of the single-trial, single-electrode mean squared amplitude, the electrode data of the entire trial was replaced by NaNs. Next, noisy transients of the single-trial, single-electrode recordings were rejected. Therefore, data points were replaced by NaNs when the data points exceeded a threshold of two standard deviations of the single-trial, single-electrode mean squared amplitude. This procedure was repeated four times to ensure that all artefacts were removed (Kaneshiro et al., 2020). For the TRF analysis, which does not operate on NaNs, noisy transients were replaced by estimates using shape-preserving piecewise cubic spline interpolation or by the interpolation of neighbouring channels for single-trial bad electrodes.
Next, the data were restructured to match the requirements of the RCA or TRF (see sections EEG – Temporal Response Function and EEG – Reliable Component Analysis), downsampled to 128 Hz and z-scored. If necessary, the neural data was cut to match the exact sample duration of the stimulus feature per trial. For the RCA analysis approach, the trials in each tempo condition were concatenated resulting in a time-by-electrode matrix (Time x 32 Electrodes; with Time varying across tempo condition). Subsequently the data of participants in the same subgroup were pooled together in a time-by-electrode-by-participant matrix (Time x 32 Electrodes x 9 or 10 Participants depending on the subgroup). In contrast to the RCA, for the TRF analysis, trials in the same stimulation condition were not concatenated in time, but grouped into cell arrays per participant according to the stimulus condition (Tempo x Trials x Electrodes x Time).
EEG – Reliable Component Analysis
To reduce data dimensionality and enhance the signal-to-noise ratio, we performed RCA (reliable components analysis, also correlated components analysis) (Dmochowski et al., 2012). RCA is designed to capture the maximum correlation between datasets of different participants by combining electrodes linearly into a vector space. One important feature of this technique is that it maximizes the correlation between electrodes across participants (which differentiates it from the similar canonical correlation analysis) (Madsen et al., 2019). Using the rcaRun Matlab function (Dmochowski et al., 2012, Kaneshiro et al., 2020), the time-by-electrode matrix was transformed to a time-by-component matrix with the maximum across-trial correlation in the first reliable component (RC1), followed by components with correlation values in descending order. For each RCA calculation, for each tempo condition and subgroup, the first three RCs were retained, together with forward-model projections for visualizing the scalp topographies. The next analysis steps in the time and frequency-domain were conducted on the maximally correlated RC1 component.
To examine the correlation between the neural signal and stimulus over time, the stimulus-response correlation (SRCorr) was calculated for every musical feature. This analysis procedure was adopted from (Kaneshiro et al., 2020). In brief, every stimulus feature was concatenated in time with trials of the same tempo condition and subgroup to match the neural component-by-time matrix. The stimulus features were temporally filtered to account for the stimulus–brain time lag, and the stimulus features and neural time-courses were correlated. To create a temporal filter, every stimulus feature was transformed into a Toeplitz matrix, where every column repeats the stimulus-feature time course, shifted by one sample up to a maximum shift of 1 s, plus an additional intercept column. The Moore-Penrose pseudoinverse of the Toeplitz matrix and temporal filter was used to calculate the SRCorr. To report the SRCorr, the mean (± SEM) correlation coefficient across tempo conditions for every stimulus feature was calculated. For comparing tempo-specificity between musical features, a linear regression was fit to SRCorr values (and TRF correlations) as a function of tempo for every participant and for every musical feature (using fitlm). We compared the resulting slopes across musical features with a one-way ANOVA
Stimulus-response coherence (SRCoh) is a measure that quantifies the consistency of phase and amplitude of two signals in a specific frequency band and ranges from 0 (no coherence) to 1 (perfect coherence) (Srinivasan et al., 2007). Here, the magnitude-squared coherence between different stimulus features and neural data was computed using the function mscohere with a Hamming window of 5 s and 50% overlap, resulting in a frequency range 0–64 Hz with a 0.125 Hz resolution. For visualizing the mean frequency response per musical feature, the coherence values at each stimulation tempo were normalized by dividing by the mean coherence across all other stimulation tempi per frequency bin (Fig. 2E-H) (van Bree et al., 2021). As strong coherence was found at the stimulation tempo and the first harmonic, the SRCoh values of each frequency vector were compared between musical features.
EEG – Temporal Response Function
The TRF is a system identification technique, which computes a filter that optimally describes the relationship between the brain response and stimulus features (Ding and Simon, 2012, Crosse et al., 2016). Via linear convolution, the filter delineates how the stimulus features map onto the neural response (forward model), using ridge regression to avoid overfitting. All computations of the TRF used the Matlab toolbox “The multivariate Temporal Response Function (mTRF) Toolbox” (Crosse et al., 2016). The TRF was calculated in a leave-one-out cross-validation manner for all trials per stimulation tempo; this procedure was repeated for each musical feature separately, and additionally for all musical features together in a multivariate model (using mTRFcrossval and mTRFtrain) using time lags 0–400 ms (Di Liberto et al., 2020). Using mTRFpredict, the neural time course of the left-out trial was predicted based on the time course of the corresponding musical feature of that trial. The quality of the predicted neural data was assessed by computing Pearson correlations between the predicted and actual EEG data separately for each electrode (TRF correlations). We averaged over the seven to eight electrodes with the highest TRF correlations that also corresponded to a canonical auditory topography. To quantify differences in the TRFs, the mean TRF correlation across stimulation tempo and/or musical feature was calculated per participant. The TRF weights across time lags were Fisher-z-scored (Fig. 3C-F) (Crosse et al., 2016).
The assessment of TRF weights across time lags was accomplished by using a clustering approach for each musical feature and comparing significant data clusters to clusters from a random distribution (Fig. 3C-F). To extract significant time windows in which the TRF weights were able to differentiate the different tempo conditions, a one-way ANOVA was performed at each time point. Clusters (consecutive time windows) were identified if the p-value was below a significance level of 0.01 and the size and F-statistic of those clusters were retained. Next, the clusters were compared to a surrogate dataset, which followed the same procedure, but had the labels of the tempo conditions randomly shuffled before entering it to the ANOVA. This step was repeated for 1000 times (permutation testing). At the end, the significance of clusters was evaluated by subtracting the proportion of times the summed F-values of each clusters exceeded the summed F-values of the surrogate clusters from 1. A p-value below 0.05 was considered significant (Fig. 3G). This approach yielded significant regions for the full-band (Hilbert) envelope and derivative (Supplementary Fig. 4). As these clusters did not show differences across amplitudes but rather in time, a latency analysis was conducted. Therefore, local minima around the grand average minimum or maximum within the significant time lag window were identified for every participant/tempo condition and the latencies retained. As there was no significant correlation between latencies and tempo conditions, the stimulation tempi were split upon visual inspection into two groups (1-2.5 Hz and 2.75-4 Hz). Subsequently, a piecewise linear regression was fitted to the data and the R2 and p-values calculated (Supplementary Fig. 4G, K).
TRFs were evaluated based on participant ratings of enjoyment, familiarity, and ease to tap to the beat. Two TRFs were calculated per participant based on the 15 highest and 15 lowest ratings on each measure (ignoring tempo condition and subgroup), and the TRF correlations and time lags were compared between the two groups of trials (Fig. 5). Significant differences between the groups were evaluated based on paired-sample t-tests.
The effect of musical sophistication was analyzed by computing the Pearson correlation coefficients between the maximum TRF correlation across tempi per participant and the general musical sophistication (Gold-MSI) per participant (Supplementary Fig. 6).
A support vector machine (SVM) classifier tested whether TRFs captured information about the intended stimulation tempo, the perceived beat rate, or both (Fig. 6). As described previously (see Behavioral Analysis), individual tempo conditions were identified in which participants tapped the same rate for two sets of trials that had different intended stimulation tempi, and conditions were also identified in which participants tapped two different rates in response to the same intended stimulation tempo. TRF analysis was performed separately for those two groups of trials, and the z-scored TRF weights were fed into the SVM classifier. First, the SVM classifier was trained to predict the stimulation tempo based on the TRF weights for trials on which the stimulation tempo corresponded to the tapped rate versus trials when the same tapped rate was twice the stimulation tempo (tapped rate = intended stimulation tempo vs. same tapped rate = 2*stimulation tempo; n=18). In comparison, we next identified participants that tapped for 6 trials at the intended tempo and for 6 trials at the harmonic of that intended tempo (intended stimulation tempo = tapped rate vs. same stimulation tempo = 2*tapped rate, n=13). The resulting TRFs were used to predict the tapped rate of the participants. Overall, the classifier was trained to find the optimal hyperplane that separates the data (fitcsvm) and was validated in with a leave-one-out cross-validation method (crossval). Classification error (quantified with kfoldLoss) was compared to a surrogate condition in which the labels of the classifier were randomly shuffled during the training step. The SVM was computed for 100 iterations of the surrogate data. An SVM-accuracy metric was quantified as:
leading to a matrix of 4 Musical Features x 13 or 18 Tempo conditions x 100 SVM repetitions.
EEG – Comparison of TRF and RCA measures
The relationship between the TRF analysis approach and the SRCorr was calculated using a linear-mixed effects model (using fitlme). Participant and tempo were random (grouping) effects; SRCorr the fixed (predictor) effect and TRF correlations the response variable. To examine the underlying model assumption, the residuals of the linear-mixed effects model were plotted and checked for consistency. The best predictors of the random effects and the fixed-effects coefficients (beta) were computed for every musical feature and illustrated as violin plot (Fig. 4).
Statistical Analysis
For each analysis, we assessed the overall difference between multiple subgroups using a one-way ANOVA. To test for significant differences across tempo conditions and musical features (TRF Correlation, SRCorr and SRCoh), repeated-measure ANOVAs were conducted coupled to Tukey’s test and Greenhouse-Geiser correction was applied when the assumption of sphericity was violated (as calculated with the Mauchly’s test). As effect size measures, we report partial η2 for repeated-measures ANOVAs and requivalent for paired sample t-test (Rosenthal and Rubin, 2003). Where applicable, the p-values were corrected using the False Discovery Rate (FDR).
Data Availability
The EEG data will be made publicly available on an Open Research Data Repository.
Author Contributions
KW: Conceptualization, Methodology, Software, Investigation, Formal Analysis, Visualization, Writing - Original draft preparation. OXW: Software. Writing - Reviewing and Editing. MJH: Conceptualization, Methodology, Software, Formal Analysis, Supervision, Writing - Original draft preparation, Writing-Reviewing and Editing, Funding acquisition.
Competing financial interests
There are no competing financial interests in relation to the work described in this manuscript.
Acknowledgments
This work was funded by the ERC Starting Grant awarded to Molly Henry (ERC-STG-804029 BRAINSYNC). We would like to thank the lab staff of the Max Planck Institute for Empirical Aesthetics for technical support during data acquisition and Lauren Fink for valuable input during data analysis and stimulus feature design.