The neural response to the temporal fine structure of continuous musical pieces is not affected by selective attention

Speech and music are spectro-temporally complex acoustic signals that a highly relevant for humans. Both contain a temporal fine structure that is encoded in the neural responses of subcortical and cortical processing centres. The subcortical response to the temporal fine structure of speech has recently been shown to be modulated by selective attention to one of two competing voices. Music similarly often consists of several simultaneous melodic lines, and a listener can selectively attend to a particular one at a time. However, the neural mechanisms that enable such selective attention remain largely enigmatic, not least since most investigations to date have focussed on short and simplified musical stimuli. Here we study the neural encoding of classical musical pieces in human volunteers, using scalp electroencephalography (EEG) recordings. We presented volunteers with continuous musical pieces composed of one or two instruments. In the latter case, the participants were asked to selectively attend to one of the two competing instruments and to perform a vibrato identification task. We used linear encoding and decoding models to relate the recorded EEG activity to the stimulus waveform. We show that we can measure neural responses to the temporal fine structure of melodic lines played by one single instrument, at the population level as well as for most individual subjects. The neural response peaks at a latency of 7.6 ms and is not measurable past 15 ms. When analysing the neural responses elicited by competing instruments, we find no evidence of attentional modulation. Our results show that, much like speech, the temporal fine structure of music is tracked by neural activity. In contrast to speech, however, this response appears unaffected by selective attention in the context of our experiment.


35
Music is a fascinatingly complex acoustic stimulus. Listeners can follow multiple melodic lines played 36 by different instruments by separating them on the basis of characteristics such as pitch and timbre 37 (Cross et al., 2008). However, the neural mechanisms that group the sounds in music into distinct 38 melodic lines, forming distinct auditory streams, and allow attention to be directed to one of the lines 39 remain largely unknown (Albert S Bregman, 1994). This is partly due to the difficulty in assessing the 40 neural processing of real-world acoustic signals that have a much richer structure than the simple pure 41 tones and short simplified music patterns that have traditionally dominated research in auditory 42 neuroscience. or of two melodic lines (CI). Each melodic line was played either by a guitar or by a piano. In the CI stimuli, each melodic line was played by a different instrument. Vibratos were inserted into the acoustic waveforms of each melody (grey shading). In the CI condition, the subjects had to attend to one of the two instruments and identify the corresponding vibratos (green tick marks) while ignoring the other instrument and its vibratos (red crosses). The stimuli were presented in blocks composed of a SI stimulus followed by a CI stimulus during which the subject was asked to attend to the instrument that they heard before in the SI stimulus. The attended instrument was alternated between blocks, and each block was played twice such that the attended instrument differed in the two presentations. The volunteers' neural responses were recorded throughout the experiment through bipolar two-channel EEG recordings.
The presentation order of the remaining blocks was pseudo-randomised across participants. In the 113 second presentation of a given CI stimulus, a subject was asked to attend to the instrument they ignored 114 in the first presentation. Two consecutive blocks did not correspond to the same invention. Each 115 participant therefore heard each CI stimulus twice, but attended a different instrument in each 116 presentation. Whether a participant was initially asked to attend to the guitar or the piano was randomly 117 decided. The participant's neural responses were measured through scalp electroencephalography 118 (EEG) with a two-channel bipolar montage (head vertex minus mastoids). 119 We used encoding and decoding approaches (linear forward and backward models) to relate the acoustic 120 stimuli to the recorded neural data. We specifically investigated the neural representation of the 121 temporal fine structure by using the stimulus waveform as a feature. We first established that we could 122 indeed record a significant neural response to this feature, by comparing the neural responses to a null 123 distribution at the level of individual subjects as well as on the population level. We then studied the 124 time course of the response in the region between 0 to 45 ms, using both forward and backward models. 125 Finally, we investigated a putative attentional modulation of this neural response through contrasting 126 the encoding of each instrument in the neural data when attended versus ignored. We used conservative 127 filters to reduce distortions to the neural responses and their latencies, but verified that our results, and 128 in particular the ones related to attention, did not change with stronger filtering (data not shown). 129 Code and data availability. The analysis presented in this manuscript was implemented using 130 MATLAB (R2019b, The MathWorks Inc.) with the EEGLAB toolbox (Delorme & Makeig, 2004 notes of the guitar were lowered by one octave so that their fundamental frequencies fell below 500 Hz. 142 They remained nonetheless somewhat higher than those of the piano notes ( figure 2, A, B). The music 143 stimuli were synthesized from MIDI files to generate wav files. These were then processed using 144 MATLAB to apply vibratos to ten segments in each melodic line. Each vibrato was constructed by 145 introducing a sinusoidal warp at a modulation frequency of = 8 Hz on the waveform of a single note. 146 The onset and offset times of the notes were obtained from the MIDI files using the Miditoolbox for 147 Matlab (Eerola & Toiviainen, 2004). The notes were selected such that the onsets of any two vibratos 148 in a given piece, whether both played by the same or different instruments, were separated by at least 149 one second. 150 The waveforms of the CI stimuli, , were constructing by normalising and mixing the waveform 151 of the guitar and the waveform of the piano according to their root-mean-square values (RMS): 152 = ( ) + 1.25 ⋅ ( ) . The mixing parameter of 1.25 for the piano was chosen following 153 a small pilot study to balance the difficulty in attending either the guitar or the piano. 154 The duration of the seven Two-Part Inventions was, taken together, 11.2 minutes. In the SI conditions, 155 only the first half of the corresponding invention was played. 156 Behavioural task. In the CI condition, the subjects were instructed to attentively listen to one 157 instrument while ignoring the other. They were also asked to classify the vibratos they heard by pressing 158 a key to indicate the ones that belonged to the attended instrument. A key press within two seconds after 159 the onset of a vibrato in the attended or ignored instrument was classified respectively as "true positive" 160 (TP) or "false positive" (FP). Key presses outside of these ranges were classified as "unprompted" and 161 were not analysed further. Due to a technical error, behavioural data was not recorded for one subject, 162 and only the results for the 16 remaining subjects were analysed. The sensitivity index d-prime was 163 computed for each subject when attending to the guitar and the piano, and it was compared between the two conditions at the population level using a two-tailed paired Wilcoxon signed rank test. Moreover, 165 for each condition the TP rate (TPR) was compared to the FP rate (FPR), and the TPR and FPR were 166 compared between conditions at the population level using two-tailed paired Wilcoxon signed rank tests 167 with FDR correction for multiple comparisons (four tests). 168 Neural data acquisition and stimulus presentation. Scalp EEG was recorded through five passive 169 Ag/AgCl electrodes (Multitrode, BrainProducts, Germany). Two electrodes were positioned on the 170 cranial vertex (Cz), and two electrodes were placed on the left and right mastoid processes. A ground 171 electrode was placed on the forehead. The impedance between each electrode and the skin was reduced 172 below 5 kOhm using abrasive electrolyte gel (Abralyt HiCl, Easycap, Germany). One vertex electrode 173 was paired with the left mastoid electrode, and they were connected to, respectively, the non-inverting 174 and inverting ports of a bipolar amplifier (EP-PreAmp, BrainProducts, Germany). The remaining vertex 175 and mastoid electrodes were similarly connected to a second identical amplifier. The output of each 176 bipolar pre-amplifier was fed into an amplifier (actiCHamp, BrainProducts, Germany) and digitized 177 with a sampling frequency of 5 kHz, thus yielding two electrophysiological data channels. The audio 178 stimuli were simultaneously recorded at 5 kHz by the amplifier through an acoustic adapter (Acoustical 179 Stimulator Adapter and StimTrak, BrainProducts, Germany). This channel and independent analogue 180 triggers delivered through an LPT port were used to temporally align the EEG data and stimuli through 181 cross-correlation. The stimuli were delivered diotically at a comfortable loudness level through insert 182 tube earphones (ER-3C, Etymotic, USA) to minimise stimulation artifacts. These earphones introduced 183 a 1 ms delay that was compensated for by shifting the neural data forward in time by 1ms. 184 EEG data filtering. To analyse the neural responses to the temporal fine structure of the stimuli, the 185 EEG data was high-pass filtered above 130 Hz (windowed-sinc filters, Kaiser window, one pass forward 186 and compensated for delay; cut-off: 115 Hz, transition bandwidth: 30 Hz, order: 536). These filters 187 rejected lower-frequency neural activity but reduced the temporal precision of the data, as evidenced 188 by the auto-correlation function of the filtered EEG data (figure 2C). Notably, they were non-causal 189 filters that spread responses in both temporal directions.

Stimulus representation.
Since the vibratos might lead to neural responses deviating from the ones 191 elicited by the rest of the tracks, the parts of the stimulus waveforms that corresponded to them were 192 replaced with zeros to create the stimulus representations (features) used in the encoding and decoding 193 models. These waveforms were then low pass filtered and resampled from 44.1 kHz to 5 kHz, the 194 sampling frequency of the EEG data, using a linear phase FIR anti-aliasing filter (windowed-sinc filter, 195 Kaiser window, one pass forward and compensated for delay; cut-off: 2,250 Hz, transition bandwidth: 196 500 Hz, order: 14,126). 197 Encoding models. We used regularised linear forward models to derive the neural response to the 198 stimulus waveform. In these convolutive encoding models, the measured EEG response is modelled 199 as ( ) = ( * )( ) + ( ), where is the stimulus waveform, the neural response or Temporal 200 Response Function (TRF), is noise, and * is the convolution symbol. In practice, assuming a non-zero 201 response in a time interval ( ; ) only, and with discrete data, the EEG activity ( ) at channel 202 . Given the bipolar montage we used, as well as the diotic stimulus presentation, we did not 204 expect any difference between the two EEG channels and assumed the same neural response for both. 205 The model was estimated for time lags spanning = −100 ms to = 45 ms. A population-206 averaged TRF was fitted using ridge regression coupled with a leave-one-subject-out and leave-one- by dividing the predicted neural response ̂ and the measured EEG activity from the testing data in 214 each fold into 10-s long segments, and by computing Pearson's correlation coefficients between each 215 segment. The correlation coefficients thus obtained were then averaged over all cross-validation folds 216 as well as over all EEG channels. 217 The performance was assessed for models corresponding to 25 normalised regularisation coefficients 218 that were distributed uniformly on a logarithmic scale between 10 −6 and 10 6 . The regularisation 219 instruments. 227 To ascertain the relative contributions of the onset and of the sustained parts of the notes to the neural 228 response, we created a new representation of the stimuli in which the note onsets were suppressed. This 229 was achieved by multiplying the original stimulus waveforms by a 60-ms window centred on each 230 note onsets, with ( ) = 1 − ℎ( ) and ℎ representing a 60-ms Hann window. Forward models were 231 then derived for the original stimuli and their onset-suppressed versions for the two SI conditions taken 232 together, by pooling the data from both instruments. These two models were fitted and their significance 233 was ascertained as described above, that is, by comparing the causal part to the null models, with FDR 234 correction for multiple comparison over time points and over the two models. In the cross-validation 235 procedure, two data parts, one from each SI condition and corresponding to the same invention, were 236 left out at each stage. 237 Decoding models. We also used backward models to reconstruct the stimulus waveform as a linear 238 combination of the neural activity on each channel i at different time lags: The coefficients were trained for each 240 subject independently, using ridge regression with a leave-one-part-out cross-validation and a 241 instruments. Significance was also derived at the population level using the mean correlation 257 coefficients for each subject from the null window of negative delays to create a null population-level 258 distribution. To test the time windows in which a significant response could be detected, the mean 259 reconstruction accuracies from three windows of interest (0 to 15 ms; 15 to 30 ms; 30 to 45 ms) were 260 compared to this null distribution using one-tailed paired Wilcoxon signed rank tests with FDR 261 correction for multiple comparisons over windows and instruments. 262 Since the guitar and piano waveforms formed pairs derived from the same inventions, and although 263 their frequency contents were different, one may wonder whether one instrument could be predicted 264 from the other, and in turn whether the neural responses to one instrument could be predicted or used 265 to decode the other one. To address this question, we trained linear backward models that sought to 266 reconstruct the waveform of one instrument from the neural data that was recorded when the other 267 instrument from the same invention was played in the SI conditions (0 to 15 ms reconstruction window). 268 The model performance was then compared to the null distribution previously described (obtained from 269 a -15 to 0 ms reconstruction window) at the population level, using one-tailed paired Wilcoxon signed 270 rank tests. 271 Competing conditions, attended and ignored instruments. In the CI conditions, we trained backward 272 models to reconstruct the waveform of either the attended or the ignored instrument independently, 273 using a window of temporal delays from = 0 ms to = 15 ms as detailed above. We then 274 compared the neural encoding of each instrument, when attended and when ignored, at the population 275 level, using two-tailed paired Wilcoxon signed rank tests with FDR correction for multiple comparisons 276 over instruments. 277 We also used forward models reconstructing the neural activity as the sum of two neural responses, one 278 to the attended instrument and one to the ignored one. In this instance, the EEG response is modelled 279 as ( ) = ( * )( ) + ( * )( ) + ( ), where and are the attended and ignored stimulus 280 waveforms, and and the corresponding TRFs. In a similar manner to the procedures previously described, population-averaged TRFs were fitted using ridge regression coupled with a leave-one-282 subject-out and leave-one-data-part out cross-validation for time lags spanning = −100 ms to 283 = 45 ms on the pooled data from the two CI conditions. To assess the presence of a putative 284 attentional modulation in the obtained TRFs, the distribution of amplitude across subjects was compared 285 between the attended and ignored TRFs for each time point in the 0 ms to15 ms region of interest (two-286 tailed paired Wilcoxon signed rank tests with FDR correction for multiple comparisons over time 287 points). 288

289
We asked volunteers to attend to continuous musical pieces consisting of either one single instrument 290 (SI) or of two competing instruments (CI) while we recorded their neural activity using EEG (figure 1). 291 We first sought to analyse the neural response to the temporal fine structure of a single melodic line. 292 To this end, we computed a linear forward model to derive neural responses to the stimulus waveform 293 at the population level in the SI conditions ( figure 3A). The temporal response functions that we 294 obtained for the two instruments were qualitatively similar to each other. They displayed a major 295 significant response at a latency of 7.6 ms, as well as a minor positive peak at 2.2 ms, with sidelobes 296 reminiscent of the EEG auto-correlation function ( figure 2C). 297 The neural response to temporal fine structure may be related to the well-established frequency-298 following response (FFR). Because the latter is known to first exhibit a response to a stimulus onset and 299 to then follow the sustained features, we explored the relative contributions of the note onsets and their 300 sustained oscillations to the neural response. We therefore trained a forward model with stimulus 301 waveforms in which the note onsets were suppressed (figure 3B). The obtained temporal response 302 functions had similar significant regions, and resembled the temporal response functions to the original 303 stimulus waveforms. Moreover, the causal parts of the two temporal response functions, those with 304 positive delays, were highly correlated ( = 0.96). 305 As an alternative method to the forward models, we then also used decoding models that reconstructed 306 the stimulus waveforms based on the EEG data. We computed these models for each subject in the SI condition. To ascertain the statistical significance of the reconstructions, we used a window from -15 308 ms to 0 ms to provide a null distribution of performance. Compared to this chance level, we found that 309 a significant reconstruction accuracy could be obtained for most subjects when using time lags from 0 310 to 15 ms for both guitar and piano (figure 4A). Indeed, significant reconstructions of the guitar 311 waveforms were obtained in 11 out of 17 subjects ( ≤ 0.05), in 10 subjects for the piano waveforms,   Armed with the ability to measure neural responses to the temporal fine structure of the notes in a 328 particular melody, we then investigated whether this response was affected by selective attention. To 329 this end, we analysed the CI stimuli, in which the participants had to attend selectively to one instrument 330 while ignoring the other. We monitored attention by asking the volunteers to classify vibratos inserted 331 into the melodic line played by the target instrument. The participants exhibited varied performances 332 on this task, however they all had an average performance that was better than that of a random observer, 333 as shown by their receiver operating characteristics, when selectively attending to either of the two 334 instruments ( figure 5A). Accordingly, at the population level, the TPR was significantly larger than the 335 FPR when attending to either instrument ( < 10 −3 for guitar and piano). The sensitivity index d' was 336 significantly larger when attending to the piano than when attending to the guitar ( = 1.5 ⋅ 10 −2 ), with 337 an average value 2.0 and 1.5, respectively (figure 5B). The TPR did not differ significatively between 338 the two CI conditions ( = 0.88; figure 5C), but the FPR was significantly higher when the subjects 339 were attending to the guitar compared to the piano (FPR: = 1.79 ⋅ 10 −3 , figure 5D). 340 In order to test for a putative attentional modulation of the encoding of the stimulus temporal fine 341 structure, we first used backward models with a window from 0 to 15 ms to reconstruct each instrument 342 waveform when it was attended as well as when it was ignored. The reconstruction accuracies did, 343 however, not exhibit a statistically significant difference between the attended and the ignored case 344 (figure 6A; p = 0.49 for guitar and piano). 345 We then computed a linear forward model that included two features, the attended and ignored 346 instruments. The linear forward model was trained using the pooled data from the two CI conditions. 347 The model then allowed us to compare the amplitude of the attended and ignored TRFs at each time lag 348 from 0 to 15 ms. No significant difference between the amplitudes emerged at any temporal lag (figure 349 6B). 350

351
We showed for the first time that neural responses to the temporal fine structure of continuous musical 352 melodies can be obtained from EEG recordings using linear convolutive models. In particular, we 353 demonstrated that the EEG recordings could in part be predicted from the acoustic waveform (forward 354 model, figure 3). Vice versa, the temporal fine structure of the musical stimuli could be decoded from 355 the corresponding EEG recordings (backward model, figure 4). Significant responses could be obtained 356 in most individual subjects when they were exposed to about 5 minutes of a single melodic line. 357 The neural response at the population level revealed further information about its origin. Indeed, the 358 significant parts of the response, as obtained from the forward models, emerged most strongly at the 359 latency of 7.6 ms ( figure 3A). The responses at the other latencies may have reflected our use of high-360 pass filters for the EEG data, which spread the response in time in both directions (Widmann et al., 361 2015). The autocorrelation of the filtered EEG data exhibited sidelobes that are reminiscent of the 362 structure of some of the peaks that we obtained in the neural responses ( figure 2C). 363 The backward model showed likewise that only delays between 0 ms and 15 ms allowed for a significant 364 reconstruction of the stimulus waveform. Together with the evidence from the forward model, these 365 delays suggest a sub-cortical origin of the neural response, putatively in the inferior colliculus, although 366 different sub-cortical structures may contribute as well (Bidelman, 2015(Bidelman, , 2018 (Bidelman, 2018). The scalp-recorded FFR may accordingly combine multiple 370 subcortical and cortical sources (Coffey et al., 2019). While the neural response that we have described 371 here is arguably of subcortical origin, our use of only two EEG channels may have obstructed the 372 observation of later cortical sources with different dipole orientations. 373 Neural responses can occur to both transient (e.g. clicks, onsets) and sustained (e.g. temporal fine 374 structure) features of complex stimuli. When investigating the frequency-following response (FFR), for 375 instance, these two aspects can be segregated by time regions (Skoe & Kraus, 2010). However, the 376 continuous nature of the stimuli that we used here did not allow for this type of analysis. Instead, we 377 trained a forward model with stimulus waveforms where note onsets were suppressed, and compared it 378 to a forward model trained using the intact waveforms ( figure 3A,B). The two responses were strikingly 379 similar, suggesting that they are primarily driven by the sustained periodic oscillations of individual 380 notes rather than their onsets. This may be expected, as these sustained oscillations constituted most of 381 our music stimuli. In a click train, in contrast, transients dominate the temporal fine structure. 382 When the participants were presented with stimuli consisting of two competing instruments, they had 383 to selectively attend to one of them, and identify vibratos that were inserted in the melodic line of that 384 instrument. We used this task as a marker of selective attention, comparable to the use of comprehension 385 questions in the case of speech stimuli. We found that most subjects were able to identify the target 386 vibratos whilst ignoring the distractors (figure 5). The sensitivity index d' was significantly larger when 387 attending to the piano than the guitar. When attending to either instruments, the true-positive rate did 388 not significantly differ, but the false-positive rate was lower when attending to the piano, indicating that 389 this effect mediated the difference in d' values. We hypothesise that since pianos cannot naturally 390 produce vibratos, the participants may have had a bias leading to a higher propensity to attribute vibratos 391 to the guitar. The two tasks were thus overall balanced, but attending to the piano may have been 392 somewhat easier for the participants. 393 The task of attending to one of two melodic lines allowed us to investigate whether the neural response 394 to the temporal fine structure of a particular melodic line was modulated by selective attention.
Following our results on the statistical methods for obtaining this neural response, we employed 396 backward models to reconstruct the stimulus waveform from the EEG recording, using temporal delays 397 between 0 ms and 15 ms. We did not, however, find any significant difference between the resulting 398 reconstruction accuracies of a melodic line when it was being attended or ignored for either instruments 399 These differences may point to underlying differences between music and speech. First, the two melodic 408 lines that we used in the present work may have been difficult to selectively attend, since they originated 409 from one musical piece, were contrapuntal, and often followed or responded to each other. The resulting 410 interaction between the two melodic lines makes their juxtaposition rather different from that of two 411 independent competing voices that do not interact but merely generate informational and acoustical 412 masking. While two competing speakers may encourage selective attention and neural processing of 413 one of them, our two melodic lines may therefore rather encourage attention, as well as neural 414 processing, of the acoustic mixture. 415 Second, the subjects that participated in the competing speaker experiments effectively had a lifelong 416 training in isolating one speaker from noise, due to the relevance of this task in daily life. As already 417 hinted at above, we speculate that musical stimuli are instead generally perceived as a whole, and that the interaction of such cortical responses with the subcortical activity related to the temporal fine 440 structure that we have uncovered here may further clarify the neural mechanisms that allow us to 441 perceive complex musical stimuli in their entirety, while also allowing us to selectively focus on a 442 particular instrument or melodic line. 443