Evoked Responses to Localized Sounds Suggest Linear Representation of Elevation in Human Auditory Cortex

The auditory system computes the position of a sound along each of the three spatial axes, azimuth, elevation and distance, from very different acoustical cues. The extraction of sound azimuth from binaural cues (differences in arrival time and intensity between the ears) is well understood, as is the representation of these binaural cues in the auditory cortex of different species. Sound elevation is computed from monaural spectral cues arising from direction-dependent filtering of the pinnae, head, and upper body. The cortical representation of these cues in humans is still debated. We have shown that the fMRI blood-oxigen level-dependent activity in small parts of auditory cortex relates monotonically to perceived sound elevation and tracks listeners internal adaptation to new spectral cues. Here we confirm the previously suggested cortical code with a different method that reflects neural activity rather than blood oxigenation (electroencephalography), show that elevation is represented relatively late in the cortex, with related activity peaking at about 400 ms after sound onset, and show that differences in sound elevation can be decoded from the electroencephalogram of listeners, particularely from those who can distinguish elevations well. We used an adaptation design to isolate elevation-specific brain responses from those to other features of the stimuli. These responses gradually increased with decreasing sound elevation, consistent with our previous fMRI findings and population rate code for sound elevation. The long latency as well as the topographical distribution of the elevation-specific brain response indicates the involvement of higher-level cognitive processes not present for binaural cue representation. The differences between brain responses to sounds at different elevations predicted the listeners sound localization accuracy, suggesting that these responses reflect perceived elevation. This is, to our knowledge, the first study that demonstrates the cortical encoding of sound elevation in humans with high-temporal resolution. Our results agree with previous findings from functional magnetic resonance imaging, providing strong support for the hypothesis that elevation is represented in a population-rate code. This represents a critical advance in our understanding of spatial auditory processing along a dimension that is still poorly understood.


Introduction
headphones. The latter design has the advantage that the adapter is not affected by the directional 89 ear filter and thus does not cause adaptation of elevation-sensitive neurons (Møller et al., 1995).  (Møller et al., 1995). This allowed us to simultaneously present sounds 104 from loudspeakers at different elevations in the array as well as non-spatial headphone sounds. 105 We equalized the transfer functions of each loudspeaker by applying an inverse filter to the stim-106 uli upon presentation. Two digital signal processors and six 8-channel amplifiers (models RX8.1 107 and SA8, Tucker-Davis Technologies, Alachua, FL, USA) drove the loudspeakers at 50 kHz sampling 108 rate. The processors' digital ports controlled the LEDs attached to each loudspeaker, obtained re-109 sponses from a custom-built button box, and sent event triggers to the EEG. We used a 64-channel 110 system (model BrainAmpMR, Brain Products, Gilching Germany) to record EEG at a sampling rate 111 of 500 Hz. The active silver/silver chloride electrodes were fixed on the subject's head with an 112 elastic cap (Easycap, Germany) according to the international 10/20 system with FCz as reference 113 electrode. Electrode impedance was kept below 2 kΩ. Two cameras (model Firefly S, Teledyne FLIR, 114 OR, USA), positioned between loudspeakers, were used to monitor the subjects' head pose. 115 Software 116 We programmed the digital signal processors using Real-time Processor Visual Design Studio  Davis-Technologies, Alachua, FL, USA) and controlled the cameras using the Spinnaker SDK (Tele-118 dyne FLIR, OR, USA). The software of both devices provides an API that we integrated in a custom 119 Python module that controlled the experimental apparatus (Bialas, 2022). We estimated subjects' were generated using the slab Python module for psychoacoustics (Schönwiesner and Bialas, 2021). 124 We recorded EEG signals using Brain Vision Recorder (Brain Products, Gliching Germany), and im-125 ported them into MNE-Python for analysis (Gramfort et al., 2013). A full list of the software envi-126 ronment can be found in the accompanying online repository. channels using spherical splines. We then subtracted the average signal across all channels from each channel ("average reference"). Next, independent component analysis unmixed the signal, 136 and an algorithm removed components corresponding to eye-blinks based on their topographical 137 correlation with a previously determined template (Viola et al., 2009; Plöchl, Ossandón, and König, 138 2012). Finally, an algorithm determined channel-specific rejection thresholds and repaired or re-139 moved epochs where the thresholds were exceeded (Jas et al., 2017). Except for the initial selection 140 of a reference for blink removal, the entire preprocessing pipeline was automated. Afterwards, the 141 data were inspected by eye to assess the effect of preprocessing.

143
Subjects sat comfortably on a height-adjustable chair in an anechoic chamber. Target loudspeak-144 ers stood at a distance of 3.2 m at elevations of 37.5°, 12.5°, −12.5°and −37.5°with respect to 145 the subject's interaural plane. Because perception of sound source elevation is slightly more ac-146 curate for lateral targets (Makous and Middlebrooks, 1990), all target speakers were positioned 147 at an azimuth of 10°to the subject's right. Initially, we tested subjects' ability to localize sounds 148 and familiarized them with the setup. To avoid that subjects explicitly learn the target speakers'  Subjects were instructed to keep their head and gaze aligned with the fixation cross at 0°azimuth 156 and elevation. After completion of the test, we prepared the EEG electrodes. During recordings, 157 each trial started with 600 ms of noise (adapter) played from the speakers at either 37.5°or −37.5°.

158
Then, a 150 ms burst of noise played from one of the other speakers, resulting in six different 159 adapter-probe pairs. Adapter and probe had overlapping 5 ms on-and offset ramps so that the 160 sound intensity remained constant during the transition. The adapter's initial position was chosen 161 randomly and changed every 30 trials. The probe's location was chosen randomly without direct 162 repetitions of the same speaker. Every probe was followed by a 350 ms silent inter-stimulus inter-163 val. In five percent of all trials, the probe did not come from one of the target positions but from a 164 random speaker within the frontal field. Subjects had to respond to these deviant trials by pressing 165 a random button as fast as possible. If they managed to respond within one second after sound 166 onset, the trial was considered a success. After one second, the trial was considered failed, and 167 stimulation resumed. The experiment was divided into four blocks, each of which consisted of 504 168 trials and lasted 35 min in total. We instructed subjects to keep their head and gaze aligned with 169 the fixation cross throughout the recording but we did not check whether they complied. the duration of adapter (1000 ms), probe (100 ms) and inter-stimulus interval (900 ms). Also, the 181 adapter was played from headphones instead of one of the target speakers. In five percent of 182 trials, subjects heard a tone after the inter-stimulus interval which prompted them to localize the probe they had just heard. Thus, they had to pay attention to each probe's location since they did 184 not know whether they would be asked to localize any given sound. The experiment was divided 185 into six blocks, each consisting of 240 trials, roughly lasting 55 min in total.

187
To quantify the accuracy of the subjects' perception of elevation, we used the elevation gain (EG) 188 which is the slope of the regression between the target and response elevation. Thus, the EG  In the latter, the EG for two subjects, which was slightly negative, was set to zero for subsequent test returned more than one significant cluster for a subject we ignored all but the largest.  how the ERP changed with elevation more closely.

275
The average ERP at Cz showed two deflections, of opposite polarity, which increased in ampli-

276
tude with the separation of adapter and probe (Fig.1A). To quantify this trend, we computed the 277 average absolute ERP amplitude in the time interval between 100 ms and 300 ms after probe onset 278 for each adapter-probe combination and regressed it against the distance between adapter and  The second experiment addressed the confound between adapter and probe position by using 288 a non-spatial adapter. Consequently, the results offered clearer support for the population-rate 289 code hypothesis. The permutation test which compared responses to the different probes at 37.5°, 290 12.5°, −12.5°and −37.5°elevation found significant clusters for most subjects between 200 ms and 291 800 ms with a peak around 500 ms after probe onset (colored bar in Fig.2A). The scalp-distribution 292 of F-scores averaged between 150 ms and 900 ms after probe onset (Fig.2B) revealed that the dif-293 ference between elevation-specific ERPs was largest at fronto-temporal electrodes with a peak at 294 FT10. Again, we chose this electrode for a closer inspection of elevation-specific changes in the 295 ERP.

296
The average ERP at FT10 exhibited a sustained deflection that was strongly modulated in ampli-297 tude by the probe's elevation (colored lines in Fig.2A). The amplitude was largest when the probe 298 was located at −37.5°and smallest when it was located at 37.5 Linear regression found that the re-299 lationship between the probe's elevation and the average response amplitude in the time interval 300 from 150 ms to 900 ms after onset was highly significant ( = −0.66, < 0.001). ure3B being empty, meaning there were no subjects who's brain responses were decodable but 314 who failed at the task. There were, however, several subjects for whom decoding failed but who 315 still performed the task accurately. 316 We also regressed the average decoding accuracy against the EG during the initial localization 317 test to investigate whether the relationship between decoding and performance was specific to     (Fig.4B&C). These additional components were unaffected by elevation. Together, the 331 three components accounted for 88 % of the variance in the average evoked response.

333
Evoked responses encode sound elevation 334 Both experiments demonstrate that the cortical processing of sound elevation can be assessed 335 with EEG. Using fMRI, we previously identified voxels in auditory cortex tuned to sound elevation 336 (Trapeau and Schönwiesner, 2018), but the lower spatial resolution of EEG would not have allowed 337 us to isolate the responses from these small patches of auditory cortex. Indeed, a previous EEG 338 study attempted to decode sound elevation from EEG responses and found that decoding accu-339 racy exceeded chance level just barely for only some of the listeners (Bednar, Boland, and Lalor, 340 2017). The adaptation design helped to circumvent this issue by separating neural activity related 341 to sound onset and elevation in time. In the second experiment, we showed that changes in sound 342 elevation evoke distinct cortical responses that can be decoded accurately, and that decoding ac-343 curacy predicts individual localization performance. Decoding accuracy for all pair-wise elevation 344 comparisons followed the same time course with a single peak at around 400 ms after the eleva-345 tion information became available. The time course and the fact that accuracy was predictive of 346 task performance suggests that we decoded a physiological process distinctly related to elevation 347 rather than processes associated with acoustical features that co-vary but are not causally related 348 with sound elevation. Elevation-specific activity was well captured by a single principle component, 349 suggesting that the observed encoding of elevation reflects a distinct physiological process.

350
A monotonic population-rate code for elevation 351 We hypothesized that the auditory cortex represents sound elevation in a population response that 352 decreases monotonically with increasing elevation. Both experiments presented here offer sup- and Vilis, 2005). The perceptual persistence of a visual stimulus is inversely related to its dura-417 tion and intensity (Coltheart, 1980). Thus, perceptual persistence could provide a mechanism of 418 evidence accumulation for perceptual decision making under difficult conditions. However, the 419 present study was not designed to investigate the effects of task-difficulty and further research is 420 required to answer those questions.

421
The cortical representation of sound direction 422 It is thought that the auditory cortex represents sound azimuth as the difference between the 423 rates of activity in two opponent neural channels, each tuned to the contralateral hemifield so that is not related to the absolute magnitude but rather the change in neural response. Thus, placing 427 the tuning curve's peak outside the physiological range places the slope in the center, where az-428 imuthal localization is most accurate (Harper and McAlpine, 2004). Similarly, the fact that sounds at 429 low elevations cause larger responses than sounds at high elevations does not mean that they are localized more accurately. Instead, the steady slope of the population response suggests that accuracy remains constant across elevations, which lines up well with localization behavior (Wightman 432 and Kistler, 1989b; Middlebrooks and Green, 1991). There is no cardinal reason why the maximum 433 of the tuning is at low rather than high elevations. While it is tempting to argue that this may be Paszke, Adam et al. (2017). "Automatic differentiation in PyTorch". In.