Summary
Perceptual constancy describes the ability to represent objects in the world across variation in sensory input such as recognizing a person from different angles or a spoken word across talkers. This ability requires neural representations that are sensitive to some aspects of a stimulus (such as the spectral envelope of a sound) while tolerant to other variations in stimuli (such periodicity). In hearing, such representations have been observed in auditory cortex but never in combination with behavioural testing, which is essential in order to link neural codes to perceptual constancy. By testing ferrets in a vowel discrimination task which they perform across multiple stimulus dimensions and recording neuronal activity in auditory cortex we directly correlate neural tolerance with perceptual constancy. Subjects reported vowel identity across variations in fundamental frequency, sound location, and sound level, but failed to consistently generalize across voicing from voiced to whispered sounds. We decoded the responses of simultaneously recorded units in auditory cortex to identity units informative about vowel identity across each of these task-orthogonal variations in acoustic input. Significant proportions of units were vowel informative across each of these conditions, although fewer units were informative about vowel identity across voicing. For about half of vowel informative units, information about vowel identity was conserved across multiple orthogonal variables. The time of best decoding was also used to identify the relative timing and temporal multiplexing of sound features. Our results show that neural tolerance can be observed within single units in auditory cortex in animals demonstrating perceptual constancy.
Introduction
Sensory systems must represent stimulus identity with tolerance for variation in physical input; in the vision an object must be recognized from multiple angles, lighting conditions and other identity preserving transformations (DiCarlo and Cox 2007, DiCarlo, Zoccolan et al. 2012) while in hearing, sounds such as individual words or phonemes can be recognized across talkers, voice pitch, background noise and other acoustic transformations (Sharpee, Atencio et al. 2011).
The ability to recognize objects across variations in sensory input is known as perceptual constancy (or perceptual invariance) and is observed for both visual and auditory stimuli (Logothetis and Sheinberg 1996, Griffiths and Warren 2004, Bizley and Cohen 2013). For sounds, abstraction of tolerant representations is a key step in auditory object formation (Griffiths and Warren 2004) and it is possible for humans and other animals to recognize features of sounds such as loudness across variation in location (Zahorik and Wightman 2001). Similarly for vocalizations including speech sounds, sounds can be recognized across talkers (Kojima and Kiritani 1989, Ohms, Gill et al. 2010), vocal tract length (Ghazanfar, Turesson et al. 2007, Schebesch, Lingner et al. 2010) and fundamental frequency (a determinant of perceived pitch)(Bizley, Walker et al. 2013, Town, Atilgan et al. 2015). Perceptual constancy is thus a general property of hearing across species and key in the analysis of auditory scenes (Bregman 1990, Bizley and Cohen 2013).
The importance of perceptual constancy has led to the study of its underlying neural mechanisms. Tolerant (or invariant) representations of a range of sound types including animal vocalizations (Billimoria, Kraus et al. 2008, Meliza and Margoliash 2012, Carruthers, Laplagne et al. 2015) and synthetic stimuli such as pure tones (Sadagopan and Wang 2008) and pulse trains (Bendor and Wang 2007) emerge in and across auditory cortex. Likewise, tolerance to background noise increases from midbrain to cortex (Rabinowitz, Willmore et al. 2013) where auditory cortical neurons can represent complex sounds including speech across noisy environments (Schneider and Woolley 2013, Mesgarani, David et al. 2014). However while there are many examples of neural tolerance or invariance to specific sound features, it is unclear to what extent these are directly relevant for behavior as few examples of such phenomena have been shown in subjects actively demonstrating perceptual constancy.
To study the tolerance of neural representations during perceptual constancy, we recorded the activity of auditory cortical neurons in ferrets performing a vowel discrimination task in which synthesized vowels were varied along a number of identity-preserving, task-orthogonal transformations. Auditory cortical neurons represent multiple features of vowels including identity and dimensions such as fundamental frequency and virtual location that can be varied independently (Bizley, Walker et al. 2009, Walker, Bizley et al. 2011). Here we asked if, and how, neurons could represent vowel identity across fundamental frequency, real-world location, sound level and voicing, and how auditory encoding related to perceptual constancy during behavior.
Results
Perceptual constancy during vowel discrimination
Ferrets discriminated vowel identity in a two-alternative forced choice task (Fig. 1A) in which vowels were synthesized with varying fundamental frequency (F0) or voicing, or presented at varying levels or from different locations. Variation in these task-orthogonal variables produced a variety of different spectra while preserving formants in the spectral envelope critical for vowel identification (Fig. 1B)(Peterson and Barney 1952, Town and Bizley 2013). On each trial, the animal nose poked at a central port to trigger stimulus presentation with a variable delay. The stimuli was two tokens of the same vowel, each lasting 250 ms with an interval of 250 ms. Animals responded at either left or right ports based on vowel identity and were rewarded with water for correct responses while errors resulted in brief timeout (1-5 s).
Ferrets were able to discriminate vowels accurately across task-relevant dimensions: Across F0 and sound source location, we found consistent performance across variation in F0 (Fig. 1C) or location (Fig. 1D) (no effect of either factor logistic regression, p > 0.05, Table 1) and performance across all F0s and all sound locations was significantly better than chance for each subject (binomial test vs. 50%, p < 0.001, Table 2). For sound level, performance increased significantly with sound level in three of four ferrets (p < 0.01; Table 1) with discrimination saturating at approximately 90% correct (Fig. 1E). For all sound levels, performance was better than chance (Table 2). We also presented whispered vowels on 20% of trials as probe stimuli, however discrimination of whispered vowels was significantly worse than for voiced vowels (Fig. 1F) and only two ferrets were able to discriminate whispered stimuli better than chance (F1201: p < 0.001, and F1203: p < 0.001, Table 2). This suggests either that ferrets are poor at generalizing across voicing or that feedback absent on probe trials was required to maintain high levels of discrimination. Together these results show that ferrets can perceive a constant vowel identity across variations in acoustic input related to fundamental frequency, sound location and sound level.
(A) Task design in which ferrets discriminated vowel sounds. (B) Spectra for 13 examples of a single vowel [u] with varying F0, location, sound level and voicing. Spectra for sounds across location generated in virtual acoustic space (Schnupp, Booth et al. 2003). (C-F) Behavioral performance when discriminating vowels across F0 (C), location (D), level (E) and location (F). Colors indicate individual subjects.
Results of logistic regressions comparing performance across task-orthogonal variables.
Comparison of observed vowel discrimination against chance performance (50%); data shown as fraction of trials correct and probability of observed performance (binomial test).
Decoding vowel identity
We implanted moveable microelectrodes in left and right auditory cortex to recorded multiunit (n = 469 units) and LFP activity during task performance from sound-responsive sites in left and right auditory cortex (Fig. S1). For each unit, we measured responses to vowels across F0, sound location, level and voicing (Fig. 2A) and measured the information available about vowel identity across each task orthogonal variable. To measure information, we decoded stimulus parameters (e.g. vowel identity, F0 etc.) from responses of individual units using spike distance based classification (Fig. S2A). The time window of the response decoded was variable and we searched for those parameters (start time and window duration) that gave best performance (Fig. S2B). We initially only decoded responses from correct trials as we reasoned these would provide the best demonstration of auditory cortical encoding. For each unit, we could assess whether decoding performance was significantly better than chance (permutation test, p < 0.05) and characterize both the number of significantly informative units and overall decoder performance across auditory cortex (Fig. 2B and 2C).
We first tested the hypothesis that neural units should be capable of encoding vowel identity across the same variation in stimulus dimensions that the animals could behaviourally. Consistent with this, when classified by statistical significance, the proportion of vowel informative units was highest across dimensions over which behavioral performance was most constant: 154 / 366 (42.1%) were informative about vowel identity across fundamental frequency, 50 / 122 (41.0%) across sound location and 80 / 197 (40.6%) across sound level, whereas only 63/207 (30.4%) of units were informative about vowel identity across voicing (Fig. 2C). Across all sound-responsive units (regardless of informative classification), we correctly decoded vowel identity across five fundamental frequencies on 66.6 ± 0.33% of trials (mean ± s.e.m., 366 units), across two sound locations on 79.5 ± 0.62% of trials (115 units), across five sound levels on 67.5 ± 0.46% of trials (197 units) and across two voicing conditions on 76.9 ± 0.55% of trials (207 units)(Fig. 2B). When we considered vowel decoding over the most extreme two fundamental frequencies or sound levels tested (Fig S3A and S3B), decoder performance was comparable to testing across location or voicing (mean ± s.e.m. performance across F0 = 73.0 ± 0.40%; across sound level = 75.7 ± 0.86%). The pattern of decoder performance therefore reflected the sensitivity of raw decoder performance to the number of orthogonal levels (i.e. number of F0s or sound locations) over which vowels were varied and once this was taken into account decoder performance was equivalent across all the stimulus dimensions tested.
(A) Raster and peri-stimulus time histograms (PSTHs) from an example unit in l response to vowels with different fundamental frequency, sound location, sound level and voicing. Data shown for the first vowel presented during the trial; PSTHs show mean ± s.e.m. firing rate. (B) Decoding performance across all units when reconstructing vowel identity and task orthogonal values from single trial responses of individual units. (C) Number of units informative about vowel identity and / or task orthogonal values when considering responses across all stimuli. (D) Population decoding of vowel identity and task-orthogonal values. Individual data points show populations with different member units; full line shows curve fit for a two-term power series model; dotted lines indicate number of units required to reach 100% decoding performance. The top row indicates vowel decoding across (left-right) F0, space, level and voicing, and the bottom row indicates decoding performance in the orthogonal dimension. (E) Number of vowel informative units and decoder performance at each task-orthogonal value.
While we are able to generalise across particular stimulus dimensions (e.g. pitch) this does not mean that we are unable to detect changes in other stimulus dimensions. Accordingly, in addition to vowel identity, auditory cortical units were also informative about task orthogonal variables across vowels. We found 78 / 366 units (21.3%) were informative about fundamental frequency (mean ± s.e.m. decoding performance = 32.9 ± 0.25%), 44 / 115 units (38.3%) about sound location (77.4 ± 0.74%), 49 / 197 units (24.9%) about sound level (34.3 ± 0.38%) and 97 / 207 units (46.9%) about voicing (77.3 ± 0.58%)(Fig. 2D). The proportion of units informative about F0 and sound level were smaller than sound location or voicing in part because of the increased number of decoded classes (five vs. two respectively). However when we decoded F0 or sound level across the most extreme values tested (Fig. S3A and S3B), the proportion of orthogonal informative units increased (F0: 100 / 350 or 28.6% of units informative; Level: 33 / 65 or 50.8%). We also noted that the proportion of units informative about sound level but not F0 was associated with the interval between sound levels (Fig. S3C-D). For sound level we tested units across two stimulus sets ranging from 64.5 to 82.5 dB SPL in 4.5 dB steps, or from 45 to 75 dB SPL in 7.5 dB steps. While we found units informative about vowel identity and / or sound level for both stimulus sets, more units were informative about sound level (alone or together with vowel identity) when stimuli were varied from 45 to 75 dB SPL (Fig. S4A).
In order to attempt to more directly link neural encoding to perceptual constancy, we next asked how decoding performance compared to the animal’s behavioural ability. Since our decoder considered only correct trials, 100% decoding performance would indicate that a unit’s information content matched the subject’s behavior. While some units were close to this level (>90%) we noted that for no individual unit did decoding performance reach this criterion. Nevertheless, by decoding responses from small populations of units (Fig. 2D) it was possible to decode vowel identity with 100% performance across F0 (23 units), sound location (22 units), sound level (28 units) and voicing (24 units). Similarly, we could also decode task-orthogonal variables with high accuracy using comparable population sizes for sound location (20 units) and voicing (21 units), although more neurons were required to decode F0 (n = 129) and sound level (56 units). Thus across small populations of auditory cortical units, neural responses could be decoded with the same performance as animal’s behavioral performance and represent multiple stimulus dimensions near-perfectly.
Finally we asked how units encoded vowel identity across individual task orthogonal values: if the responses of neural units underlie the animals’ perceptual constancy we would expect similar levels of decoder performance across the variations in acoustic structure of vowels that animals’ displayed invariance. We found that, as with behavior (Fig. 1C-F), the proportion of vowel informative units did not differ significantly with fundamental frequency (logistic regression, χ2 = 0.035, p = 0.852) or sound location (χ2 = 1.85, p = 0.174). Furthermore we also found no effect of either sound level (χ2 = 0.416, p = 0.519) or voicing (χ2 = 0.002, p = 0.964) on proportions of vowel informative units, suggesting that for neural encoding of vowel identity was robust across all dimensions tested.
Conserved information content
We recorded units with vowel informative responses across task-orthogonal dimensions and found that across the proportion of recorded units, the proportion of vowel informative units was constant at different task-orthogonal values. This could emerge due to the presence of “invariant” neural units that are tolerant to changes within and across task-orthogonal dimensions, or, alternatively, it could be that tolerance to task-orthogonal values only emerges across populations of units. To test these hypotheses, we asked whether the same units were informative about vowel identity across all orthogonal dimensions. We considered this question in two stages: firstly, within a single task-orthogonal dimension (e.g. F0), did the same units encode vowel identity across the different values tested. Secondly, we asked whether the same units provided tolerant information across different task-irrelevant dimensions. We addressed the first question in two ways by (a) comparing vowel decoding performance across all task orthogonal values, and (b) predicting a unit’s classification as informative about vowel identity at a specific task orthogonal value (e.g. F0 = 200 Hz) from that unit’s classification at all other values tested (e.g. F0 = 149, 263, 330 and 459 Hz) using multivariate logistic regression modelling. For logistic regressions, we reasoned that units with conserved function would be informative about vowel identity at multiple orthogonal values and so a unit’s classification for a subset of values should allow us to predict classification on a held-out value. In contrast if function was not conserved, the classification at each orthogonal value would vary randomly and we should not be able to predict classification better than chance for held-out data.
Vowel decoding at different F0s was positively correlated (Fig 3A, mean ± s.e.m. R2 = 0.359 ± 0.0179) indicating that units informative about vowel identity at one F0 were also did informative at other F0s. In support of this, unit classification (vowel informative / uninformative) at each F0 could be predicted from unit classification at all other F0s (Table 3, p ≤ 0.01) though the weighting of each predictor F0 varied (Fig S4). Similarly, vowel decoding was also correlated at different sound levels (Fig 3C and Fig. S5, mean ± s.e.m. R2 = 0.290 ± 0.02) although unit classification at individual sound levels was not predicted from classification at other sound levels tested (p > 0.05). Across space, vowel decoding performance was not correlated for stimuli presented at left and right speakers (Fig. 3B; R2 = 0.0353) and unit classification at one location was not predictive of classification at the other location (χ2 = 0.072, p = 0.788). Vowel decoding performance was weakly correlated for voiced and voiceless stimuli (Fig. 3D, R2 = 0.063) and unit classification on voiced trials was not predictive of classification on voiceless trials (χ2 = 1.28, p = 0.258). Together this suggests that information about vowel identity at different F0s and, to a lesser extent, sound level was conserved across units (i.e. the same units could signal vowel identity under different conditions within a stimulus dimension) whereas information about vowel identity at different locations or in different voicing conditions was provided by different units.
Logistic regression results for determining if a unit was vowel informative at a particular F0 value, predicted from unit classification at all other fundamental frequencies.
(A-D) Paired comparisons of decoding performance for vowel identity at different fundamental frequencies (A), sound locations (B), sound levels (C) or voicing conditions (D). (E) Number of units classified as vowel informative across multiple task-orthogonal dimensions. (F) As E but for units classified as being informative about multiple task-orthogonal dimensions.
We next asked if vowel decoding was conserved across stimulus dimensions by counting the number of units that were vowel informative across each of dimension sound location, level, voicing and fundamental frequency. While not every unit was tested across every task-orthogonal dimension, we observed that the dimensions over which behavioral performance was constant (F0, sound location and level) were also the dimensions over which the highest proportions of units remained informative across dimensions (Fig. 3E). Across fundamental frequency and sound level, we recorded 56/155 units (36.1%) that were vowel informative across both dimensions. Similar proportions of conserved units were also identified across fundamental frequency and location (33 /100; 33%), sound level and location (21 / 60 units; 35%) and the combination of F0, sound level and location (19 / 57; 33.3%). Notably fewer conserved units were observed for other combinations of task orthogonal factors (<25%) – all of which included voicing (across which animals generalized poorly). This suggests that the same auditory cortical units represent vowel identity across (as well as within) task-orthogonal variations that the animals also perceive as having a constant identity.
If the same neurons are informative about vowel identity across different task-orthogonal variables, it should be possible to predict a unit’s classification (vowel informative / uninformative) in one condition (e.g. vowel discrimination across F0) based on its classification in other conditions (e.g. vowel discrimination across level). Using logistic regression models, we confirmed this was true for fundamental frequency and sound level
, F0 and sound location
as well as F0 and voicing
, and sound level and voicing
Models in which unit classification across the target dimension (e.g. F0) was predicted by classification in more than one other dimension (e.g. sound level and location) were also significant (Table 4). Thus our statistical analysis confirmed that information about vowel identity was conserved across units when considering multiple task-orthogonal variables consistent with the idea that tolerant information was a property of single units and not only neural ensembles.
Logistic regression results for predicting (1) if a unit would be vowel informative across one task-orthogonal dimension (Response) based on unit classification across other task-orthogonal variables (Predictors), or (2) if a unit was informative about one task-orthogonal dimension based on its classification as informative about other task-orthogonal dimensions. For single predictor models, the association between response and predictor was bidirectional – i.e. if classification across space predicted classification across F0, then classification across F0 also predicted classification across space.
We also asked whether units that were informative about one task-orthogonal variable (e.g. sound location) were also informative about other task-orthogonal variables (e.g. voicing). While we did observe small populations of units that were informative about multiple dimensions (Fig. 3F) such populations were smaller than the populations that were vowel informative across dimensions (sign-rank test, p = 0.0029). Furthermore, we found no combination of task-orthogonal dimensions in which unit classification was predictive (Table 4). This suggests that while information about vowel identity is conserved across task orthogonal dimensions, at least during the performance of a vowel identification task, few units are informative about multiple task-orthogonal dimensions.
Temporal multiplexing of sound features
We next turned our attention to the timing of information about auditory features. Responses of auditory cortical neurons in anesthetized ferret contain mutual information about vowel identity, F0 and virtual location in distinct time windows (temporal multiplexing)(Walker, Bizley et al. 2011). Here we asked whether the same was true of neural responses recorded in animals actively discriminating sounds using decoding rather than information theoretic measures (which together provide complementary approaches to understanding neural encoding (Quian Quiroga and Panzeri 2009)). We also extended the study of temporal multiplexing to sound level and voicing.
To investigate temporal multiplexing, we compared the time windows after trial onset that gave best performance using our decoder (Fig. S2B). A time window was defined by the start time and window duration, and we summarized these two parameters by calculating the midpoint of the window (i.e. start time + duration / 2). For each unit, we measured the center time for decoding of vowel identity across each task-orthogonal dimension, and for decoding task-orthogonal values across vowels. The cumulative distribution functions (CDFs) of center times across units was then compared when decoding vowel identity and task-orthogonal values (Fig 4).
We found that information about vowel identity arose earlier than about fundamental frequency (Fig. 4A) in units who were informative about both F0 and vowel identity (Sign-rank test: z = -2.43, p = 0.0150) and units that were informative about F0 or vowel identity (Rank-sum test: z = -2.31, p = 0.0206) but not uninformative units (Sign-rank test: z = -0.0734, p = 0.942). Similarly, information about sound location was also best decoded earlier than vowel identity in units informative about both dimensions (Fig. 4B, z = 2.26, p = 0.0240) but not either dimension alone (z = 1.29, p = 0.198) or uninformative units (z = 0.722, p = 0.470). Information about vowel identity was best decoded before sound level (Fig. 4C) in units who were informative about both features (z = - 2.13, p = 0.0333) but not units informative about individual features (z = -0.933, p = 0.351) or uninformative units (z = 0.471, p = 0.638). Information about voicing was best decoded earlier than vowel identity (Fig. 4D) but in contrast to other comparisons, this was only significant in uninformative units (z = 2.40, p = 0.0164) and units informative about vowel or voicing (z =2.79, p = 0.005) but not units informative about both features (z = 1.33, p = 0.184). Together these results suggest that units encoding multiple sound features use temporal multiplexing to represent these features at different times. Notably, multiplexing in units sensitive to several stimulus features was only observed when we varied F0, sound level or location - features that animals also perceive with a constant identity.
(A-D)Cumulative distributions showing center times (start time + duration / 2) for best performancewhen decoding vowel identity or task-orthogonal variables (A: F0, B: location, C: level and D: voicing). Units shown separately by classification as informative about vowel identity and task-orthogonal values (Dual feature units), either vowel identity or task task-orthogonal (Single feature units) or neither feature (uninformative). (E) CDFs for decoding vowel identity across each task-orthogonal variable. (F) CDFs for decoding task-orthogonal values across vowels.
Our comparison between vowel and task-orthogonal decoding revealed differences in timing of best decoding. However we also asked if vowel identity was always best decoded with a fixed temporal profile or if best decoding of vowel identity differed between task-orthogonal variables. For units that only encoded vowel identity, timing of vowel information significant differed across task orthogonal variables (Fig. 4E)(Kruskall-Wallis one-way ANOVA: χ2 = 19.95, p = 1.74 x 10−4). Post-hoc comparisons confirmed that vowel identity was best decoded across voicing significantly later than across F0 (tukay-kramer correction for pairwise comparisons, p = 0.001), sound location (p = 0.0015) and sound level (p = 0.013). Thus for these units, information about vowel identity was not fixed and could vary depending on the orthogonal features of sounds. We found no differences in the timing of vowel identity information between sound level, location or F0 and no significant differences between task-orthogonal variables for units informative about vowel identity and task-orthogonal values (χ2 = 5.76, p = 0.124) or uninformative units (χ2 = 0.677, p = 0.879).
We also compared the timing of information about task-orthogonal variables decoded across vowels. Here we found significant differences in timing of task-orthogonal information for units that were informative only about task-orthogonal values (χ2 = 11.04, p = 0.0206) and units informative about both task-orthogonal values and vowel identity (χ2 = 9.77, p = 0.0115). Post-hoc comparisons for units informative about only task-orthogonal values showed that sound location was best decoded significantly before F0 (p = 0.0255) and sound level (p = 0.0283). For units informative about both task-orthogonal values and vowel identity, post-hoc comparisons showed that F0 was best decoded significantly later than information about space (p = 0.0199) and voicing (p = 0.0468). No further post-hoc comparisons were significant and timing across uninformative units was not significantly different (χ2 = 7.033, p = 0.0708). These results indicate that task-orthogonal variables are encoded at different times in neural responses to sound and support the suggestion that temporal multiplexing is a general phenomenon that maybe observed beyond the vowels studied here such as stimuli varying only in level or location (REFERENCES).
Discussion
Here we demonstrated that auditory cortical neurons can reliably represent vowel sounds across a range of acoustic transformations that preserve sound identity in perception. By recording neurons in animals engaged in a task that demonstrates perceptual constancy, our results provide a direct comparison of neural encoding and behavior.
Brain and behavior
A central aim of this study was to directly test the tolerance (or invariance) of neural coding during perceptual constancy. We confirmed that vowel discrimination performance was indeed consistent across fundamental frequency, location and a range of sound levels during recording, although animals were unable to generalize performance to whispered stimuli (but this may be due to their presentation as probe trials). Across all task-orthogonal dimensions, we could decode vowel identity better than chance for many units and match behavioral performance by decoding over populations of units. When comparing behavior and neural decoding, it was notable that the proportion of units informative about vowel identity across voicing was lower (30.4%) than any other task orthogonal dimension (40 – 42% of units). This correlation suggest that processing of auditory cortical neurons reflects behaviourally relevant information and that such neurons are less able to represent vowels across voicing conditions, leading to impaired generalization of vowel identity in the task. This is further supported by the timing at which we best decoded vowel identity across voicing conditions, which was significantly later than any other orthogonal variable for units that were informative about vowel identity (Fig. 4E).
Why did auditory cortical units fail to generalize across voicing as well as other dimensions? One explanation may relate to the energy content of whispered and voiced stimuli; in contrast to voiced vowels that concentrate energy at harmonic frequencies, whispered sounds contain energy distributed across frequency (Fig. 1b). Versnel and Shamma (1998) reported similar responses in primary auditory cortex (A1) to voiced and unvoiced (whispered) vowels presented with durations of 90 – 140 ms. We observed that for units such as that shown in Fig. 2A, the onset response to voiced and whispered vowels was indeed similar. However during the sustained period of our stimuli (100 – 250 ms) responses differed, suggesting that the broadband versus harmonic spectra of unvoiced versus voiced vowels may be reflected in the sustained firing of the unit. This suggestion is consistent with the late optimization of voicing information when compared to other orthogonal features (Fig. 4F).
Temporal multiplexing
We found that for units informative about both vowel identity and task-orthogonal values were best decoded using response activity in consistently distinct time windows. Given we decoded vowel identity and task-orthogonal variables independently and without a priori bias to particular times in the neural response after stimulus onset, it is most likely the case multiplexing represents the properties of the units rather than of the decoder. Our results showing that best decoding of vowel identity and sound location was faster than voicing or F0 are consistent with findings that perception of sound location and vowel identity are associated with the onset of sounds (Litovsky, Colburn et al. 1999, Stecker and Hafter 2002) whereas fundamental frequency requires time for listeners to estimate (Gray 1942, Mckeown and Patterson 1995, Walker, Bizley et al. 2011). These differences in timing may be intuitive given multiple cycles of a vowel are required to F0 from the amplitude waveform arriving at the ear. However we also showed several previously undemonstrated differences in the timing of auditory information including rapid rise in decoded information about voicing and that sound level took longer to decode best than vowel identity, sound location or voicing. The time-course of sound level decoding may reflect the temporal integration of the auditory system (Buus, Florentine et al. 1997, Glasberg and Moore 2002). Thus neural organization of sound features in time mirrored perception that are thought to be driven by stimulus acoustics (e.g. vowel identity vs. F0) and the dynamics of auditory scene analysis (e.g. the precedence effect and its importance in localizing sound sources, particularly in reverberant environments).
Multi-feature representations
The representation of multiple sound features, sometimes within the same units, supports previous reports of distributed coding in auditory cortex (Bizley, Walker et al. 2009) and mirrors to co-encoding of multiple stimulus features in the ventral visual stream (Hong, Yamins et al. 2016). As in vision, the function of the auditory system is to recognize objects across identity-preserving transformations. A key function of visual cortex may be to extract invariant or tolerant representations of objects through hierarchical networks (Yamins and DiCarlo 2016) and the emergence of tolerant representations in primary and non-primary auditory cortex suggests similar principles may apply to the auditory system. However primary auditory and visual cortices may not be equivalent given the extensive subcortical processing of sounds in the auditory system (King and Nelken 2009) but rather early auditory cortex may be higher in the auditory invariance hierarchy than early visual cortices (V1 / V2) in the visual invariance hierarchy. Our results and other studies (Bizley, Walker et al. 2009) support this view as we sampled neurons that represented vowel identity across orthogonal variables from both primary and non-primary cortices. However we did not map the precise boundaries between cortical subfields in this study and our results only show tolerant representations exist early in auditory cortex rather than determining whether tolerance increases from primary to non-primary areas. As the emergence of tolerant representations has been mapped progressively through the visual system (Hong, Yamins et al. 2016), it will also be important to study the tolerance of auditory representations at stages in the ascending auditory system before cortex; in the medial geniculate nucleus and inferior colliculus of animals demonstrating perceptual constancy.
Author contributions
SMT and JKB designed the experiments and wrote the paper; SMT, KCW and JKB collected the data; SMT analysed the data.
Methods
Animals
Subjects were four pigmented female ferrets (1-5 years old) trained to discriminate vowels across fundamental frequency, sound level, voicing and location (Bizley, Walker et al. 2013, Town, Atilgan et al. 2015). Each ferret was chronically implanted with Warp-16 microdrives (Neuralynx, MT) housing sixteen independently moveable tungsten microelectrodes (WPI Inc., FL) positioned over primary and posterior fields of left or right auditory cortex (Fig. S1). Details of the surgical implantation procedures and histological confirmation of electrode position are described elsewhere (Bizley, Walker et al. 2013). A further six ferrets (also pigmented females) implanted with the same microdrives were used as naïve animals for passive recording. These animals were trained in a variety of psychophysical tasks that did not involve the vowel sounds presented here.
Subjects were water restricted prior to testing; on each day of testing, subjects received a minimum of 60ml/kg of water either during testing or supplemented as a wet mash made from water and ground high-protein pellets. Subjects were tested in morning and afternoon sessions on each day for up to five days in a week. Test sessions lasted between 10 and 50 minutes and ended when the animal lost interest in performing the task.
The weight and water consumption of all animals was measured throughout the experiment. Regular otoscopic examinations were made to ensure the cleanliness and health of ferrets’ ears. All experimental procedures were approved by a local ethical review committee and performed under license from the UK Home Office and in accordance with the Animals (Scientific Procedures) Act 1986.
Apparatus
Ferrets were trained to discriminate sounds in a customized pet cage (80 cm x 48 cm x 60 cm, length x width x height) within a sound-attenuating chamber (IAC) lined with sound-attenuating foam. The floor of the cage was made from plastic, with an additional plastic skirting into which three spouts (center, left and right) were inserted. Each spout contained an infra-red sensor (OB710, TT electronics, UK) that detected nose-pokes and an open-ended tube through which water could be delivered.
Sound stimuli were presented through two loud speakers (Visaton FRS 8) positioned on the left and right sides of the head at equal distance and approximate head height. These speakers produce a flat response (±2 dB) from 200Hz to 20 kHz, with an uncorrected 20 dB drop-off from 200 to 20 Hz when measured in an anechoic environment using a microphone positioned at a height and distance equivalent to that of the ferrets in the testing chamber. An LED was also mounted above the center spout and flashed (flash rate: 3 Hz) to indicate the availability of a trial. The LED was continually illuminated whenever the animal successfully made contact with the IR sensor within the center spout until a trial was initiated. The LED remained inactive during the trial to indicate the expectation of a peripheral response and was also inactive during a time-out following an incorrect response.
The behavioral task, data acquisition, and stimulus generation were all automated using custom software running on personal computers, which communicated with TDT real-time signal processors (RZ2 and RZ6, Tucker-Davis Technologies, Alachua, FL).
Task Design, Stimuli and Behavioral Testing
Ferrets discriminated vowel identity in a two-alternative forced choice task described elsewhere (Town, Atilgan et al. 2015). Briefly, on each trial the animal was required to approach the center spout and hold head position for a variable period (0 – 500 ms) before stimulus presentation. Each stimulus consisted of a 250 ms artificial vowel sound repeated once with an interval of 250 ms. Animals were required to maintain contact with the center spout until the end of the interval between repeats (i.e. 500 – 1000 ms after initial nose-poke) and could then respond at either left or right spout. Correct responses were rewarded with water delivery whereas incorrect responses led to a variable length time-out (3 - 8 s). Incorrect responses were also followed by a correction trial on which animals were presented with the same stimuli with the same time. Correction trials and trials on which the animal failed to respond within the trial window (60 s) were not analyzed.
Stimuli were artificial vowel sounds synthesized in MATLAB (Mathworks, USA) based on an algorithm adapted from Malcolm Slaney’s Auditory Toolbox (https://engineering.purdue.edu/∼malcolm/interval/1998-010/). The adapted algorithm simulates vowels by passing a sound source (either a click train, to mimic a glottal pulse train for voiced stimuli, or broadband noise for voiceless stimuli) through a biquad filter with appropriate numerators such that formants are introduced in parallel. Four formants (F1-4) were modelled: three subjects were trained to discriminate [u] (F1-4: 460, 1105, 2857, 4205 Hz) from [ε] (730, 2058, 2857, 4205 Hz) while one subject was trained to discriminate [a] (936, 1551, 2975, 4263 Hz) from [i] (437, 2761, 2975, 4263 Hz). Selection of formant frequencies was based on previously published data (Peterson and Barney 1952, Town, Atilgan et al. 2015) and synthesis produced sounds consistent with the intended phonetic identity. Formant bandwidths were kept constant at 80, 70, 160 and 300 Hz (F1-4 respectively) and all sounds were ramped on and off with 5 ms cosine ramps.
To test perceptual constancy, we varied the rate of the pulse train to generate different fundamental frequencies and used broadband noise rather than pulse train to generate voiceless vowel. For sound level we simply attenuated signals in software prior to stimulus generation. For sound location, we presented vowels only from the left or right speaker whereas for all other tests sounds were presented from both speakers. Across variations in F0, voicing and space, we fixed sound level at 70 dB SPL. For tests across sound level and location, voiced vowels were generated with 200 Hz fundamental frequency. Sound levels were calibrated using a Brüel & Kjær (Norcross, USA) sound level meter and free-field [1/2] inch microphone (4191) placed at the position of the animal’s head during trial initiation.
Neural Recording
Neural activity in auditory cortex was recorded continuously throughout task performance. On each electrode, voltage traces were recorded using TDT System III hardware (RX8 and RZ2) and OpenEx software (Tucker-Davis Technologies, Alachua, FL) with a sample rate of 25 kHz. For extraction of action potentials, data were bandpass filtered between 300 and 5000 Hz and motion artefacts were removed using a decorrelation procedure applied to all voltage traces recorded from the same microdrive in a given session (Musial, Baker et al. 2002). For each channel within the array, we identified candidate events as those with amplitudes between -2.5 and -6 times the RMS value of the voltage trace and defined waveforms of events using a 32-sample window centered on threshold crossings.
In the current study, waveform shapes were not sorted and data from multiple test sessions combined across days. The activity for each unit thus represents the unsorted multi-unit activity of a small population of cells at the recording site. We identified sound responsive units as those whose stimulus evoked response within the 300 ms after onset of first token differed significantly from spontaneous activity in the 300 ms before making contact with the spout (t-test, p < 0.05).
Decoding procedure
We decoded vowel identity across trials using a simple spike-distance decoder with leave-one-out cross-validation (LOCV). For every trial in a given condition (see below), we calculated template responses as the mean PSTH response to each vowel from all but one test trial and used the Euclidean distance between test trial and template responses to recover vowel identity (Fig. S2A). Where equal distances were observed between templates, we randomly assigned vowel identity. This procedure was repeated for all trials and decoding performance measured as the percentage of trials on which vowel identity was correctly recovered.
Auditory cortical units showed a wide variety of response profiles that made it difficult to select a single fixed time window over which to generate PSTHs. To accommodate the heterogeneity of auditory cortical neurons and identify the time at which stimulus information arose, we repeated our decoding procedure using a series of time windows with varying start time (-0.5 to 1 s after stimulus onset, varied at 0.1 s intervals) and duration (10 to 500 ms, 10 ms intervals) (Fig S2B). Within this parameter space we then reported the parameters that gave best decoding performance. Where several parameters gave best performance we reported the time window with earliest start time and shortest duration.
To assess the significance of decoding performance, we conducted a permutation test in which the decoding procedure (including temporal optimization) was repeated 100 times but with vowel identity randomly shuffled between trials to give a null distribution of decoder performance. The null distribution was then parameterized by fitting a Gaussian probability density function, for which we then calculated the probability of observing the real decoding performance. Here we only considered units for which there was a minimum of five trials from which to generate template responses for each stimulus condition. As such, trial number was the limiting factor in our choice of decoder as more complex approaches (e.g. linear discriminant analysis, mutual information etc.) required more trials for each stimulus condition than our study allowed.
Population Decoding
To decode vowel identity from the responses of populations of units on a given trial (y), we used a maximum likelihood estimate in which the likelihood of each stimulus class (e.g. vowel identity, F0, level etc.) was simply the weighted sum of individual unit estimates:

Where N is the number of units in the population, and wi is a confidence weighting based on spike distances:

Where d1 to dn are the spike distances between the test trial response and template responses (for n stimulus classes – e.g. vowel identity or F0), and d1 represents the minimum distance.
We tested populations of up to 50 units, by which point decoder performance had typically saturated at 100% (with the exception of decoding F0 and sound level). Populations were constructed by random sampling without replacement from the possible cohort of units under investigation. For each population size for which the number of possible combinations of units exceeded the number of units available, we repeated sampling 100 times.
Acknowledgements
This work was funded by grants to JKB from the BBSRC (BB/H016813/1), the Royal Society (DH0900058) and Wellcome Trust / Royal Society (WT098418MA)