Training neural networks to recognize speech increased their correspondence to the human auditory pathway but did not yield a shared hierarchy of acoustic features

The correspondence between the activity of artificial neurons in convolutional neural networks (CNNs) trained to recognize objects in images and neural activity collected throughout the primate visual system has been well documented. Shallower layers of CNNs are typically more similar to early visual areas and deeper layers tend to be more similar to later visual areas, providing evidence for a shared representational hierarchy. This phenomenon has not been thoroughly studied in the auditory domain. Here, we compared the representations of CNNs trained to recognize speech (triphone recognition) to 7-Tesla fMRI activity collected throughout the human auditory pathway, including subcortical and cortical regions, while participants listened to speech. We found no evidence for a shared representational hierarchy of acoustic speech features. Instead, all auditory regions of interest were most similar to a single layer of the CNNs: the first fully-connected layer. This layer sits at the boundary between the relatively task-general intermediate layers and the highly task-specific final layers. This suggests that alternative architectural designs and/or training objectives may be needed to achieve fine-grained layer-wise correspondence with the human auditory pathway. Highlights Trained CNNs more similar to auditory fMRI activity than untrained No evidence of a shared representational hierarchy for acoustic features All ROIs were most similar to the first fully-connected layer CNN performance on speech recognition task positively associated with fmri similarity


Introduction 1
The use of deep neural networks (DNNs) as models of biological neural 2 networks has been discussed as an opportunity for synergy between neuro- between object manifolds (DiCarlo and Cox, 2007). Similar language has 10 been used to describe how DNNs accomplish recognition tasks (Bengio et al.,11 2013). Several studies have now reported that state-of-the-art (SOTA) ma-12 chine learning systems, trained only to maximize their performance on a 13 specific task, without any explicit goal to mimic neural activity, appear to 14 learn representations that are similar to those found in the brains of animals 15 engaged in a similar task (Kriegeskorte, 2015). For example, the output layer 16 of Alexnet (Krizhevsky and Hinton, 2012) has been found to be highly pre-  The most convincing demonstration that modern convnets learn representa-24 tions that are meaningful to neurons in the primate visual system is work 25 from Bashivan et al. (2019) showing that task-optimized DNNs can be used 26 to control the activity of macaque V4 neurons. They found that stimuli syn-27 thesized to maximally activate specific units in the DNN also drove activity 28 of matched sites in V4 well beyond their maximum firing rate in response to 29 3 natural images. 30 Comparisons of DNNs to biological sensory pathways often come with 31 claims of shared representation hierarchy. Regions of interest (ROIs) along 32 some pathway are mapped to layers of a DNN based on their similarity. Early 33 layers in the network tend to be more similar to early ROIs in the pathway 34 and late layers to late ROIs (Cichy et al., 2016, Güçlü and van Gerven, 2015). 35 These results suggest that DNNs are not just learning representations that 36 are similar to single regions, but rather that they constitute models of an tence of a shared hierarchy, they looked only at voxels that showed a reliable 52 response to sound and layers of their network which were predictive of voxel 53 activity across auditory cortex. They found that the most predictive layers 54 4 of primary auditory cortex were intermediate layers, while the most predic-55 tive layers of secondary auditory cortex were deeper layers. From this, they 56 conclude that the hierarchical distinction between primary and secondary 57 auditory cortex is mirrored in their convnet (Kell et al., 2018). Güçlü et al. 58 also reported evidence for a shared hierarchy in human auditory cortex, but 59 they only analyzed the superior temporal gyrus (STG). They used represen-60 tational similarity analysis (RSA) to compare representations learned in a 61 DNN trained to predict tags from excerpts of musical audio. 1 They found a 62 gradient of complexity across STG where anterior voxel clusters were more 63 similar to early layers while posterior voxel clusters were more similar to late 64 layers (Güçlü et al., 2016). While both of the above studies report evidence 65 for a shared hierarchy between human auditory cortex and DNNs trained on 66 sound, they report different spatial patterns of similarity gradients.

67
Several different analysis tools are used to compare representations. The 68 ultimate goal of these analyses is to quantify the similarity of two represen-69 tations, but similarity is an ambiguous term that must be defined by the 70 experimenter. In many of the aforementioned studies, an encoding analysis 71 is performed where firing rate or voxel activity is predicted by a regularized 72 linear model of the neural network activity. According to this approach, a 73 representation is similar to another to the extent that it can be linearly pre-  Here, we use CKA to quantify the similarity between representations 101 learned in convnets trained on speech and activity throughout the human 102 6 auditory pathway during speech listening, as measured with 7-Tesla (7T) 103 fMRI. The high spatial resolution of 7T fMRI allows us to simultaneously 104 measure activity from auditory cortex as well as subcortical auditory regions, 105 which are often omitted from auditory fMRI analyses due to their small size.

106
Since significant auditory processing occurs in brainstem and midbrain re-107 gions, this provides us with several distinct regions with a relatively known 108 connectivity structure with which to compare the convnet representations.

109
To the best of our knowledge, ours is the first study to compare DNN repre- later regions. While we found that our trained networks were more similar 119 to the brain than an untrained network, we found no such diagonal pattern. 120 Instead we found that, on average, nearly all ROIs are most similar to the 121 first fully-connected layer.

214
The following description was prepared by fMRIPrep.  The BOLD time-series, were resampled to surfaces on the following spaces:  independence between X and Y . When using a linear kernel, CKA is simply: which is equivalent to the RV-coefficient (Robert and Escoufier, 1976). To quantify the similarity between a given ROI and network layer, we also calculate the CKA similarity between each ROI and the layers of an untrained network. This untrained network has the same architecture as the trained models, but its parameters have been randomly initialized and never updated. If training has increased the correspondence to the brain, the CKA scores for a trained network should be greater than that of the untrained network. We capture the effect of training on similarity by calculating the difference of standardized CKA scores between a trained network of interest and an untrained network, which we refer to here as the neural similarity score for brevity. Within each subject, the CKA scores are standardized using the mean µ s and standard deviation σ 2 s calculated over all models and ROI-layer pairs. The CKA scores of the untrained network are standardized using the same mean and standard deviation. The neural similarity score φ s m is a difference of z-scores which reflects the similarity achieved by model m 19 in subject s relative to the untrained model.
Thus a neural similarity score of 1 indicates that the similarity achieved by 385 the trained model is 1 standard deviation greater than that achieved by the

392
We calculated the CKA similarity for each network, subject, and ROI-393 layer pair. The results of these analyses can be summarized in similarity 394 matrices whose rows correspond to layers of a network and whose columns 395 correspond to the auditory ROIs. Figure 1 shows the grand mean similarity  This hypothesized diagonal pattern also does not occur in the raw CKA sim-404 ilarity scores, neither for the trained nor untrained networks (Figure 1a-b).

405
Instead, for all ROIs, the first fully connected layer (fc1) achieves the highest 406 raw CKA similarity and the highest neural similarity score. This pattern 407 does not occur in the similarity matrix for the untrained network, suggesting 408 that it was introduced by training and not by the architecture. We calculated the average neural similarity score matrix for each net-  We hypothesized that the differences between models observed in Figure 2 423 may be related to the models' accuracy on the phone classification task on 424 which they were trained. In Figure 3, we plot the peak neural similarity score  and then freeze trained on Language 2. Training generally increased the correspondence between brain and networks. Layer fc1 shows the highest neural similarity score and there is little evidence for shared hierarchy (no diagonal pattern). In some layers of certain networks, training did not affect or actually reduced the ROI-layer similarity (shown in white and blue). Layer fc2 yields greater neural similarity for the networks that were trained on two languages, which also performed better on the triphone recognition task. ROIs, followed by the second fully-connected layer, fc2.

440
This apparent discrepancy may be best explained by reference to the dif- that are more task-specific than any representations employed by the hu-503 man brain, whose ultimate goal during speech listening is typically natural 504 language understanding, not phoneme recognition. However, fc2 was also 505 found to be relatively similar, but only for the models which were trained 506 on two languages rather than one. These networks benefited from twice the 507 amount of training data as the models trained on only one language and dis-508 played superior generalization as a result. Our analysis revealed that these 509 more generalizable, less language-specific penultimate representations were 510 also more similar to activity in the auditory brain.

511
Alternative architectures, cost functions, training procedures, or measure-512 ment modalities may be required to achieve a layer-to-ROI correspondence for