Neural Encoding for Human Visual Cortex with Deep Neural Networks Learning “What” and “Where”

Neural encoding, a crucial aspect to understand human brain information processing system, aims to establish a quantitative relationship between the stimuli and the evoked brain activities. In the field of visual neuroscience, with the ability to explain how neurons in primary visual cortex work, population receptive field (pRF) models have enjoyed high popularity and made reliable progress in recent years. However, existing models rely on either the inflexible prior assumptions about pRF or the clumsy parameter estimation methods, severely limiting the expressiveness and interpretability. In this paper, we propose a novel neural encoding framework by learning “what” and “where” with deep neural networks. The modeling approach involves two separate aspects: the spatial characteristic (“where”) and feature selection (“what”) of neuron populations in visual cortex. Specifically, we use the receptive field estimation and multiple features regression to learn these two aspects respectively, which are implemented simultaneously in a deep neural network. The two forms of regularizations: sparsity and smoothness, are also adopted in our modeling approach, so that the receptive field can be estimated automatically without prior assumptions about shapes. Furthermore, an attempt is made to extend the voxel-wise modeling approach to multi-voxel joint encoding models, and we show that it is conducive to rescuing voxels with poor signal-to-noise characteristics. Extensive empirical results demonstrate that the method developed herein provides an effective strategy to establish neural encoding for human visual cortex, with the weaker prior constraints but the higher encoding performance. Author summary Characterizing the quantitative relationship between the stimuli and the evoked brain activities usually involves learning the spatial characteristic (“where”) and feature selection (“what”) of neuron populations. As an effective strategy, we propose a novel end-to-end “what” and “where” architecture to perform neural encoding. The proposed modeling approach consists of receptive field estimation and multiple features regression, which learns “where” and “what” simultaneously in a deep neural network. Different from previous methods, we use the sparsity and smoothness regularizations in the deep neural network to guide the receptive field estimation, so that the receptive field for each voxel can be estimated automatically. Moreover, in consideration of computational similarities between adjacent voxels, we made an attempt to extend the proposed modeling approach to multi-voxel joint encoding models, improving the encoding performance of voxels with poor signal-to-noise characteristics. Empirical evaluations show that the proposed method outperforms other baselines to achieve the state-of-the-art performance.


Introduction 1
A great mystery in computational neuroscience is understanding how the brain effortlessly 2 performs information perception and processing given sensory input. Uncovering this internal 3 mechanism is of great scientific importance, not only for neuroscience researches, but also for 4 artificial intelligence researches. In the field of visual neuroscience, one common method for 5 insights into visual information processing is to establish neural encoding models with 6 functional magnetic resonance imaging(fMRI) [1,2]. This modeling approach (Fig 1) of neural 7 encoding links fMRI signals at the millimeter scale to neural response at the micron scale, 8 providing a non-invasive approach to revealing the nonlinear relationship between the external 9 stimuli and the evoked brain activities [3][4][5]. Numerous studies have built neural encoding models from diverse perspectives for human 11 visual cortex, including, but not limited to, Gabor wavelet pyramid model [4], luminance 12 contrast models [6][7][8][9][10], the motion energy model [11], semantic models [12], deepnet 13 models [13], and other machine learning methods [14,15]. The estimated characteristics of 14 visual cortex derived from these models can be roughly summarized in two aspects: "what" and 15 "where". "Where" characterizes the spatial characteristic of neuron populations in visual cortex, 16 i.e. the location and extent of pooling over visual features, whereas "what" characterizes the "What" focuses on feature selection and feature tuning functions. In the semantic model [12], 25 several object category features (e.g. the presence of an animal, or a car.) are encoded as the 26 vector of binary variables. Then, every object category was assigned a tuning parameter for each 27 voxel to construct the model. Similarly, different visual features are further studied in 28 subsequent models [4,11,[14][15][16]. 29 Recently, deep learning with neural networks [17][18][19] has been widely used to perform 30 feature learning from scratch with promising performance, which sparks interest in using deep 31 learning methods for understanding information processing in visual cortex [5,[20][21][22]. Based on 32 deep neural networks, [13] proposed a new approach to encoding visual features named 33 feature-weighted receptive field (fwRF) [13]. It starts with a natural image, obtains feature maps 34 in a pre-trained convolutional neural network, and computes a weighted sum within the spatial 35 extent of an 2D Gaussian receptive field. Finally, it regresses all the feature maps onto brain 36 activity simultaneously, which yielded the state-of-the-art prediction accuracy. However, while 37 previous work demonstrated promising results of processing in visual cortex, neural encoding 38 models still lack adequate examination and require plenty of effort to improve. 39 There are two main challenges getting in the way of development of effective models. On 40 one hand, conventional approaches [6,7] are endowed with inflexible prior assumptions on the 41 spatial characteristics of receptive fields, which limit the effectiveness of models to a large 42 extent. For example, in the population receptive model [6], it assumed that the pRF has an 43 isotropic Gaussian topography while the potentially suppressive surround is neglected. There 44 have been subsequent models [7,9,10,13] which have adopted the same principles with different 45 pRF topographies. In general, one assumption about receptive field structure puts one prior 46 constraint on the ability to extract the receptive field topography of the model. Specifically, 47 inaccurate assumptions about receptive field topography may lead to erroneous estimation of the 48 receptive field. Hence, it is meaningful to propose a new method that can extract receptive field 49 topography without inflexible prior assumptions. On the other hand, previous 50 approaches [6,7,9,10,13,15] to obtaining receptive fields are based on grid search, which set 51 search parameters according to experience. Accordingly, they are prone to pRF center 52 mislocalization and size miscalculation. In the population receptive models [6,7,13], the pRF 53 topography parameters were set to certain parameters, which can be obtained by minimizing the 54 residual between the observed fMRI signal and the predicted signal. In this case, according to 55 different shape parameters (i.e. center and radius), these models will inevitably generate large 56 quantities of candidate pRF. That is, the grid fitting requires searching over quite large 57 model-parameter spaces. Consequently, their encoding performances depend on the amount and 58 parameter interval of candidate receptive fileds, which are set artificially. Obviously, more often 59 than not, it is not optimal and requires lots of manual effort. It would therefore be significant to 60 obtain the receptive fileds automatically in a more reasonable way.

61
Existing methods are prone to suffer from one or both of these issues and yield 62 dissatisfaction. Attaching great importance to these bottlenecks, we proposed a novel "what" 63 and "where" neural encoding architecture via deep neural networks. The proposed method first 64 extract hierarchical features from the DNN driven for image recognition. Then, the original feature turning functions ("what") automatically, which is rich in interpretability.

77
• The estimation of receptive fields is endowed with weaker constraints. Instead of strong 78 prior assumptions on the shape of receptive fields, L1 regularization and Laplacian 79 smoothing are adopted in our modeling approach, which can be regarded as weak prior 80 assumptions about receptive fields.

81
• We made an attempt in the extension of the modeling approach. In consideration of the 82 computational similarities between voxels, the voxel-wise modeling approach is extended 83 to multi-voxel joint encoding models, suggesting a new approach to rescuing voxels with 84 poor signal-to-noise characteristics more effectively.

85
• Extensive empirical evaluations on the publicly available fMRI dataset demonstrate that 86 our modeling approach achieves superior performance compared with other neural 87 encoding models.

89
In the neural encoding dataset, we assume X = [x 1 , ..., Nonlinear feature extraction 105 The neural encoding models take visual stimuli as the input, and output the evoked brain 106 activities. Normally, it includs two sequential stpdf. The first step is a nonlinear feature 107 mapping, converting the visual input to its feature representations; the second step is a 108 voxel-wise linear mapping, projecting the feature representations onto activities at each 109 voxel [4,[23][24][25][26][27]. In the present study, The nonlinear feature mapping consists of two parts: 110 nonlinear feature extraction and nonlinear feature refinement. The nonlinear feature extraction is 111 introduced in this sub-section, while nonlinear feature refinement and voxel-wise linear mapping 112 are described in the next several sub-sections.

113
Recent studies [13,28] has demonstrated that Alexnet [18], a specific version of DNNs, is 114 capable of predicting voxel activities with statistical significance and high accuracies throughout 115 the visual cortex. In line with previous work [13], a deep neural network (a specific 116 implementation referred as the AlexNet) is adopted to extract nonlinear features in the present 117 study. Briefly, AlexNet has been pre-trained to achieve the best preformance in Large Scale

118
Visual Recognition Challenge 2012 [18]. It consists of eight layers of computational units: the 119 first five layers were convolutional layers, while the rest layers were fully connected. The image 120 input was fed into the first layer; the output from one layer served as the input to its next layer. 121 Each convolutional layer involves plenty of units and a set of filters that extracts filtered outputs 122 from different locations of the input. The resolution (square root of the number of pixels in each 123 feature map) and depth (number of feature maps) for each convolutional layer was (55, 96), (27,124 256), (13,384), (13,384), (13,256)  given image x i . In consideration of feature maps from multiple layers may be adopted in the  The feature refinement in the specific layer (the l-th layer). In the channel attention module, the original feature maps h l i are initially used to obtain the channel attention weights m l i . In the next phase, the original feature maps are element-wise multiplied by the channel attention weights to obtain the channel-wise weighted feature maps s l i . In the spatial RF module, the receptive field p is reshaped to the corresponding receptive field p l according to the size of channel-wise weighted feature maps s l i . The Hadamard product of p l and s l i finally produce the refined features r l i .

Channel attention module 152
It is acknowledged that attention plays an crucial role in human perception [29][30][31]. One 153 important property of the human visual system is that one does not attempt to process a whole 154 scene at once. Instead, humans exploit a sequence of partial glimpses and selectively focus on 155 salient features in order to capture visual structure better [32][33][34]. In the present work, we pay 156 more attention to the meaningful feature maps rather than considering each feature map equally. 157 Since a channel-wise feature map is a detector response map of the corresponding filter in vectorh, the channel attention module Φ(·) can be further formulated as follows: where W a ∈ R C×C and b a ∈ R 1×C are the transformation matrix and bias term respectively; 170 σ denotes the sigmoid function and ⊗ denotes element-wise product; m ∈ R 1×C denotes the 171 attention weight while s ∈ R S×C stands for attentioned feature maps. During element-wise 172 multiplication, the channel attention weight is broadcasted (copied) along the spatial dimension. 173 In this way, the original feature maps h are converted into attentioned feature maps s.

174
Spatial RF module 175 In visual areas, population activity at each voxel in the cortical sheet encodes visual features 176 within a spatially localized region of visual field [6][7][8][9]. In view of this, the attentioned feature 177 maps are pooled within a limited and contiguous area in this module, which is the spatial 178 receptive field (RF). In order to overcome the drawbacks brought by the inflexible prior 179 assumptions or the clumsy parameter estimation methods, the sparsity regularization and 180 smoothness regularization are adopted in this module. Under the guidance of both 181 regularizations, the proposed method yields explicit receptive field automatically and encodes 182 features within a contiguous region of the visual field. 183 Mathematically, the receptive field is denoted as p given each specific voxel, and randomly 184 initialized with the same spatial size of natural images (227 × 227). In consideration of different 185 sizes of feature maps, the receptive field is adapted to different convolutional layers by the 186 means of reshaping. Taking the first convolutional layer as an example, the corresponding 187 receptive field size is 55 × 55. In this way, there are five sizes of receptive fields converted from 188 the original receptive field, with one-to-one correspondence to the five convolutional layers. As 189 a direct way to combine attentioned feature maps with receptive fields, element-wise product is 190 more natural and general than specially designed operations. Herein, the receptive field is 191 broadcasted (copied) along the channel dimension in the beginning, matching with the 192 dimensions of attentioned feature maps. Given the feature maps s = [s 1 , s 2 , ..., s C ] ∈ R S×C , 193 the spatial RF module Ψ(·) can be formulated as follows: where s k ∈ R S×1 is the c-th channel of attentioned feature maps and p ∈ R S×1 is the 195 voxel-wise receptive field. is the Hadamard product of matrices. Hence, spatial RF module 196 Ψ(·) outputs the refined feature vector r ∈ R 1×C .

197
As a matter of fact, if there is no any other operation applied to the receptive field, it is just 198 an ordinary mask. Inspired by the physiological structural characteristics of receptive field, our 199 model adopted two forms of regularization: sparsity and smoothness. Specifically, for each 200 specific voxel, since we expect its receptive field to be highly sparse, the receptive field was 201 regularized by L1 penalty with strength λ s : To ensure the receptive field focuses on an localized area as effectively as possible, we use an L2 203 penalty on the Laplacian of the receptive field with the strength λ l : Here, || · || F and || · || 1 denote Frobenius norm and L 1 norm ("entriwise" norm) of a matrix 205 respectively, and is the convolutional operation. In this way, sparsity and smoothness 206 regularizations make the ordinary mask transform into the meaningful receptive field. In 207 contrast to Gaussian assumption, these two forms of regularization can be viewed as weak prior 208 assumptions about receptive field , which is more conducive to the expressiveness and flexibility 209 of neural encoding models. The original intention of neural encoding models is to account for the responses of different 212 visual processing stages and reveal the information processing mechanism of neurons in visual 213 cortex. Voxel-wise linear mapping sets up such a computational path to relate visual features to 214 the evoked response at each voxel, bridging the gap between feature selection property of neuron 215 populations in visual cortex and hierarchical feature maps in deep neural networks.

216
In some previous studies [21,28,35], the voxel-wise encoding models regress feature maps 217 in each layer onto brain activity independently, which is proved highly effective. However, it  The image data are converted into the feature maps (or attentioned feature maps), and the receptived field is randomly initialized and reshaped to same size as feature maps. In the next phase, the Hadamard product ( ) of the feature maps and the receptive field produce the refined features. Finally, the refined features simultaneously regressed onto voxel-wise activity by a transformation matrix. (b) The receptive field is regularized by the two forms of regularizations: sparsity and smoothness. These two regularizations are able to guide the evolution of receptive field.
fully-connected layer encoded semantic features. In the light of feature diversity, we use feature 222 maps from all layers to predict each voxel's response rather than assume an one-to-one 223 correspondence between a voxel and a DNN layer. It is a reasonable way that each feature map 224 is assigned an appropriate weight, which can be learned from training data directly by the 225 appropriate optimization algorithm.

226
For each specific image x i , its original multi-layer feature maps h i turn into the refined 227 multi-layer feature vector r i through the nonlinear feature refinement module. The predicted 228 response at the specific voxel is modeled as a linear combination of multi-layer features. Let us 229 denote the weight of multi-layer features as w. In order to predict the response of the specific 230 voxel to the natural image, the weight w is element-wise multiplied by the refined feature vector 231 r i . Formally: where r i ∈ R 1×K and w ∈ R 1×K , K donotes the number of total feature maps. There is tend 233 to be a voxel-wise bias b in practice, and we omit it for notational simplicity, as it does not play a 234 part in the validation accuracy. Finally, the mean-squared error can be formulated as below: where y i is the measured voxel-wise activity in response to image i,ŷ i is the predicted activity 236 of the model, B indicates the minibatch size.
where λ l and λ s are hyper-parameters. The data uesd in the present study are the public fMRI dataset vim-1 (Data are available at 255 https://crcns.org/data-sets/vc/vim-1.), which are described in detail in [4]. In summary, convolutional neural network and multivariate linear regression [13]. It starts with a 291 natural image, obtains feature maps in a pre-trained convolutional neural network, and 292 computes a weighted sum within the spatial extent of an 2D Gaussian receptive field.

293
Finally, it regresses all the feature maps onto brain activity simultaneously and outputs 294 accurate predictions, which outperforms other comparable encoding models to achieve the 295 state-of-the-art performance. However, fwRF only considers the fusion of diverse features 296 from DNNs and still neglects the drawbacks resulted from strong prior assumptions 297 (Gaussian assumption), like previous methods.

298
Voxel selection 299 Voxel selection is a crucial component to fMRI brain encoding, as plenty of voxels may not 300 respond to the visual stimuli in reality. A common approach is to choose those voxels to which 301 the model provided better predictability (encoding performance) during the training process.

302
The goodness of fit between model predictions and measured voxel activities is quantified by the 303 Pearson's correlation coefficient (PCC). For each voxel, the PCC is computed on the validation 304 set, and is finally an average of 5 runs with different data splits in our experiments. We select 305 voxels with positive PCC for further analyses, and the details of the selected data are 306 summarized in Table 1 (The details of the selected data from subject 2 are shown in S1 Table). 307 Figures in this study refer to data from subject 1 (except Fig 6). The AlexNet [18] architecture pre-trained on ImageNet dataset is exploited to initial both the 310 convolutional and fully-connected layers, and other parameters of the model are randomly Optimizer [36] with an initial learning rate of 0.00005 and early stopping is adopted.  To evaluate the encoding performance quantitatively, we use several standard similarity 319 metrics, including mean squared error (MSE), Pearson's correlation coefficient (PCC), and 320 coefficient of determination (COD, i.e. R 2 ). These metrics focus on different properties for the encoding performance. MSE is a common way to evaluate prediction performance in 322 machine learning, which focuses on the point-to-point prediction accuracy. Note that MSE is not 323 highly indicative of predictions, whereas PCC and FEV can take variable texture and goodness 324 of fit into account, which are more significant in neuroscience. We also performed the statistical 325 significance test (SST) of model prediction accuracy. For each voxel and each model type, the 326 Pearson's correlation coefficient between the model prediction and measured response above 327 0.27 is significant p < 0.001 relative to its null hypothesis distribution [4,13]. In the present 328 study, we use PCC to denotes the prediction accuracy, if there is no special instruction.  Previous neuroscience studies [21,24] have shown that the ventral and dorsal visual streams 343 are hierarchically organized, with early visual areas processing low-level visual features (such as 344 edges) and downstream visual areas processing increasingly complex visual features (such as 345 shapes). Does the hierarchical features of CNN have anything to do with the hierarchical visual 346 areas of brain? To answer this question, we analyzed the contributions of different CNN layers 347 to activity prediction in different brain regions-of-interest (ROI). As shown in Fig 5 (a) , the To assess the degree of consistency of encodability across subjects, we evaluated the feature 361 map-by-map or unit-by-unit similarity of the prediction accuracy between two subjects of Vim-1 362 dataset. To verify the capacity to estimate receptive fileds of our modeling approach, we intuitively 369 visualized the receptive field with increasing iterations across different ROIs. The results are 370 illustrated in Fig 7, which are the representative voxels from V1, V2, V3, V4, LO. On one hand, 371 it is easy to find that the receptive fields are smooth and localized for the particular voxels. The 372 receptive field shapes may not be regular for all voxels, whereas the main shapes can be clearly 373 distinguished. On the other hand, the preliminary outlines can be formed within 30 iterations   Fig 9. Comparison of prediction accuracy between the proposed method and the fwRF. Each of the six axes displays a comparison between the prediction accuracy of the two models in specific visual ROI (V3a and V3b are plotted in the same axe). In all six scatter plots, the ordinate and abscissa represent the prediction accuracy values of the proposed method and the fwRF. The blue dots indicate the voxels cannot be significantly encoded (under 0.27) by either of the two models. The red dots indicate the voxels that can be better predicted by the proposed method than the fwRF and vice versa for the cyan dots.  The sensitivity analysis is crucial to DNN-based models. Since weak prior assumptions psychophysical studies [38][39][40]. It inspired us to relate receptive fields of multiple voxels to 414 mimic this modulations and construct multi-voxel joint encoding models. Here, we use the 415 similarity loss to relate the adjecent voxels, as voxels in the same area tend to perform similar 416 computations. Taking two-voxel cooperative encoding as an example, for voxel j and voxel k, to 417 ensure their receptive fields as similar as possible, we use an L2 penalty on the difference 418 between receptive fields with the strength λ m : The joint objective function of the multi-voxel joint encoding model is formulated as follow: Here, T j and T k are the objective function of voxel j and voxel k, respectively, and L similarity 421 is the bond between two voxels. This bond can be extended to three or more voxels according to 422 appropriate definitions of voxel neighborhood. In the present study, we did a preliminary 423 verification of the multi-voxel joint encoding.

424
In consideration of computational similarities in the same visual area, we define a voxel 425 neighborhood that consists of three voxels according to their location index in the Vim-1 dataset. 426 Those voxels (up to three voxels) whose location index are adjacent are chosen to be jointly  The number of significant voxels in the multi-voxel joint encoding is 331 while that in the single-voxel encoding model is 178. Results in both early visual areas and higher visual areas demonstrate that the multi-voxel joint encoding is conducive to rescuing voxels with poor signal-to-noise characteristics of each voxel. We infer that the increased expressiveness is made possible by this space-feature 483 separability.

484
Limitations of the proposed model 485 While the proposed modeling approach exhibits good encoding performance, there is still an 486 imperfection worthy of consideration: the measurement of receptive field size. The proposed 487 modeling approach yields an explicit receptive fields with weak prior assumptions, which 488 provides pros and cons. It facilitates the expressiveness of model, whereas brings the difficulties 489 to the measurement of receptive field size. The receptive field size is different from classical 490 population receptive field (pRF) [6]. The differences underline the fact that the eatimation of 491 receptive field in our method depends entirely upon the feature maps. As the proposed modeling 492 approach does not impose an strong assumptions (regular shapes) on receptive fileds, it limits 493 the measurement of receptive field size to a certain esxtent. However, for irregular receptive 494 field, we can supply an alterbnative measurement method, which is the blob detection. In light 495 of different pixels of the receptive field, blob detection can detect an approximate regular shape, 496 such as the Gaussian shape (areas circled by red line in Fig 7 (e)).

497
The receptive field structure 498 The receptive field presented here is a mask with sparsity and smoothness regularizations, 499 and the salient characteristic is that its shapes can be irregular. While keeping the shape regular 500 has the advantage of reducing model parameters of the model, it also reduces the expressiveness. 501 In particular, the other appropriate operators instead of Laplacian operator can be adopted in our 502 method, exploring more appropriate receptive field structures. For example, the Laplacian of 503 Gaussian operator may allow the model to explicitly capture receptive fields with a "Mexican 504 hat" profile that enforce a suppressive band around an excitatory center. Furthermore, the 505 receptive field with any appropriate weak prior assumptions could be trained using the proposed 506 modeling approach presented here. There may be a more optimal assumption about receptive 507 field structure, which needs future studies to confirm.

508
Encoding in higher visual areas (V4 and LO) 509 It is worth noting that all the presented methods obtain good performance in early visual 510 areas (V1,V2 and V3) while little effect on higher visual areas (V4 and LO). Nevertheless, the 511 proposed modeling approach is still superior to other methods, which makes progress in higher 512 visual areas. However, it is meaningful to build an effective encoding model in higher visual 513 areas, as it may reveal the high-level visual information processing in the visual cortex. Deep 514 neural network has mapped the function of the human visual cortex [35] and revealed a gradient 515 in the complexity of neural representations across the ventral stream [21]. The convolutional 516 layer encoded location and orientation-selective features while the fully-connected layer 517 encoded semantic features. In order to make progress in higher visual areas, perhaps more 518 attention to the fully connected layer will make sense. The receptive fields easimated in our 519 model focus on the convolutional layer, whereas the fully-connected layer are neglected. It is 520 worth our while further improving the encoding performance of our proposed modeling 521 approach with appropriate operations on the fully-connected layer.