A controller-peripheral architecture and costly energy principle for learning

Complex behavior is supported by the coordination of multiple brain regions. How do brain regions coordinate absent a homunculus? We propose coordination is achieved by a controller-peripheral architecture in which peripherals (e.g., the ventral visual stream) aim to supply needed inputs to their controllers (e.g., the hippocampus and prefrontal cortex) while expending minimal resources. We developed a formal model within this framework to address how multiple brain regions coordinate to support rapid learning from a few example images. The model captured how higher-level activity in the controller shaped lower-level visual representations, affecting their precision and sparsity in a manner that paralleled brain measures. In particular, the peripheral encoded visual information to the extent needed to support the smooth operation of the controller. Alternative models optimized by gradient descent irrespective of architectural constraints could not account for human behavior or brain responses, and, typical of standard deep learning approaches, were unstable trial-by-trial learners. While previous work offered accounts of specific faculties, such as perception, attention, and learning, the controller-peripheral approach is a step toward addressing next generation questions concerning how multiple faculties coordinate.

as is common in machine learning (ML) systems (e.g., end-to-end optimization 118 by gradient descent learning). 119 120 121 Figure 1: The controller-peripheral architecture provides a general framework for how different brain regions coordinate while performing a task. (A) Peripherals aim to supply their controllers with the information they require while expending minimal resources (i.e., costly energy principle). Here, we illustrate a number of possible arrangements of controllers and peripherals. (ii) A single controller with multiple peripherals could offer an account of multi-modal integration for the convergence of visual and somatosensory signals in parietal cortex (Lewis and Van Essen, 2000) or semantic hubs in the anterior temporal lobe (Jackson et al., 2021). (iii) Conversely, multiple controllers with a single peripheral could model eye movements in which multiple controllers related to visual search, obstacle avoidance, social cognition, etc. share this perceptual resource. Controllers and peripherals can be arranged hierarchically as in (v). This arrangement is consistent with hierarchical accounts of the ventral visual stream in object recognition. (B) We use the controller-peripheral architecture to develop a model that can learn concepts from a few visual examples. To simplify, we assume a single controller involving the hippocampus and ventromedial prefrontal cortex (vmPFC) and a single peripheral involving the ventral visual stream. The model captures how higher-level goals and outcomes shape activity throughout the ventral visual stream, which aims to provide its controller with needed information while minimizing resource expenditure (i.e., the costly energy principle).

Model Overview
One key aspect of the controller-peripheral account is that the peripheral 167 aims to provide the controller with the information it needs while expending 168 minimal resources (i.e., costly-energy principle). The peripheral altered its op-169 eration in response to the controller by adjusting its attention weights. The principle is reflected by an l1 penalty on the sum of peripheral attention weights. 176 We consider the sparsity of the attention weights (i.e., proportion that are zero) 177 in the DNN module as an indicator for resource expenditure.   Figure 2: The controller-peripheral framework consisting of a clustering module capturing HPC and PFC and a DNN module capturing ventral visual stream, captures human category learning behavior in Shepard et al. (1961). (A) Six category learning tasks where human participants learnt to classify geometric shapes into one of two categories where stimuli were made up of three binaryvalued features (color, shape and size). Typically, models operate over handcoded three-dimensional vectors. Instead, we trained on actual images, in this case using the insect stimuli from Mack et al. (2016), a replication of these classic learning problems with image stimulus inputs. The peripheral part of the model, reflecting the ventral visual stream, extracts these three higher-level dimensions for the controller (Fig. 1B). (B) The model (right) captured the difficulty ordering of the six categorization problems described in (Shepard et al., 1961)(left). Probability of error is plotted as a function of learning block for each problem type. (C) The controller exhibited the same attention strategies as SUSTAIN, solving Type I by attending to one dimension, Type II by attending to two dimensions and Type III-VI by attending to all three dimensions.
The model successfully captured human learning performance (Fig. 2B). 204 Despite the system taking images as inputs, the controller's clustering solutions    was higher for learning problems with fewer relevant dimensions, and these two 259 factors interacted such that problems with fewer relevant dimensions (i.e., lower 260 complexity) showed greater compression over learning (Fig. 3A). 261 We evaluated whether a similar relationship exists between the controller's 262 attention weights and vmPFC. Unlike previous models, the attention mechanism 263 in the controller is part of a control system directing the DNN peripheral. We   2020), which revealed a vmPFC region that showed a significant interaction between learning block and problem complexity. Neural compression increased over learning blocks and was higher for learning problems with fewer relevant dimensions (each fMRI run consists four learning blocks; see the original paper for more details). (B) Functional correspondence between the clustering module of the controller-peripheral system and vmPFC in the human brain. The clustering module deploys attention strategies (in terms of attention compression) that tracks the degree of neural compression in vmPFC across category learning tasks over learning across category structure complexity.  should be more more precisely coded, whereas those not attended by the con-302 troller can also be ignored by the peripheral in accord with the costly energy 303 principle. We observed this pattern in the outputs of the peripheral (Fig 4A).   Figure 4: Performance of the DNN peripheral and its relation to LOC activity during learning. (A) Following the controller's needs and the costly energy principle, task-relevant features (shaded) are more precisely coded than taskirrelevant features (unshaded). (B) The error-rate for a classifier applied to LOC activity to discriminate (decode) between pairs of stimuli mirrored the precision of the peripheral's feature outputs, consistent with our claim that the peripheral's advanced layers correspond to LOC. (C) Following the costly energy principle, the fewer relevant features for a learning problem (VI>II>I), the more zero-valued peripheral attention weights there are.

Peripheral activity aligns with neural representation in ventral visual
In contrast to cognitive models like SUSTAIN that take handcrafted features 312 as inputs, our peripheral model processes images to provide a stimulus coding for 313 the controller. Previous work from Braunlich and Love (2019) found that more 314 attended features (according to SUSTAIN) were better decoded from the LOC's 315 BOLD response. Removing this featural assumption, we trained linear support 316 vector classifiers to discriminate each pair of stimuli for each task based on 317 LOC activity. We predicted that classifier error should track mean information 318 loss in the peripheral's feature outputs. When the controller demands precise 319 inputs, the peripheral should provide them and, accordingly, we predicted LOC 320 activity will better discriminate between items. As predicted decoding error 321 (i.e., neural information loss) tracked peripheral information loss, which was 322 greatest for Type I (one feature relevant), followed by Type II (two features  The peripheral's operation is modulated by the needs of the controller, which 328 change over learning. Therefore, as the controller optimizes its high-level at-329 tention for the learning task, the peripheral's attention weights should adjust, 330 which in turn should affect the precision of the peripheral's feature outputs.

331
As predicted, we found that as the controller's attention to a feature decreases 332 the information loss in the peripheral's feature output increases and that as the    and, unlike humans, they lack the ability to continually and rapidly adapt to 454 changing environments. One use of the controller-peripheral framework is to 455 repurpose engineering models to better suit neuroscience and perhaps in turn 456 offer insights to machine learning. Rather than directly importing or refining 457 models from machine learning, perhaps the controller-peripheral architecture 458 can shift the emphasis in neuroscience to considering how multiple models or 459 modules interrelate to develop more encompassing theories of brain function. was originally trained to map real-world images to one-hot vectors across 1, 000 468 pre-defined categories. In our work, this DNN module is to be integrated with 469 a controller (a clustering module), which requires the VGG-16 to be fine-tuned 470 such that the DNN module outputs three-dimensional vectors whose dimensions 471 encode psychological features of the stimulus, such as "0" for thin legs and "1" 472 for thick legs on the first dimension (see full mapping in Appendix D, Table   473 S.6). The fine-tuning process can be viewed as akin to familiarizing human 474 participants to the experimental stimuli.

475
To fine-tune the DNN module appropriately, we preserved the layers of VGG-476 16 that are believed to correspond to regions along the ventral visual stream up 477 to and including LOC (Xu and Vaziri-Pashkam, 2021). We then replaced layers 478 succeeding "block4 pool" with a three-unit fully-connected layer (301, 056 con-479 nection weights; randomly initialized using glorot uniform distribution) whose    Peripheral attention layer We inserted a goal-directed attention layer between the "block4 pool" layer and the fine-tuned output layer of the fine-tuned DNN. The peripheral attention layer was implemented in the same fashion as Luo et al. (2021). Additionally, we applied l1 regularization on the parameters (weights) of the attention layer. In accord with the costly-energy principle, this is to enforce task-driven sparsity over the attention weights. We defined attention modulation as the Hadamard product (filter-wise multiplication) between the preceding layer's activations and the attention weights. Formally, we denote pre-attention activation for a given stimulus from a DNN layer as x n , where x n ∈ R H×W ×F (H and W are the spatial dimensions of the representation and F are the number of filters). We denote the corresponding attention weights as g ∈ R F . The attention modulation is then defined as: where x * n are the post-attention activations that will be passed onto the output

541
The hidden layer is initialized with no clusters and new clusters are recruited based on the difficulty of the task. For a given stimulus (output from the DNN), each cluster is activated according its psychological similarity to the stimulus captured by the following equation: where |h ji − a in i | r is the dimensional distance between the center of the cluster 542 and the stimulus. In this work, we set r = 2 and q = 1 (Euclidean distance).  Clusters compete to respond to input patterns and in turn inhibit one another following where t is an inverse temperature parameter. When t is large, inhibition is weak. Contrary to SUSTAIN's winner-take-all (WTA) scheme where only the activation of the most activated cluster is passed onto the output layer, we allow all clusters contribute to model decision subject to normalization: where u is a decision parameter. Adjusting u can change how much the module's to capture different processes the brain might implement.

556
Every cluster has association weights connected to the output layer, hence the activation of output layer unit k is denoted: Association weights are trainable parameters of the module, which are initialized from zero. Output activations are further converted to a probability response using where the probability of a given stimulus belonging to category k is the magni- has a measure of "support" (i.e., consistency) for the correct response which is determined by the direction and magnitude of the association weights (see its association weights, as opposed to their absolute magnitude which would 577 advantage older clusters with established association weights.

578
The totalSupport (−1 and 1 inclusive) of the current clustering, which determines whether a new cluster is recruited should totalSupport fall below some threshold parameter, is where H out i is the output of cluster i and support i (−1 and 1 inclusive) is the support from cluster i defined as, where w i,correct is the association weight from cluster i to the correct output unit 579 and w i,incorrect is the association weight to the incorrect output (i.e., response)

582
Loss optimization After the cluster recruitment step, parameters of the clustering module, namely the association weights, attention weights and all cluster positions will be updated via gradient descent in order to minimize the global categorization loss as well as a regularization loss: The first half of the loss is the cross-entropy error between a stimulus y n 583 = (y 1 , y 2 , ..., y k ) T and its prediction by the module p n = (p 1 , p 2 , ..., p k ) T . The

618
Training procedures are identical across these models but they differ in terms of the optimization objectives. For Model 2, which follows the costly-energy principle but not the controller-peripheral framework, it is optimized to the following loss: The only difference to Model 1 is that in Model 2, both the DNN module and the clustering module are updated to minimize the global categorization error (first half) in addition to the l1 regularization loss on the peripheral attention weights (second half). For Model 3, which follows the controller-peripheral framework but not the costly-energy principle, it is optimized to: The only difference to Model 1 is that Model 3 does not have the regularization term on the peripheral attention weights. For Model 4, which follows neither the controller-peripheral framework nor the costly-energy principle, it is optimized to the global categorization loss without the l1 regularization over peripheral attention weights: geometric shape stimuli used in the original study ( Fig. 2A) Intuitively, the compression score is formulated as normalized entropy (bounded 686 between 0 and 1 inclusive) indexing the dispersion of attention across stimu-687 lus dimensions. If the task complexity is high, requiring attention to multiple 688 features, attention weights will be less selective, which will lead to a low com-689 pression score. If the task complexity is low, attention will be allocated to 690 some features more than others, which will lead to a high compression score In the model, information loss can be directly computed using network ac-716 tivities in that we have access to both stimulus representations before and after 717 learning. In the brain, however, such a direct measure is not available. There-718 fore, we used multivariate pattern analysis (MVPA) to we determined how lin-719 early separable (i.e., confusable) two neural activity patterns are in LOC for 720 every pair of stimuli in each task (decoding error; 1 -decoding accuracy). Intu-721 itively, neural patterns of stimulus pairs that differ by a task-relevant dimension 722 should be more easily separable (less confusable) than stimulus pairs that differ were estimated using an event-specific univariate GLM approach (see Sec. 6.5.3 for details). We fitted the support vector classifiers using a three-fold crossvalidation procedure (the first run was excluded from training because learning 737 at the start could be unstable). We computed the average decoding error (1 -738 classifier accuracy) over stimuli pairs and participants across tasks.

739
To test the prediction that information loss in both brain and model would   show correctly ordered learning curves. However, neither of them exhibits con-  and is optimized based on the costly-energy principle, we consider three alternative models which lack one or both elements. For Model 2 and 4 which lack the controller-peripheral architecture, they are optimized to minimize the standard global categorization error. For Model 3 and 4 which lack costly-energy principle, there are no constraints on the perceptual attention weights of their peripheral module; (B-C) All three alternative models (Model 2-4) cannot account for human learning behavior of Shepard et al. (1961) or show task-specific resource expenditure patterns as those found in Ahlheim and Love (2018). For Model 2, while the energy expenditure (in terms of the sparsity of learned peripheral attention weights) across problem types is in the right order, it is not as significant as Model 1 and Model 2 cannot capture human behavior with global error optimization; For both Model 3 and 4, while they show correct difficulty ordering of six problem types, they do not demonstrate energy expenditure patterns that can reflect task difficulty.    We reviewed participant behaviors in Type I problem when only one feature is relevant and found that when "mandible" was the relevant feature, response time was significantly longer and categorization accuracy was significantly lower than when either of the other dimensions was relevant.