Targeted V1 comodulation supports task-adaptive sensory decisions

Sensory-guided behavior requires reliable encoding of stimulus information in neural populations, and flexible, task-specific readout. The former has been studied extensively, but the latter remains poorly understood. We introduce a theory for adaptive sensory processing based on functionally-targeted stochastic modulation. We show that responses of neurons in area V1 of monkeys performing a visual discrimination task exhibit low-dimensional, rapidly fluctuating gain modulation, which is stronger in task-informative neurons and can be used to decode from neural activity after few training trials, consistent with observed behavior. In a simulated hierarchical neural network model, such labels are learned quickly and can be used to adapt downstream readout, even after several intervening processing stages. Consistently, we find the modulatory signal estimated in V1 is also present in the activity of simultaneously recorded MT units, and is again strongest in task-informative neurons. These results support the idea that co-modulation facilitates task-adaptive hierarchical information routing.


Encoding of local visual orientation in a V1 population
on and off on screen and can change their orientation. One stimulus is selected as relevant, and the monkey must report changes in its orientation with a saccadic eye movement. B) The recorded population of V1 neurons has receptive field centers (gray) within the receptive field of a simultaneously recorded MT unit. Two of the three stimuli locations are within the MT unit's receptive field ("relevant" -purple) and one is in the opposite hemifield ("control" -black). C) Distribution of behavioral performance across blocks, quantified by the % hits. D) Behavioral performance as a function of time within a block, binned using 5 consecutive trials; the boxes mark 25 and 75% quantiles, points indicate different blocks and the red star indicates a significant difference in means (relative twosided t-test, p = 0.015). E) The distribution of firing rates over all stimulus presentations, to each of the two task stimuli for three example neurons with different d ′ values. F) |d ′ | distribution, over all blocks of relevant tasks and all V1 neurons (shade). Lines indicate sub-distributions of neurons with significant informativeness (purple), and neurons in the control task (black). G) Relationship between the informativeness values in relevant and control tasks. A and B adapted from [18].
Neurons in V1 respond selectively to the local orientation of visual stimuli, and the selectivities of the full pop- variability is neuron-specific, we also find weak, structured correlations in pairs of units which suggest additional 138 sources of shared noise not captured by the model (Suppl. Fig. S2).

139
The modulator coupling is dissociable from traditional attentional effects on mean firing rate (Suppl. S7), which 140 have been suggested to improve encoding precision of particular attended stimuli [27], and it cannot be explained by 141 neural adaptation, as the degree of adaptation was uncorrelated with the quality of the fit of the modulated-SR model 142 (Suppl. S8). Finally, the modulator structure cannot be explained by the fact that the response measurements are 143 in the form of multi-unit spike counts (Suppl. S9). Overall, our analysis reveals that V1 responses are modulated by 144 a common fluctuating signal, and that the strength of this modulation in each unit reflects its task-informativeness.
From an encoding perspective, this seems counter-intuitive (Suppl. S10). Why would the brain inject noise specifically in the few neurons that matter most? A) The average response of neurons of the three subpopulations to two task stimuli. There are 12 informative, 38 uninformative and 4950 inactive neurons. B) Effects of increasing modulator strength on encoding and decoding, respectively, for modulator coupling weights equal to informativeness. Encoding is measured by the SNR, while decoding precision is quantified as the variance of the decoding weights of the modulator-guided decoder. C) Performance of three different decoders in simulations of a discrimination task with 1000 model V1 neurons, 50 informative, with increasing relative modulator strength (mean and 95% confidence interval). D) Same comparison as in C, but with the modulator coupling weights corrupted by Gaussian noise, as shown in the right panel. E) Decoder performance comparison for simulated multiunits, obtained by summing the activity of random pairs of neurons.
The modulator fluctuates rapidly, allowing any task information it provides to be accessed quickly, potentially on the 149 time scale of single trials. We hypothesize that the modulation serves to "label" the responses of the task-relevant 150 V1 subpopulation, so that downstream circuits can easily identify and use these signals.

151
To analyze the decoding process, we simulated an encoding model that captures the essence of the response properties 152 observed in the V1 data. For this, we use a variant of the modulated-SR model with static stimulus-dependent firing 153 rates, and one shared, temporally-independent stochastic modulator m t (see Methods, and [9]): where k n,t (s) is the spike count of neuron n at time t in response to stimulus s; the modulator m t is drawn 155 independently from a Gaussian distribution with zero mean, and influences neuron n with coupling weight c n , 156 which is set to be proportional to the neuron's task-informativeness. Finally, since the degree of modulation affects 157 not only variability but also mean responses, we explicitly correct for the mean increase to isolate the effects of 158 modulator-induced co-variability (see Methods).

159
Given this encoding model and a binary discrimination task, s ∈ {0, 1}, the ideal observer's optimal decoder compares 160 a weighted sum of the neural responses with a modulator-specific decision threshold, c(m t ) (see Methods): where a (opt) n = log(λ n (1))−log(λ n (0)) denotes the optimal decoding weights. These are independent of the modulator 162 and equivalent to those derived from an independent Poisson model. The decoding weights are non-zero only for the small subpopulation of informative neurons (Fig. 3A, purple), with their signs indicating preference between the 164 two stimulus alternatives. Zero weights eliminate active but uninformative (Fig. 3A, black) or inactive (Fig. 3A The MG decoder does not rely on knowledge of the response properties of the encoding population, but it assumes 189 access to the modulator (e.g., it is a broadcast signal). This has important implications for learning the decoder;

195
We compared the performance of different decoders in a binary discrimination task, based on simulated responses of 196 a large population of V1 neurons with a small fraction of informative neurons (5%, Fig. 3A; see also Suppl. S10D for 197 variations in percentage of informative neurons). The statistically optimal decoder corresponds to the ideal observer's 198 solution, and thus provides an upper bound on achievable performance; the SO decoder provides a lower bound.

199
The optimal decoder's accuracy deteriorates as the modulator increases in amplitude, corrupting the encoded signal accuracy close to that of the ideal observer, a result that is robust to variations in population size (Suppl. S10).

206
In practice, the performance of the MG decoder could depend on how strongly correlated the modulator couplings, 207 c n , are with task-informativeness. To test the robustness of the MG decoder, we weakened the correlation between 208 the modulator couplings, c n , and task-informativeness by adding noise to c n . We found that although performance 209 decreases overall, the nonmonotonic dependence of the MG decoder performance on modulator strength is preserved 210 (Fig. 3D). Given that our measurements mostly include multiunits, we also tested their impact on decoding and found 211 that the results are qualitatively robust to such measurement noise (Fig. 3E). Interestingly, the optimal modulation 212 amplitude generally shifts towards the range estimated from the data, suggesting that physiologically, the degree of 213 modulation may be well-matched to the precision of the modulator targeting. Figure 4: V1 modulator is task-specific and facilitates decoding. A) The distribution of relative modulator strengths across all relevant task blocks (purple) and all control task blocks (black). The star indicates significant difference between the two distributions (U-test, p < 0.001). B) Same as A, but comparing across the two relevant tasks (p = 0.45). C) The distribution of correlation coefficients between modulator coupling (green) or residual response variance (blue) and the residual behavioral relevance of a unit's activity (correlation with behavior), obtained by regressing out informativeness and mean firing rate. D) Decoding from the recorded V1 population: performance of different decoders or logistic regression for an example block population with increasing number of training samples (mean ± SEM); star indicates significant differences between the optimal and the MG decoder. E) Performance with minimal training against minimal number of training samples (stimulus presentations) needed to reach above chance (50%) performance, for each block; stars indicate significant differences between the optimal and the MG decoder. F) Decoding weights estimated with maximum training (90% of all stimulus presentations) versus with minimal training (1%) for various decoders; same colors as D,E.

V1 modulator is task specific and facilitates decoding 215
In our experimental context, the theory predicts that the co-variability of neural responses should change based on 216 whether they are task-informative. Given that the recorded V1 population is informative in the relevant tasks but 217 not the control task ( Fig. 1G), we expect differences in overall modulator strength across tasks and in individual 218 modulation strengths across neurons. Indeed, the overall strength of the estimated modulation significantly decreases 219 in the control task, both in absolute terms and relative to stimulus induced variations ( Fig. 4A and Suppl. S12).

220
In comparison, the two relevant task conditions have indistinguishable statistics of overall modulation strength 221 (Fig. 4B). Our theory explains this difference as a change in labeling, from the recorded subpopulation that is 222 informative for the relevant tasks, to a different (unrecorded) subpopulation that is informative in the control task.

223
The comparison between the two relevant tasks is limited by the proximity of the two relevant stimulus locations, as 224 only few units are exclusively informative in one task (see Sec. 2.1). However, despite the reduced sample size, we 225 find a significant correlation between the difference in informativeness in the two relevant tasks and the difference 226 in coupling (Spearman correlation, r = 0.16 with p < 0.05), showing that units that are more informative in one of 227 the two tend to also have higher coupling in that task.

228
In our framework, decoding weights are approximated by estimating coupling strengths, and thus neurons with large 229 coupling (and thus strongly modulated) should have a stronger influence on behavior. Despite V1's early position 230 in the visual processing stream, we find this to be true in our data; 91% of blocks show significant correlations 231 (Spearman r, α = 0.05) between modulator coupling and a unit's correlation with the monkey's behavior computed 232 as a d ′ of neural responses, with categories defined by the animal's choices rather than stimulus identity (see Methods).

233
Potential confounds in this analysis are not only overall firing rates, but also the informativeness of a unit, as the 234 most informative neurons would be expected to have a stronger influence on behavior [34; 35]. Nonetheless, even 235 after controlling for these confounds, it remains the case that units that are more modulated are the ones that are also 236 more predictive of behavior (Fig. 4C). This relationship is not present for the residual response variance (Fig. 4C).

237
Furthermore, we do not find a relationship with behavioral correlation in other shared noise sources (Suppl. S13), 238 which suggests that the shared modulator-induced fluctuations are particularly relevant for downstream processing.

239
The most direct prediction of the theory is the ability of the MG decoder to set appropriate decoding weights for the recorded V1 responses, and to do so rapidly, with limited data. To test these predictions, we decoded the stimulus 241 identity from V1 responses using our heuristic MG decoder and compared its performance with that of the ideal 242 observer for the estimated (modulated-SR) encoding model. When all available data is used for estimation, the MG 243 decoder performance is close to that of the optimal decoder (∼ 80% correct, which suggests that the strength and 244 targeting precision of the estimated modulator is sufficient to guide decoding).

245
The optimal decoder provides an upper bound on decodability assuming perfect knowledge of the V1 response 246 properties, but it can still perform poorly when the model is estimated from limited data; in fact, its performance is 247 at chance in the low-data regime (Fig. 4D). Similarly, learning decoding weights directly through logistic regression 248 requires many training trials before performing above chance (Fig. 4D). In contrast, the modulator-guided (MG) regression in the small training sample regime (comparing MG against either learned optimal or regression-based 252 decoder significant; t-test, p < 0.0001, see Fig. 4D). We quantify this effect across all data and find that the MG 253 decoder reaches above-chance performance significantly faster than the learned optimal decoder (t-test, p < 0.0001, 254 Fig. 4E) and that the performance attained with minimal training is significantly higher relative to that of the 255 learned optimal decoder (t-test, p = 0.01). The MG decoder also reaches above-chance performance significantly 256 faster than a regression-based decoder (t-test p < 0.001) and learned optimal and regression-based decoder do not 257 differ significantly (t-test, p > 0.05 for minimal training and performance). Our theory predicts that the advantage 258 of the MG decoder lies in its ability to accurately estimate the decoding weights quickly. Indeed, we find a strong 259 correlation between the MG decoding weights obtained with minimal training and those estimated from all available 260 data, but this relationship does not hold for the learned optimal decoding weights or the regression weights (Fig. 4F).

261
Although significant, the difference in the number of trials required for above-chance performance may seem small. Visual information processing is hierarchical, and task-relevant information needs to propagate through several 277 stages before reaching decision-making areas. Moreover, since receptive field sizes increase across stages of processing 278 [36], localized task-specific information will diffuse in subsequent visual layers, making the task of identifying the 279 subpopulation of relevant readout neurons even harder. Thus, the decoding problem identified in V1 persists, and 280 likely worsens, in downstream areas. As a separate issue, while thus far we have assumed the correct modulator 281 targeting to be already present in the circuit, the right degree of modulation for each neuron in a task needs to also 282 be learned from experience. Can the modulator-guided readout still facilitate flexible and accurate task performance  the last layer are combined with gain terms g n , which tune the readout of the decision circuit to the specific task 300 (Fig. 5D). As for the MG decoder in Eq. 3, these gains are adaptively computed using the correlations between 301 the individual neural responses and the modulator, which is again assumed to be available at the decision stage. 302 We optimize the modulator coupling strengths to maximize behavioral performance on the task, using explicit trial 303 feedback (via backpropagation). The general rationale is that if task-informative neurons can be modulator-labeled 304 in the V1 stage, then this labeling will be inherited downstream by exactly those neurons that receive their signal.

305
Thus their co-variability can guide decoding at the decision layer. 306 We assess the efficiency of the modulator-based solution by comparing it to two alternative models, both of which 307 adapt based on experience within the task, but which differ in their parameter complexity. At one extreme, we 308 consider a system that relearns the connection strengths between all layers de novo ("retraining"). This approach 309 corresponds conceptually to the regression model in Fig..  alternative models, fine-tuning the network via the modulator substantially reduces the amount of task-training 315 required to reach criterion performance (Fig. 5G).
The improvement in performance of the modulator solution over regression-based relearning corresponds qualitatively 317 to what we found when decoding from the data in Fig. 4D). Nonetheless, one important distinction between this 318 hierarchical model and the previous MG decoder is that the modulator affects both the mean and the variance of 319 the V1-like encoding layer (see Methods). To disambiguate the effects of modulation on neural variability vs. mean 320 responses, we introduce a third model, which is parametrized and trained in the same way, but deterministically 321 boosts the gain of initial stage neurons [13], in the absence of stochastic modulation. We find that targeting of deterministic gain modulation can be learned faster than retraining all the connections, but it does not reach the 323 same performance as the stochastic modulator given limited training. This suggests that the separation of stimulus 324 information and task relevance into the mean and variance of neural activity, respectively, further enhances the 325 identifiability of the stimuli at the decision stage.

326
When investigating the properties of the learned solution, we find that the learned couplings are highest for task-327 informative neurons (5% highest |d ′ |) in the primary encoding layer (Fig. 5H), as in the data (2F-J). Although the 328 modulator only affects the responses of these neurons directly, we find that informative neurons in the downstream 329 processing layer are also preferentially correlated with the modulator (Fig. 5I). This suggests that task relevance 330 propagates along the hierarchy in parallel to the stimulus information. siderations, we expect correlated activity in V1 to drive MT to some extent. What is specific to our theory is the 338 prediction that the degree of inherited modulation should reflect the task informativeness of individual MT units. 339 We find that responses of MT units that cover the two relevant stimulus locations (Fig. 1A) vary in their taskinformativeness (Suppl. S15) and show different degrees of supra-Poisson variability (Fig. 6A) The second model additionally conditions on the modulator estimated from the simultaneously recorded V1 units 347 ("SR+V1 modulation" ; Fig. 6B). The SR model provided a good fit for all MT units (Suppl. Fig. S6A), which is require vast amounts of task experience. A final distinction is that our approach does not rely on an explicit 438 context cue: the task relevance of sensory features is communicated solely through task feedback. Overall, multiple 439 mechanisms for task-specific readout are likely to coexist in the brain and be engaged in a context dependent manner.

440
Our theory is agnostic to the source of the modulator and the circuit mechanisms underlying its task-specific Theoretical framework for decoding from a neural population 464 We simulated a binary discrimination task analogous to the experiment, which requires discriminating stimuli s = 0 from s = 1 on the basis of the activity of a population of N neurons. Neural responses are modeled as Poisson draws with a stimulus-dependent firing rate, which is itself modulated by a time-varying noisy signal, m t , shared across neurons: where λ n (s) is the stimulus response function of the neuron, and t indexes time within a stimulus presentation. The with weights a (opt) n = log(λ n (1)) − log(λ n (0)), and time-varying threshold The modulator-guided heuristic decoder assumes access to the modulator m t and the neural responses k nt , and learns approximate decoding weights based on co-fluctuations of the two within a trial: The sign of the decoding weight is separately estimated by comparing responses to the two stimuli (trial feedback; 475 see also [9] and Suppl. S10).
Hierarchical information propagation with learned stochastic modulation 477 We use a 4 layer artificial neural network that maps an image stimulus with 3136 pixels into categories, corresponding 478 to 10 digits or different orientations. The first encoding layer includes neurons with fixed Gabor receptive fields.

479
The modulator affects encoding neurons through coupling terms c n , which modulate the neuron's responses: where h (0) n,t is the activity of neuron n in the encoding layer, w n are the weights from the input to this neuron. Neurons in the top layer include a multiplicative gain g n ≥ 0: where b (2) n is a neuron-specific bias, optimized together with the weights w (2) n during pre-training. The gain g n is learned using the MG correlation rule: where h (2) n,t (s) denotes the activity at time t of neuron n in the last processing layer, in response to stimulus s.

480
There are three stages of learning. 1) Pre-training optimizes all network weights to natural image statistics using a across sessions. 509 We analyze 21 − 109 trials per block, where the monkey either detected the target (hit) or failed to detect it (miss). 510 We discard trials where the monkey did not finish the task in a hit or miss and trials where one of the distractors

Informativeness of a unit 526
The informativeness of a unit is quantified by where µ 0 and σ 2 0 , µ 1 with spike counts measurements k n . Parameters β n are obtained by maximizing the log-likelihood of the data, separately for each block: L(β n ) = t −(β n s t ) T k n,t + exp(1 T β n s t ) + αβ n T β n .
The extended MT SR model includes the (normalized) V1 modulator as an additional predictive variable.  For Fig. 2G we computed the rank of each modulator coupling in its own block-specific population and compare 556 the distribution of significantly informative to uninformative units. In Fig. 2H-J we used partial correlations to 557 test for a relationship between unit's modulator coupling and task-informativeness in each block not explained by 558 differences in overall firing rate. Specifically, we report the Spearman correlation between residual informativeness 559 (after linearly regressing firing rate) and modulator coupling. We train each decoder on data that includes a balanced number of stimulus 0/1 presentations at high and low 574 contrast, varying the size of the training set from the minimum 4 (one for each stimulus-contrast pair) to all 575 available data. Decoder performance is tested on held out data. The optimal decoder uses maximum likelihood 576 estimates (as in theory, with a 200ms decoding window), but based on estimated instead of ground truth parameters.

577
It uses a constant threshold which is optimized on the training data. This is known to be suboptimal (Eq. 2), but is 578 more robust to the noise in the data and therefore performs better in the limited data regime. The modulator-guided 579 (MG) decoder estimates readout weights by taking the inner product between the unit's activity and the modulator 580 values (Eq. 8, using 50ms bins), with signs determined from trial-level feedback, and a constant threshold.