Abstract
Particular deep artificial neural networks (ANNs) are today’s most accurate models of the primate brain’s ventral visual stream. Here we report that, using a targeted ANN-driven image synthesis method, new luminous power patterns (i.e. images) can be applied to the primate retinae to predictably push the spiking activity of targeted V4 neural sites beyond naturally occurring levels. More importantly, this method, while not yet perfect, already achieves unprecedented independent control of the activity state of entire populations of V4 neural sites, even those with overlapping receptive fields. These results show how the knowledge embedded in today’s ANN models might be used to non-invasively set desired internal brain states at neuron-level resolution, and suggest that more accurate ANN models would produce even more accurate control.
Particular deep feed-forward artificial neural network models (ANNs) constitute today’s most accurate “understanding” of the initial ~200ms of processing in the primate ventral visual stream and the core object recognition behavior it supports (see (1) for the currently leading models). In particular, visually-evoked internal neural representations of these specific ANNs are remarkably similar to the visually-evoked neural representations in mid-level (area V4) and high-level (area IT) cortical stages of the ventral stream (2, 3), a finding that has been extended to neural representations in visual area V1 (4), to patterns of behavioral performance in core object recognition tasks (5, 6), and to both magnetoencephalography and fMRI measurements from the human ventral visual stream (7, 8). Notably, these prior findings of model-to-brain similarity were not curve fits to brain data – they were predictions evaluated using images not previously seen by the ANN models, showing that these models have some generalization of their ability to capture key functional properties of the ventral visual stream.
However, at least two important potential limitations of this claim have been raised. First, because the visual processing that is executed by the models is not simple to describe, and the models have only been evaluated in terms of internal functional similarity to the brain (above), perhaps they are more like a copy of, rather than a useful “understanding” of, the ventral stream. Second, because the images to assess similarity were sampled from the same distribution as that used to set the model’s internal parameters (photograph and rendered object databases), it is unclear if these models would pass a stronger test of functional similarity – does that similarity generalize to entirely novel images? That is, perhaps their reported apparent functional similarity to the brain (3, 7, 9), substantially over-estimates their true functional similarity.
Here we conducted a set of non-human primate visual neurophysiology experiments to assess the first potential limitation by asking if the detailed knowledge that the models contain is useful for one potential application (neural activity control), and to assess the second potential limitation by asking if the functional similarity of the model to the brain generalizes to entirely novel images.
Specifically, we used one of the leading deep ANN ventral stream models (i.e. a specific model with a fully fixed set of parameters) to synthesize new patterns of luminous power (“controller images”) that, when applied to the retinae, were intended to control the neural firing activity of particular, experimenter-chosen neural sites in cortical visual area V4 of macaques in two settings. i) Neural “Stretch”: synthesize images that “stretch” the maximal firing rate of any single targeted neural site well beyond its naturally occurring maximal rate. ii) Neural Population State Control: synthesize images to independently control every neural site in a small recorded population (here, populations of 5-40 neural sites). We here tested that population control by aiming to use such model-designed retinal inputs to drive the V4 population into an experimenter-chosen “one hot” state in which one neural site is pushed to be highly active while all other nearby sites are simultaneously all “clamped” at their baseline activation level. We reasoned that successful experimenter control would demonstrate that at least one ANN model can be used to non-invasively control the brain – a practical test of useful, causal “understanding” (10, 11).
We used chronic, implanted microelectrode arrays to record the responses of 107 neural multi-unit and single-unit sites from visual area V4 in three awake, fixating rhesus macaques (nM =52, nN =33, nS=22). We first determined the classical receptive field (cRF) of each site with briefly presented small squares (for details see Methods). We then tested each site using a set of 640 naturalistic images (always presented to cover the central 8° of the visual field that overlapped with the estimated cRFs of all the recorded V4 sites), and using a set of 370 complex curvature stimuli previously determined to be good drivers of V4 neurons (12) (location tuned for the cRFs of the neural sites). Using each site’s visually evoked responses (see Methods) to 90% of the naturalistic images (n=576), we created a mapping from a single “V4” layer of a deep ANN model (13) (Conv-3 layer; that we had established in prior work) to the neural responses. The predictive accuracy of this model-to-brain mapping has previously been used as a measure of the functional fidelity of the brain model to the brain (1, 3). Indeed, using the V4 responses to the held-out 10% of the naturalistic images as tests, we replicated and extended that prior work – we found that the neural predictor models correctly predicted 89% of the explainable (i.e. image driven) variance in the V4 neural responses (median over the 107 sites, each site computed as the mean over two mapping/testing splits of the data; see Methods).
Besides generating a model-V4-to-brain-V4 similarity score (89%, above), this mapping procedure produces a potentially powerful tool – an image-computable predictor model of the visually-evoked firing rate of each of the V4 neural sites. If truly accurate, this predictor model is not simply a data fitting device and not just a similarity scoring method – instead it must implicitly capture a great deal of visual “knowledge” that may be difficult to express in human language, but is hypothesized (by the model) to be used by the brain to achieve successful visual behavior. To extract and deploy that knowledge, we used a model-driven image synthesis algorithm (see Figure-1 and Methods) to generate controller images that were customized for each neural site (i.e. according to its predictor model) so that each image should predictably and reproducibly control the firing rates of V4 neurons in a particular, experimenter-chosen way. That is, we aimed to test the hypothesis that experimenter-delivered application of a particular pattern of luminous power on the retinae will reliably and reproducibly cause V4 neurons to move to a particular, experimenter-specified activity state (and that removal of that pattern of luminous power will return those V4 neurons to their background firing rates).
While there are an extremely large number of possible neural activity states that an experimenter might ask a controller method to try to achieve, we restricted our experiments to the V4 spiking activity 70-170 ms after retinal power input (the time frame where the ANN models are presumed to be most accurate), and we have thus far tested two control settings: Stretch control and One-hot population control (described below). To test and quantify the goodness of control, we applied patterns of luminous power specified by the synthesized controller images to the retinae of the animal subjects while we recorded the responses of the same V4 neural sites (see Methods).
Each experimental manipulation of the pattern of luminous power on the retinea are colloquially referred to as “presentation of an image”, but we state the precise manipulation here of applied power that is under experimenter control and fully randomized with other applied luminous power patterns (other images) to emphasize that this is logically identical to more direct energy application (e.g. optogenetic experiments) in that the goodness of experimental control is inferred from the correlation between power manipulation and the neural response in exactly the same way in both cases (see (11) for review). The only difference of the two approaches is the assumed mechanisms that intervene between the experimentally-controlled power and the controlled dependent variable (here V4 spiking rate) – steps that the ANN model aims to approximate with stacked synaptic sums, threshold non-linearities, and normalization circuits. In both the control case presented here and the optogenetics control case, those intervening steps are not fully known, but approximated by a model of some type. That is, neither experiment is “only correlational” because causality is inferred from experimenter-delivered, experimenter-randomized application of power to the system.
Because each experiment was performed over separate days of recording (one day to build all the predictor models, one day to test control), only neural sites that maintained both high SNR and consistent rank order of responses to a standard set of 25 naturalistic images across the two experimental days were considered further (nM =38, nN =19, and nS=19 for Stretch experiments; nM =38, and nS=19 for One-hot-population experiments; see Methods).
“Stretch” Control: Attempt to maximize the activity of individual V4 neural sites
We first defined each V4 site’s “naturally-observed maximal firing rate” as that which was found by testing its response to the best of the 640 naturalistic test images (cross-validated over repeated presentations, see Methods). We then generated synthetic controller images for which the synthesis algorithm was instructed to drive one of the neural site’s firing rate as high as possible beyond that rate, regardless of the other V4 neural sites. For our first Stretch Control experiment, we restricted the synthesis algorithm to only operate on parts of the image that were within the classical receptive field (cRF) of each neural site. For each target neural site (nM =21, nN =19, and nS=19), we ran the synthesis algorithm from five different random image initializations. For 79% of neural sites, the synthesis algorithm successfully found at least one image that it predicted to be at least 10% above the site’s naturally observed maximal firing rate (see Methods). However, in the interest of presenting an unbiased estimate of the stretch control goodness for randomly sampled V4 neural sites, we included all sites in our analyses, even those (~20%) that the control algorithm predicted that it could not ”stretch.” Visual inspection suggests that the five stretch controller images generated by the algorithm for each neural site are perceptually more similar to each other compared to those generated for different neural site (see Figures 2 and S1), but we did not psychophysically quantify that similarity.
An example of the results of applying the Stretch Control images to the retinae of one monkey to target one of its V4 sites is shown in Figure 2-A), along with the ANN-model-predicted responses of this site for all tested images. A closer visual inspection of this neural site’s “best” natural and complex curvature images within the site’s cRF (Fig. 2 top) suggests that it might be especially sensitive to the presence of an angled convex curvature in the middle and a set of concentric circles at the bottom left side. This is consistent with extensive systematic work in V4 using such stimuli (12, 14), and it suggests that we had successfully located the cRF and tuned our stimulus presentation to maximize firing rate by the standards of such prior work. Interestingly however, we found that all five synthetic stretch control images (red) drove the neural responses above the response to each and every tested naturalistic image (blue) and above the response to each and every complex curvature stimulus presented within the cRF (purple), (Fig. 2-A).
To quantify the goodness of this stretch control, we measured the neural response to the best of the five synthetic images (again, cross-validated over repeated presentations, see Methods) and compared it with the naturally-observed maximal firing rate (defined above). We found that the stretch controller images successfully drove 68% of the V4 neural sites (40 out of 59) statistically beyond its maximal naturally-observed firing rate (unpaired-samples t-test at the level of p < 0.01 between distribution of highest firing rates for naturalistic and synthetic images; distribution generated from 50 random cross-validation samples, see Methods). Measured as an amplitude, we found that the stretch controller images typically produced a firing rate that was 39% higher than the maximal naturalistic firing rate (median over all tested sites, Figure-2 panel B and C).
Because our fixed set of naturalistic images was not optimized to maximally drive each V4 neural site, we considered the possibility that our stretch controller was simply rediscovering image pixel arrangements that are already known from prior systematic work to be good drivers of V4 neurons (12, 14). To test this hypothesis, we tested 19 of the V4 sites (nM = 11, nS = 8) by presenting – inside the cRF of each neural site – each of 370 complex curve shapes (14) – a stimulus set that has been previously shown to contain image features that are good at driving V4 neurons when placed within the cRF. Because we were also concerned that the fixed set of naturalistic images did not maximize the local image contrast within each V4 neuron’s cRF, we presented the complex curved shapes at a contrast that was matched to the contrast of the synthetic stretch controller images (see supplementary Figure S4). Interestingly, we found that for each tested neural site, the synthetic controller images generated higher firing rates than the most-effective complex curve shape (Figure 2-D). Specifically, when we used the maximal response over all the complex curve shapes as the reference (again, cross-validated over repeated presentations), we found that the median stretch amplitude was even larger (187%) than when the maximal naturalistic image was used as the reference (73% for the same 19 sites). In sum, the ANN-driven stretch controller had discovered pixel arrangements that were better drivers of V4 neural sites than prior systematic attempts to do so.
“One-Hot-Population” Control: Attempt to only activate one of many V4 neural sites
Similar to prior single unit visual neurophysiology studies (15–17), the stretch control experiment attempted to optimize the response of each V4 neural site one at a time without regard to the rest of the neural population. But the ANN model potentially enables much richer forms of population control in which each neural site might be independently controlled. As a first test of this, we asked the synthesis algorithm to try to generate controller images with the goal of driving the response of only one “target” neural site high while simultaneously keeping the responses of all other recorded neural sites low (aka a “one-hot” population activity state; see Methods).
We attempted this one-hot-population control on neural populations in which all sites were simultaneously recorded (One-hot-population Experiment 1; n=38 in monkey-M; Experiment 2; n=19 in monkey-S). Specifically, we randomly chose a subset of neural sites as “target” sites (14 in monkey-M and 19 in monkey-S) and we asked the synthesis algorithm to generate five one-hot-population controller images for each of these sites (i.e. 33 tests in which each test is an attempt to maximize the activity of one site while suppressing the activity of all other measured sites from the same monkey). For these control tests, we allowed the controller algorithm to optimize pixels over the entire 8° diameter image (that included the cRFs of all the recorded neural sites, see Fig. 3), and we then applied the one-hot-population controller images to the monkey retinea to assess the goodness of control. The synthesis procedure predicted a softmax score of at least 0.5 for 77% of population experiments (as a reference, the maximum softmax score is 1 and is obtained when only the target neural site is active and all off-target neural sites are completely inactive; for an example near 0.3 see Fig. 3).
While the one-hot-population controller images did not achieve perfect one-hot-population control, we found that the controller images were typically able to achieve enhancements in the activity of the target site without generating much increase in off-target sites (relative to naturalistic images; see examples in Figure 3-A). To quantify the goodness of one-hot-population control in each of the 33 tests, we computed a one-hot-population score on the responses of the activity profile of each population (softmax score, see Methods), and we referenced that score to the one-hot-population control score that could be achieved using only the naturalistic images (i.e without the benefit of the ANN model and synthesis algorithm). We took the ratio of those two scores as the measure of improved one-hot population control, and we found that the controller typically achieved an improvement of 57% (median over all 33 one-hot-population control tests; see Fig. 3-B and C) and we found that that improved control was statistically significant for 76% of the one-hot population control tests (25 out of 33 tests; unpaired-samples t-test at the level of p < 0.01).
We considered the possibility that the improved population control was resulting from the non-overlapping cRFs that would allow neural sites to be independently controlled simply by restricting image contrast energy to each site’s cRF. To test this possibility, we analyzed a sub-sample of the measured neural population in which all sites had strongly overlapping cRFs (see Fig. 3-D). We considered a neural population of size 10 in monkey-M and of size 8 in monkey-S for this experiment with largely overlapping cRFs (see Fig. 3-D). In total we performed the experiment on 12 target neural sites in two monkeys (4 in monkey-M and 8 in monkey-S) and found that the amplitude of improved control was still 40%. Thus, a large portion of the improved control is the result of specific spatial arrangements of luminous power within the retinal input region shared by multiple V4 neural sites that the ANN-model has implicitly captured and predicted and the synthesis algorithm has successfully recovered (Fig. 4).
As another test of one-hot-population control, we conducted an additional set of experiments in which we restricted the one-hot control synthesis algorithm to operate only on image pixels within the shared cRF of all neural sites in a sub-population with overlapping cRFs (Fig. 3-E). We compared this within-cRF synthetic one-hot population control with the within-cRF one-hot population control that could be achieved with the complex curved shapes (because the prior experiments with these stimuli were also designed to manipulate V4 responses only using pixels inside the cRF). We found that, for the same set of neural sites, the synthetic controller images produced a very large one-hot population control gain (median 112%, Fig. 3-E) and the control score was significantly higher than best curvature stimulus for 86% of the neural sites (12 out of 14).
Does the functional fidelity of the ANN brain model generalize to novel images?
Besides testing non-invasive causal neural control, these experiments also aimed to ask if ANN models would pass a stronger test of functional similarity to the brain than prior work had shown (2, 3). Specifically, does that model-to-brain similarity generalize to entirely novel images? Because the controller images were synthesized de novo from random pixel arrangement and they were optimized to drive the firing rates of V4 neural sites both upwards (targets) and downwards (one-hot-population off-targets), we considered them to be a highly novel set of neural-modulating images that is far removed from the object naturalistic images. Indeed, visual inspection suggests the novelty of these images (Fig. 5). We thus used the V4 neural responses to all the tested synthetic images to ask if the ANN model “neural” responses matched the brain’s responses, using the same similarity measure as prior work (2, 3), but now with zero parameters to fit. That is, a good model-to-brain similarity score required that the ANN predictor model for each V4 neural site accurately predict the response of that neural site for all of many synthetic images that are each very different than those that we used to train the ANN (photographs) and also very different from the images used to map ANN “V4” sites to individual V4 neural sites (naturalistic images).
Consistent with the control results (above), we found that the ANN model accounted for 54% of the explainable variance for the set of synthetic images (median over 76 neural sites in three monkeys; Fig. S3). While this model-to-brain similarity score is lower than that obtained for naturalistic images responses (89%), it is still a substantial portion of the variance considering the fact that all parameters were fixed to make these “out-of-domain” image predictions. We believe this is the strongest test of generalization of today’s ANN models of the ventral stream thus far, and it again shows that the model’s internal neural representation is both remarkably similar to the brain’s intermediate ventral stream representation (V4), but also that it is still not a perfect model of the representation.
How do we interpret these results?
In sum, we here demonstrate that, using a deep ANN-driven controller method, we can push the firing rates of most V4 neural sites beyond naturally occurring levels and that V4 neural sites with overlapping receptive fields can be partly – but not yet perfectly – independently controlled. In both cases, we show that the goodness of this control is unprecedented in that it is superior to that which can be obtained without the ANN. Finally, we find that – with no parameter tuning at all – the ANN model generalizes quite well to predict V4 responses to synthetic images – images which are strikingly different than the real-world photographs used to tune the ANN synaptic connectivity and map the ANN’s “V4” to each V4 neural site. We believe that these results are the strongest test thus far of today’s deep ANN models of the ventral stream.
Beginning with the work of Hubel and Wiesel (18, 19), decades of visual neuroscience has closely equated an understanding of how the brain represents the external visual world with an understanding of what stimuli cause each neuron to respond the most. Indeed, textbooks and important recent results tell us that V1 neurons are tuned to oriented bars (19), V2 neurons are tuned to correlated combinations of V1 neurons found in natural images (20), V4 neurons are tuned to complex curvature shapes in both 2D and 3D (16, 21) and tuned to boundary information (12, 14), and IT neurons respond to complex object-like patterns (17) including faces (22, 23) and bodies as special cases (24).
While these efforts have been essential to building both a solid foundation and intuitions about the role of neurons in encoding visual information, our results here show how they can be further refined by current and future ANN models of the ventral stream. For instance here we found that synthesis of only few images leads to higher neural response levels that was possible by searching in a relatively large space of natural images (n=640) and complex curved stimuli (n=370) derived from those prior intuitions. This shows that even today’s ANN models – which are clearly not yet perfect (1, 6) – already give us new ability to find manifolds of more optimal stimuli for each neural site at a much finer degree of granularity and to discover such stimuli unconstrained by human intuition and difficult to fully describe by human spoken language (see examples in Fig. S1). This is likely to be especially important in mid and later stages of the visual hierarchy (e.g. in V4 and inferior temporal cortex) where the response complexity and larger receptive fields of neurons makes manual search intractable.
In light of these results, what can we now say about the two important critiques of today’s ANN models raised at the outset of this study (understanding and generality)? In our view, the results strongly mitigate both of those critiques, but they do not eliminate them.
On understanding: the ability to use knowledge to gain improved control over things of interest in the world (as we have demonstrated here) is an important test of understanding. However we acknowledge that this is not the only possible view, and many other notions of “understanding” remain to be explored to see if and how these models add value.
On generality: because we found that even today’s ANN models show good generalization to entirely novel images, we believe these results close the door on critiques that argue that current ANN models are extremely narrow in the scope of images they can accurately cover. However, we note that while 54% of the explainable variance in the generalization test was successfully predicted, this is lower than the 89% explainable variance that is found for images that are “closer” to (but not identical to) the mapping images. This not only confirms that these brain models are not yet perfect, but also suggests that a single metric of model similarity to each brain area is insufficient to characterize and distinguish among alternative models (e.g. (1)). Instead, multiple similarity tests at different generalization “distances” could be useful, as we can imagine future models that show less decline in successfully predicted variance as one moves from images “near” the training and mapping distributions (typically photographs and naturalistic images) to “far” (e.g. synthetic images like those use here, and others).
From an applications standpoint, the results presented here show how today’s ANN models of the ventral stream can already be used to achieve improved non-invasive, population control (e.g. Fig 4). However, the control results are clearly not yet perfect. For example, in the one-hot population control setting we were not able to fully suppress each and every one of the responses of the “off-target” neural sites while keeping the target neural site active (see examples in Figures-3, 4). Post-hoc analysis showed that we could partially anticipate which off-target sites would be most difficult to suppress – they were typically (and not surprisingly) the sites that had high patterns of response similarity with the target site (r = 0.49, p < 10−4; correlation between response similarity with the target neural site over naturalistic images and the off-target activity level in the full image one-hot population experiments; n=37 off-target sites). Such results raise very interesting scientific and applied questions of if and when perfect independent control is possible at neuron-level resolution. Are our current limitations on control due to anatomical connectivity that restricts the potential population control, the non-perfect accuracy of the current ANN models of the ventral stream, non-perfect mapping of the model neurons to the individual neural site in the brain, inadequacy of the controller image synthesis algorithm, or some combination of all of these and other factors?
Consider the synthesis algorithm: Intuitively, each particular neural site might be sensitive to many image features, but maybe only to a few that the other neural sites are not sensitive to.
This intuition is consistent with the observation that, using the current ANN model, it was more difficult for our synthesis algorithm to find good controller images in the One-hot-population setting than in the Stretch setting (the one-hot-population optimization typically took more than twice as many steps to find a synthetic image that is predicted to drive the target neural site response to the same level as in the Stretch setting), and visual inspection of the images suggests that the one-hot-population images have fewer identifiable “features” (Figure 5). As the size of the to-be-controlled neural population is increased, it would likely become increasingly difficult to achieve fully independent control, but this is an open experimental question.
Consider the current ANN models: Our data suggest that future improved ANN models are likely to enable even better control. For example, better ANN V4 population predictor models generally produced better one-hot population control of that V4 population (Fig. S5). One thing is clear already – improved ANN models of the ventral visual stream have led to control of high-level neural population that was previously out of reach. With continuing improvement of the fidelity of ANN models of the ventral stream (1, 25, 26), the results presented here have only scratched the surface on what is possible with such implemented characterizations of the brain’s neural networks.
Acknowledgments
We thank Dr. A. Pasupathy for generously providing the complex curvature stimuli, and Kailyn Schmidt for technical support. We also thank Chris Stawarz and the MWorks consortium (https://mworks.github.io) for experimental software support. This research was supported by Intelligence Advanced Research Projects Agency (IARPA), the MIT-IBM Watson AI Lab, US National Eye Institute grants R01-EY014970 (J.J.D.), and Office of Naval Research MURI-114407 (J.J.D).