Attentionally modulated subjective images reconstructed from brain activity

Visual image reconstruction from brain activity produces images whose features are consistent with the neural representations in the visual cortex given arbitrary visual instances [1–3], presumably reflecting the person’s visual experience. Previous reconstruction studies have been concerned either with how stimulus images are faithfully reconstructed or with whether mentally imagined contents can be reconstructed in the absence of external stimuli. However, many lines of vision research have demonstrated that even stimulus perception is shaped both by stimulus-induced processes and top-down processes. In particular, attention (or the lack of it) is known to profoundly affect visual experience [4–8] and brain activity [9–21]. Here, to investigate how top-down attention impacts the neural representation of visual images and the reconstructions, we use a state-of-the-art method (deep image reconstruction [3]) to reconstruct visual images from fMRI activity measured while subjects attend to one of two images superimposed with equally weighted contrasts. Deep image reconstruction exploits the hierarchical correspondence between the brain and a deep neural network (DNN) to translate (decode) brain activity into DNN features of multiple layers, and then create images that are consistent with the decoded DNN features [3, 22, 23]. Using the deep image reconstruction model trained on fMRI responses to single natural images, we decode brain activity during the attention trials. Behavioral evaluations show that the reconstructions resemble the attended rather than the unattended images. The reconstructions can be modeled by superimposed images with contrasts biased to the attended one, which are comparable to the appearance of the stimuli under attention measured in a separate session. Attentional modulations are found in a broad range of hierarchical visual representations and mirror the brain–DNN correspondence. Our results demonstrate that top-down attention counters stimulus-induced responses and modulate neural representations to render reconstructions in accordance with subjective appearance. The reconstructions appear to reflect the content of visual experience and volitional control, opening a new possibility of brain-based communication and creation.

. Overview of image reconstruction from brain activity during attention (A) Experimental design of attention trials. In each trial, two cue images and a superposition of the preceding two cue images (flashed at 2 Hz) were sequentially presented to subjects. During an attention period, subjects were asked to attend to one of two superimposed images indicated by green fixation color during cue periods while ignoring the other. Subjects pressed a button to indicate which of the first or second image were attended to for confirmation. (B) Reconstruction procedure. Given a set of decoded features for all DNN layers as a target of optimization, the method [3] optimizes pixel values of an input image so that the features computed from the input image become closer to the target features. A deep generator network (DGN) [25] was introduced to produce natural-looking images, in which optimization was performed at the input space of the DGN (see Methods: "Visual image reconstruction analysis").
images. Notably, even for identical stimulus images, the appearance of reconstruction was strikingly different depending on the attention. The quality of successful reconstructions from attention trials seems comparable to that from single-image trials ( Figure 2B).
We evaluated the reconstruction quality by behavioral ratings with a pair-wise identification task. Human raters judged which of two candidates (attended and unattended images for attention trials; true [presented] and false images for single-image trials) is more similar to the reconstruction. Twenty raters performed the identification for each reconstruction with a specific candidate pair (e.g., "post" and "leopard" for a reconstruction with target "post"). The accuracy for each reconstruction can be defined by the correct identification ratio among all raters. However, we will present results using the pooled accuracy for each image pair for attention trials, that is, the correct identification ratio across all raters and the reconstructions (trials) with the same image pair. For each image pair, the accuracies were pooled over the two attention conditions (attend to one or to the other) to cancel the potential effect of image saliency: if identification solely depends on the relative saliency of the two component images regardless of attention, the pooled identification ratios would cancel to the chance level. To be comparable, the accuracy for single-image trials was calculated by pooling the identification results over each pair of single-image stimuli.
For both attended image and single-image reconstructions, the mean identification accuracies were significantly higher than the chance ( Figure 2C Figure S2 for more examples; see Methods: "Evaluation of reconstruction quality"). For each specific presented image, two reconstructions from the same subjects are shown for trials with different attention targets. (B) Reconstructions from single-image trials. Images with black and gray frames indicate presented and reconstructed images, respectively (see Figure S1C for more examples). (C) Identification accuracy based on behavioral evaluations. Dots indicate mean accuracies of pair-wise identification evaluations averaged across samples for each paired comparison (chance level, 50%; see Methods: "Evaluation of reconstruction quality"). Black and red lines indicate mean and lower/upper bounds of 95% C.I. across pairs. (D) Scatter plot of attended and single-image identification accuracy based on behavioral evaluations. Dots indicate mean accuracies averaged over samples for each paired comparison and subjects. The red line indicates the best linear fit.
For each DNN layer, Pearson correlations were calculated between the decoded feature pattern from each attention trial and a set of DNN feature patterns of the superimposed images with different contrasts. The weighted contrast that yielded the highest correlation was considered to indicate the degree of attentional modulation. We could use the DNN features calculated from the reconstructions instead of the decoded features, but they highly resembled and yields similar results in this analysis. Thus, the decoded features can be seen as the stimulus features of the reconstructions.
The decoded features for successful reconstructions of attended images generally showed peak correlations with greater contrasts of the attended images ( Figure 3B top; decoded from VC).
The estimated correlations often peaked at 100% (i.e., attended image), indicating that representations regulated by top-down voluntary attention can override those from external stimuli. Overall, the peaks of the correlations were shifted toward attended images in most DNN layers except for some lower layers ( Figure 3B bottom; 62.6%, mean across layers; averages across trials and five subjects). In an independent behavioral experiment, we measured the perceived contrasts of equally weighted stimuli under attention by matching the stimulus contrasts after the attention period. The matched contrast (indicated by "visual appearance" in Figure 3B bottom) was relatively small but comparable to the biases observed in the decoded features (57.0%, three subjects averaged; see Methods: "Evaluation of visual appearance").
An additional analysis using five visual subareas (V1-V4 and higher visual cortex [HVC]) showed similar results even with lower visual areas: the peak shift toward the attended image was observed in 403 out of 475 (= 5 subjects × 19 layers × 5 areas) conditions (84.8%; Figure   3C; see Figure S3 for the results of individual subjects). These results indicate that robust attentional modulations are found across visual areas and the levels of hierarchical visual features as measured by the equivalence to biased stimulus contrasts.

Figure 3. Attentional modulation modeled by image contrast
(A) Evaluation procedure by weighted image contrasts. Correlations were calculated between decoded feature patterns and feature patterns computed from superpositions with weighted contrasts (every 5% steps ranging from 0% vs. 100% to 100% vs. 0% for attended and unattended images; presented images, 50%). (B) Correlation coefficients between decoded feature patterns and feature patterns computed from weighted superpositions. Top panels show results from individual trials with presented and reconstructed images. The bottom panel shows the results averaged across all trials. Colored lines indicate correlations for individual DNN layers (a total of 19 layers; decoded from VC; five subjects averaged; see Figure S3 for the results of individual subjects). Dots indicate contrasts showing the highest correlations with decoded feature patterns. A dashed line indicates the mean contrast of visual appearance evaluated in an independent behavioral experiment (gray area, 95% CI across pairs; see Methods: "Evaluation of visual appearance"). (C) Correlation coefficients between decoded feature patterns from individual visual subareas and feature patterns computed from weighted superpositions. Conventions are the same as in (B).
Finally, we further investigated the attentional modulations in terms of the feature specificity in individual visual areas. Here, we performed a pair-wise identification analysis based on feature correlation, in which a decoded feature pattern was used to identify an image between two candidates by comparing the correlations to the image features. The identification of attended images was performed for all combinations of areas and layers, and the results were compared with single-image identification ( Figure 4A).
While V1-V3 show markedly superior performance in single-image identification especially at lower-to-middle DNN layers, such superiority is diminished in attended image identification.
As for V1, the attended image identification is generally poor across all DNN layers. Thus, V1-V3 appear to play a major role in representing stimuli, but not as much in attentional modulation. The attention identification performances of different brain areas show similar profiles peaking across middle-to-higher DNN layers. The representations of these levels may be critical in attentional modulation.
A closer look reveals a hierarchical correspondence between the brain areas and DNN layers.
The attended image identification shows higher accuracies from lower-to-middle areas (V2 and V3) with features of lower-to-middle layers (conv2-5) and from higher areas (V4 and HVC) with features of higher layers (fc6-8; Figure 4B left; see Figure S4 for the results of individual subjects). This accuracy pattern generally mirrors the tendency found in the single-image identification performance ( Figure 4B right) except V1. These results suggest that attentional modulation is also constrained by the hierarchical correspondence between brain areas and DNN layers for stimulus representation.

Discussion
In this study, we investigated how top-down attention modulates the neural representation of visual stimuli and their reconstructions using the deep image reconstruction approach. We found that the reconstructions from visual cortical activity during selective attention resembled the attended images rather than the unattended images. While reconstruction quality varied across stimuli and trials, successful reconstructions stably reproduced distinctive features of attended images (e.g., shapes and colors). When the reconstructions were modeled using superimposed images with biased contrasts, attentional biases were observed consistently across the visual cortical areas and the levels of hierarchical visual features and were comparable to the appearance of equally weighted stimuli under attention. The identification analysis based on feature correlations revealed elevated attentional modulation for middle-to-higher DNN layers across the visual cortical areas. Attentional modulation exhibited a hierarchical correspondence between visual areas (except V1) and DNN layers, as found in stimulus representation. Our analyses demonstrate that top-down attention can render reconstruction in accordance with subjective experience by modulating a broad range of hierarchical visual representations.
We have shown robust attention-biased reconstructions especially with image pairs whose individual images were well reconstructed when presented alone ( Figure 2D). However, there were substantial performance differences across subjects. We found that subjects with higher performances in single-image reconstructions (e.g., Subject 4) did not necessarily show better attended image reconstructions ( Figure 2C) nor stronger attentional modulations (Figures S3 and S4). The difference of attentional modulation across subjects may be attributable to the individual difference in the capability to control attention. Exploring psychological and neuronal covariates with these differences can be an important research direction for future studies.
Reconstructions were explained by superimposed images with contrasts biased to the attended ones, which were comparable to the appearance of stimulus images under attention ( Figure 3B).
On average, the decoded features were most correlated with the stimulus features with biased contrasts around 60% vs. 40%, overriding the 50% vs. 50% contrasts in the stimuli. However, it should be noted that the peak biases were variable across DNN layers as well as visual areas, image pairs, and trials. Further, the visual features of biased stimulus images cannot account for the interaction of attentional modulations across layers. Thus, biased image contrast should be considered a rough approximation of attentional modulation in the visual system.  [19,20]) to investigate the effects of attentional modulations on neural representations. In contrast, our approach is based on hierarchical DNN features that are discovered via the training with a massive dataset of natural images. This allows us to examine attentional effects on hierarchically organized visual features that are difficult to be designed by an experimenter. Furthermore, the image reconstruction from decoded features enables in-depth examinations of the extent and specificity of attentional effects. As this approach primarily relies on the validity of DNNs as computational models of the neural representation [26,27], the use of more brain-like DNNs [28,29] may enhance the efficacy to reveal fine-grained contents of attentionally modulated visual experience.
A limitation of this study is the lack of explicit instructions to subjects about the strategy for directing their attention to target images, which might partly explain the variations across subjects ( Figure S4C). Higher visual areas tended to be more closely linked to attentional modulation ( Figure 3C trials where the same image as in the previous trial was presented (7 min 58 s for each run).
Each trial was 8-s long with no rest period between trials. The color of the fixation spot changed from white to red for 0.5 s before each trial began, to indicate the onset of the trial. Additional 32-and 6-s rest periods were added to the beginning and end of each run, respectively. Subjects were requested to maintain steady fixation throughout each run and performed a one-back repetition detection task on the images, responding with a button press for each repeated image, to ensure they maintained their attention on the presented images. In one set of the training session, a total of 1,200 images were presented only once. This set was repeated five times (1,200 × 5 = 6,000 samples for training). The presentation order of the images was randomized across runs. This training session is identical to that conducted in the previous study [3] (referred to as "training natural image session" of the "image presentation experiment"). The data for the last two subjects (Subjects 4 and 5) were newly collected, whereas the data for the first three subjects (Subjects 1-3) were adopted from the data published by a previous study [3] (https://openneuro.org/datasets/ds001506/versions/1.3.1).
Test session. The test session consisted of 16 separate runs. Each run comprised 55 trials that consisted of 10 single-image trials and 45 attention trials (7 min 58 s for each run). In each single-image trial, images were presented in the same manner as the training session. In each attention trial, subjects were presented with a sequence of images, each of which consisted of two successive cue images (2 s, 1 s for each cue) and spatially superimposed images of the two cue images (6 s), and were asked to attend to one image (indicated by green fixation shown with either of the two cue images) of a superposition of two images while ignoring the other such that the attended images are perceived more clearly. During the attention period, the subjects were also required to press one of two buttons gripped by their right hand to answer whether they correctly recognized which of the first and second cue image should be attended (percentages of correct, error, and miss trials among a total of 720 attention trials; 99.4%, 0.6%, and 0% for Subject 1; 98.8%, 0.6%, and 0.7% for Subject 2; 97.4%, 0.8%, and 1.8% for Subject 3; 99.9%, 0 %, and 0.1% for Subject 4; 93.5%, 3.5%, and 3.1% for Subject 5). In the test session, we used 10 out of 50 natural images that were used in the previous study [3] ("test natural image session" of the "image presentation experiment"; these images were not included in the stimuli of the training session). The 10 images were used to create a total of 45 combinations of superimposed images, and all these 45 unique superimposed images were presented as well as 10 unique single-images in each run with randomized orders (a total of 55 unique images were presented in each run). For each combination of superimposed two images, the number of trials to be the target of attention was balanced between the two images in every two consecutive runs and an entire session (a total of 8 trials for each). conv3_2, conv3_3, and conv3_4, 802816; conv4_1, conv4_2, conv4_3, and conv4_4, 401408; conv5_1, conv5_2, conv5_3, and conv5_4, 100352; fc6 and fc7, 4096; and fc8, 1000.

Feature decoding analysis
We used a set of linear regression models to construct multivoxel decoders to decode a DNN feature pattern for a single presented image from a pattern of fMRI voxel values obtained in the training session (training dataset; samples from 6000 trials for each subject). The training dataset was used to train decoders to predict the values of individual units in feature patterns of all DNN layers (one decoder for one DNN unit). Decoders were trained using fMRI patterns in an entire visual cortex (VC) or individual visual subareas (V1-V4 and HVC), and voxels whose signal amplitudes showed the highest absolute correlation coefficients with feature values of a target DNN unit in the training data were provided to a decoder as inputs (with a maximum of 500 voxels).
The trained decoders were then applied to the fMRI data obtained in the test session (test

Visual image reconstruction analysis
We performed the image reconstruction analysis using the previously proposed method extended the algorithm to combine features from multiple DNN layers and to use DNN features decoded from the brain instead of those computed from a reference image. To produce naturallooking images, they further introduced a deep generator network (DGN) [25], which was pretrained to generate natural images using the generative adversarial network (GAN) framework [43], and performed optimization at the input space of the DGN.
In this study, following the method developed by the previous study [3], we used decoded DNN features from multiple DNN layers (a total of 19 layers of the VGG19 model) and introduced the pre-trained DGN [44] (the model for fc7 available from https://github.com/dosovits/caffe-frchairs) to constrain reconstructed images to have natural image-like appearances. The optimization was performed using a gradient descent with momentum algorithm [45] starting from zero-value vectors as the initial state in the latent space of the DGN (200 iterations; see [3] for details; code is available from https://github.com/KamitaniLab/DeepImageReconstruction). experiment consisted of a cue period (2 s), an attention period (6 s), a white-noise period (0.1 s), and an evaluation period (no time constraint), in which the cue and attention periods were the same as those in an attention trial in the fMRI test session. During a white-noise period, we presented white noise images (0.1 s, 60 Hz) on the same location of the presented images during the preceding cue and attention periods to diminish any potential effects of afterimages. During an evaluation period, we presented a test image consisted of a mixture of preceding two cue images, which was initialized with a random contrast for the weighted superpositions. Subjects were required to change the stimulus contrast of the presented test image to be closer to the visual appearance of the image perceived during the preceding attention period by pressing buttons for control. After matching the contrast, subjects were allowed to start the next trial in 2 s after pressing another button for proceeding. The evaluation was performed for all 45 combinations of superimposed images and two attention conditions (a total of 90 conditions), which were separately evaluated in two separate runs with randomized orders (~15 mins for each run). Each subject evaluated all conditions twice, and a mean contrast averaged across all subjects, repetitions, and attention conditions were used as a score for a specific pair (e.g., "owl" and "post").

Identification analysis
In the identification analysis based on feature correlations, correlation coefficients were calculated between a pattern of decoded features and patterns of image features calculated from two candidate images (one for attended and the other for unattended images for attention trials; one for true [presented] and the other for false images for single-image trials). For each reconstructed image from attention trials, the pair-wise identification was performed with a pair of attended and unattended images (one pair for each sample). For each reconstructed image from single-image trials, the identification was performed for all pairs between one true (presented) image and the other nine false images that were used in the test session (nine pairs for each sample). The image with a higher correlation coefficient was selected as the predicted image. The accuracy of each sample was defined by the proportions of correct identification. To eliminate potential biases due to baseline differences across units, feature values of individual units underwent z-score normalization using means and standard deviations of feature values of individual units estimated from the training data before performing the identification.

QUANTIFICATION AND STATITICAL ANALYSIS
One-sided Wilcoxon signed-rank test was used to test the significance of the identification accuracies based on behavioral evaluations (n = 90; Figure 2C), and to test the significance of the single-image identification accuracies based on decoded DNN features (n = 160; Figure 4B right). A correlation between the identification accuracies of attention and single-image trials evaluated based on behavioral evaluations was tested by t-test (n = 45; Figure 2D). One-sided binomial test was used to test the significance of the attended image identification accuracies based on decoded DNN features (n = 716, 711, 701, 719, and 673 for Subject 1-5, respectively; Figure 4B left).

Supplemental Information
Attentionally modulated subjective images reconstructed from brain activity        Reconstructed images with relatively high rating accuracies (higher than 80%) are shown. Conventions are the same with Figure 2A. (B) Examples of attended image reconstructions with low rating accuracies.
Reconstructed images with relatively low rating accuracies (lower than 60%) are shown. The failures of attended image reconstructions were categorized into clutter images, mixtures of two superimposed images, or images more similar to unattended images.
(C) Reconstructed images from samples for trials without button responses.
Reconstructed images obtained from samples for miss trials, in which subjects missed to press a button to indicate correct recognition of target images, are shown. The reconstructions from these miss trials tended to be not similar to either of the two superimposed images.
(D) Reconstructed images from samples for trials with wrong button responses.
Reconstructed images obtained from samples for error trials, in which subjects incorrectly pressed a button to indicate target images, are shown. The reconstructions from these error trials sometimes produced images judged to be similar to non-target (or instructed to be unattended) images. Conventions are the same with Figure 3C. The subjects whose reconstructions from attention trials were highly evaluated (e.g., Subject 1-3; cf., Figure 2C) showed greater biases in decoded feature patterns, which deviated toward attended images (100%) from presented images (50%). Colored lines beneath data indicate results of statistical tests (for attention, one-sided binomial test, p < 0.05; for single-image, one-sided Wilcoxon signed-rank test, p < 0.05). The results showed relatively larger variabilities among subjects in the accuracies of the attended image identification than those of the single-image identification, possibly due to the individual differences of the ability to direct their selective attention. Differences of brain regions that showed high attended image identification accuracies might be attributable to the differences of their strategies for attention, as we did not explicitly provide specific strategies for their attempt of attention.