A massive 7T fMRI dataset to bridge cognitive and computational neuroscience

Extensive sampling of neural activity during rich cognitive phenomena is critical for robust understanding of brain function. We present the Natural Scenes Dataset (NSD), in which high-resolution fMRI responses to tens of thousands of richly annotated natural scenes are measured while participants perform a continuous recognition task. To optimize data quality, we develop and apply novel estimation and denoising techniques. Simple visual inspections of the NSD data reveal clear representational transformations along the ventral visual pathway. Further exemplifying the inferential power of the dataset, we use NSD to build and train deep neural network models that predict brain activity more accurately than state-of-the-art models from computer vision. NSD also includes substantial resting-state and diffusion data, enabling network neuroscience perspectives to constrain and enhance models of perception and memory. Given its unprecedented scale, quality, and breadth, NSD opens new avenues of inquiry in cognitive and computational neuroscience.

Neuroscience has an insatiable appetite for data. Many ongoing efforts to extensively sample brain 3 activity (de Vries et al., 2020;Siegle et al., 2021;Stringer et al., 2019) and structure (Markram et al.,4 2015; Van Essen et al., 2013;Zheng et al., 2018) are motivated, in part, by the availability of new 5 computational methods that make analysis of massive datasets feasible (Jonathan Pillow and Sahani,6 2019). Equally as important is the growing desire to understand how the brain coordinates complex 7 sensory and motor behaviors and the realization that the neural networks supporting such behaviors span 8 multiple scales, from single neurons to local circuits to whole systems (Bassett and Sporns, 2017). 9 Understanding massive, complex networks will inevitably require commensurately massive amounts of 10 data.

12
The need for massive data is especially acute in visual neuroscience, a model system for understanding 13 brain function. The network that mediates our ability to flexibly and efficiently perceive the visual world 14 occupies approximately one-third of human cerebral cortex (Van Essen et al., 2001) and interconnects 15 brain areas with profoundly different functional properties (Grill-Spector and Malach, 2004). This network 16 both encodes visual stimuli and interfaces visual representations into a cognitive context, including 17 information about what one has already seen (Wheeler et al., 2000), might see (Breedlove et al., 2020), 18 or is selectively attending (Kay et al., 2015). Understanding vision thus means interrogating a high- 19 dimensional, context-dependent neural network. 20 21 Given these considerations, it is clear that extensive experimental data providing access to whole-brain 22 responses to complex stimuli are critical in the quest to understand the human visual system. The ideal 23 dataset should include naturalistic stimuli: the visual system is distributed widely across the brain, and 24 natural scenes, in addition to being ecologically relevant (Geisler, 2008), are effective activators of the 25 entire system (Huth et al., 2012). Moreover, the ideal dataset should be large: in order to take full 26 advantage of powerful data analysis and machine learning (ML) techniques that have recently become 27 available (Vu et al., 2018), we need considerably more data than is currently available. How much? 28 Modern ML methods used in computer vision to process natural scenes (e.g. deep convolutional neural 29 networks) require tens to hundreds of thousands of image samples for training (Krizhevsky, 2009;Lin et 30 al., 2014). A dataset that sampled brain activity at these scales would raise the exciting possibility of 31 exploiting these methods to develop better models of how the brain processes natural scenes (Güçlü and 32 van Gerven, 2015;Han et al., 2019;Khaligh-Razavi and Kriegeskorte, 2014;Seeliger et al., 2021;33 Stansbury et al., 2013;St-Yves and Naselaris, 2017;Yamins et al., 2014;Zhuang et al., 2021), and would 34 accelerate efforts to bridge computational and cognitive neuroscience (Naselaris et al., 2018). 35 36 In this paper, we present the first dataset that achieves sampling at this ambitious scale. The Natural 37 Scenes Dataset (NSD) consists of high-resolution (1.8 mm) whole-brain 7T fMRI of 8 carefully screened 38 human participants who each viewed 9,000-10,000 color natural scenes (22,000-30,000 trials) during 39 30-40 scan sessions distributed over the course of a year. Aggregated across participants, NSD includes 40 responses to 70,566 distinct natural scene images-this is more than an order of magnitude larger than 41 comparable datasets involving fMRI sampling of many images (Chang et al., 2019; Horikawa and extensive sampling with a practical time commitment, we used an aggressive rapid event-related design. 48 This drove the development of new analysis techniques that accurately compensate for the overlap of 49 hemodynamic responses across successive trials. To ensure participant engagement and control 50 cognitive state, we incorporated a continuous recognition task (Brady et al., 2008) in which participants 1 were instructed to indicate whether they have seen each presented image at any point in the past. In 2 addition to making the experiment tolerable (and even somewhat interesting) for participants, inclusion of 3 this task makes NSD, to our knowledge, the longest-term continuous recognition memory fMRI study in 4 history and, thus, a likely source of new insights into long-term memory formation and the cognitive 5 context of vision. Finally, to ensure the broad reach of the NSD dataset, we incorporated design input 6 from a large network of collaborators with diverse scientific interests (e.g., low-level vision, high-level 7 vision, memory, connectivity, neuroanatomy) and technical expertise (e.g., mapping, multivariate pattern 8 analysis, encoding models, representational similarity analysis, neural network modeling). This input 9 helped precipitate a carefully curated dataset with extensive auxiliary measures, thereby increasing the 10 likelihood that NSD will enjoy widespread application in cognitive and computational neuroscience. We obtained 73,000 color natural scenes from the richly annotated Microsoft Common Objects in Context 5 (COCO) image dataset (Lin et al., 2014), a dataset that is heavily used in the computer vision and 6 machine learning communities. Our experimental design specified that each of 8 subjects would view 7 10,000 distinct images and a special set of 1,000 images would be shared across subjects (8 subjects  8 9,000 unique images + 1,000 shared images = 73,000 images). This sampling strategy was chosen to 9 maximize the number of distinct images in NSD. Each image would be presented 3 times to a given 10 subject. While this is a low number, we reasoned that 3 trials would be sufficient to produce robust 11 responses given our use of ultra-high field (7T) fMRI. Furthermore, images would be presented using a 12 rapid event-related design consisting of 4-s trials ( Figure 1A). This was done to maximize statistical 13 power and to create an engaging experience for the subjects. In addition, the continuous nature of task 14 engagement helps avoid unwanted arousal-related fMRI signals (Roth et al., 2020). 15 16 The NSD experiment was split across 40 scan sessions for each subject ( Figure 1B). To control cognitive 17 state and encourage deep processing of the images, subjects were instructed to perform a continuous 18 recognition task in which they reported whether the current image had been presented at any previous 19 point in the experiment. We controlled the distributions of image presentations such that both short-term 20 and long-term repetitions were probed (Supplementary Figure 1A). Parameters were selected such that 21 even in the first scan session, images were not always new, and even in the last scan session, images 22 were not always old (Supplementary Figure 1B). participants viewed sequences of color natural scenes and judged whether each image had been 28 previously shown at any point in the past. The scenes, taken from Microsoft's COCO (Lin et al., 29 2014), are richly annotated with object information (as depicted). B, Run and session design. Each 30 run lasted 5 minutes and consisted of 62 or 63 stimulus trials with occasional interspersed blank 31 trials. Each scan session consisted of 12 runs (750 stimulus trials). C, Timeline of 7T fMRI scan 32 sessions. Each subject participated in an initial screening session (prffloc), 30-40 NSD core 33 sessions, and two final sessions (nsdsynthetic, nsdimagery). The first NSD core session 34 corresponds to day 0. D, Behavioral compliance. Results for individual subjects (thin colored lines) 35 and the median across subjects (thick black line) are shown. Easy trials are defined as trials that 36 involved the presentation of an image that had occurred earlier in the same scan session. 37 1 Neuroimaging data collection on carefully selected subjects 2 3 All fMRI data in NSD were collected at 7T using a whole-brain 1.8-mm 1.6-s gradient-echo EPI pulse 4 sequence. After verbally screening a number of potential participants with respect to basic eligibility 5 criteria, we recruited 14 subjects to participate in an initial 7T fMRI screening session which involved 6 population receptive field (pRF) (Benson et al., 2018) and category localizer (fLoc) (Stigliani et al., 2015) 7 experiments. Based on data from this scan session, we ranked the 14 subjects with respect to data 8 quality and invited the top 8 subjects to participate in the full NSD experiment (all subjects accepted). This 9 selection process was conducted to ensure the best possible data quality for NSD. Analyses conducted 10 after completion of the NSD experiment confirm that the ranking procedure successfully identified 11 subjects that yield high-quality data and that data quality would have suffered substantially had we 12 omitted the selection process ( Figure 2C). 13 14 Data were collected from the 8 NSD subjects over the course of a year ( Figure 1C). Subjects consistently 15 engaged with the task as indicated by the near perfect response rates ( Figure 1D, top). Moreover, the 16 subjects correctly recognized items that were repeated within a given scan session ("easy trials"; Figure   17 1D, bottom) at rates higher than a 50/50 guess rate (for details on recognition memory, see Figure 6B). 18 The full NSD dataset includes a variety of anatomical neuroimaging measures (including T1, T2, diffusion, 19 SWI, and TOF), functional neuroimaging measures (including the pRF and fLoc experiments, the NSD 20 experiment, resting-state data, and two additional experiments involving synthetic stimuli and visual 21 imagery), and behavioral measures (Figure 2A-B). In some fMRI sessions, physiological and eyetracking 22 data were also collected. With regards to the core NSD experiment, we completed the full set of 40 NSD 23 scan sessions for four of the subjects, but due to unforeseen summer absences and scheduled respectively. Resting-state data were collected before and after the NSD runs in a subset of the 5 NSD core sessions (totaling 100 or 180 minutes per subject). B, Available measures. Examples of 6 the actual data are depicted. C, Participant selection. Data quality from the initial screening session 7 was used to rank a set of 14 participants. On the right is an illustration of one measure contributing 8 to the ranking, specifically, variance explained in the fLoc experiment (one slice per participant; 9 identical color range). The inset compares the participant ranking against the b3 noise ceiling 10 calculated on the full NSD dataset (see Figure 5). A line fit to the 8 NSD subjects (green dots) is 11 extrapolated to predict noise ceilings for the subjects who were not selected for participation in 12 NSD (red circles). D, Metrics of data quality. Results for individual subjects (thin colored lines) and 13 the median across subjects (thick black line) are shown. The insets show detail on tSNR and head 14 motion for one sample run (see Supplementary Figures 3-4 for more information). 15 16 Stable high-resolution functional imaging across scan sessions 17 18 We believe visual inspection is the most effective way to assess many common aspects of fMRI pre- 19 processing (Kay et al., 2019). Accordingly, we generated a comprehensive set of movies that detail the 20 excellent quality of the pre-processed NSD dataset. These include movies that assess the co-registration 21 of the different imaging modalities (e.g. T1, T2, EPI; Supplementary Video 1); movies that assess the  Videos 8 and 9). The movies demonstrate 5 that the quality of the NSD data enables precision functional mapping (Gordon et al., 2017): activity 6 patterns are fine-scale and highly reliable within individual subjects and these patterns are distinct across 7 subjects. 8 9 In addition to visual inspection, quantitative data quality metrics were computed for each NSD scan 10 session. This was in fact done on a rolling basis as the data were acquired, allowing us to monitor data 11 quality and provide performance bonuses to the subjects. Inspecting the metrics, we see that tSNR is 12 stable across scan sessions for each subject (Figure 2D, left). One subject, subject 8, exhibits low tSNR 13 compared to the other subjects; this can be attributed to higher levels of head motion for this subject 14 (Satterthwaite et al., 2014) (Figure 2D, middle). We also observe that BOLD responses are stable across 15 scan sessions for each subject, though there is substantial variation in the strength of BOLD responses 16 across subjects (Figure 2D, right).

18
One feature we implemented in the pre-processing of the fMRI data was to interpolate the data on both a 19 fine temporal grid and a fine spatial grid. This upsampling strategy preserves fine-scale detail that is 20 present in the raw fMRI data due to the temporal jitter of the acquired fMRI volumes relative to the 21 experimental paradigm (Watanabe et al., 2013) and the spatial jitter of the acquired fMRI volumes relative 22 to the brain's anatomy (Kang et al., 2007;Kay et al., 2019). An illustration of the benefits of upsampling is 23 provided in Figure 3. This example highlights the existence of replicable, fine-scale detail in fMRI image 24 intensities as well as in BOLD responses extracted from the fMRI data. For an example coronal slice in Subject 1, we compare the non-upsampled 1.8-mm preparation of 30 the data (left), the upsampled 1-mm preparation (right), and a post-hoc upsampled version of the 1 1.8-mm results (middle). Two quantities are shown: mean signal intensity and variance explained 2 by an ON-OFF GLM model. B, Zoomed view of white rectangle marked in panel A. C, Profile view 3 of blue dotted horizontal line marked in panel B. Error bars in the bottom plot indicate ± 1 SEM 4 across sessions. D, Timecourse estimates for voxels marked by blue arrows in panel C. Each trace 5 corresponds to an estimate of the hemodynamic timecourse for a single voxel in one NSD scan 6 session. 7 8 Highly reliable diffusion data and derivatives 9 10 The diffusion data included with the NSD dataset complements the extensive fMRI measurements. We 11 pre-processed the raw diffusion data using the state-of-the-art DESiGNER pipeline methodology (Ades-12 Aron et al., 2018) as implemented on brainlife.io (Avesani et al., 2019). We find that the quality of the pre-13 processed diffusion data for each subject is high, as evidenced by the signal-to-noise ratio 14 (Supplementary Figure 6B). We then proceeded to perform diffusion signal modeling (Daducci et al., 15 2015;Jensen and Helpern, 2010;Pierpaoli et al., 1996;Tournier et al., 2007;Zhang et al., 2012), 16 anatomically-informed tractography (Smith et al., 2012), andprofilometry (Yeatman et al., 2012). White-17 matter microstructural properties are found to be highly reliable for each subject ( Figure 4A). Structural 18 connectivity matrices (Hagmann et al., 2008) derived from tractography results are also highly reliable, 19 both at the group level ( Figure 4B-C) as well as at the single-subject level ( Figure 4C, inset). The ready-20 to-use diffusion derivatives provided with NSD include a variety of macrostructural and microstructural 21 measures, white-matter tracts, and structural connectivity matrices. These derivatives can easily 22 integrated into machine learning workflows, and serve as launching points for scientific investigations 23 seeking to apply network neuroscience perspectives (Bassett and Sporns, 2017) to understanding brain 24 function in the NSD dataset. To increase the value of the NSD dataset to the broader community, we performed analysis of the data 3 from the pRF and fLoc experiments and manually defined regions of interest (ROIs) based on the results. 4 The defined ROIs include retinotopic visual areas based on the pRF results (V1, V2, V3, hV4), 5 eccentricity-based regions based on the pRF results (bands between 0°, 0.5°, 1°, 2°, 4°, and beyond), 6 and category-selective regions based on the fLoc results (face-, word-, place-, and body-selective We performed a general linear model (GLM) analysis of the data from the NSD experiment in order to 15 help streamline subsequent analyses of the data. The goal of the GLM was to obtain single-trial betas 16 representing the estimated response of each voxel to each trial conducted. Given the low signal-to-noise 17 ratio of fMRI and the overlap of the hemodynamic response from trial to trial, estimating accurate betas is 18 a challenging endeavor. We thus developed a novel GLM approach consisting of three components. First, 19 we used a library of hemodynamic response functions (HRFs) derived from an initial analysis of the 20 dataset as an efficient and well-regularized method for estimating voxel-specific HRFs (Figure 5A-C). 21 Second, we adapted the GLMdenoise technique (Charest et al., 2018;Kay et al., 2013a) to the single-trial 22 GLM framework, thereby enabling the use of data-driven nuisance regressors ( Figure 5D). Third, to 23 address the challenge posed by highly correlated single-trial regressors, we developed an efficient 24 implementation of ridge regression (Hoerl and Kennard, 1970;Rokem and Kay, 2020) and used this to 25 regularize and improve the accuracy of the betas ( Figure 5E). To assess the efficacy of these various 26 GLM techniques, we generated three versions of the betas, reflecting increasing sophistication 27 (Supplementary Figure 8). Beta version 1 (b1) is the result of simply using a canonical HRF for all 28 voxels. Beta version (b2) is the result of fitting an HRF to each voxel using the library-of-HRFs approach. 29 Beta version (b3) uses the library-of-HRFs approach like b2 but also adds the use of GLMdenoise and 30 ridge regression in an attempt to improve the accuracy of the betas.  To assess the quality of the different beta versions (b1, b2, b3), we calculated noise ceilings for individual 19 voxels. Surface maps of noise ceiling results reveal locations of reliable responses to the NSD stimuli: 1 high noise ceilings are present in occipital cortex and extend into temporal and parietal cortex ( Figure 5F 2 and Supplementary Video 10). Importantly, the maps reveal very large increases in noise ceilings from 3 b1 to b2 to b3, indicating that the additional GLM techniques incorporated into b2 and b3 improve 4 reliability of responses. Detailed quantifications show that these improvements are highly consistent 5 across voxels and subjects ( Figure 5G and Supplementary Figure 9).

7
The absolute magnitudes of the noise ceilings are noteworthy. For beta version b3, noise ceiling levels in 8 visual cortex are, on average, 36% (calculated by computing the median across the nsdgeneral ROI and 9 then averaging across subjects). This means that a typical visual cortex voxel in the NSD dataset has 10 associated with it a set of 10,000 responses (30,000 trials divided by 3 trials per image = 10,000 images) 11 and a large percentage, 36%, of the variance in these 10,000 values is a signal that is, in theory, 12 predictable. Expressed in terms of Pearson's correlation (r), this is equivalent to a prediction accuracy of r 13 = 0.60. 14 15 To put these numbers into further perspective, we propose the concept of 'equivalent trials' which allows 16 comparison of different datasets that vary in signal-to-noise ratio and trial distribution. The next largest 17 data collection effort that is similar in nature to NSD is BOLD5000 (Chang et al., 2019). Using the same 18 GLM analysis methods on both NSD and BOLD5000, we find that the signal-to-noise ratio per trial is 19 approximately 0.260 for NSD and 0.187 for BOLD5000. Combining these values with the number of trials 20 conducted in each dataset, we estimate that the total size of the NSD dataset is 213,000 trials  (0.260) 2 21 = 14,399 equivalent trials, whereas the total size of BOLD5000 is 18,870 trials  (0.187) 2 = 660 equivalent 22 trials. Thus, using the metric of equivalent trials, NSD can be viewed as 14,399/660 = ~22 times as large 23 as the BOLD5000 dataset. This is a massive increase in power. Note that even if we do not take into 24 account the higher SNR per trial in the NSD dataset, NSD still has substantially more subjects ( Having demonstrated the quality of the NSD data, we now turn to example analyses that illustrate the rich 31 scientific insights that can be derived from the data. As a simple starting example, we fit a voxelwise pRF 32 model that uses local contrast in the NSD images to account for the NSD betas. This simple model is 33 expected to recover spatial tuning in early visual cortex where responses co-vary with stimulus energy 34 (Albrecht and Hamilton, 1982;Boynton et al., 1999). Indeed, in all eight subjects, high-quality maps of 35 angle and eccentricity estimates are obtained in early visual cortex, and these estimates extend all the 36 way to the fovea (Supplementary Figure 10). These results provide a check of the validity of the NSD 37 betas, and provide evidence that subjects were able to successfully maintain central fixation during the 38 NSD experiment.

40
Reliable and long-term recognition memory effects 41 42 The use of a continuous recognition task establishes NSD as one of the largest datasets relevant to 43 human memory. Despite the challenging nature of the task, we find that subjects were able to 44 successfully discriminate old images from new images (average d' across subjects: 1.28, maximum: 1.47, 45 minimum: 0.94). Further, recognition memory remained above chance even at long timescales between 46 repetitions ( Figure 6A). Specifically, for each session, we calculated a measure of recognition accuracy 47 accounting for guessing (adjusted hit rate: hit rate minus false alarm rate) and binned this measure by the 48 time since last exposure (considering only those trials involving a previously shown image). At the group 49 level, subjects exhibit performance levels greater than chance (adjusted hit rate > 0) in all time bins. At 50 the level of individuals, all subjects show a positive adjusted hit rate in the longest time bin for which data 1 are available for every subject (when binning on a log scale; 7 out of 8 subjects when binning on a linear 2 scale). These results indicate that from its behavioral component alone, NSD is powered to address 3 questions concerning human memory spanning short (seconds) to relatively long (months) timescales. But what about neural effects? To assess whether recognition effects are present in the fMRI 20 measurements, we performed two-sample t-tests contrasting NSD betas observed for hits with NSD betas 21 observed for correct rejections (the so-called 'old/new effect'; (Wagner et al., 2005)). We find highly  (Spaniol et al., 2009), and persist at the group level ( Figure 6B, bottom). The scale and statistical power 25 afforded by the NSD dataset also provides novel insight. Whereas old/new effects are typically studied 1 using group-level analyses, the quality of the NSD dataset reveals highly statistically significant results at 2 the level of individual subjects. Indeed, when pooling trials across all NSD scan sessions, several 3 subjects exhibit statistically significant activity differentiating hits and correct rejections in nearly the entire 4 cerebral cortex (see results for a representative subject in Figure 6B, top). Reminiscent of past datasets 5 employing extensive sampling of individuals (Gonzalez-Castillo et al., 2012), the current results suggest 6 that the extent of cortex engaged by basic memory processes is much more widespread than previously 7 appreciated. visual areas V1, V2, and V3. In anterior ventral temporal cortex (aVTC), the animacy progression is 20 present to some extent, but a different, more clustered representation emerges that presumably reflects 21 more complex categorical and semantic clusters. Indeed, zooming in on small sections of the t-SNE 22 embedding for aVTC reveals that these clusters contain images with relatively homogeneous semantic 23 content ( Figure 7C): the blue cluster is dominated by images of round edible objects, while the gray 24 cluster is dominated by images of people interacting with objects. Overall, these results demonstrate the 25 rich inferential power of the NSD dataset and indicate that the dataset can be readily exploited to 26 characterize how visual representations arise in the human brain. RSA analysis. B, t-SNE embedding for each ROI in an example subject (subject 1). Each dot 5 represents a distinct image (total 10,000). Using category labels from the COCO image dataset, we 6 color each dot according to whether the associated image contains particular combinations of 7 people, animals, and inanimates. C, t-SNE embedding for anterior ventral temporal cortex with 8 actual images depicted. Insets highlight an inanimate cluster (blue inset) and a cluster of people 9 with inanimate objects (gray inset). 10 11 A brain-optimized neural network model of the visual system 12 13 One of the main motivations for NSD was to amass sufficient sampling of brain activity to be able to drive 14 data-hungry machine learning techniques. As an intriguing test case, we specifically investigated whether 15 we could successfully use the scale of NSD to train, from scratch, a deep convolutional neural network 16 (CNN) to accurately predict brain activity (Seeliger et al., 2021). Adopting the framework of encoding 17 models (Naselaris et al., 2011), we took NSD betas from visual areas V1-hV4, divided these data into a 18 training set and validation set, and evaluated how accurately different computational models predict brain 19 responses in the validation set based on the presented image. The primary encoding model of interest is 20 based on a new network we refer to as 'GNet', a brain-optimized CNN whose parameters are trained 21 using image-response pairings observed in the training set. For comparison, we also evaluated an Varying the amount of training data provided to the models, we find that the performance of the GNet-5 based encoding model is relatively poor when only small amounts of training data are available ( Figure   6 8A, orange arrows). This is expected since the feature extractors in GNet (see Methods) are not pre-7 trained and thus require data for tuning. However, when large amounts of training data are available, the 8 GNet model exhibits an impressive increase in performance, achieving approximate parity with the 9 AlexNet-based encoding model ( Figure 8A, blue arrows). Interestingly, by implementing a variant of 10 GNet that is able to exploit data from multiple subjects, we demonstrate that the GNet model is able to 11 outperform the AlexNet model (two-tailed paired t-test across subjects, p = 0.013), albeit modestly 12 ( Figure 8A, red arrows). For additional insight into model performance, we compared voxel-wise 13 performance levels to noise ceiling estimates ( Figure 8B). Across voxels, prediction accuracy is tightly 14 correlated with the noise ceiling, suggesting that voxel-wise differences in prediction accuracy simply 15 reflect differences in signal-to-noise ratio. In addition, performance levels are close to, but do not reach, 16 the noise ceiling. 17 18 19 20 Figure 8. Prediction of brain activity using a brain-optimized neural network. We used 21 encoding models (Naselaris et al., 2011) to predict voxel activity in V1-hV4. NSD betas were 22 divided into a training set and validation set, and the accuracy of different encoding models was 23 quantified as the voxel-wise correlation between model predictions and responses observed in the 24 validation set. A, Performance as a function of amount of training data used. Models include an 25 encoding model based on AlexNet which is a task-optimized neural network (blue); encoding 1 models based on GNet which is a brain-optimized neural network trained using data from single 2 subjects (orange) or data from multiple subjects (red); and a V1-like control model based on Gabor 3 filters (purple). Error bars reflect one standard deviation across results obtained from different 4 bootstrap samples of the data. B, Detailed view of the performance of the multi-subject GNet model 5 for a representative subject. C, Surface maps depicting spatial distribution of validation accuracy for 6 the multi-subject GNet model. 7 8 The demonstration that an encoding model based on a brain-optimized CNN (GNet) outperforms an 9 encoding model based on a task-optimized CNN (AlexNet) is significant, as it indicates NSD is large 10 enough to successfully train a complex neural network architecture. Had the NSD dataset been smaller in 11 scale or lower in quality, qualitatively different patterns of model performance would have been obtained 12 (in Figure 8A, compare orange arrows reflecting a few thousand trials to red arrows reflecting tens of 13 thousands of trials). The availability of NSD thus opens the door to using brain activity to directly guide In the last several years, there have been a number of large-scale neuroimaging datasets that have been 3 made publicly available for re-use (Aliko et al., 2020;Gordon et al., 2017;Nastase et al., 2020;Taylor et 4 al., 2017;Van Essen et al., 2013). Several distinguishing aspects of the present work set NSD apart from 5 past datasets. One is the unprecedented scale of the dataset. NSD shares the motivation of recent 'deep' 6 (or 'precision') neuroimaging efforts (Bellec and Boyle, 2019;Filevich et al., 2017;Gordon et al., 2017;7 Pinho et al., 2018;Poldrack et al., 2015;Seeliger et al., 2019) seeking to amass large amounts of data 8 from individual subjects, as opposed to modest amounts of data on a large number of subjects. In this 9 context, NSD is, to our knowledge, the most extensive fMRI data collection effort that has been performed 10 to date. This can be gauged not only in terms of the number of hours of fMRI data acquisition per subject 11 (30-40 hours of data for each of 8 subjects on the core NSD experiment) and the high spatial resolution 12 of the acquired data (1.8 mm), but also the wealth of additional measures beyond the core experiment, 13 including substantial amounts of resting-state and diffusion data, physiological data, and functional 14 localizers. The availability of extensive measures provides the opportunity to build complete models of 15 how individual brains support vision and memory (Naselaris et al., 2021).

17
A second aspect is the unusually high quality of the data. Although quality of neuroimaging data is more 18 complex to assess than quantity, assessment of data quality is essential since MRI data have relatively 19 low sensitivity and are prone to errors and artifacts. In particular, when acquiring massive datasets, there 20 is a risk of accumulating unknown sources of noise and artifact. The work presented in this paper (and in 21 the accompanying files in the data release) guards against this possibility by crafting a customized and 22 highly optimized approach to pre-processing the NSD data and providing comprehensive documentation 23 of the high data quality (see also Supplementary Discussion). Several factors likely contributed to the 24 high data quality: these include the use of ultra-high magnetic field strength (7T) which enhances BOLD 25 contrast-to-noise ratio; the screening, training, and incentivization of participants; the detailed inspection 26 and supervision of data processing; and the large network of collaborators who helped guide the design 27 and trajectory of the dataset.

29
A third aspect of the present work lies in the novel analysis techniques developed for improved GLM 30 analysis of fMRI time-series data. These include (i) an efficient and robust method to estimate voxel-31 specific HRFs, (ii) adaptation of the GLMdenoise technique (Kay et al., 2013a) to a single-trial GLM 32 framework, and (iii) development of ridge regression as an effective method for regularizing single-trial 33 response estimates. These three techniques have been integrated into a toolbox that can be applied to 34 other neuroimaging datasets, and are the subject of a forthcoming paper. An important lesson stemming 35 from our results is that well-executed data collection is important but not the only factor to consider: data 36 preparation methods exert a major influence on the quality of a dataset and hence its scientific value. One 37 can view improvements in data quality as equivalent to increases in data quantity, in the sense that 38 analysis methods that reduce unwanted variability (noise) can be interpreted as increasing the effective 39 amount of data collected (Kay et al., 2013a). Thus, by improving data quality, the methods introduced 40 with NSD are contributing to the massive scale of the dataset.

42
The NSD dataset has many potential applications in cognitive and computational neuroscience. Given its 43 extensive sampling of natural scenes (70,566 distinct images aggregated across 8 subjects) and high 44 signal-to-noise ratio, the dataset will be useful for investigating a variety of phenomena in low-, mid-, and 45 high-level vision. In addition, the memory component of the NSD experiment provides a unique 46 opportunity to study the neural mechanisms of both short-and long-term memory (ranging from seconds 47 to many months), as well as potential interactions between vision and memory. From a methodological 48 perspective, the repeated scanning of individuals using a consistent experimental manipulation (up to 40 49 scan sessions of the NSD experiment per subject) provides a unique opportunity for development and 50 evaluation of neuroimaging pipelines. Finally, perhaps the most exciting use of NSD is as a common 1 dataset to bridge the disciplines of cognitive science, neuroscience, and artificial intelligence (Naselaris et 2 al., 2018). As we have shown in the context of deep neural network modeling (see Figure 8), there are 3 sufficient data in NSD to successfully drive the training of neural network models with thousands of free 4 parameters. This demonstration exemplifies how NSD-with its large amounts of carefully curated fMRI 5 data collected during a rich cognitive paradigm-enables data-driven approaches towards understanding 6 the complexities of information processing in the brain. Information on the NSD dataset is available at http://naturalscenesdataset.org. The dataset will be made 28 publicly available upon manuscript publication. The data are hosted in the cloud, allowing researchers to 29 exploit high-performance cloud computing to efficiently analyze the dataset. We provide both raw data in 30 BIDS format (Gorgolewski et al., 2016) and prepared data files, along with extensive technical 31 documentation in the NSD Data Manual. To ensure strict validation for an upcoming Algonauts prediction 32 challenge (Cichy et al., 2019), the initial public release will withhold the last three NSD scan sessions 33 from each participant (about 8.4% of the NSD data). We also provide an archive of code used in this 34 paper, as well as utility functions for working with the prepared data. Finally, example scripts 35 demonstrating scientific analyses of the NSD data are available; these scripts may be useful for teaching 36 purposes. The NSD study was advertised to the University of Minnesota community. We sought to recruit right-5 handed individuals (18-65 years old) with no known cognitive deficits nor color blindness and with normal 6 or corrected-to-normal vision. Those who were interested in participating were contacted for a phone 7 interview to explain the nature of the study and to screen them for eligibility. We discussed the long-term 8 nature of the study, the time commitment it would involve, and the feasibility of traveling to the scanner on 9 a regular basis. We paid attention to the communicativeness of potential participants and their general 10 attitude towards study participation. Selecting participants whom we were confident would provide high-11 quality data was more important to us than obtaining a random sample of the general population. Based 12 on the phone interviews, we invited 14 people who we felt were strong candidates to participate in an 13 initial 7T fMRI screening session. Of these, 8 were selected to participate in the full NSD experiment.
14 15 Subjects 16   17 Eight subjects (two males, six females; age range 19-32) participated in the NSD dataset (subj01-18 subj08). There were six additional subjects (four males, two females; age range 20-53) who participated 19 in the initial 7T fMRI screening session but not in the remainder of data collection. Subjects were naïve 20 and were not involved in the design nor planning of the NSD dataset. All subjects had normal or 21 corrected-to-normal visual acuity. Informed written consent was obtained from all subjects, and the termed 'prffloc', referring to the population receptive field (pRF) and functional localizer (fLoc) experiments 30 conducted in that session. The 7T sessions also included, for each subject, 30-40 sessions in which the 31 main NSD experiment was conducted ('nsd01-nsd40'). These sessions are collectively termed the 'NSD 32 core'. In some of these sessions, resting-state data were acquired before and after the NSD experiment. 33 Finally, the 7T sessions also included two sessions conducted after completion of the NSD core; these 34 sessions, termed 'nsdsynthetic' and 'nsdimagery', involved measuring responses to synthetic stimuli and  Figure   1 2B. Below we summarize the scanning protocols (full protocol printouts are available online).

3
At 3T, we collected a number of anatomical measures (T1, T2, diffusion, angiogram). The motivation for 4 collecting data at 3T was to ensure acquisition of T1 volumes with good gray/white-matter contrast and 5 homogeneity, which is difficult to achieve at ultra-high field (Polimeni et al., 2018). To increase contrast-6 to-noise ratio and enable the ability to assess reliability, we acquired several repetitions of T1-and T2-7 weighted volumes. For each subject, we collected between 6-10 scans of a whole-brain T1-weighted 8 MPRAGE sequence (0.8-mm isotropic resolution, TR 2400 ms, TE 2.22 ms, TI 1000 ms, flip angle 8°, 9 bandwidth 220 Hz/pixel, no partial Fourier, in-plane acceleration factor (iPAT) 2, TA 6.6 min/scan) and 2-10 3 scans of a whole-brain T2-weighted SPACE sequence (0.8-mm isotropic resolution, TR 3200 ms, TE 11 563 ms, bandwidth 744 Hz/pixel, no partial Fourier, in-plane acceleration factor (iPAT) 2, TA 6.0 12 min/scan). In addition to T1 and T2 data, we also acquired 4 high angular resolution diffusion-weighted In addition to the EPI scans, the 7T sessions also included dual-echo fieldmaps for post-hoc correction of 40 EPI spatial distortion (same overall slice slab as the EPI data, 2.2 mm  2.2 mm  3.6 mm resolution, TR We determined the maximum square extent visible in both eyes given the constraints of the RF coil to be 31 8.4°  8.4° (714 pixels  714 pixels). Thus, stimuli from the various experiments (e.g., pRF, fLoc, NSD) 32 were adjusted to fill 8.4° of visual angle (details provided below). At the beginning of each scan session, 33 we made an effort to position the monitor in the same location relative to the scanner and to position the 34 subject's head and RF coil in the same location relative to the scanner. We also used a calibration square 35 (8.4° in size) to determine any incidental horizontal or vertical offsets needed in that session in order for 36 the participant to see the entire square in each eye, unobstructed. Given these efforts, we believe that 37 consistent and high-quality visual stimulation was achieved across scan sessions. Nonetheless, we 38 caution that due to limitations in positioning and/or potential drift over the course of a scan session, some 39 slight occlusion of the corners of the 8.4°  8.4° square extent may have occurred some of the time.

41
A Mac Pro computer controlled stimulus presentation using code based on Psychophysics Toolbox 3.0.14 42 (Brainard, 1997;Pelli, 1997). Behavioral responses were recorded using a button box (Current Designs,43 Philadelphia, PA). In some scan sessions (nsd21-nsd30, the same sessions in which the primary set of 44 resting-state data were acquired), physiological data were collected using a pulse oximeter and a 45 respiratory belt (stock Siemens equipment). Care was taken to secure the oximeter with tape to the left 46 index finger of the subject and to secure the respiratory belt snugly to the subject's torso. Visual 47 inspections of the physiological data (available online) suggest that the data are of excellent quality and 48 that more than 90% of the pulse data and more than 95% of the respiratory data are usable. These 49 physiological data are valuable for identifying potential contaminants of resting-state fMRI signals (Lynch 50 et al., 2020). Physiological data were carefully synchronized with the fMRI data and cropped, but are not 1 further described in this paper.

3
In several scan sessions (see Supplementary Figure 2 for details), eyetracking was performed using an 4 EyeLink 1000 system (SR Research, Mississauga, Ontario, Canada) combined with a custom infrared 5 illuminator mounted on the RF coil. Eyetracking was performed for the left eye, and eyetracking data were 6 obtained at 2000 Hz using the Pupil-CR centroid mode. We caution that the eyetracking data are variable 7 in quality, as achieving sufficient pupil contrast was often difficult. For complementary information, we also 8 captured video recordings of the eyetracker computer display (see Figure 2B) using a cell phone secured 9 to a mount. Eyetracking data and video recordings were carefully synchronized with the fMRI data and 10 cropped accordingly (analysis_eyetracking.m), but are not further described in this paper. Note that it may Participants were scanned roughly once a week, with attempts to keep a regular weekly scan time. At the 17 beginning of each session (starting at approximately nsd07), participants were asked to rate on a five-18 point scale how well they slept the night before, their mood, how hungry they were, and their stress level. 19 We also asked whether they had had caffeine in the past three hours. At the end of each scan session, 20 participants were asked to rate how comfortable they were during the session and to provide any general 21 feedback they had about the session. These various measures, as well as any technical issues that arose 22 during the session, were logged onto a spreadsheet (available online). During data collection, we monitored aspects of data quality including overall image quality, head motion, 31 quality of physiological data, and behavioral performance. Between functional runs, we checked in with 32 the subject to assess their energy level, enthusiasm, and compliance. If we noticed any substantial drops 33 in response rate, we politely notified the subject and offered short breaks before continuing.

35
To promote subject engagement and retention, participants were given the opportunity to earn monetary 36 bonuses that gradually increased in size over the course of the NSD study. These bonuses were 37 contingent on achieving certain performance levels on data quality metrics such as head motion and 38 response rate. Information regarding performance was supplied to participants in the form of a continually 39 updated "leaderboard" figure. We found that this figure greatly helped to motivate participants. In the NSD experiment, participants performed a long-term continuous recognition task while viewing a 46 large number of color natural scenes. We chose this recognition task because it engages and challenges 47 the observer and is unbiased with respect to the specific content of the images (unlike other tasks such 48 as animacy judgment). In addition, it infuses the experiment with a rich memory dimension that is likely of 49 interest to memory researchers. A total of 73,000 distinct images were prepared. We intended that the 8 50 NSD subjects would each view 10,000 distinct images presented 3 times each over the course of 40 scan 1 sessions. We designated a special set of 1,000 images as shared images that would be seen by all 2 subjects ('shared1000'); all other images would be mutually exclusive across subjects. The distribution of 3 the 3 presentations of each image was tightly controlled, and subjects were naïve as to both the number 4 and distribution of the presentations.

6
Images were presented using a 3-s ON / 1-s OFF trial structure (Figure 1A). In informal piloting, we found 7 that this pacing made the recognition task feasible and not overly taxing. In addition, we reasoned that the 8 relatively long stimulus duration would increase neural activity and that the rapidity of the design would 9 allow more trials to be collected and thereby increase overall experimental power. Finally, we speculated 10 that the 3/1 trial structure would yield a pleasant experience for participants, at least compared to slow 11 event-related designs where most experimental time is spent viewing a blank screen.

13
Image preparation 14 15 The NSD stimuli are prepared as a single brick of RGB images with dimensionality 425 pixels  425 pixels 16  3 RGB channels  73,000 images and unsigned 8-bit integer format. sky, land, wall, road) in addition to "things" (e.g., car, skateboard, hat) were available.

25
We selected only images whose smaller dimension (height or width) was at least 425 pixels. Where Cropping an image can change the way the viewer interprets it. We refer to this effect of cropping as 32 "semantic loss". In order to be able to take full advantage of the rich annotations available for the COCO 33 images, we attempted to minimized semantic loss when cropping images. For landscape-oriented images, 34 we selected between a center, left, or right crop. For portrait-oriented images, we selected between a In addition to screening to minimize semantic loss, we implemented a screening procedure to remove 40 duplicate images. Some of the COCO images are extremely similar to each other, differing only by a post-41 processing operation (i.e., grayscaling or sharpening) or by a few frames in a motion-capture sequence. 42 To remove these near-duplicates, we downsampled all images to 40  40 and then computed the 43 correlation of grayscale pixel intensities between all image pairs. We manually inspected the image pairs 44 with the 500 highest correlation values. Of these, 38 image pairs were observed to be near duplicates. 45 We randomly selected another image from the COCO dataset to replace one image in each near-46 duplicate pair. Finally, we screened captions for all images for indications of violent or salacious content. 47 No images were deemed too offensive to include in the experiment.

49
The distribution of "thing" categories across the final images selected for NSD was nearly identical to 1 distribution in the full COCO dataset. As a result, the "person" category was over-represented; however, 2 with a few exceptions, all 80 COCO object categories are displayed in at least 100 images to each 3 subject. Note that images tend to depict more than one category, so that a given object category 4 frequently appeared in the same image with other categories. For each subject's images, at least 90% of 5 the images contain 2 or more of the 80 COCO categories. We determined the ordering of the 10,000 images  3 trials = 30,000 trials in advance and kept the 10 ordering fixed across subjects. The idea is that these 10,000 images are actually treated as slots into 11 which different NSD images are inserted. We designated the first 1,000 slots as corresponding to the 12 special shared1000 images; the remaining 9,000 slots were filled with unique images for each subject. 13 Note that because the trial ordering and repetition structure are identical across subjects, the difficulty of 14 the recognition task is comparable across subjects (up to the fact that some images might be more 15 difficult to remember than others).

17
We controlled the distribution of image presentations in order to prevent the recognition task from 18 becoming too difficult (and risking loss of subject morale). In the procedure, we conceptualized the task of 19 determining the trial ordering as equivalent to placing image presentations on a circle that would 20 eventually be cut and unraveled. To determine presentation times, we created a circular probability 21 distribution by mixing a von Mises distribution and a uniform distribution (Supplementary Figure 1A). 22 Using random draws from the resulting distribution (positioning the distribution at a random location on 23 the circle for each image), we determined 3 presentation times for each of the 10,000 images. After 24 completing the placement of all 30,000 trials, we then cut the circle, unraveled it into a linear sequence of 25 image presentations, and divided this sequence into 40 consecutive segments corresponding to the 40 26 NSD scan sessions (750 trials per session).

28
We chose specific parameters for the probability distribution: we used a von Mises distribution with 29 concentration parameter of 3 6 and a mixing ratio of 60% and 40% for the von Mises and uniform 30 distributions, respectively. This choice of parameters yields appealing properties. First, the distribution is 31 relatively narrow (see Supplementary Figure 1A) and therefore ensures that there will be many trials 32 involving an image that has been presented in the recent past (thus, making the trials easy), while still 33 allowing the probing of more distant memory events. Second, there is minimal "burn-in" time at the 34 beginning of the experiment: even in the first scan session, there is still a substantial number of trials

39
To provide a sense of the overall experimental design, we computed basic statistics on each NSD scan 40 session. For a typical session, the total number of distinct images shown once, twice, and all three times 41 within that session is 437, 106, and 34, respectively (these numbers reflect the mean across scan 42 sessions, rounding to the nearest integer).

44
Trial and run design 45 46 Each trial lasted 4 s, and consisted of the presentation of an image for 3 s, followed by a 1-s gap. A total 47 of 75 trials were conducted in a run; thus, each run lasted 300 s. The first 3 trials (12 s) and last 4 trials 48 (16 s) were blank trials. The remaining 68 trials were divided into 63 stimulus trials and 5 blank trials. The 49 blank trials were randomly positioned in each run such that the minimum and maximum continuous 50 number of stimulus trials was 9 trials (36 s) and 14 trials (56 s), respectively (see Figure 1B). For even-51 numbered runs, the 63 rd stimulus trial was designated to be a blank trial. A total of 12 NSD runs were 1 collected in one NSD session, yielding a total of (63 + 62)  6 = 750 stimulus trials. Moreover, this design 2 was repeated for all 40 NSD sessions: 750 stimulus trials  40 sessions = 30,000 stimulus trials. The 3 temporal ordering of stimulus and blank trials was generated once and kept fixed across subjects.

5
Stimulus presentation and task 6 7 Since the BOLDscreen is calibrated to behave as a linear display device, we used a squaring luminance 8 response when presenting the NSD experiment in order to simulate the typical viewing of digital images. 9 At time of presentation, the prepared NSD images (425 pixels  425 pixels) were resized to fill 8.4°  8.4° 10 (714 pixels  714 pixels) using linear interpolation. Throughout each run (including blank trials), a small 11 semi-transparent red fixation dot (0.2°  0.2°; 50% opacity) was present at the center of the stimuli 12 ( Figure 1A). Stimuli were shown against a gray background with RGB value (127,127,127).

14
Subjects were instructed to fixate the central dot and to press button 1 using the index finger of their right 15 hand if the presented image was new, i.e. the image had never been presented before, or button 2 using 16 the middle finger of their right hand if the presented image was old, i.e. the image is identical to one that 17 had been presented before, either in the current scan session or any previous scan session. Subjects 18 were additionally instructed to continue to fixate and wait for the next image in the event of blank trials.

20
Before the start of the NSD experiment, we showed the subjects a version of the experiment involving 21 cartoon images in order for them to become familiarized with the feel and timing of the task. During the 22 NSD experiment, minimal feedback was provided to the subjects regarding their performance on the 23 recognition task. Participants were blind to the precise details of the NSD experiment (e.g., total number 24 of images, total number of presentations per image). Participants were informed only about their 25 response rate (fraction of trials on which they successfully made a response) and a vague "performance 26 metric" which, unbeknownst to them, quantified their percent correct for easy trials (trials that involved the 27 presentation of an image that had occurred earlier in the same scan session). A postmortem held after 28 the completion of the NSD experiment revealed the nature of the design (details below).

30
Details on experiment timing 31 32 Stimulus presentation was locked to the refresh rate of the BOLDscreen monitor. Empirical 33 measurements confirmed that the monitor refresh rate was nearly exactly 120 Hz: duration of runs were 34 highly reliable, ranging from 299.95-299.98 s. To compensate for the slight offset from 300 s, the fMRI 35 data were pre-processed to achieve a sampling rate of 0.999878 s (high-resolution preparation) or 36 0.999878 s  (4/3) = 1.333171 s (standard-resolution preparation). For brevity, we refer to these numbers 37 as 1.000 s and 1.333 s. Experimental runs were started by a trigger issued by the MR scanner. Due to 38 input polling and monitor refresh, there was slight variability in the delay between trigger detection and the 39 presentation of the first stimulus frame, ranging from 3-22 ms. We did not attempt to compensate for this Due to constraints on subject availability (including unplanned out-of-town absences in the summer of 45 2019) and due to constraints on scanner availability (the 7T scanner was decommissioned in November 46 2019), we did not complete the full NSD experiment for every participant. Fortunately, we were able to 47 collect a sizable amount of data: 40,40,32,30,40,32,40, and 30 NSD sessions for subj01-subj08, 48 respectively. In these collected data, each subject viewed 9,209-10,000 distinct images and participated 49 in 22,500-30,000 trials. Aggregated across subjects, the total number of distinct images shown was 1 70,566, and the total number of trials was 213,000. After completion of the final memory test (details below), participants filled out a post-NSD questionnaire. 6 This questionnaire probed topics such as strategies used for performing the NSD task and estimates for 7 the number of images viewed and the number of image repetitions. After filling out this questionnaire, the 8 design of the NSD experiment was then revealed to the participants. For consistency with the NSD experiment, stimuli were resized to fill a circular region with diameter 8.4°. 22 Each run lasted 300 s (exact empirical timings were highly accurate and ranged between 299.95-300.00 For consistency with the NSD experiment, stimuli were resized to fill a square region filling 8.4°  8.4° of 42 visual extent. Each run lasted 300 s (exact empirical timings were highly accurate and ranged between 43 300.000-300.002 s). Throughout stimulus presentation, a small red fixation dot was present at the center 44 of the stimuli. Subjects were instructed to maintain fixation on the dot and to press a button whenever 45 they noticed an image in which only the background was present ("oddball" task). A total of 6 runs were 46 collected in the first 7T fMRI session (prffloc).

48
Resting-state experiment 49 50 Stimuli consisted of a white fixation cross (0.5°  0.5°) on a gray background (see Figure 2A). Each 1 resting-state run lasted 300 s. In the second resting-state run held within a given scan session, the 2 fixation cross turned red after 12 s had elapsed and remained red for 4 s before returning to white.

4
Resting-state data were acquired in several NSD core scan sessions: nsd21-nsd38 for subj01 and 5 subj05, and nsd21-nsd30 for all other subjects. Thus, a total of 100 or 180 minutes of resting-state data 6 were acquired for each subject; this large amount of data is appealing as it enables stable estimates of 7 resting-state correlations (Laumann et al., 2015). In each session, one resting-state run was acquired at 8 the beginning of the session (prior to the NSD runs) and another resting-state run was acquired at the 9 end of the session (after the NSD runs). 10 11 In the first resting-state run, subjects were instructed to stay awake and fixate the cross but otherwise 12 rest. In the second resting-state run, subjects were additionally instructed to inhale deeply when the 13 fixation cross turned red. This instructed breath was designed to aid analysis of the physiological data 14 collected concomitantly with the resting-state data. Prior to each resting-state run, subjects were asked to 15 report their current sleepiness level using the Stanford Sleepiness Scale (Shahid et al., 2012) (1-7 where 16 1 is most active and 7 is most sleepy). After each resting-state run, subjects were asked to report their 17 sleepiness level during the run that had just completed.

19
After the last scan session involving resting-state data, participants filled out a post-resting-state 20 questionnaire. This questionnaire queried what the participants were doing during the resting-state runs 21 and whether they thought about the images from the NSD experiment.

25
After completion of the NSD experiment, we conducted an additional 7T fMRI scan session in which 26 responses were measured to a variety of carefully controlled synthetic (non-naturalistic) stimuli while the 27 subject performed either a fixation task or a one-back task. These data will be described and released in 28 a forthcoming manuscript.

30
Visual imagery experiment (nsdimagery) 31 32 After completion of the nsdsynthetic experiment, we conducted an additional 7T fMRI scan session in 33 which responses were measured while participants engaged in visual imagery and other cognitive tasks. 34 These data will be described and released in a forthcoming manuscript. We also conducted a final memory test in which we collected various memory-related measures 48 regarding the images shown to the subjects during the NSD experiment (nsdmemory). These data will be 49 described and released in a forthcoming manuscript.

51
Finally, using the web-based Meadows platform (http://meadows-research.com), we conducted an 1 assessment of how the NSD subjects perceive and interpret the NSD images (nsdmeadows). First, we 2 selected a small set of images that maximally span semantic space. This was done by isolating the 515 3 images (out of the shared 1,000 NSD images) that were viewed all 3 times by all 8 NSD subjects during 4 the NSD experiment; computing shifted inverse frequency sentence embeddings for the sentence 5 captions provided by the COCO dataset (Arora et al., 2017); and using a greedy approach to determine 6 the subset of 100 images that maximize the average distance between each image's embedding and its 7 closest neighbor. We then asked participants to perform a Multiple Arrangements Task (Kriegeskorte and   8 Mur, 2012) in which they arrange using a drag-and-drop interface the 100 images within a white circular 9 arena according to the similarity of their content. Using an adaptive procedure, subsequent arrangements 10 were conducted using subsets of the images in order to maximize information gain. This was done until 11 45 minutes had elapsed. Using a similar interface on Meadows, participants then provided valence and 12 arousal ratings for the 100 images as well as 3 additional images pulled from the 515 images. Ratings 13 were performed separately for valence and arousal, and were accomplished by freely arranging using a 14 drag-and-drop interface the images (delivered in small batches) along a one-dimensional axis ranging 15 from low to high. This assessment took about 15 minutes. 16 17 Overview of data analysis 18 19 We designed custom analysis strategies to maximize the quality of derived measures from the NSD data. The analysis of the NSD data can be divided into three components: (i) pre-processing of the anatomical, 27 diffusion, and functional data, (ii) time-series analysis of the fMRI data to estimate trial-wise betas, and (iii) 28 further analyses of the trial-wise betas to answer specific scientific questions. The first two components 29 produce the so-called 'prepared data' that are generally useful to the community, whereas the third 30 component refers to analyses performed for the purposes of this paper (pRF estimation, univariate 31 memory analysis, representational similarity analysis, end-to-end neural-network training).

33
The pre-processing approach that we designed for the NSD dataset prioritizes accuracy and preservation 34 of information (e.g. avoiding spatial smoothing). We avoid "baking in" unnecessary assumptions (e.g. 35 aggressively removing signal fluctuations without careful assessment of validity), and care is taken to 36 manually inspect each pre-processing step to ensure satisfactory results. While we believe our pre-37 processing is general and likely suitable for most downstream uses of the data, the raw data are also 38 available for those who wish to explore other pre-processing approaches such as fmriprep (Esteban et   39 al., 2019). We note several aspects of the NSD dataset that may render the dataset challenging from a 40 pre-processing standpoint: the relatively high spatial resolution of the fMRI data (1.8 mm) places higher 41 demands on spatial accuracy, the ultra-high field strength (7T) used for the fMRI data yields higher levels 42 of EPI spatial distortion compared to lower field strengths, and the emphasis on many repeated scans of 43 individuals heightens the importance of achieving consistent imaging results across scan sessions. T1-and T2-weighted volumes were corrected for gradient nonlinearities using a custom Python script 1 (https://github.com/Washington-University/gradunwarp) and the proprietary Siemens gradient coefficient 2 file retrieved from the scanner. The multiple T1 and T2 volumes acquired for a given subject were then co-3 registered (preprocess_nsd_structuralalignment.m). This was accomplished by first co-registering the T1 4 volumes to each other (rigid-body transformation; correlation metric; the first T1 volume serving as the 5 target) and then by co-registering the T2 volumes to the T1 volumes (rigid-body transformation; mutual 6 information metric; the prepared T1 data serving as the target). In the estimation of registration 7 parameters, a manually defined 3D ellipse was used to focus the cost metric on brain tissue. Individual 8 volumes were manually inspected and rejected if substantial image artifacts were visible (only the 2nd 9 and 4th T1 volumes for subject 8 were rejected). The final T1 and T2 data were created by performing 10 cubic interpolation at a resolution of 0.5 mm. Results for the multiple acquired volumes were averaged 11 (within modality) to increase contrast-to-noise ratio. Finally, the 0.5-mm volumes were resampled to 12 create alternative 0.8-mm and 1.0-mm versions. We co-registered the high-resolution T2 volume to the prepared 0.5-mm T2 volume (external_mtl.m). 35 Given that these volumes were acquired on different scanners, we evaluated several strategies for 36 achieving accurate co-registration. We obtained the best results by performing a simple linear co-37 registration (affine transformation; correlation metric) in combination with a rectangular box that focused 38 the cost metric on regions of interest in the medial temporal lobe. The estimated registration was 39 subsequently used to map labels defined on the high-resolution T2 volume to the subject-native 40 anatomical space.

42
De-identification 43 44 We mapped a liberal brain mask defined in MNI space to the subject-native anatomical space, and then 45 used the result to mask and thus de-identify the anatomical volumes 46 (preprocess_nsd_applybrainmask.m).

48
FreeSurfer processing 49 50 The prepared 0.8-mm T1 volume was processed using FreeSurfer version 6.0.0 with the -hires option 1 (analysis_freesurfer.m). Manual edits of tissue segmentation (labeling voxels as gray matter, white 2 matter, or cerebrospinal fluid) were performed for each of the eight subjects to optimize the accuracy of 3 the cortical surface representations generated by FreeSurfer. The prepared 0.8-mm T2 volume was used 4 to inform manual segmentation decisions, but was not explicitly used in the FreeSurfer processing. We 5 also manually marked surface imperfections that remained even after manual edits; these are labeled in 6 the surface inspections (Supplementary Video 2) and are largely confined to a few difficult regions 7 located in the inferior aspects of the temporal and frontal lobes (see Figure 5F).

9
Several additional FreeSurfer processing steps were performed. Using mris_expand, we generated 10 cortical surfaces positioned at 25%, 50%, and 75% of the distance between the pial surface and the 11 boundary between gray and white matter. These surfaces are useful for creating surface representations 12 of the fMRI data. Multiple surfaces at different gray-matter depths were created given the relatively high 13 spatial resolution of the fMRI data (1.8-mm acquisition). We also generated several flattened surface The raw data from the four diffusion scans (99 AP, 99 PA, 100 AP, 100 PA) were corrected for gradient 34 nonlinearities, concatenated, and then pre-processed following a published protocol (Ades-Aron et al., 35 2018). Specifically, diffusion volumes were denoised and cleaned with respect to Gibbs ringing using 36 MRTrix3 before being corrected for susceptibility, motion, and eddy distortions using FSL's topup and 37 eddy functions (brainlife.app.287). Following these corrections, the diffusion volumes were bias-corrected 38 and had background noise removed using MRTrix3. Finally, the diffusion volumes were co-registered to 39 the 0.8-mm T1-weighted anatomical volume using FSL's epi_reg (rigid-body transformation, boundary-40 based registration), and then resliced to 0.8-mm isotropic voxels. The diffusion data were organized into 41 two runs: data from the 99 AP and 99 PA scans constitute 'Run 1' and data from the 100 AP and 100 PA 42 scans constitute 'Run 2' (brainlife.app.371). To assess data quality, we calculated signal-to-noise ratio in 43 the corpus callosum using workflow provided by Dipy (Garyfallidis et al., 2014) (brainlife.app.120).

45
Following pre-processing, brain masks were generated using Dipy's median_otsu (brainlife.app.70). This  (brainlife.app.9, brainlife.app.365). The NODDI model was fit twice for each run: once for white-1 matter tract microstructure using an intrinsic free diffusivity parameter (d∥) of 1.7  10 -3 mm 2 /s, and once 2 for cortical microstructure using d∥ = 1.1  10 -3 mm 2 /s, following previously described procedures We implemented a pre-processing approach that aimed to preserve as much spatial and temporal detail 26 as possible. In short, the fMRI data were pre-processed by performing one temporal resampling to correct 27 for slice time differences and one spatial resampling to correct for head motion within and across scan 28 sessions, EPI distortion, and gradient nonlinearities. This produced volumetric fMRI time-series data in 29 subject-native space for each NSD subject. The functional data were pre-processed independently of the 30 anatomical data; this was done intentionally in order to avoid dependence of the pre-processed functional 31 data on choices such as how to co-register the functional and anatomical data and how to map the 32 functional data to cortical surface representations. Also, to minimize the risk of inaccurate or unwanted 33 assumptions, we did not include any temporal filtering (e.g. detrending, confound regression, censoring). 34 Pre-processing results were carefully visually inspected to ensure quality control. There were a few 35 anomalous cases, such as acquisition being split across two different scan sessions; special 36 modifications were made to the pre-processing to accurately compensate for these occurrences (see 37 online notes for details). 2. Fieldmap preparation. The multiple fieldmaps acquired in the scan session (3.6-mm slices) were 3 upsampled using nearest-neighbor interpolation to match the slice resolution of the fMRI data 4 (1.8-mm slices). The fieldmaps were then phase-unwrapped using the FSL utility prelude and 5 regularized by performing 3D local linear regression using an Epanechnikov kernel with radius 5 6 mm. Values in the magnitude component of the fieldmaps were used to regularize the fieldmaps 7 and the regression in order to improve robustness of the field estimates. Finally, the fieldmaps 8 were linearly interpolated over time, producing an estimate of the field for each fMRI volume 9 acquired. 10 3. Spatial undistortion. The temporally resampled volumes from Step 1 were undistorted based on 11 the field estimates from Step 2 using the standard unwarping method (Jezzard, 2012). 12 Undistorted volumes were generated using cubic interpolation. Given the results of the first-stage pre-processing, we computed the mean fMRI volume and corrected 28 this volume for gradient nonlinearities (preprocess_nsd_epigradunwarp.m). We then co-registered this 29 gradient-corrected volume to the gradient-corrected volume from the first NSD scan session (affine 30 transformation, correlation metric). Thus, the first NSD scan session determined the target space for 31 preparing fMRI data from different scan sessions (preprocess_nsd_epialignment.m).

33
Second-stage pre-processing 34 35 We repeated the pre-processing steps (Steps 1-5 above) but with the final spatial resampling step 36 incorporating the effects of the gradient nonlinearity correction and the session registration 37 (preprocess_nsd_secondstage.m). In this way, a single cubic interpolation is used to compensate for the 38 effects of head motion, spatial distortion, gradient nonlinearities, and session registration. For this final 39 interpolation step, we used either a 1.8-mm grid (standard-resolution preparation) or a 1.0-mm grid (high-40 resolution preparation). The latter approach intentionally upsamples the data in order to exploit the 41 benefits of small head displacements and preserve as much spatial detail as possible (Kang et al., 2007;42 Kay et al., 2019). To minimize storage requirements, the interpolations were performed within a 3D box 43 that was just large enough to cover the brain of each subject.

45
To facilitate assessment of T2* effects, we created a bias-corrected version of the mean EPI volume 46 (analysis_biascorrection.m). For each preparation, we took the mean EPI volume and fit a 5th-degree 3D 47 polynomial, considering only voxels labeled as cortical or cerebellar gray matter in the FreeSurfer aseg 48 file. The fitted volume ('coilbias') was then divided from the mean EPI volume, producing the bias-49 corrected volume ('bc').

51
Final outputs 1 2 The final result of pre-processing was volumetric fMRI time-series data in subject-native space. Two 3 versions were generated: the standard-resolution version was prepared at a spatial resolution of 1.8 mm 4 and a temporal resolution of 1.333 s, whereas the high-resolution version was prepared at a spatial 5 resolution of 1 mm and a temporal resolution of 1.000 s.

19
• Mapping to and from fsaverage. We calculated the indexing information that maps subject-native 20 surfaces to and from fsaverage using nearest-neighbor interpolation in the spherical space 21 defined by FreeSurfer. Visual inspections confirm the quality of the folding-based alignment 22 achieved by FreeSurfer (Supplementary Video 3). 23 • Mapping to and from MNI. Using FSL's utility fnirt, we co-registered the subject-native prepared 24 1.0-mm T1 volume to the MNI152 T1 1-mm template provided by FSL 25 (preprocess_nsd_MNIandbrainmask.m). Visual inspections confirm the quality of the registration 26 (Supplementary Video 4).

27
• Converting surface data to volumetric format. We implemented a method that, for a given target 28 anatomical volume with resolution R mm, allows each surface vertex to contribute a triangular 29 (linear) kernel of size +/-R mm and then calculates a weighted average of data values at the 30 position of each voxel in the volume (cvnmapsurfacetovolume_helper.m). 31 Based on the results of the above analyses, we calculated coordinate transformations that indicate how to 32 map between the functional spaces, the anatomical spaces, the subject-native surfaces, fsaverage, and 33 MNI space (preprocess_nsd_calculatetransformations.m). Finally, we created a lightweight utility 34 (nsd_mapdata.{m,py}) that uses the coordinate transformations to map user-supplied data from one 35 space to another space under a given interpolation scheme (nearest-neighbor, linear, cubic, winner-take-36 all). For example, data in subject-native functional space can be mapped to MNI space using a single 37 interpolation of the functional data. We used this utility to perform a number of mundane but useful Several data quality metrics were calculated (export_runmetrics.m) and summarized in Figures 1D and   45 2D. Temporal signal-to-noise ratio (tSNR) was computed from raw fMRI volumes (no pre-processing) by 46 first detrending the time-series data from each voxel (quadratic polynomial fit) and then dividing the mean 47 signal intensity by the standard deviation of signal intensity values (autoqc_fmri.m). We calculated the 48 median tSNR across voxels within a simple brain mask (mean volume thresholded at 1/10th of the 99th 49 percentile of values) and then computed the median across runs. Head motion was quantified by 50 calculating framewise displacement (FD) (Power et al., 2012) based on motion parameter estimates (1.8-1 mm preparation). We calculated the mean FD across volumes in a run and then computed the median 2 across runs. BOLD response was quantified by calculating the percentage of variance explained by a 3 simple ON-OFF GLM model (1.8-mm preparation). We calculated the median variance explained across 4 voxels within the nsdgeneral ROI and then computed the median across runs. Response rate was 5 quantified by calculating the percentage of trials for which the subject pressed a button and then 6 computing the mean across runs. Percent correct for easy trials was quantified by calculating the 7 percentage of trials for which the subject gave the correct response, considering only those trials for 8 which the presented image had been previously presented in the same scan session.

10
To identify EPI signal dropout regions (export_signaldropout.m), we divided the T2 volume (resampled to 11 match the EPI data) by the mean EPI volume (1-mm preparation). The resulting volume is useful as it 12 indicates which voxels have high signal intensity in the T2 but are corrupted by signal dropout in the EPI. 13 We mapped the volume to the cortical surface (cubic interpolation; mean across depth), transformed the 14 result to fsaverage, and then used a data-driven threshold to mark atypically high values. Vertices marked 15 in at least four of the eight subjects are indicated in Figure 5F. To visualize surface imperfections, we 16 took the voxels that were marked in the 0.8-mm anatomical space (during the inspection of surface 17 quality), smoothed this binary volume with a 3D Gaussian with full-width-half-max of 2 mm, mapped the 18 result to the cortical surface (cubic interpolation; max across depth), and then transformed the result to 19 fsaverage. Vertices exceeding 0.01 in at least one of the eight subjects are indicated in Figure 5F. The MTL ROI collection consists of ROIs related to the hippocampus and surrounding brain regions in the 35 medial temporal lobe (external_mtl.m). Manual labeling of these ROIs was performed by an expert (W. 36 Guo) based on the high-resolution T2 volume obtained for each subject, following a published protocol 37 (Berron et al., 2017). Labels were defined on the raw high-resolution volume, and were mapped to 0.5-38 mm anatomical space using the previously determined affine transformation. Note that the resulting labels 39 have some amount of jaggedness due to the anisotropy of the voxels in the high-resolution T2 volume. 40 Alternative versions of the labels were created in the same way as described for the thalamus labels. the fovea (0° eccentricity) but were restricted to the extent of cortex stimulated by the pRF experiment. 49 The pRF results were also used to define prf-eccrois, a collection of ROIs consisting of concentric regions 50 with increasing eccentricity coverage (0.5°, 1°, 2°, 4°, >4°). Labeled regions were confined to the same 1 cortical extent labeled in prf-visualrois.

3
Results of the fLoc experiment were used to define several collections of category-selective ROIs 4 (analysis_drawrois*.m). These ROIs were manually drawn on cortical surfaces by experts (K. Kay,A. 5 White, A. Bratch) based on a combination of anatomical location and stimulus selectivity t-values 6 obtained from the fLoc experiment, following general procedures used in prior studies (Gomez et al., 7 2018;Kay and Yeatman, 2017;Stigliani et al., 2015). For each ROI collection (floc-bodies, floc-faces, 8 floc-places, floc-words), several ROIs exhibiting preference for the associated category were defined 9 (e.g., floc-faces was based on t-values for the contrast of faces > non-faces). ROIs were defined by 10 drawing a polygon around a given patch of cortex and then restricting the ROI to vertices within the 11 polygon that satisfy t > 0. This liberal criterion was used to provide maximum flexibility (the user can 12 simply restrict the ROI further to a specific desired threshold). The pre-processed fMRI data from the pRF experiment were analyzed using the Compressive Spatial Six quality measures (pRF BOLD, fLoc BOLD, pRF behavior, fLoc behavior, raw motion, detrended 27 motion) were computed for each of the 14 subjects who participated in the screening session. BOLD 28 quality was quantified as the percentage of voxels for which variance explained by modeling the fMRI 29 time-series data (either pRF model fitting or GLM model fitting) exceeded 20%. Behavior quality was 30 quantified as described above. Motion was quantified by calculating the median voxel displacement 31 relative to the reference volume used for motion correction, computing the median of this quantity across 32 volumes, and then computing the mean across runs. This motion quantification was performed using raw 33 motion parameter estimates (thereby providing a measure of global head displacement over the course of 34 the session) as well as using motion parameter estimates that are linearly detrended within each run 35 (thereby providing a measure of within-run head instability). Each of the six measures was linearly scaled 36 to span the range 1-5 where 1 corresponds to the worst performance and 5 corresponds to the best 37 performance observed across subjects. Finally, the normalized measures were averaged to produce an 38 overall ranking for each subject, as depicted in Figure 2C. We performed a GLM analysis of the pre-processed time-series data from the NSD experiment. To 8 maximize flexibility for subsequent analyses, the GLM approach was designed to provide estimates of 9 BOLD response amplitudes ('betas') for single trials. Due to low signal-to-noise ratio, single-trial 10 estimation in fMRI is challenging. We therefore developed several analysis components in order to 11 optimize the quality of single-trial betas. These components are packaged into a tool called GLMsingle, 12 and is the subject of a forthcoming manuscript where additional details and discussion can be found. To generate a library of HRFs that accurately capture empirically occurring timecourse variation, we 30 performed an initial analysis of data from the first NSD core session (nsd01). The first step was to create 31 a comprehensive summary of observed timecourses (hrf_derivecanonicalpcs.m). The time-series data 32 from each subject's nsd01 session was fit using a finite impulse response model (0-30 s) where all of the 33 stimulus trials are treated as instances of a single experimental condition (this simplification is necessary 34 to make estimation feasible). We identified voxels for which model variance explained (R 2 ) was greater 35 than 10%, and from these voxels randomly drew 20,000 voxels (with replacement). Pooling across 36 subjects, timecourse estimates from the resulting 160,000 voxels were subjected to singular value 37 decomposition to determine the top 3 principal components (shown in Figure 5B, inset). To fine-tune 38 timecourse estimates, we re-fit the time-series data from the nsd01 session using these 3 principal 39 components as the basis (as opposed to the finite impulse response basis). Finally, adopting the 40 visualization approach of the Temporal Decomposition Method (Kay et al., 2020), we projected voxel 41 timecourse estimates onto the unit sphere (using the same voxel selection criterion of R 2 > 10%), and 42 constructed a 2D histogram for each subject (shown in Figure 5A).

44
The second step was to define a set of timecourses that span the observed timecourse variation 45 (hrf_constructmanifold.m). To do this, we converted the 2D histograms to units of relative frequency and 46 then averaged the histograms across subjects. Inspecting the group-average histogram (shown in Figure   47 5B), we manually clicked a sequence of points on the unit sphere that follow the data density as closely 48 as possible. We then parameterized the path traced by these points (a simple 1D manifold) by positioning 49 regularly spaced points where successive points are separated by six angular degrees ( Figure 5B, cyan 50 dots). The timecourses corresponding to the resulting set of 20 points were cubic interpolated to a 51 sampling rate of 0.1 s and normalized to peak at 1 (Figure 5C). Finally, we fit each timecourse using a 1 double-gamma function as implemented in SPM's spm_hrf.m (hrf_fitspmhrftomanifold.m). This yielded a 2 library of 20 canonical HRFs that may be useful for application to other experimental datasets 3 (getcanonicalhrflibrary.m).

5
Cross-validation framework for single-trial GLM 6 7 The GLMdenoise and ridge regression analysis components of GLMsingle both require tuning of 8 hyperparameters. To determine the optimal setting of hyperparameters, we use a cross-validation 9 approach in which out-of-sample predictions are made for single-trial beta estimates, as opposed to time-10 series data. This simplifies and reduces the computational requirements of the cross-validation 11 procedure. Note that because of cross-validation, although GLMsingle produces estimates of responses 12 to single trials, it does require the existence of and information regarding repeated trials, i.e., trials for 13 which the stimulus is the same.
14 15 The first step of the cross-validation procedure is to analyze all of the available data using no   (Rokem and Kay, 2020) to estimate the single-trial 51 betas, systematically evaluating different shrinkage fractions. Cross-validation is used to select 1 the optimal shrinkage fraction for each voxel. To mitigate bias on the overall scale of betas, we 2 apply a post-hoc scaling and offset on betas obtained for a given voxel in order to match, in a 3 least-squares sense, the unregularized betas obtained for that voxel.

5
Application of GLMsingle to the NSD data 6 7 We used GLMsingle to analyze the time-series data from each NSD scan session (glm_nsd.m). Major 8 algorithmic parameters included the following: we evaluated up to 10 nuisance regressors; we evaluated 9 shrinkage fractions from 0.05 to 0.90 in increments of 0.05 and from 0.91 to 1 in increments of 0.01 10 (representing a finer grain for voxels with the best signal-to-noise ratio); we performed 6-fold cross-11 validation (consecutive pairs of runs) for Steps 4 and 5; and we used an ON-OFF R 2 threshold of 5% in 12 Step 4. As an optional resource, we fit the time-series data from the resting-state experiment using methods that 32 parallel those used for the NSD experiment (glm_nsdresting.m). For each scan session involving resting-33 state, we took the two resting-state runs (first and last run acquired) and analyzed the data using the 34 design matrix of the neighboring NSD runs and the same voxel-wise HRFs determined from analyzing the 35 NSD runs in that scan session (this is analogous to beta version b2). Although there is no reason to think 36 that spontaneous resting-state activity conforms to the 4-s trial structure of the NSD experiment, these 37 resting-state betas may be useful as a direct comparison for the NSD betas. where RESP indicates the NSD beta observed on a given trial, signal is the mean signal across different 1 images, signal is the standard deviation of the signal across different images, and noise is the standard 2 deviation of the noise. Note that the first Gaussian distribution characterizes true signal variability, 3 whereas the second Gaussian characterizes variability due to noise. Also, note that this framework treats 4 response variability unrelated to the stimulus as "noise", but such variability may in fact reflect "signal" 5 from the perspective of functional connectivity (Biswal et al., 1995).
To compute the noise ceiling, we first take the trial-wise NSD betas for each voxel and z-score these 8 betas within each scan session. This simple normalization compensates for nonstationarities that may 9 exist across sessions. We then calculate the variance of the betas across the three presentations of each 10 image (using the unbiased estimator that normalizes by n-1 where n is the sample size), average this 11 variance across images (thus, stabilizing the variance estimates), and then compute the square-root of 12 the result. This produces an estimate of the noise standard deviation: where n indicates the number of trials that are averaged together (see online notes for additional details). 31 We note that there is a direct relationship between the metric of split-half reliability and the noise ceiling: if 32 a voxel has two sets of responses that reflect the same image presentations, then the correlation 33 between the two sets of responses multiplied by 100 is equal to the noise ceiling for single-trial responses 34 expressed in percent variance explained.

37
Using the above methods, we calculated noise ceilings for each of the beta versions and for each of 38 various spatial preparations (1.8-mm, 1-mm, fsaverage, nativesurface). For simplicity, noise ceiling 39 estimates were calculated using betas associated with images with all three presentations available. The 40 noise ceiling results shown in Figure 5F-G were computed assuming n = 3, reflecting the scenario in 41 which trial-wise betas are averaged across three trials for each image. To provide a common basis for comparing different datasets, we define the number of equivalent trials 6 present in a dataset as N  ncsnr 2 where N indicates the number of trials conducted and ncsnr is the 7 noise ceiling signal-to-noise ratio (as defined earlier). The assumptions here are that (i) every trial has 8 equal value, irrespective of whether it is used to measure brain responses to an image that has already 9 been shown or a new image (e.g., two trials for one image is equivalent to one trial for two distinct 10 The median ncsnr, averaged across subjects, was 0.260 for NSD (averaged across the first four NSD 22 subjects), and 0.187 for BOLD5000 (averaged across the four subjects in BOLD5000). This indicates 23 that, despite the longer time duration allocated per trial in BOLD5000 (10 s) compared to NSD (4 s), the 24 quality of a single-trial beta in NSD is higher than that in BOLD5000. Specifically, one NSD trial is 25 approximately equivalent to (0.260) 2 /(0.187) 2 = 1.93 BOLD5000 trials. This increase in quality is likely 26 due, in part, to the screening of subjects and the ultra-high magnetic field strength (7T) used in NSD.

27
Note that the ncsnr metric quantifies the SNR per trial and is expected to be unbiased with respect to the 28 number of repeated trials used to calculate it. Thus, although the exact number of trials per image is 29 different in the NSD and BOLD5000 datasets, the ncsnr values can still be directly compared. To prepare stimuli for pRF estimation, the NSD images were converted to grayscale, resized to 800 pixels 40  800 pixels (cubic interpolation), and squared to mimic the luminance response of the display. For this analysis (results shown in Figure 6B), we used version 3 of the NSD betas (b3)  For each subject, we extracted betas for vertices within each ROI (concatenating across hemispheres). 28 We then computed Pearson's correlation between beta patterns across all possible pairs of images. This  'GNet'-a brain-optimized network that is trained to directly predict brain activity in the NSD dataset. The To facilitate direct comparison, all models are designed to have comparable coupling models. For GNet, 10 both the feature extractor and coupling model are trained jointly using brain data; for AlexNet and the 11 Gabor models, the feature extractors are fixed and only the coupling model is trained using brain data. Constraints were placed on the weights ( )-termed 'spatial pooling fields'-that couple the feature 36 maps to voxel activity. For the AlexNet-and Gabor-based encoding models, the spatial pooling field for 37 each voxel was a 2D isotropic Gaussian that was applied to all feature maps (see Supplementary Figure   38 12B, middle). We find that this constrained model of spatial pooling typically yields better prediction Given the demanding memory requirements of training large-scale neural networks to jointly predict tens 3 of thousands of voxels, we selected the four NSD subjects with the highest noise ceilings (see Figure   4 5G). For the selected subjects (1, 2, 5, 6), NSD betas were extracted from visual areas V1-hV4. These 5 betas were separated into those evoked by the shared1000 images and those that were not; the former 6 were designated as the validation set, while the latter were designated as the training set. For example, 7 for subject 1, there were 9,000 images  3 trials = 27,000 samples in the training set, and 1,000 images  single instantiation of GNet was created for all four subjects, and data from all subjects were used to train 28 the GNet-based encoding models. In this scheme, all subjects share a common feature extractor, but 29 each subject has independently adjusted coupling models and feature weights.

36
Note that the NSD subjects view largely non-overlapping sets of images. Thus, when training GNet on 37 data from multiple subjects, we used a modified procedure for selecting batches of training data. For each 38 iteration of training, we first extracted a batch of training samples from one subject's data and calculated 39 the gradient with respect to the loss function. Coupling model parameters for that subject and feature 40 extractor parameters were then updated and the process was repeated until all batches from all subjects 41 were used. This corresponded to one training epoch.

Supplementary Discussion
Our examination of the NSD data did not reveal any severe problems or artifacts that could not be compensated for in data pre-processing. Nonetheless, there are several known limitations of the data that should be considered. EPI pulse sequences invariably suffer from signal dropout and spatial distortion in locations with magnetic field inhomogeneties. The NSD data exhibit signal dropout in typical locations such as near the ear canals and the frontal sinuses (see Figure 5F), and our approach for distortion correction may have some imperfections (see Supplementary Video 6). While we stressed the importance of central fixation to the NSD subjects, some inadvertent eye movements are likely present in the data. Methods for retrospectively estimating eye position from imaging data (Frey et al., 2020;Son et al., 2020) might prove useful. In a few fMRI scan sessions (6 out of 308; 1.9%), the subject exited the scanner and re-entered either on the same day or a different day to complete the session. We compensated for these occurrences in the pre-processing of the data, but they nonetheless contribute some variability. Due to hardware errors, behavioral responses are missing for a few of the NSD runs (2 out of 3,408; 0.06%). Finally, a few fMRI scan sessions (4 out of 308; 1.3%) had slightly incomplete brain coverage due to subject motion. A comprehensive summary of data anomalies is available in the online materials.

Supplementary Figures
Supplementary Figure 1. Design of the NSD experiment. A, Image presentations. Each of 10,000 distinct images was placed 3 times on a circle according to a probability distribution created by mixing a relatively narrow von Mises distribution and a uniform distribution. The resulting image sequence was divided into 40 equally-sized segments for the 40 NSD scan sessions. B, Basic statistics of image repetitions. We define new trial as a trial involving an image never shown before, old trial as a trial that is not a new trial, and easy trial as an old trial for which the presented image had been shown previously in the same scan session.
Supplementary Figure 2. Overview of data collection. This table summarizes the overall NSD data collection effort. Structural and diffusion MRI data were collected at 3T. Functional MRI data were collected at 7T. The breakdown of the 7T fMRI scan sessions is indicated: for example, subject 2 participated in 1 (prffloc) + 40 (nsd01-nsd40) + 1 (nsdsynthetic) + 1 (nsdimagery) = 43 7T fMRI scan sessions. Additional behavioral data were acquired outside of the scanner (nsdpostbehavior, nsdmemory, nsdmeadows). Note that scan sessions were occasionally split across multiple magnet entries (see aquamarine and yellow cells). For simplicity, we treat these cases as if they represent single scan sessions.
Supplementary Figure 3. Details on the quantification of tSNR. This figure shows example tSNR results (nsd20 scan session, first NSD run). The middle slice in each of three orthogonal views (axial, sagittal, coronal) is displayed. To compute tSNR, the raw fMRI volumes from a given run are obtained, and the mean across volumes is computed (Mean EPI). A brain mask is computed by identifying voxels whose intensity is at least 10% of the 99th percentile of intensities in the mean volume (Mask). tSNR is calculated by quadratically detrending the time-series of each voxel (preserving the mean) and then computing the mean divided by the standard deviation of the time-series values (tSNR). A summary tSNR value is determined by calculating the median tSNR across voxels within the brain mask. This corresponds to the summary metric shown in Figure 2D, left (the inset shows results from subject 2).
Supplementary Figure 4. Details on the quantification of head motion. This figure shows example motion parameter estimates obtained during data pre-processing (nsd20 scan session, first NSD run, 1.8-mm data preparation). The rotation parameters, originally in radian units, are multiplied by 50 in order to allow interpretation in terms of millimeters of displacement for a circle of diameter 100 mm (Power et al., 2012). Motion parameters are relative to the reference volume which is the first volume in each scan session. Framewise displacement (FD), calculated as the sum of the absolute differences of the motion parameters for successive pairs of volumes, is also plotted. The mean FD across volumes is indicated in the plot titles, and corresponds to the summary metric shown in Figure 2D, middle (the inset shows results from subject 5).
Supplementary Figure 5. Quantification of functional imaging stability. We took the mean fMRI volume (1-mm preparation) in each scan session, bias-corrected the volume by dividing by a fitted 3D polynomial (autoqc_nsd_grand.m), and then computed pairwise correlation across sessions. Dotted white lines mark increments of five NSD scan sessions. Inspection of similarity of the mean EPI volume across sessions reveals a few minor anomalies. We investigated these cases further and determined the following: the nsd36 scan session in subject 7 involved a poor scanner shim, which was largely but not fully corrected by the fieldmap-based processing; the nsd01 scan session in subject 8 involved an unusually large amount of head motion, which resulted in some residual spatial distortion; and the nsd12-nsd16 scan sessions in subject 8 involved a temporary sinus infection near frontal cortex that manifested as bright signal intensities in the EPI volumes outside the brain but otherwise did not cause any data problems. For visual inspection of these effects, see Supplementary Videos 6-7.
Supplementary Figure 6. Diffusion processing for investigation of white-matter connectivity. A, Schematic of the diffusion pre-processing pipeline. Diffusion volumes were corrected for noise, Gibbs ringing, susceptibility, motion, eddy currents, and bias fields before being co-registered to the T1 anatomy. Following pre-processing, the data were organized into two runs (corresponding to the 99-direction and 100-direction scans, respectively). B, Signal-to-noise ratio, computed in the corpus callosum. C, White-matter tract segmentation from an example subject (subject 7). White-matter tracts are organized based on typical anatomical and functional definitions into associative (left), projection (middle), and callosal (right) tracts and overlaid on the T1 anatomy. D, Reliability of MD, ODI, NDI, ISOVF, Mean Kurtosis (MK), Axial Kurtosis (AK), and fiber count, length, and volume. Each dot indicates results averaged along a single tract. Pearson's correlation (r) and root-mean-squared error (RMSE) for each measure are indicated in the inset. E, Macrostructural and microstructural properties observed for different tracts. Error bars indicate ± 1 SEM across subjects.  Figure 5F-G. Each subplot is a 2D histogram comparing noise ceilings for two different beta versions. Improvements in noise ceilings are consistent across voxels and subjects.
Supplementary Figure 10. Angle and eccentricity estimates from the NSD data. Here we show results from the analysis of the pRF experiment and results from an analogous analysis performed on trial-averaged NSD betas (see Methods for details). Each panel shows an occipital view of FreeSurfer's sphere surface, and white lines indicate borders of visual areas V1-hV4 (defined based on results of the pRF experiment). Angle and eccentricity estimates are plotted using the same colormaps as in Benson et al. (Benson et al., 2018). We also plot the amount of time-series variance explained in the pRF data (variance relative to the mean signal level) and the amount of variance explained in the NSD betas (variance relative to 0% BOLD signal change). Clear retinotopic maps in early visual cortex are visible in the NSD results, including robust angle estimates even in foveal regions. In addition, there is high consistency of retinotopic estimates across the pRF and NSD datasets. There is some discrepancy in absolute eccentricity estimates at peripheral locations; this is likely due to technical differences in how modeling procedures behave for voxels near the stimulus edge.
Supplementary Figure 11. Recognition memory effects in the NSD data. Same format as Figure 6B, but showing results for all individual subjects. Note that positive values indicate BOLD responses are greater for hits than for correct rejections, whereas negative values indicate BOLD responses are greater for correct rejections than for hits. The observed decrease in the magnitudes of the t-values (e.g. from nsd05 to nsd20) likely reflects a decrease in the subjects' recognition accuracy over the course of the NSD experiment.
Supplementary Figure 12. Design of AlexNet-and GNet-based encoding models. A, Illustration of an encoding model that predicts brain activity in a given voxel ( ) in response to images ( ). Images are passed to nonlinear feature extractors, (trapezoids), that output feature maps (grey cuboids). Feature maps are grouped, passed through an element-wise nonlinearity, (⋅), and then multiplied pixel-wise by a spatial pooling field ( 1 , … , where superscripts index distinct groups of feature maps) that determines the region of visual space that drives voxel activity. The weighted pixel values in each feature map are then summed, reducing each feature map to a scalar value. These scalar values are concatenated across all feature maps, forming a single feature vector that is passed through another element-wise nonlinearity (left black rectangle) and then weighted by a set of feature weights, (right black rectangle), to yield predicted voxel activity. Note that for each type of encoding model (e.g., AlexNet-based encoding model, GNet-based encoding model), the feature extractors are identical for all voxels, but the spatial pooling fields and feature weights are optimized and may vary across voxels. For the AlexNet-based encoding model, the feature extractors were pre-specified, the spatial pooling fields were optimized via line search, and the feature weights were optimized via ridge regression. For the GNet-based encoding model, stochastic gradient descent with early stopping was used to optimize the parameters of the feature extractors , the spatial pooling fields 1 , … , , and the feature weights . B, Illustration of spatial pooling fields. For the AlexNet model, a single isotropic 2D Gaussian pooling field (middle) selected from a set of candidates (right) was applied to all feature maps. For the GNet model, an independent, flexible pooling field (left) was applied to each group of feature maps. Applying flexible pooling fields to AlexNet leads to lower prediction accuracy overall (results not shown), so we present the version that uses isotropic 2D Gaussian fields. C, Comparative architecture of AlexNet and GNet. AlexNet and GNet are both deep convolutional neural networks, but differ in the types and sequencing of layers (rows of the table). The first three layers are the same for both networks and correspond to the first three layers of an AlexNet trained to classify objects in the ImageNet dataset. For both networks, these shared "prefiltering" layers are followed by sequences of convolutional layers (rows labeled "conv"; values indicate feature depth and convolutional filter resolution; "str" = filter stride, "pad" = convolutional padding), max-pooling layers ("maxpool"), batchnormalization and weight-dropout layers ("batchnorm + dropout"), adaptive averaging layers ("adaptive avg"), and fullyconnected layers ("fully con."; value indicates number of units). Feature maps in the convolutional or fully connected layers (indicated by red arrows; resolution of the feature maps in parentheses) are used as predictors of brain activity in the context of an encoding model (see panel A).

Supplementary Video 6. Inspection of mean EPI across scan sessions (volume visualization).
Video available online (https://osf.io/ydf9j/). Each frame shows the mean EPI volume from a single scan session (1-mm data preparation). Note that session 0 corresponds to the prffloc scan session and the last two scan sessions from each subject correspond to the nsdsynthetic and nsdimagery scan sessions. This video is useful for assessing overall image quality and the stability of functional imaging across scan sessions. For quantitative analysis, see Supplementary Figure 5.

Supplementary Video 7. Inspection of mean EPI across scan sessions (surface visualization).
Video available online (https://osf.io/ytjk4/). This is similar in spirit to Supplementary Video 6, except that the mean EPI volumes have been projected onto each subject's cortical surface and then transferred to the fsaverage surface.

Supplementary Video 8. Inspection of BOLD signal strength across scan sessions (volume visualization).
Video available online (https://osf.io/kwxta/). Each frame shows the amount of variance explained by the ON-OFF GLM model (1-mm data preparation; fixed color range). This video is useful for assessing the overall strength and stability of BOLD responses in the NSD dataset.

Supplementary Video 9. Inspection of BOLD signal strength across scan sessions (surface visualization).
Video available online (https://osf.io/gu9wx/). This is similar in spirit to Supplementary Video 8, except that the variance explained volumes have been projected onto each subject's cortical surface and then transferred to the fsaverage surface.