Abstract
The visual system analyzes image properties across multiple spatial and temporal scales. Population receptive field (“pRF”) models have successfully characterized spatial representations across the human visual pathways. Here, we studied temporal representations, measuring fMRI and electrocorticographic (“ECoG”) responses in posterior, lateral, ventral, and dorsal visual areas to briefly viewed contrast patterns. We built a temporal pRF model employing linear summation and time-varying divisive normalization. Our model accurately predicts the fMRI amplitude and ECoG broadband time-course, accounting for two phenomena – accumulation of stimulus information over time (summation), and response reduction with prolonged or repeated exposure (adaptation). We find systematic differences in these properties: summation periods are increasingly long and adaptation more pronounced in higher compared to earlier visual areas. We propose that several features of temporal responses – adaptation, summation, and the timescale of temporal dynamics – can be understood as resulting from a small number of canonical neuronal computations.
1. Introduction
A successful visual system extracts meaning from stimuli that vary across space and time. This requires integrating and segregating features at multiple scales. The classic visual perception example that requires flexible spatial pooling is object recognition: recognizing an object requires grouping of features across space1, but grouping over too large a region results in a jumbled, or ‘crowded’, percept that interferes with recognition2. Integrating over time is also fundamental to perception, and depends on task and stimuli3, 4. For example, interpreting scenes across eye movements, blinks, occlusion events, or extended actions requires integrating and segregating features over multiple temporal scales.
The properties of spatial pooling have been widely studied, and two trends emerge along the cortical visual hierarchy. First, receptive field size is relatively small in V1 and increases in extrastriate areas, as measured in electrophysiology5 and fMRI6-8. Second, spatial summation becomes more subadditive. In extrastriate areas such as MT9 and inferotemporal cortex10, responses to two stimuli presented simultaneously are much lower than the sum of the responses to the stimuli presented separately. In a systematic study across the visual pathways in human, spatial summation was shown to become increasingly sub-additive from V1 to extrastriate areas7; this may be a correlate of achieving representations of objects and scenes that are increasingly tolerant to changes in stimulus size and location11.
Here, we investigated how neuronal populations at different stages of the visual system respond to stimuli extended over times, just as previous studies characterized how neurons respond to stimuli extended over space. Prior measurements have shown that temporal summation in individual neurons is subadditive in V112, and that for stimuli of several seconds or more, the blood oxygenation level dependent (BOLD) fMRI signal in V1 is close to, but systematically less than, the linear prediction13, 14. Here, we characterized responses to brief stimuli, at the time scale of neuronal dynamics in the visual system (tens to hundreds of ms), in many visual areas, measured with both fMRI and electrocorticography (ECoG). fMRI measurements have the advantage of being non-invasive and recording from many visual areas in parallel. The fMRI measurements also have limits for interpreting the neural response. First, subadditivities in the fMRI response can arise from the stimulus-to-neuronal transform or neuronal-to-BOLD transform. Second, the slow response does not enable us to characterize the detailed time course of the neuronal response. The ECoG measurements complement fMRI by providing much greater temporal resolution and by not compounding nonlinearities in the neuronal response with nonlinearities in the hemodynamics.
To quantify and understand how temporal information is encoded across visual cortex, we built temporal population receptive field (“pRF”) models which predict the fMRI and ECoG responses to arbitrary stimulus time courses, and we examined the model parameters in visual areas spanning V1 to IPS. Together, the temporal pRF model reveals a systematic hierarchy of increasingly large temporal windows and increasingly large deviations from linear summation, paralleling the hierarchy of spatial receptive fields.
2. Results
We present two variants of a temporal pRF model. The first was fit to fMRI data, and captures subadditivities of the BOLD amplitude for stimuli with different temporal profiles (sections 2.1-2.4). Fitting the ECoG responses required expanding the model to account for temporal dynamics at the ms scale (section 2.5). Finally, we test how accurately the expanded, dynamic model predicts the fMRI responses (section 2.6).
2.1 Measuring temporal summation in visual cortex
In each trial of the fMRI experiment, participants viewed either one or two pulses of a static spatial contrast pattern. Each pattern was an independently generated band-pass noise image (24° diameter), used in prior studies of spatial encoding7, 15, except that for the two-pulse stimuli, the two spatial patterns were identical. Each trial used one of thirteen distinct time courses (Figure 1A). The durations of the one-pulse stimuli and the ISIs of the two-pulse stimuli were the same: 0, 17, 33, 67, 134, 267, 533ms, and each pulse in the 2-pulse stimuli was 134ms. The 0-ms one-pulse stimulus was a blank (mean luminance), and the two-pulse stimulus with 0 ISI was identical to the one-pulse stimulus of twice the length (267ms). Four participants were scanned, and data were binned into nine bilateral, eccentricity-restricted (2-10°) visual areas defined from a separate retinotopy scan.
The fMRI data were analyzed in two stages. First, we extracted the amplitude (β-weight) for each stimulus condition using a variation of the general linear model, “GLM denoise”16, a technique that improves the signal-to-noise ratio by including noise regressors in the GLM. Second, we fitted the temporal pRF model to the GLM β-weights, averaged across voxels within ROIs.
2.2 Temporal summation in visual cortex is subadditive
We tested the linearity of the fMRI BOLD signal in each visual area. To do so, we assume a time-invariant linear system such that the BOLD amplitude (GLM β-weight) is proportional to the total stimulus duration within the trial1. For example, the linear prediction is that a stimulus of duration 2t produces twice the amplitude as a stimulus of duration t, and the same amplitude as two-pulse stimuli, with total duration 2t (Figure 2A). This prediction is not borne out by the data. The response to a stimulus of length 2t is about 75% of the linear prediction in V1 and 50% in TO (Figure 2B, left panel). This failure of linearity is found in all visual areas measured, with temporal summation ratios below 0.8 for all ROIs, and a tendency toward lower ratios in later areas (Figure 2C).
A further failure of linearity occurs for trials with two pulses and variable ISI: the response is larger when the ISI is longer, especially in V1, whereas the linear prediction is that the amplitudes are the same, and double the response to the one-pulse (Figure 2B, right). When the ISI is long, the response in V1 is close to the linear prediction made from the one-pulse stimulus. In TO, even with a long ISI the response is well below the linear prediction. This pattern, whereby the response to a second stimulus is reduced for short ISIs, and larger for longer ISIs, is often called adaptation and recovery17, 18. For TO, the recovery time is longer than V1.
2.3 The temporal subadditivity is captured by a compressive temporal summation model (CTS)
We modeled the temporal subadditivity with a compressive temporal summation model (“CTS”), analogous to the compressive spatial summation model (CSS) used to predict fMRI responses to spatial patterns7, 19. The model predicts the neuronal response by convolving the stimulus time course with a temporal impulse response function, and then passing the output through a power-law static nonlinearity (Figure 3). The model is linear if the exponent equals 1 and subadditive if less than 1. Finally, we summed the time-varying neuronal prediction to derive a single value, which, when scaled, represents the predicted BOLD amplitude.
We compared the CTS model (fitted exponent) to a linear model (exponent fixed at 1) by measuring cross-validated accuracy. The CTS model is more accurate than the linear model for all areas (Figure 4A). The linear model substantially underpredicts responses to short durations and overpredicts responses to long durations, whereas the CTS model does not. Further, the predictions of the linear model do not depend on ISI, whereas the CTS model correctly predicts that the response amplitude increases with longer ISI. The cross-validated predictions of the CTS model capture more than 90% of the variance of the left-out data for all 9 ROIs. This represents an improvement of 8-17% compared to the linear model. The improvement is more pronounced in later than early areas (LO/TO/IPS vs. V1-V3).
2.4 The CTS model fits capture systematic differences between areas
The CTS model is parameterized by τ1, ε, and a gain factor, g. τ1 is the latency to peak in the temporal impulse response function, and therefore is related to temporal summation window length; e is the exponent, and represents how compressive the temporal summation is. The exponent e is less than 1 for all ROIs, and is smaller in later (hV4-IPS) than in earlier areas (V1-V3), consistent with the pattern found for spatial summation7 (Figure 4b; see Figure S1 for individual subject fits). The same pattern was also found in a second experiment using identical temporal conditions but different spatial patterns, including noise stimuli and face images (Figure S2). A consequence of more compressive temporal summation is that the response amplitude varies less with minor changes in stimulus duration, just as greater compression of spatial summation predicts more tolerance to changes in size and position7.
From the current fMRI data set, we did not observe systematic variation in τ1. Our interpretation is that we do not have enough power to accurately fit τ1 due in part to the coarse temporal resolution of fMRI. (See Figure S8A for CTS parameter recovery.) Because fitting a parameter that is not well-constrained by the data can affect the fit to other parameters, we re-fit the CTS model with τ fixed at 0.05, 0.1, or 0.2 s; in each case, ε is below 1 for all ROIs, and lower in later areas than early areas, just as observed in the full model fit.
To further examine the differences in temporal processing between ROIs, we summarized the CTS model in terms of two metrics that have more directly interpretable units: Rdouble and TISI (Figure 4b). Rdouble is the ratio between the CTS-predicted BOLD response to a 100-ms stimulus and a 200-ms stimulus. Lower Rdouble means more compressive temporal summation. Later visual areas have lower Rdouble than earlier ones. TISI is the minimal duration separating two 100-ms pulses such that the response to the paired stimuli is close to the linear prediction from the single stimulus. Similar to previous measurements at longer time scales20, 21, the recovery time is longer for later than earlier visual areas.
In a separate analysis, we asked whether model parameters differed as a function of eccentricity, as suggested by differential temporal sensitivity in V1 between fovea and periphery22. We did not find reliable differences for parafovea (2-5 deg) versus periphery (5-10 deg) (Figure S3). This may be due to the limited range of eccentricities; as Horiguchi et al22 found the biggest difference in temporal sensitivity between fovea and the far periphery (20-60 deg), whereas we only tested out to 10 deg.
2.5 Temporal dynamics of normalization
There are at least two potential sources of subadditivity contributing to the BOLD response: subadditivity of the neuronal response with respect to the stimulus time course, and subadditivity of the fMRI amplitude with respect to the neuronal response. To evaluate additivity of the neuronal response in isolation, and to characterize the neuronal response at a finer temporal scale, we re-analyzed data from a published ECoG experiment22 (Figure 1B). We analyzed data from 45 electrodes in visual cortex (Figure S4, ECoG subject 1; Figure S7, ECoG subject 2). In each trial, a static texture (22°-diameter) was presented for 500ms followed by a 500-ms blank. We analyzed trials with noise patterns of 1/fn amplitude spectra, with n =0, 1, or 2 (white, pink, or brown noise). We summarized the ECoG signal as the time-varying envelope of the broadband response (60-200 Hz), averaged across stimulus class, trials, and electrodes within visual areas, as the broadband response is a correlate of the multiunit spiking activity23. Because there were fewer electrodes in anterior ROIs than in V1-V3, we grouped the anterior electrodes into lateral, ventral, and dorsal regions.
Across all visual areas, the time course of the ECoG broadband signal consisted of a large transient power increase, followed by a lower sustained response (e.g., Figure 5A, left). This transient/sustained pattern is similar to that observed for electrophysiological spiking data (e.g., 12, 24, 25). The CTS model predictions fail to capture the sharp onset transient (Figure 5A, middle panels). To account for the temporal pattern of the ECoG response, we implemented a dynamic variation of the CTS model, “dCTS”.
The dCTS model, like the CTS model, is linear-nonlinear. But in contrast to CTS, in which the nonlinearity is applied uniformly in time as a power law, the dCTS non-linearity was implemented as a divisive normalization, with the normalization signal low pass-filtered (Figure 5B). The low-pass filtering causes the response reduction to lag the linear response, producing an onset transient. This feedforward model with delayed normalization approximates a feedback normalization proposed by Heeger26. The numerator contains the linear (un-normalized) response parameterized by τ1. The denominator contains the sum of a semi-saturation constant (σ) and the low-passed linear response (parameterized by τ2). All three terms are raised to the power n. Following stimulus onset, the response increases rapidly due to the exponent n, and then reduces due to normalization (controlled by σ and n). The time constant τ2 controls the time scale of normalization. Because we are modeling the population response summed via the ECoG electrode, we treat the normalization pool (denominator) and the response pool (numerator) as the same, as previously assumed in spatial models of the fMRI signal7.
The dCTS model, fitted to the ECoG broadband time series, captures the main features of the temporal dynamics in all ROIs - an initial transient followed by a sustained response (Figure 6A) – explaining 93% to 99% of the variance in the time courses. In some electrodes, especially those with peripheral receptive fields (Figure S4), there is a small positive deflection 100-200ms after stimulus offset. This is consistent with the finding that peripheral V1 has a relatively greater sensitivity to visual transients27. This feature of the data is not captured by our model. A variant of the model, in which the linear impulse response function is biphasic, predicts the offset transient (Figure S5). Because the offset response is not evident for most electrodes, we use the monophasic response function for primary analyses.
Although the time-courses in all ROIs follow a transient-sustained pattern, they differ in detail. These differences are reflected in model parameters (Figure 6B). This is clearest for the time-scale of the impulse response function, τ1, which generally increases along the visual hierarchy, from ~90ms (V1) to ~150ms in later areas. The parameters n, σ, and τ2, do not follow as clear a pattern. However, the relationship between a single model parameter and the predicted response depends on the other parameters. For example, the level of the sustained response increases with n and decreases with σ.
To clarify the effect of the fitted parameters on the resultant time series, we derived two summary metrics for each model fit (Figure 6C): For a sustained stimulus, the model predictions were summarized by the time to peak (Tpeak) and the asymptotic response amplitude (Rasymptote). A longer Tpeak indicates a longer temporal summation window, and increases slightly from V1 to V3, and substantially in more anterior areas. A smaller Rasymptote corresponds to a lower sustained response, indicative of more normalization. Rasymptote is highest in V1, and decreases substantially in extrastriate areas.
In a separate analysis, we assessed the effect of our signal processing pipeline on the parameter estimates. Because the broadband envelope is derived from a modulating signal, its temporal resolution is limited by the period of the oscillations. Simulations show that this has a small but measurable effect on parameter estimates of the dCTS model, with no change in the general pattern of results (Figure S6).
2.6 Integration of fMRI and ECoG
The fMRI and ECoG data sets were fit with different variants of the CTS model. The two variants were chosen for practical reasons – the slow time scale of the fMRI response limits our ability to resolve the dynamics of the nonlinearity, and the static non-linearity used to fit the fMRI data is a poor fit to the ECoG time course. Here we asked how accurately the dCTS model, fit to ECoG data, predicts the fMRI responses. In each ROI, the dCTS parameters derived from ECoG data were used to generate time-course predictions for the 13 distinct temporal stimuli used in the fMRI experiment. We converted these time courses to predicted BOLD amplitudes assuming one of two fMRI transforms: either linear, which was shown to be a reasonable approximation for relatively long ISIs (a few to many seconds) 13, 14, 28, or a square root transform, as recently proposed29 (Figure 7). Because the dCTS model parameters were derived from the ECoG data alone, there were no free parameters other than a gain factor. Although the models were solved with different participants, different stimuli, and a different instrument, they nonetheless accurately fit the BOLD data, with r2 ranging from 67% to 94% for the linear fMRI transform, and 80% to 96% for the square root transform. For every ROI, the square root transform was slightly more accurate than the linear transform. The most accurate fits for both transforms are for V1-V3.
3. DISCUSSION
3.1 Summation and adaptation in visual cortex
We report subadditive temporal summation throughout human visual cortex. Across 9 areas, responses to long stimuli were less than the linear prediction from briefer stimuli, with more pronounced subadditivities in areas anterior to V1-V3. We captured this effect in a new temporal receptive field model, with a static non-linearity to explain the fMRI amplitude and a dynamic non-linearity to explain the ECoG time course. The dynamic implementation is more general, as it accurately predicts responses in both modalities. Nonetheless, the simpler instantiation of the model (CTS) is adequate to make highly accurate predictions for the fMRI data (cross-validated R2 ~ 90%); an adequate model can be useful and is commonly employed in science and engineering, even when the model is known to fail for certain conditions (which all models do) 8, 30.
The two variants of the model, CTS and dCTS, account for two phenomena: first, areas accumulate information over time (summation, modeled as temporal convolution), and second, response levels reduce from prolonged or repeated exposures (adaptation, modeled with an exponent or divisive normalization). Both phenomena, and the corresponding model parameters, vary systematically across the visual hierarchy: the summation window lengthens and the effect of adaptation grows more pronounced in later compared to earlier visual areas.
3.2 Subadditivities in fMRI
We observed temporal subadditivities for fMRI and ECoG and therefore these effects cannot be solely due to hemodynamic nonlinearities. For the fMRI model fits, we assumed a linear transformation from the neural to BOLD response, as proposed previously13, 14. A recent alternative proposal is a square root transformation29. We compared fMRI predictions from ECoG models using linear and square root transforms, and found both fit well, with slightly better fits for the square root transform. There are numerous differences between the ECoG and fMRI experiments so we do not consider this a compelling reason to reject the linear assumption. If we do assume the square root transform as the last stage of the CTS model (conversion to fMRI), the CTS model parameters would differ, with exponents between 0.2 and 0.5, rather than 0.1 and 0.25, still consistent with significant temporal subadditivities across visual cortex. Thus, the fMRI results, as well as the ECoG results, provide strong evidence for temporal nonlinearities in the neural response.
3.3 Subadditivities in Temporal Summation
Prior literature has characterized temporal subadditivities in several ways. For example, the fMRI response to a long presentation of a reversing contrast pattern is less than the prediction from a short presentation13; the fMRI response to contrast patterns is larger for short ISIs than long ISIs31; the response of V1 neurons to a steady flash is not predicted by its temporal frequency tuning and decreases over time12; the response of a neuron to a repeated stimulus is less than the response to the first stimulus17, 24. Our model accounts for effects such as these with a small number of components – temporal summation (convolution) and a normalization that depends on response history. By formulating a quantitative, forward model, we can then ask whether a phenomenon is unexpected, requiring additional explanation, or is already predicted by the model. For example, repetition suppression and fMRI adaptation at a long time-scale (several seconds32, 33) might not be predicted by our model, and hence may be distinct from the short-term adaptation effects we observe.
A phenomenon as ubiquitous as subadditive temporal summation (adaptation) is likely to be a critical part of neural information processing34. For example, adaption may serve to prioritize new information or act as a gain control35. An interesting consequence of subadditive temporal summation is that responses to stimuli of different durations are more similar to one another than they would be if summation were linear. This may be thought of as a form of duration tolerance or timing tolerance, analogous to size and position tolerance in spatial encoding, which are increasingly prominent in higher visual areas7.
3.4 Multiple Scales of Temporal Dynamics
Our finding that temporal windows lengthen across the visual hierarchy is consistent with prior work measuring temporal dynamics at a larger scale. For example, temporal receptive window length was studied by measuring response reliability to scrambled movie segments36, 37: In visual cortex, responses depended on information accumulated over ~1s, whereas in anterior temporal, parietal and frontal areas the time scale ranged from ~12-36s. Similarly, in event related fMRI, the influence of prior trials was modeled with an exponential decay, with longer time constants in later areas: Boynton et al 13 reported a time constant of ~1s in V1 for contrast reversing checkerboards, and Mattar et al21, using static face images, reported short time constants in V1 (~0.6s) and much longer constants in face areas (~5s). In macaque, the timescale of fluctuations in spike counts was longer for areas higher in the hierarchy compared to sensory areas38.
Analyzing visual information at multiple temporal scales has benefits. First, accumulating information in the past is necessary for predicting the future, and a hierarchy of temporal windows may thus be useful for predictions over different time-scales39. Second, signal-to-noise ratios are optimized when the temporal scale of the analysis is matched to the temporal scale of the event of interest (i.e., a “matched filter”); different visual areas extract information about different image properties, which in turn are likely to have different temporal (or spatiotemporal) distributions in natural viewing. Conversely, the timescale of cortical areas may set the time-scale of integration for behavior. For example, words, faces, and global motion patterns are integrated over periods 5-10 times longer than textures and local motion patterns40, 41; modeling the time-scale of cortical areas critical for these tasks may help explain these large behavioral effects.
3.5 Models of Temporal Dynamics
Several models have been proposed to account for temporal dynamics (Figure S9). For example, psychophysical temporal sensitivity42-44 and fMRI responses in V127 and extrastriate cortex45 can be accounted for by a model with two temporal frequency channels, sustained and transient. This model also captures some features of the ECoG broadband response, but does not match the time series in detail for our 500-ms stimuli (Figure S9). For example, it does not predict a gradual decline in signal amplitude following the peak response. The dCTS model has a different form, which was motivated to capture important phenomena governing temporal dynamics, the timescale of summation and the degree of subadditivity. The model components accounting for these phenomena are grounded in canonical computations used to model visual cortex: linear filtering, exponentiation, and normalization46-48. The two temporal channels model contains filtering and exponentiation but not normalization. A potential way to assess a specific role for normalization would an experiment with two stimuli superimposed spatially but with different temporal frequencies: The two-channel model would predict summation, but normalization would predict subadditivity of fMRI responses or frequency tagged MEG or EEG responses. On the other hand, a model with two-temporal channels may be useful for capturing differential time courses to stimuli that preferentially drive magno vs parvo pathways, or for differences in foveal vs peripheral sensitivity27, 45; hence the two types of models are complementary.
The dCTS model we propose is input-referred8, i.e. a computational description of the output specified in terms of the visual stimulus, rather than a model of how the dynamics arise. Hypotheses about circuit mechanisms giving rise to temporal dynamics in cortex have been proposed49, 50; these dynamical systems models predict differences in time scales across cortical hierarchies, in agreement with empirical results, though they don’t account for the specific shape of neural temporal responses (e.g., compare Figure 3A in [50] to Figure S9). Another way to account for the different time scales across visual areas would be a cascade model, in which the dCTS is a canonical computation, with the output of one stage used as the input to the next stage, with the same model parameters used in each stage. Such a cascade model can account for some of the properties in later visual areas, such as more subadditive temporal summation.
3.6 Generalization and future directions
The dCTS model we fit accurately predicts responses across multiple visual field maps using two different types of measures and many stimulus temporal profiles. An important test of a model is whether it can make informative predictions for conditions it was not designed to account for. The fact that the dCTS model, fit only to ECoG data from 500-ms stimuli, predicts the fMRI responses for many different temporal patterns is an example of successful quantitative generalization. As a test of qualitative generalization to conditions that differ even further from those the model was designed for, the dCTS model predicts different time course shapes as a function of stimulus contrast, similar to multi-unit activity (MUA) observed in human visual cortex41 (Figure 8A). One reason that our model, developed to account only for temporal patterns, generalizes to contrast is that the model is comprised of elements fundamental in sensory processing (filtering and normalization). Finally, the dCTS model predictions for temporally white noise stimuli have autocorrelation functions that decline with temporal lag, with slower declines for later visual areas, consistent with network models of macaque cortex50.
However, just as with spatial pRF models, it is likely that our model will fail for certain tasks or stimuli8. For example, sustained attention to the stimulus41, presence of a surround40, non-separable spatiotemporal patterns (motion), and stimulus history of many seconds or more20, can all affect the time course of the response, phenomena not captured by our current model. However, a model with these limits is still quite useful: By formulating a forward model of responses to large-field contrast stimuli during passive viewing, we provide a quantitative benchmark that can be used to assess how other factors influence response dynamics, and a platform upon which to extend the model to new stimulus or task features. An important goal for future work is to develop a space-time model that simultaneously accounts for nonlinearities in spatial7 and temporal summation.
ONLINE METHODS
4.1 fMRI procedure
Participants
Data from four experienced fMRI participants (2 males, age range 21-48, mean age 31) were collected at the Center for Brain Imaging (CBI) at NYU. All participants had normal or corrected-to-normal visual acuity. The experimental protocol was approved by the University Committee on Activities Involving Human Subjects, and informed written consents were obtained from all participants prior to the study. Each subject participated in one 1.5-hour session for the main experiment, and an additional 1 hour session for visual field map identification and high-resolution anatomical volumes.
Visual Stimuli
Stimuli
In each trial, we used an independently generated large field (24° diameter) band-pass noise pattern (centered at 3 cycles per degree). The pattern was chosen because it was previously shown to be effective in eliciting responses in most visual areas1. (See ref [1] for details on stimulus construction). In each trial of the supplementary fMRI experiment, participants viewed either an independently generated pink noise (1/f amplitude spectrum, random phase) large field image (24° diameter, 768 x 768 pixels), or a face image embedded in the pink noise. Stimulus generation, presentation and response recording were coded using Psychophysics Toolbox2, 3 and vistadisp (https://github.com/vistalab/vistadisp). We used a MacBook Air computer to control stimulus presentation and record responses from the participants (button presses) during the experiment.
Display
Stimuli were displayed via an LCD projector (Eiki LC_XG250; resolution: 1024 x 768 pixels; refresh rate: 60 Hz) onto a back-projection screen in the bore of the magnet. Participants, at a viewing distance of ~58 cm, viewed the screen (field of view, horizontal: ~32°, vertical: ~24°) through an angled mirror. The images were confined to a circular region with a radius of 12°. The display was calibrated and gamma corrected using a linearized lookup table.
Fixation task
To stabilize attention level across scans and across subjects during the main experiment, all participants were instructed to do a one-back digit task at the center of fixation throughout the experiment. The digit (0.24° x 0.24°) was presented at the center of a neutral gray disk (0.47° diameter). Within a scan, each digit (randomly selected from 0 to 9) was on for 0.5 second, off for 0.167 second before the next digit appeared at the same location. Participants were asked to press a button when a digit repeated. Digit repetition occurred around 2-3%, with no more than two identical digits being presented successively. To reduce visual adaptation, all digits alternated between black and white, and on average participants pressed a button every 30 seconds. During the retinotopy task, the fixation alternated pseudo-randomly between red and green (switches on average every 3s), and the subject pressed a button to indicate color changes.
Experimental Design
We used a randomized event-related experimental design to prevent subjects from anticipating the stimulus conditions. An event is a stimulus presented according to one of thirteen distinct time courses (< 800 ms in total), either a single pulse with variable duration or a double pulse with fixed duration and variable inter-stimulus interval (ISI). Durations and ISIs were multiples of the monitor dwell time (1/60 s). Each pulse in the double-pulse stimuli lasted 134ms. The 0-ms stimulus was a blank (zero-contrast, mean luminance, and hence identical to the preceding and subsequent blank screen between stimulus events). Each participant completed seven scans, and within a scan, each temporal event repeated 4 times. A temporal event started with the onset of a pattern image, and the inter-trial interval (stimulus plus subsequent blank) was always 4.5 seconds. For experiments with two pulses, the two noise patterns were identical. The design was identical for the supplementary fMRI experiment, except that each time course repeated three times per scan, and each participant completed 12 scans.
MRI Data Acquisition
All fMRI data were acquired at NYU Center for Brain Imaging (CBI) using a Siemens Allegra 3T head-only scanner with a Nova Medical phased array, 8-channel receive surface coil (NMSC072). For each participant, we collected functional images (1500 ms TR, 30 ms TE, and 72-degree flip angle). Voxels were 2.5mm3 isotopic, with 24 slices. The slice prescription covered most of the occipital lobe, and the posterior part of both the temporal and parietal lobes. Images were corrected for B0 field inhomogeneity using CBI algorithms during offline image reconstruction.
In a separate session, we acquired two to three T1-weighted whole brain anatomical scans (MPRAGE sequence; 1mm3). Additionally, a T1-weighted “inplane” image was collected with the same slice prescription as the functional scans to aid alignment of the functional images to the high-resolution T1-weighted anatomical images. This scan had an inplane resolution of 1.25 x 1.25 mm and a slice thickness of 2.5 mm.
Data Preprocessing and Analysis
Data preprocessing
We co-registered and segmented the T1-weighted whole brain anatomical images into gray and white matter voxels using FreeSurfer’s auto-segmentation algorithm (surfer.nmr.mgh.havard.edu). Using custom software, vistasoft (https://github.com/vistalab/vistasoft), the functional data were slice-time corrected by resampling the time series in each slice to the center of each 1.5s volume. Data were then motion-corrected by co-registering all volumes of all 7 scans to the first volume of the first scan. The first 8 frames (12 seconds) of each scan were discarded for analysis to allow longitudinal magnetization and stabilized hemodynamic response.
GLM analysis
We used a variant of the GLM procedure—GLM denoise4, a technique that improves signal-to-noise ratios by entering noise regressors into the GLM analysis. Noise regressors were selected by performing principle component analysis on voxels whose activities were unrelated to the task. The optimal number of noise regressors was selected based on cross-validation R2 improvement. The input to GLM denoise was the pre-processed EPI data and a design matrix for each scan (13 distinct temporal profiles x number of time points per scan), and the output was ß-weights for each temporal profile for each voxel, bootstrapped 100 times across scans. For analysis, we normalized all 13 ß-weights per voxel by the vector length and selected a subset of voxels (see Voxel selection). We then averaged the ß-weights for a given temporal condition from the first bootstrap across voxels within each ROI and across all subjects to get a mean; this gives one estimate of the mean response per ROI for a given condition. This was repeated for each condition, and then repeated for each of the 100 bootstraps, yielding a matrix of 100 x 13 for each ROI (bootstraps by temporal condition).
ROI identification
We fitted a linear pRF model5 to each subject’s retinotopy data (average of two scans). We made an initial guess of ROI locations by first projecting the maximum likelihood probabilistic atlas from Wang et al6 onto the cortical surface. Then we visualized eccentricity and polar angle maps derived from the pRF model fits and modified ROI boundaries based on visual inspection. For each participant, we defined nine bilateral ROIs (V1, V2, V3, hV4, VO-1/2, LO-1/2, TO-1/2, IPS-0/1).
Voxel selection
All analyses were restricted to voxels that satisfy the following three criteria. First voxels be must located within 2-10° (eccentricity) based on the pRF model. Second, voxels must have positive bootstrapped ß-weights (averaged across bootstraps) for all non-blank temporal conditions. Third, voxels must have > 3% GLM R2. Voxels that satisfy all criteria were pooled across subjects, and the group average (bootstrapped) ß-weights were analyzed and plotted.
4.2 ECoG Procedure
We re-analyzed previously published ECoG data7.
Preprocessing
The data were pre-processed as in the original paper. In brief, electrodes that had large artifacts or epileptic activity, as identified by the neurologist, were excluded from analysis. From the remaining electrodes, we re-referenced the time series to the common average, and then down sampled the data from the recorded frequency 3052/1528 Hz (Subject 1/Subject 2) to 1,000 Hz.
Trial structure
At the beginning of each 1-second trial, a large field (22°) noise image was randomly selected from one of 8 image classes. Several of these image classes were chosen for studying gamma oscillations in the original paper, which was not the purpose of this study. For this study, we analyzed data from 3 of the 8 image classes, those that were most similar to the noise stimuli in the fMRI experiment: white, pink, and brown noise (amplitude spectra proportional to 1/f0, 1/f1, 1/f2). Each image was presented for 500ms followed by a 500ms blank. We analyzed data in 1200 ms epochs, beginning 200 ms prior to stimulus onset and ending 500 ms after stimulus offset.
Broadband envelope
We computed the time varying broadband envelope in several steps, as follows. First, we band-pass filtered the time series in 12 adjacent 10-Hz bins from 80 Hz to 200 Hz (80-90 Hz, 90-100 Hz, etc) using a Butterworth filter (passband ripples < 3 dB, stopband attenuation 60 dB). For each filtered time series, we computed the envelope as the magnitude of the analytic function (Hilbert transform). We then normalized the envelope of each bin by dividing by the variance, so that each envelope had a variance of 1. We normalized the variance to compensate for the fact that the power in field potentials declines with frequency. We then summed the 12 envelopes to derive a single, time-varying broadband envelope. Finally, we defined the baseline as the average value of the envelope in the 200 ms prior to stimulus onset and subtracted this baseline value from the time series at all points.
Broadband units
Because of the normalization of the 12 bands, the broadband power is the sum of 12 z-scores. So, for example, a stimulus-driven power increase of 12 means an average increase in power of 1-zcore per each of the 10-Hz frequency bands.
Electrode selection
We selected all electrodes located in identifiable visual areas based on separate retinotopy scans, and whose stimulus-triggered broadband response, averaged across trials, reached at least a power of 3 (see broadband units, above).
4.3 Temporal pRF Models
We used three variants of a temporal pRF model, one linear and two non-linear, to predict neuronal summation measured using fMRI and ECoG. All model forms take the time course of a spatially uniform contrast pattern as input (Tinput), and produce a predicted neuronal response time course as output. To predict the fMRI data (BOLD), we summed the predicted time course within a trial (< 1 s) to yield one number per temporal condition. These numbers were compared to the fMRI ß-weights for model fitting (see below). For ECoG data, the predicted time course was compared directly to the broadband time series for model fitting.
Models
Linear model
The linear model prediction is computed by convolving a neuronal impulse response function (IRF) with the stimulus time course (Tinput), and scaling by a gain factor (g)
The time course is then summed for the fMRI predictions (plus an error term, e):
For the IRF, we assumed a gamma function, parameterized by τ1, of the form,
Because the IRF was assumed to have unit area, the specific shape of the IRF has no effect on the predictions, and the prediction reduces to: and the only value solved for is the gain factor. We did not fit the linear model to ECoG data because the linearly predicted time courses clearly differ from broadband traces.
Compressive summation model (CTS)
To compute the CTS predicted neuronal response, we first computed the linear response by convolving an IRF (gamma function with variable time to peak τ1) with an input stimulus time course. Then an exponent ε is applied point-wise to the predicted linear output.
To fit the CTS model to the fMRI data, we again summed the predicted response time series: and solved for τ1, ε, and g. We did not fit the CTS model to ECoG broadband traces because CTS-predicted neuronal response differs from the measurements qualitatively.
Dynamic compressive temporal summation (dCTS)
This variant of the CTS model implemented the compressive nonlinearity with a divisive normalization rather than a compressive power law. The numerator contains the linear neuronal response (same computation as the linear part in CTS). The denominator is the sum of two terms, a semi-saturation constant (σ) and an exponentially filtered (low-pass) linear response. The rate of the exponential decay is determined by a parameter τ2. All three terms (one in the numerator, two in the denominator), are raised to the power n, assumed to be greater than 1.
We fit the 4 parameters as well as a gain factor, g, to the ECoG broadband time series. To predict the fMRI response from the dCTS model (Figure 7), we used the parameters fitted from ECoG data for each ROI, generated a neuronal time course for each of the 13 distinct temporal profiles from the fMRI experiment. Then we either summed each predicted time course (linear assumption) or point-wise square-rooted the time course and then summed, and finally scaled the sum by a gain factor.
Parameter estimation
CTS model for fMRI
Models were fit in two steps, one to obtain seed parameters, and one to fit parameters.
In the first step, we obtain seed values for τ1 and ε for each ROI. To do so, we generated 1000 seeds by randomly selecting τ1 from [0.01 1] and ε from [0, 1]. These were then used to make 1000 sets of model predictions for the 13 temporal stimuli. For each ROI, the 1000 sets of model predictions were compared to the 13 ß-weights. Using linear regression, we then derived the gain factor, g, and the variance explained for each of the 1000 sets of predictions. The model parameters τ1, ε, g were averaged from all models with variance explained greater than 95%. This gave us seeds for the three parameters for each ROI.
We then did a search fit using Matlab’s fminsearch, 100 times per ROI, using the 100 sets of bootstrapped ß-weights, and the seeds as derived above. The search finds the parameters which minimize the squared error between predicted and measured ß-weights. This gave us 100 estimates of each model parameter for each ROI, which we summarized by the median and 50% confidence interval.
Linear model for fMRI
The linear model does not require a search or seeds. Instead, we fit the 100 bootstrapped data sets per ROI by linear regression, giving us 100 estimates of the gain factor, g, per ROI.
dCTS model for ECoG
We again used a two-stage approach to fitting the dCTS model, first to obtain seeds and then to estimate parameters. For each ROI, we averaged the broadband envelope across electrodes and trials, yielding one time course per ROI. We then generated 1000 model predictions by randomly selecting each parameter: τ1 from [0.01, 1], τ2 from [0.01, 1], n from [0.5, 5], and σ from [0.01, 0.5]. Using linear regression on the ECoG data, we derived the gain factor, g, and the variance explained for each of the 1000 predicted time series. For each ROI, the sets of parameters that generated reasonably accurate model predictions ( > 80% variance explained) were averaged and served as the seed for the search fit.
For the search fit, we did 100 bootstraps per ROI over the electrodes in that ROI. For each of the 100 bootstrapped time courses per ROI, we used fminsearch to find the parameters that minimized the squared error between the predicted and observed time series. In addition to the four parameters above, we included a nuisance shift parameter, which delays the onset of the response. In principle, this delay is important, since the time at which the signal from the stimulus reaches cortex is delayed, and the delay varies across visual field maps, and could be as high as 50-150ms. However, the impulse response function includes a slow ramp, and the broadband envelope extraction contains a small amount of blur. Hence in practice, the shifts were quite small (< 10 ms), and not informative about the latency of neuronal response.
Model accuracy
fMRI experiment
For the fMRI experiment, we compared model accuracy of the CTS and the linear model. Because the models have different numbers of free parameters, it is important to obtain an unbiased estimated of model accuracy, which we did by leave-one-out cross validation. For each ROI, and for each of the 100 bootstrapped sets of β-weights, we fit 13 linear models and 13 CTS models by leaving out each of the 13 temporal stimuli. For each bootstrap, we thus obtain 13 left-out predictions, which were compared to the 13 ß-weights by coefficient of determination, R2:
This yielded 100 R2’s per ROI, and we summarized model accuracy as the median and 50% confidence interval derived from these values.
For the dCTS model fit to the ECoG data, there was only one temporal condition, and no model comparison, so we did not cross-validate the model fits. Instead, we summarized model accuracy as the variance explained, r2, the square of the Pearson-correlation coefficient r.
Note that the coefficient of determination, R2, is bounded by [-∞, 1], as the residuals between model and data can be larger than the data. In contrast, r2 is bounded by [0, 1].
Public Data Sets and Software Code
To ensure that our computational methods are reproducible, all data and all software will be made publicly available via an open science framework site, https://osf.io/v843t/. The software repository will include scripts of the form trf_MakeFigure2 to reproduce figure 2, etc., as in prior publications8.
Acknowledgements
We thank Dora Hermes for helpful discussion and for helping us analyze ECoG data from prior work. We also thank Josef Parvizi and the Stanford Human Intracranial Cognitive Electrophysiology Program for helping us with ECoG data acquisition for a prior paper, which was re-analyzed for this paper. We thank David Heeger, Brian Wandell, and Mike Landy for comments on an earlier draft of this manuscript. We also thank Bosco Tjan, David Heeger, XJ Wang, Denis Pelli, Rachel Denison, and Geoff Aguirre for helpful discussions and feedback as we developed our models and analyses. The research was supported by NIH grants R00-EY022116 and R01-MH111417 (J.W.)
Footnotes
↵1 Because the stimulus events are short (≤800 ms), and the hemodynamic response function (hRF) is low-pass (on the order of seconds), the convolution of the stimulus time course with a neural impulse response function, followed by the convolution of this output with an hRF, is approximately the same as summing the stimulus time course (to create a scaled impulse), followed by convolution of the impulse with the hRF.