The response time paradox in functional magnetic resonance imaging analyses

The functional MRI (fMRI) signal is a proxy for an unobservable neuronal signal, and differences in fMRI signals on cognitive tasks are generally interpreted as reflecting differences in the intensity of local neuronal activity. However, changes in either intensity or duration of neuronal activity can yield identical differences in fMRI signals. When conditions differ in response times (RTs), it is thus impossible to determine whether condition differences in fMRI signals are due to differences in the intensity of neuronal activity or to potentially spurious differences in the duration of neuronal activity. The most common fMRI analysis approach ignores RTs, making it difficult to interpret condition differences that could be driven by RTs and/or intensity. Because differences in response time are one of the most important signals of interest for cognitive psychology, nearly every task of interest for fMRI exhibits RT differences across conditions of interest. This results in a paradox, wherein the signal of interest for the psychologist is a potential confound for the fMRI researcher. We review this longstanding problem, and demonstrate that the failure to address RTs in the fMRI time series model can also lead to spurious correlations at the group level related to RTs or other variables of interest, potentially impacting the interpretation of brain-behavior correlations. We propose a simple approach that remedies this problem by including RT in the fMRI time series model. This model separates condition differences from RT differences, retaining power for detection of unconfounded condition differences while also allowing the identification of RT-related activation. We conclude by highlighting the need for further theoretical development regarding the interpretation of fMRI signals and their relationship to response times.

trials in the dMFC, supporting the idea that the commonly found effect was driven by RTs. The next step to fully under-75 stand this RT-driven effect is the difficult step of determining whether this is a time on task effect or an actual difference 76 in the amplitude of neuronal signaling that correlates with RT and whether these differences align with the underlying 77 theory of conflict in the Stroop task. This was the focus of follow-up work by Yeung et al. (2011) who proposed that the 78 RT-based effects were a result of differential engagement of specific cognitive processes and not simply due to time

136
The statistical models used for fMRI data generally involve the convolution of a vector representing trial or stimulus 137 onsets with a canonical hemodynamic response function (as shown in Figure 1) to create regressors for use in linear model activation in regions where neural activity does not scale with RTs but rather the true duration is an unknown constant across all trials. Second, it does not allow a separate identification of condition differences versus RT effects; 157 instead, it only performs well if the restricted interaction model is correct.

158
To address these issues, we created a generalized model of RT that can identify RT effects separately from the task 159 effect (corrected for RT); this is shown as "Constant Duration, RT" in Figure 2 and hereafter as ConstDurRT. This model 160 includes a boxcar function with constant duration for each of the task conditions, along with a single regressor that 161 models the parametric modulation of the response in relation to RT for each trial. Because all RTs are modeled within a 162 single regressor, any differences in RT between conditions will be removed by this regressor, leaving the condition dif-163 ference effects to be interpreted as unconfounded estimates of activation in relation to the experimental manipulation.

164
This model can be extended to a full interaction model, if an interaction is suspected, and this will be further described in 165 the Discussion section ("Condition by RT interaction models") as well as concerns in using an interaction model to study 166 condition differences if there is not a significant interaction present. We focus on a simpler non-interaction scenario in 167 the simulation studies to illustrate the results of interest and these results are nontrivial to extend to the interaction 168 model setting.

169
Notably we have not mean centered RT or subtracted any value from RT on each trial. also be analyzed, although that is not the focus of the simulation analyses and will be the focus of the real data analysis.

180
Further details regarding the modeling approach can be found in Methods; code and data for all analyses are shared Error rates and power We first assessed the false positive rate for each of the models on each of the simulated data sets for the condition 197 comparison contrast (Figures 3, S1). In all cases the ConstDurRT model appropriately controlled Type I error, but error 198 rates were inflated when model assumptions regarding the relationship between RT and neural activity were violated 199 by the data. Specifically, ConstDurNoRT had highly inflated error rates when activation did scale with RT, and RTDur had 200 inflated error rates when the signal did not scale with RT. Thus, the most commonly used model for task fMRI analysis,

201
ConstDurNoRT, suffers from substantial inflation of false positives in the face of RT differences between conditions, 202 because it inaccurately attributes the confounding RT signal to differences in the intensity of the underlying neuronal 203 signal. The larger Type I error rates observed with the Stroop-based RT reflects that the standard deviation of the RT,204 relative to the mean, was lower in this setting and so RT-based differences are easier to detect. The error rates for 205 blocked designs (solid lines) are slightly higher, likely due to the fact that blocked designs have a higher signal to noise 206 ratio, making it easier to detect RT differences in the data. Results with a longer ISI, between 3-6s, are similar (Results

221
The foregoing analyses, along with the previous work by Grinband, focused on confounding of RT between-trials, which 222 impacts average condition effects. Here we introduce a new problem of a between-subject RT confound. The within-volving group comparisons or associations. For example, the incongruent versus congruent BOLD contrast estimate 225 may correlate with the differences in average RTs for incongruent and congruent conditions. This is of particular inter-226 est given the increasing focus on analyses of brain-behavior correlations in fMRI literature (e.g., Dubois and Adolphs 227 (2016)).

228
The driving factor of correlations between condition differences in brain activation and the corresponding differ-229 ences in RTs is simply due to a linear relationship between the activation estimate and RT when the data and model 230 assumptions are in conflict. In the case where signals scales with RT and the ConstDurNoRT model is used (duration = 231 1s), the relationship between the estimated activation,̂, and the true activation, , is approximatelŷ= × , for 232 a single trial. In this illustration, the true activation, , is assumed to be the same for both conditions and does not 233 vary over subjects. Figure 5 shows this linear relationship holds within the range of RTs one would expect to observe 234 in most data sets (i.e., < 2s). Moving from the activation estimate of a single trial to the BOLD activation across multiple 235 trials, the relationship becomeŝ= × , where is the average RT across trials. Last, for two conditions, 1 and 2, 236 the relationship for the contrast of conditions is . (1) From this it directly follows that in a group level analysis there is an expected linear relationship between the estimated 238 condition difference and the difference in RTs, specifically the between-subject slope would be , the true, common 239 activation of the two conditions that does not vary across subjects. As is the case with all linear trends (equivalently, 240 correlations) this relationship does not require a non-zero RT difference on average, but is driven by between-subject 241 RT variability. Therefore the RT difference is an important confound regardless of whether it is significantly different 242 from 0 on average.

243
The simulation results in Figure 6 show the relationship across all models and data types considered in this work. The

244
ConstDurNoRT model produces correlations between contrast differences and RT differences when the signal scales 245 with RT, as does the RTDur model when the signal does not scale with RT while the ConstDurRT model does not induce 246 correlations for either signal type. In other words, although there is no true relationship between average subject RT 247 difference and the fMRI contrast estimate, the ways in which these models cannot capture RT (ConsDurNoRT) or intro-248 duce RT information (RTDur) cause an RT effect at the group level, which may interfere with the correct interpretation 249 of group level results (e.g., if group level variable of interest is related to RT). Notably, the data were simulated such 250 that the variance in RT did not change with RT, whereas in real data the variance of RT often increases with its mean.

251
The implication is the correlation estimated by our simulations is conservative. Even so, it is within the ballpark of the 252 expected true correlations between brain and behavior measures (Marek et al., 2022). 253 Importantly in this scenario, the linear relationship at the group level is simply, , the common activation for both 254 conditions and all subjects. If the activation differs between conditions or across subjects the confound will potentially 255 be more complex in the between-subject analysis and can even introduce new artifactual associations into a group 256 analysis. A simple extension to illustrate how false associations can be introduced is to only relax the assumption 257 that is the same across subjects, but preserve the assumption that is the same for both conditions. For example, 258 assume age is equally related to both conditions through the relationship, = 0 + 1 + and note that this does not 259 introduce an association of age with the true activation difference. If the signal scales with RTs and the ConsDurNoRT 260 model is used to estimate the condition difference, not only is an RT effect introduced to the group level analysis, but 261 an artifactual age effect is also introduced. This can be seen in the following derivation that extends the earlier defined but the model estimates the BOLD activation to be 1 × for RTs < 2s.
This result is concerning because there is not a true relationship of the condition difference with either the RT 264 difference or age, but the use of ConsDurNoRT when the data scale with RT introduces an age association to the group 265 level analysis through an interaction with the RT difference. The implication is if the RT difference is nonzero, on average, 266 an age association will be present in the group level when there is not a relationship between age and the true activation 267 difference. Adding RT difference as a confound regressor to the group model will not remedy these issues and should 268 be avoided as it may inflate the significance of the false association with age. This calls into question the interpretation 269 of between-subject analyses where the signal may scale with RT and ConsDurNoRT is used.

270
The relationship between RT and potential variables of interest will be different and more complex if, say, the RT  Figure 6. Correlation between the contrast difference and difference in average condition RT across subjects as a function of the average difference in RT between conditions. Since the correlation is driven by between-subject variability in the difference in RT, there is no requirement that RTs differ between tasks and the correlation is constant regardless of the RT difference.
Widespread RT activation is not specific to task, revisited 277 Our real data analyses were modeled to included separate regressors for each condition as well as a single RT regres-278 sor to control for RT effects in our contrasts between conditions, similar to the ConsDurRT model. A total of 7 tasks, 279 with sample sizes ranging from 86 to 94, were analyzed. The cognitive processes involved in these tasks include atten-280 tion, temporal discounting, proactive control, reactive control, response inhibition, resisting distraction, and set shifting.

281
Brief descriptions are given in Table 1

293
The problem of potential response time confounds for fMRI activation estimates has been discussed for more than a 294 decade, but there has been little resulting change in how the community approaches the analysis and interpretation of 295 fMRI contrasts when RT differences between task conditions are present. There are three takeaways from the present 296 RTDur model assumes the signal must scale with RT, hence both models fail to control error rates when these model 308 assumptions are violated (Figure 3).

309
We have also uncovered that subject-specific differences in average RT represent an important group-level con-  The paradox that RT is the effect of interest in fMRI studies 324 Our recommendation here is to focus separately on RT-based effects and condition differences, adjusted for RT. There

325
can be resistance to this idea, since RT is the measure of interest in behavioral studies and removing RT effects from 326 condition differences is argued to be "throwing the baby out with the bathwater". This argument is paradoxical since 327 if RT effects are the effects of interest, then why not study the RT effects directly? Studying unadjusted condition 328 differences does not directly reflect RT effects and may reflect differences that are completely unrelated to RTs. In fact 329 the condition difference effect, when unadjusted for RTs, may be driven by any of the underlying models shown in

337
Overall our recommendation of focusing on RT-based effects and condition differences, adjusted for RTs, separating 338 the two effects so they can be more accurately interpreted, ultimately improving our understanding of the brain.    presented in a row.

430
Simulated data that scaled with RT were created with the convolved RT duration regressors (RTDur) and data that did 431 not scale with RT used the constant duration regressors (ConsDurNoRT). The BOLD activation sizes for the ℎ subject 432 for each condition, ,1 and ,2 , were sampled from a Gaussian distribution, ( , 2 ), where is the true activation 433 magnitude and 2 is the between-subject variance. The time series data for the ℎ subject, of length , was created 434 according to where 1 and 2 are either the Model 1 or 2 regressors ( × 1) and 2 is the within-subject variance.

436
In an effort to choose realistic values for 1 , 2 , 2 and 2 , we considered the first level effect size (converting the true 437 to a correlation), second level effect size for a 1-sample t-test (Cohen's D) as well as the ratio of the total mixed effects 438 variance to the within-subject variance. Following the definitions of parameters as given in the model above, the total 439 mixed effects variance for a first level contrast of parameter estimates is Our within-subject effect size for condition versus baseline was between 0.07-0.08 (correlation), ratio of total vari-   Figure S1. Type I error as RT difference between conditions increases. This illustrates that results are similar to when the ISI ranged between 2-4s (result in main manuscript). The Forced Choice Task RT distribution was used in the top panels, while Stroop RT distribution was used in the bottom panels, both with an ISI between 3-6s was used and inference of interest was the 1-sample t-test of the condition effect with 100 subjects. 2500 simulations were used to calculate the error rate.
Power differences when studying condition differences after testing for interaction 604 Here we study the power for a condition effect after testing for a potential condition by RT interaction. The interaction 605 model contained two condition regressors and two RT regressors, split by condition. RTs were centered by the theoret-606 ical RT based on the distribution used to simulate RTs. Although the slopes of the interaction model were not found to 607 significantly differ, their magnitudes will not be exactly equal and this introduces variance into the condition difference 608 estimate from this model. This is reflected in the reduced power compared to the ConsDurRT model ( Figure S2). sented with a cue made up of dots. This cue can be a valid cue -referred to as A (e.g., ":") -or an invalid cue -referred 622 to as B (e.g., ".."). Next a probe is presented, also made up of a simple dot formation. This probe can be valid (X) or 623 invalid (Y). Participants are instructed to respond to valid probe and cue combinations (targets -AX combinations) with 624 a key press (e.g., "x") and all others (non-targets) with a different key press (e.g., "m").

625
The Delay-Discounting Task (DDT) is a measure of temporal discounting, the tendency for people to prefer smaller, 626 immediate monetary rewards over larger, delayed rewards. Participants complete a series of 27 questions that each 627 require choosing between a smaller, immediate reward (e.g., $25 today) versus a larger, later reward (e.g., $35 in 25 628 days). The 27 items are divided into three groups according to the size of the larger amount (small, medium, or large).

629
Modeling techniques are used to fit the function that relates time to discounting. The main dependent measure of 630 interest is the steepness of the discounting curve such that a more steeply declining curve represents a tendency to 631 devalue rewards as they become more temporally remote.

632
The cued task-switching task indexes the control processes involved in reconfiguring the cognitive system to support 633 a new stimulus-response mapping. In this task, subjects are presented with a task cue followed by a colored number 634 (between 1-4 or 6-9). The cue indicates whether to respond based on parity (odd/even), magnitude (greater/less than 635 5), or color (orange/blue). Trials can present the same cue and task, or can switch the cue or the task. Responses are 636 slower and less accurate when the cue or task differs across trials (i.e., a switch) compared to when the current cue or 637 task remains the same (i.e., a repeat).

638
The Stop-Signal Task is designed to measure motor response inhibition, one aspect of cognitive control. On each trial 639 of this task participants are instructed to make a speeded response to an imperative "go" stimulus except on a subset 640 of trials when an additional "stop signal" occurs, in which case participants are instructed that they should make no  The motor selective stop-signal task measures the ability to engage response inhibition selectively to specific re-652 sponses. In this task, cues are presented to elicit motor responses (e.g., right hand responses, left hand responses).

653
A stop-signal is presented on some trials, and subjects must stop if certain responses are required on that trial (e.g., 654 right hand responses) but not others (e.g., left hand responses) if a signal occurs. In contrast to a simple stop-signal 655 task in which all actions are stopped when a stop-signal is presented, this task aims to be more like stopping in "the 656 real world" in that certain motor actions must be stopped (e.g., stop pressing the accelerator at a red light) but others 657 should proceed (e.g., steering the car and/or conversing with a passenger). Commonly, stop-signal reaction time (SSRT), 658 the main dependent measure for response inhibition in stopping tasks, is prolonged in the motor selective stopping 659 task when compared to the more canonical simple stopping task. This prolongation of SSRT is taken as evidence of the 660 cost of engaging inhibition that is selective to specific effectors or responses.

661
The Stroop task is a seminal measure of cognitive control. Successful performance of the task requires the ability 662 to overcome automatic tendencies to respond in accordance with current goals. On each trial of the task, a color word 663 (e.g., "red", "blue") is presented in one of multiple ink colors (e.g., blue, red). Participants are instructed to respond 664 based upon the ink color of the word, not the identity of the word itself. When the color and the word are congruent 665 (e.g., "red" in red ink), the natural tendency to read the word facilitates performance, resulting in fast and accurate 666 responding. When the color and the word are incongruent (e.g., "red" in blue ink), the strong, natural tendency to read 667 must be overcome to respond to the ink color. The main dependent measure in the Stroop task is the "Stroop Effect", 668 which is the degree of slowing and the reduction in accuracy for incongruent relative to congruent trials.

669
Exclusion information by task for real data analysis Table S1. Exclusion information for Attention Network task.