Abstract
The concept of test-retest reliability indexes the repeatability or consistency of a measurement across time. High reliability is critical for any scientific study, specifically for the study of individual differences. Evidence of poor reliability of commonly used behavioral and functional neuroimaging tasks is mounting. These reports have called into question the adequacy of using even the most common, well-characterized cognitive tasks with robust population-level effects, to measure individual differences. Here, we lay out a hierarchical framework that estimates reliability as a correlation divorced from trial-level variability, and show how reliability tends to be underestimated under the conventional intraclass correlation framework. In addition, we examine how reliability estimation diverges between the modeling frameworks and assess how different factors (e.g., trial and subject sample sizes, relative magnitude of cross-trial variability) impact reliability estimates across the different frameworks. This work highlights that a large number of trials (e.g., greater than 100) may be required to achieve reasonably precise reliability estimates. We reference the tools of TRR and 3dLMEr for the community to apply trial-level models to behavior and neuroimaging data.
Competing Interest Statement
The authors have declared no competing interest.