Cross-validation failure: Small sample sizes lead to large error bars
Introduction
In the past 15 years, machine-learning methods have pushed forward many brain-imaging problems: decoding the neural support of cognition (Haynes and Rees, 2006), information mapping (Kriegeskorte et al., 2006), prediction of individual differences –behavioral or clinical– (Smith et al., 2015), rich encoding models (Nishimoto et al., 2011), principled reverse inferences (Poldrack et al., 2009), etc. Replacing in-sample statistical testing by prediction gives more power to fit rich models and complex data (Norman et al., 2006, Varoquaux and Thirion, 2014).
The validity of these models is established by their ability to generalize: to make accurate predictions about some properties of new data. They need to be tested on data independent from the data used to fit them. Technically, this test is done via cross-validation: the available data is split in two, a first part, the train set used to fit the model, and a second part, the test set used to test the model (Pereira et al., 2009, Varoquaux et al., 2017).
Cross-validation is thus central to statistical control of the numerous neuroimaging techniques relying on machine learning: decoding, MVPA (multi-voxel pattern analysis), searchlight, computer aided diagnostic, etc. Varoquaux et al. (2017) conducted a review of cross-validation techniques with an empirical study on neuroimaging data. These experiments revealed that cross-validation made errors in measuring prediction accuracy typically around . Such large error bars are worrying.
Here, I show with very simple analyses that the observed errors of cross-validation are inherent to small number of samples. I argue that they provide loopholes that are exploited in the neuroimaging literature, probably unwittingly. The problems are particularly severe for methods development and inter-subject diagnostics studies. Conversely, cognitive neuroscience studies are less impacted, as they often have access to higher sample sizes using multiple trials per subjects and multiple subjects. These issues could undermine the potential of machine-learning methods in neuroimaging and the credibility of related publications. I give recommendations on best practices and explore cost-effective avenues to ensure reliable cross-validation results in neuroimaging.
The effects that I describe are related to the “power failure” of Button et al. (2013): lack of statistical power. In the specific case of testing predictive models, the shortcoming of small samples are more stringent and inherent as they are not offset with large effect sizes. My goals here are to raise awareness that studies based on predictive modeling require larger sample sizes than standard statistical approaches.
Section snippets
Distribution of errors in cross-validation
Cross-validation strives to measure the generalization power of a model: how well it will predict on new data. To simplify the discussion, I will focus on balanced classification, predicting two categories of samples; prediction accuracy can then be measured in percents and chance is at 50%. The cross-validation error is the discrepancy between the prediction accuracy measured by cross-validation and the expected accuracy on new data.
Previous results: cross-validation on brain images. Varoquaux
An open door to overfit and confirmation bias
The large error bars are worrying, whether it is for methods development of predictive models or their use to study the brain and the mind. Indeed, a large variance of results combined with publication incentives weaken scientific progress (Ioannidis, 2005).
With conventional statistical hypothesis testing, the danger of vibration effects is well recognized: arbitrary degrees of freedom in the analysis explore the variance of the results and, as a consequence, control on false positives is
Conclusion: improving predictive neuroimaging
With predictive models even more than with standard statistics small sample sizes undermine accurate tests. The problem is inherent to the discriminant nature of the test, measuring only a success or failure per observations. Estimates of variance across cross-validation folds give a false sense of security as they strongly underestimates errors on the prediction accuracy: folds are far from independent. Rather, to avoid the illusion of biomarkers that do not generalize or overly-optimistic
Acknowledgments
Computing resources were provided by the NiConnect project (ANR-11-BINF-0004_NiConnect). I am grateful to Aaron Schurger, Steve Smith, and Russell Poldrack for feedback on the manuscript. I would also like to thank Alexandra Elbakyan for help with the literature review, as well as Colin Brown and Choong-Wan Woo for sharing data of their review papers.
References (59)
- et al.
Deriving reproducible biomarkers from multi-site resting-state data: an autism-based example
NeuroImage
(2017) - et al.
Single subject prediction of brain disorders in neuroimaging: promises and pitfalls
NeuroImage
(2017) The secret lives of experiments: methods reporting in the fmri literature
Neuroimage
(2012)- et al.
Statistical power and prediction accuracy in multisite resting-state fmri connectivity
NeuroImage
(2017) - et al.
A common, high-dimensional model of the representational space in human ventral temporal cortex
Neuron
(2011) - et al.
A supervised clustering approach for fmri-based inference of brain states
Pattern Recognit.
(2012) - et al.
Reconstructing visual experiences from brain activity evoked by natural movies
Curr. Biol.
(2011) - et al.
Beyond mind-reading: multi-voxel pattern analysis of fMRI data
Trends Cogn. Sci.
(2006) - et al.
Information mapping with pattern classifiers: a comparative study
Neuroimage
(2011) - et al.
Machine learning classifiers and fMRI: a tutorial overview
Neuroimage
(2009)
The file drawer problem and tolerance for null results
Psychol. Bull.
Divide and conquer: a defense of functional localizers
Neuroimage
Statistical inference and multiple testing correction in classification-based multi-voxel pattern analysis (mvpa): random permutations and cluster size control
Neuroimage
The wu-minn human connectome project: an overview
Neuroimage
Assessing and tuning brain decoders: cross-validation, caveats, and guidelines
NeuroImage
From estimating activation locality to predicting disorder: a review of pattern recognition for neuroimaging-based psychiatric diagnostics
Neurosci. Biobehav. Rev.
Individualized gaussian process-based prediction and detection of local and global gray matter abnormalities in elderly subjects
NeuroImage
A survey of cross-validation procedures for model selection
Stat. Surv.
No unbiased estimator of the variance of k-fold cross-validation
J. Mach. Learn. Res.
Toward discovery science of human brain function
Proc. Ntl Acad. Sci.
Is cross-validation valid for small-sample microarray classification?
Bioinformatics
Machine Learning on Human Connectome Data from mri
Power failure: why small sample size undermines the reliability of neuroscience
Nat. Rev. Neurosci.
Pooling fmri data: meta-analysis, mega-analysis and multi-center studies
Front. Neuroinformatics
Statistical comparisons of classifiers over multiple data sets
J. Mach. Learn. Res.
The autism brain imaging data exchange: towards a large-scale evaluation of the intrinsic brain architecture in autism
Mol. psychiatry
The reusable holdout: preserving validity in adaptive data analysis
Science
Neurovault. org: a web-based repository for collecting and sharing unthresholded statistical maps of the human brain
Front. Neuroinform.
The Elements of Statistical Learning
Cited by (402)
Detection of lung cancer through SERS analysis of serum
2024, Spectrochimica Acta - Part A: Molecular and Biomolecular SpectroscopyPredicting treatment outcome based on resting-state functional connectivity in internalizing mental disorders: A systematic review and meta-analysis
2024, Neuroscience and Biobehavioral ReviewsPerformance reserves in brain-imaging-based phenotype prediction
2024, Cell ReportsUnsupervised active–transfer learning for automated landslide mapping
2023, Computers and Geosciences