Cross-validation failure: Small sample sizes lead to large error bars

doi:10.1016/j.neuroimage.2017.06.061

NeuroImage

Volume 180, Part A, 15 October 2018, Pages 68-77

https://doi.org/10.1016/j.neuroimage.2017.06.061 Get rights and content

Abstract

Predictive models ground many state-of-the-art developments in statistical brain image analysis: decoding, MVPA, searchlight, or extraction of biomarkers. The principled approach to establish their validity and usefulness is cross-validation, testing prediction on unseen data. Here, I would like to raise awareness on error bars of cross-validation, which are often underestimated. Simple experiments show that sample sizes of many neuroimaging studies inherently lead to large error bars, eg $\pm 10 %$ for 100 samples. The standard error across folds strongly underestimates them. These large error bars compromise the reliability of conclusions drawn with predictive models, such as biomarkers or methods developments where, unlike with cognitive neuroimaging MVPA approaches, more samples cannot be acquired by repeating the experiment across many subjects. Solutions to increase sample size must be investigated, tackling possible increases in heterogeneity of the data.

Introduction

In the past 15 years, machine-learning methods have pushed forward many brain-imaging problems: decoding the neural support of cognition (Haynes and Rees, 2006), information mapping (Kriegeskorte et al., 2006), prediction of individual differences –behavioral or clinical– (Smith et al., 2015), rich encoding models (Nishimoto et al., 2011), principled reverse inferences (Poldrack et al., 2009), etc. Replacing in-sample statistical testing by prediction gives more power to fit rich models and complex data (Norman et al., 2006, Varoquaux and Thirion, 2014).

The validity of these models is established by their ability to generalize: to make accurate predictions about some properties of new data. They need to be tested on data independent from the data used to fit them. Technically, this test is done via cross-validation: the available data is split in two, a first part, the train set used to fit the model, and a second part, the test set used to test the model (Pereira et al., 2009, Varoquaux et al., 2017).

Cross-validation is thus central to statistical control of the numerous neuroimaging techniques relying on machine learning: decoding, MVPA (multi-voxel pattern analysis), searchlight, computer aided diagnostic, etc. Varoquaux et al. (2017) conducted a review of cross-validation techniques with an empirical study on neuroimaging data. These experiments revealed that cross-validation made errors in measuring prediction accuracy typically around $\pm 10 %$ . Such large error bars are worrying.

Here, I show with very simple analyses that the observed errors of cross-validation are inherent to small number of samples. I argue that they provide loopholes that are exploited in the neuroimaging literature, probably unwittingly. The problems are particularly severe for methods development and inter-subject diagnostics studies. Conversely, cognitive neuroscience studies are less impacted, as they often have access to higher sample sizes using multiple trials per subjects and multiple subjects. These issues could undermine the potential of machine-learning methods in neuroimaging and the credibility of related publications. I give recommendations on best practices and explore cost-effective avenues to ensure reliable cross-validation results in neuroimaging.

The effects that I describe are related to the “power failure” of Button et al. (2013): lack of statistical power. In the specific case of testing predictive models, the shortcoming of small samples are more stringent and inherent as they are not offset with large effect sizes. My goals here are to raise awareness that studies based on predictive modeling require larger sample sizes than standard statistical approaches.

Section snippets

Distribution of errors in cross-validation

Cross-validation strives to measure the generalization power of a model: how well it will predict on new data. To simplify the discussion, I will focus on balanced classification, predicting two categories of samples; prediction accuracy can then be measured in percents and chance is at 50%. The cross-validation error is the discrepancy between the prediction accuracy measured by cross-validation and the expected accuracy on new data.

Previous results: cross-validation on brain images. Varoquaux

An open door to overfit and confirmation bias

The large error bars are worrying, whether it is for methods development of predictive models or their use to study the brain and the mind. Indeed, a large variance of results combined with publication incentives weaken scientific progress (Ioannidis, 2005).

With conventional statistical hypothesis testing, the danger of vibration effects is well recognized: arbitrary degrees of freedom in the analysis explore the variance of the results and, as a consequence, control on false positives is

Conclusion: improving predictive neuroimaging

With predictive models even more than with standard statistics small sample sizes undermine accurate tests. The problem is inherent to the discriminant nature of the test, measuring only a success or failure per observations. Estimates of variance across cross-validation folds give a false sense of security as they strongly underestimates errors on the prediction accuracy: folds are far from independent. Rather, to avoid the illusion of biomarkers that do not generalize or overly-optimistic

Acknowledgments

Computing resources were provided by the NiConnect project (ANR-11-BINF-0004_NiConnect). I am grateful to Aaron Schurger, Steve Smith, and Russell Poldrack for feedback on the manuscript. I would also like to thank Alexandra Elbakyan for help with the literature review, as well as Colin Brown and Choong-Wan Woo for sharing data of their review papers.

References (59)

A. Abraham et al.
Deriving reproducible biomarkers from multi-site resting-state data: an autism-based example
NeuroImage
(2017)
M.R. Arbabshirani et al.
Single subject prediction of brain disorders in neuroimaging: promises and pitfalls
NeuroImage
(2017)
J. Carp
The secret lives of experiments: methods reporting in the fmri literature
Neuroimage
(2012)
C. Dansereau et al.
Statistical power and prediction accuracy in multisite resting-state fmri connectivity
NeuroImage
(2017)
J.V. Haxby et al.
A common, high-dimensional model of the representational space in human ventral temporal cortex
Neuron
(2011)
V. Michel et al.
A supervised clustering approach for fmri-based inference of brain states
Pattern Recognit.
(2012)
S. Nishimoto et al.
Reconstructing visual experiences from brain activity evoked by natural movies
Curr. Biol.
(2011)
K.A. Norman et al.
Beyond mind-reading: multi-voxel pattern analysis of fMRI data
Trends Cogn. Sci.
(2006)
F. Pereira et al.
Information mapping with pattern classifiers: a comparative study
Neuroimage
(2011)
F. Pereira et al.
Machine learning classifiers and fMRI: a tutorial overview
Neuroimage
(2009)

R. Rosenthal

The file drawer problem and tolerance for null results

Psychol. Bull.

(1979)

R. Saxe et al.

Divide and conquer: a defense of functional localizers

Neuroimage

(2006)

J. Stelzer et al.

Statistical inference and multiple testing correction in classification-based multi-voxel pattern analysis (mvpa): random permutations and cluster size control

Neuroimage

(2013)

D.C. Van Essen et al.

The wu-minn human connectome project: an overview

Neuroimage

(2013)

G. Varoquaux et al.

Assessing and tuning brain decoders: cross-validation, caveats, and guidelines

NeuroImage

(2017)

T. Wolfers et al.

From estimating activation locality to predicting disorder: a review of pattern recognition for neuroimaging-based psychiatric diagnostics

Neurosci. Biobehav. Rev.

(2015)

G. Ziegler et al.

Individualized gaussian process-based prediction and detection of local and global gray matter abnormalities in elderly subjects

NeuroImage

(2014)

S. Arlot et al.

A survey of cross-validation procedures for model selection

Stat. Surv.

(2010)

Y. Bengio et al.

No unbiased estimator of the variance of k-fold cross-validation

J. Mach. Learn. Res.

(2004)

B. Biswal et al.

Toward discovery science of human brain function

Proc. Ntl Acad. Sci.

(2010)

U.M. Braga-Neto et al.

Is cross-validation valid for small-sample microarray classification?

Bioinformatics

(2004)

C.J. Brown et al.

Machine Learning on Human Connectome Data from mri

(2016)

K.S. Button et al.

Power failure: why small sample size undermines the reliability of neuroscience

Nat. Rev. Neurosci.

(2013)

S.G. Costafreda

Pooling fmri data: meta-analysis, mega-analysis and multi-center studies

Front. Neuroinformatics

(2009)

J. Demšar

Statistical comparisons of classifiers over multiple data sets

J. Mach. Learn. Res.

(2006)

A. Di Martino et al.

The autism brain imaging data exchange: towards a large-scale evaluation of the intrinsic brain architecture in autism

Mol. psychiatry

(2014)

C. Dwork et al.

The reusable holdout: preserving validity in adaptive data analysis

Science

(2015)

K.J. Gorgolewski et al.

Neurovault. org: a web-based repository for collecting and sharing unthresholded statistical maps of the human brain

Front. Neuroinform.

(2015)

T. Hastie et al.

The Elements of Statistical Learning

(2009)

Cited by (402)

Detection of lung cancer through SERS analysis of serum
2024, Spectrochimica Acta - Part A: Molecular and Biomolecular Spectroscopy
Early detection and postoperative assessment are crucial for improving overall survival among lung cancer patients. Here, we report a non-invasive technique that integrates Raman spectroscopy with machine learning for the detection of lung cancer. The study encompassed 88 postoperative lung cancer patients, 73 non-surgical lung cancer patients, and 68 healthy subjects. The primary aim was to explore variations in serum metabolism across these cohorts. Comparative analysis of average Raman spectra was conducted, while principal component analysis was employed for data visualization. Subsequently, the augmented dataset was used to train convolutional neural networks (CNN) and Resnet models, leading to the development of a diagnostic framework. The CNN model exhibited superior performance, as verified by the receiver operating characteristic curve. Notably, postoperative patients demonstrated an increased likelihood of recurrence, emphasizing the crucial need for continuous postoperative monitoring. In summary, the integration of Raman spectroscopy with CNN-based classification shows potential for early detection and postoperative assessment of lung cancer.
Development and prospective clinical validation of a convolutional neural network for automated detection and segmentation of focal cortical dysplasias
2024, Epilepsy Research
Focal cortical dysplasias (FCDs) are a leading cause of drug-resistant epilepsy. Early detection and resection of FCDs have favorable prognostic implications for postoperative seizure freedom. Despite advancements in imaging methods, FCD detection remains challenging. House et al. (2021) introduced a convolutional neural network (CNN) for automated FCD detection and segmentation, achieving a sensitivity of 77.8%. However, its clinical applicability was limited due to a low specificity of 5.5%. The objective of this study was to improve the CNN’s performance through data-driven training and algorithm optimization, followed by a prospective validation on daily-routine MRIs.
A dataset of 300 3 T MRIs from daily clinical practice, including 3D T1 and FLAIR sequences, was prospectively compiled. The MRIs were visually evaluated by two neuroradiologists and underwent morphometric assessment by two epileptologists. The dataset included 30 FCD cases (11 female, mean age: 28.1 ± 10.1 years) and a control group of 150 normal cases (97 female, mean age: 32.8 ± 14.9 years), along with 120 non-FCD pathological cases (64 female, mean age: 38.4 ± 18.4 years). The dataset was divided into three subsets, each analyzed by the CNN. Subsequently, the CNN underwent a two-phase-training process, incorporating subset MRIs and expert-labeled FCD maps. This training employed both classical and continual learning techniques. The CNN’s performance was validated by comparing the baseline model with the trained models at two training levels.
In prospective validation, the best model trained using continual learning achieved a sensitivity of 90.0%, specificity of 70.0%, and accuracy of 72.0%, with an average of 0.41 false positive clusters detected per MRI. For FCD segmentation, an average Dice coefficient of 0.56 was attained. The model’s performance improved in each training phase while maintaining a high level of sensitivity. Continual learning outperformed classical learning in this regard.
Our study presents a promising CNN for FCD detection and segmentation, exhibiting both high sensitivity and specificity. Furthermore, the model demonstrates continuous improvement with the inclusion of more clinical MRI data. We consider our CNN a valuable tool for automated, examiner-independent FCD detection in daily clinical practice, potentially addressing the underutilization of epilepsy surgery in drug-resistant focal epilepsy and thereby improving patient outcomes.
Predicting treatment outcome based on resting-state functional connectivity in internalizing mental disorders: A systematic review and meta-analysis
2024, Neuroscience and Biobehavioral Reviews
Predicting treatment outcome in internalizing mental disorders prior to treatment initiation is pivotal for precision mental healthcare. In this regard, resting-state functional connectivity (rs-FC) and machine learning have often shown promising prediction accuracies. This systematic review and meta-analysis evaluates these studies, considering their risk of bias through the Prediction Model Study Risk of Bias Assessment Tool (PROBAST). We examined the predictive performance of features derived from rs-FC, identified features with the highest predictive value, and assessed the employed machine learning pipelines. We searched the electronic databases Scopus, PubMed and PsycINFO on the 12th of December 2022, which resulted in 13 included studies. The mean balanced accuracy for predicting treatment outcome was 77% (95% CI: [72%- 83%]). rs-FC of the dorsolateral prefrontal cortex had high predictive value in most studies. However, a high risk of bias was identified in all studies, compromising interpretability. Methodological recommendations are provided based on a comprehensive exploration of the studies' machine learning pipelines, and potential fruitful developments are discussed.
Performance reserves in brain-imaging-based phenotype prediction
2024, Cell Reports
This study examines the impact of sample size on predicting cognitive and mental health phenotypes from brain imaging via machine learning. Our analysis shows a 3- to 9-fold improvement in prediction performance when sample size increases from 1,000 to 1 M participants. However, despite this increase, the data suggest that prediction accuracy remains worryingly low and far from fully exploiting the predictive potential of brain imaging data. Additionally, we find that integrating multiple imaging modalities boosts prediction accuracy, often equivalent to doubling the sample size. Interestingly, the most informative imaging modality often varied with increasing sample size, emphasizing the need to consider multiple modalities. Despite significant performance reserves for phenotype prediction, achieving substantial improvements may necessitate prohibitively large sample sizes, thus casting doubt on the practical or clinical utility of machine learning in some areas of neuroimaging.
Neuroimaging-based classification of PTSD using data-driven computational approaches: A multisite big data study from the ENIGMA-PGC PTSD consortium
2023, NeuroImage
Recent advances in data-driven computational approaches have been helpful in devising tools to objectively diagnose psychiatric disorders. However, current machine learning studies limited to small homogeneous samples, different methodologies, and different imaging collection protocols, limit the ability to directly compare and generalize their results. Here we aimed to classify individuals with PTSD versus controls and assess the generalizability using a large heterogeneous brain datasets from the ENIGMA-PGC PTSD Working group.
We analyzed brain MRI data from 3,477 structural-MRI; 2,495 resting state-fMRI; and 1,952 diffusion-MRI. First, we identified the brain features that best distinguish individuals with PTSD from controls using traditional machine learning methods. Second, we assessed the utility of the denoising variational autoencoder (DVAE) and evaluated its classification performance. Third, we assessed the generalizability and reproducibility of both models using leave-one-site-out cross-validation procedure for each modality.
We found lower performance in classifying PTSD vs. controls with data from over 20 sites (60 % test AUC for s-MRI, 59 % for rs-fMRI and 56 % for d-MRI), as compared to other studies run on single-site data. The performance increased when classifying PTSD from HC without trauma history in each modality (75 % AUC). The classification performance remained intact when applying the DVAE framework, which reduced the number of features. Finally, we found that the DVAE framework achieved better generalization to unseen datasets compared with the traditional machine learning frameworks, albeit performance was slightly above chance.
These results have the potential to provide a baseline classification performance for PTSD when using large scale neuroimaging datasets. Our findings show that the control group used can heavily affect classification performance. The DVAE framework provided better generalizability for the multi-site data. This may be more significant in clinical practice since the neuroimaging-based diagnostic DVAE classification models are much less site-specific, rendering them more generalizable.
Unsupervised active–transfer learning for automated landslide mapping
2023, Computers and Geosciences
Detailed landslide inventories are required for multiple purposes including disaster damage assessments, susceptibility mapping for spatial planning, and disaster risk reduction. Active learning is an artificial intelligence strategy that can achieve good performances in landslide mapping by training a machine-learning model with a reduced number of landslide/non-landslide observations, which can save time and effort in labeling training instances. Nevertheless, active-learning models are unstable at the beginning of sample selection due to the limited initial knowledge of landslide distribution. Transfer learning can help make the learner robust by transferring a landslide model trained on an existing landslide inventory from a different, but geographically similar source area, to the unseen target area. In order to adjust a transferred machine-learning model to the possibly unique environmental characteristics of the unseen area, we proposed a new framework called Unsupervised Active-Transfer Learning (UATL). This framework used a weight function to combine the landslide model transferred from the source area, with a model trained on a small, but increasing number of landslide/non-landslide observations from the target area to efficiently build a more robust learner. We examined two methods, adaptive UATL and regular UATL, which differed in the way they assign weights to the combined learners. We evaluated our proposed new methods by comparing them with three benchmark methods (active learning only, model transfer only, and the model trained in the unseen area itself) by means of the partial area under the receiver operating characteristic (ROC) curve (AUROC) as the evaluation criterion. The results showed that the new methods, and especially adaptive UATL, can achieve good predictive performances. With only about 235 training instances from the target area, the partial AUROC obtained from adaptive UATL was only 2% lower than that obtained from the model trained in the target area itself, and consistently outperformed the other two benchmarks. Overall, we suggest that the framework proposed can be applied to the natural hazards management workflow for assisting in emergency response, especially in data-scarce regions (e.g., mountainous areas and developing countries).

View all citing articles on Scopus

View full text

Cross-validation failure: Small sample sizes lead to large error bars

Abstract

Introduction

Section snippets

Distribution of errors in cross-validation

An open door to overfit and confirmation bias

Conclusion: improving predictive neuroimaging

Acknowledgments

NeuroImage

NeuroImage

Neuroimage

NeuroImage

Neuron

Pattern Recognit.

Curr. Biol.

Trends Cogn. Sci.

Neuroimage

Neuroimage

Psychol. Bull.

Neuroimage

Neuroimage

Neuroimage

NeuroImage

Neurosci. Biobehav. Rev.

NeuroImage

A survey of cross-validation procedures for model selection

Stat. Surv.

No unbiased estimator of the variance of k-fold cross-validation

J. Mach. Learn. Res.

Toward discovery science of human brain function

Proc. Ntl Acad. Sci.

Is cross-validation valid for small-sample microarray classification?

Bioinformatics

Machine Learning on Human Connectome Data from mri

Power failure: why small sample size undermines the reliability of neuroscience

Nat. Rev. Neurosci.

Pooling fmri data: meta-analysis, mega-analysis and multi-center studies

Front. Neuroinformatics

Statistical comparisons of classifiers over multiple data sets

J. Mach. Learn. Res.

The autism brain imaging data exchange: towards a large-scale evaluation of the intrinsic brain architecture in autism

Mol. psychiatry

The reusable holdout: preserving validity in adaptive data analysis

Science

Neurovault. org: a web-based repository for collecting and sharing unthresholded statistical maps of the human brain

Front. Neuroinform.

The Elements of Statistical Learning