Big Data Reproducibility: Applications in Brain Imaging and Genomics

Eric W. Bridgeford; Shangsi Wang; Zhi Yang; Zeyi Wang; Ting Xu; Cameron Craddock; Jayanta Dey; Gregory Kiar; William Gray-Roncal; Carlo Coulantoni; Christopher Douville; Carey E. Priebe; Brian Caffo; Michael Milham; Xi-Nian Zuo; Consortium for Reliability and Reproduciblity; Joshua T. Vogelstein

doi:10.1101/802629

Abstract

Reproducibility, the ability to replicate analytical findings, is a prerequisite for both scientific discovery and clinical utility. Troublingly, we are in the midst of a reproducibility crisis, in which many investigations fail to replicate. Although many believe that these failings are due to misunderstanding or misapplication of statistical inference (e.g., p-values or the dichotomization of “statistically significant”), we believe the shortcomings arise much earlier in the data science workflow, at the level of measurement, including data acquisition and reconstruction. A key to reproducibility is that multiple measurements of the same item (e.g., experimental sample or clinical participant) are similar to one another, while they are dissimilar from other items. The intra-class correlation coefficient (ICC) quantifies reproducibility in this way, but only for univariate (one dimensional) data, while relying on Gaussian assumptions for validity. In contrast, big data is multivariate (high-dimensional), non-Gaussian, and often non-Euclidean (including text, images, speech, and networks), rendering ICC inadequate. We propose a novel statistic, discriminability, which quantifies the degree to which individual samples are similar to one another, without restricting the data to be univariate, Gaussian, or even Euclidean. We then introduce the possibility of optimizing experimental design via increasing discriminability. We prove that optimizing discriminability yields an improved ability to use the data for subsequent inference tasks, without specifying the inference task a priori. We then apply this approach three different datasets: a brain imaging dataset built by the “Consortium for Reliability and reproducibility” which consists of 28 disparate magnetic resonance imaging datasets, and two genomics datasets. Discriminability is the only statistic that, by optimizing according to it, improves performance on all subsequent inference tasks for each dataset, despite that they were not considered in the optimization. We therefore suggest that designing experiments and analyses to optimize discriminability may be a crucial step in solving the reproducibility crisis.

Competing Interest Statement

The authors have declared no competing interest.

Footnotes

Added genomics experiments, renamed several features of discriminability framework (one and two-sample testing to GOF and comparison tests), added simulation cases, added several comparison algorithms (FPI, HSIC, DISCO).
https://github.com/neurodata/r-mgc
https://github.com/neurodata/hyppo

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.