RT Journal Article SR Electronic T1 Optimal Experimental Design for Big Data: Applications in Brain Imaging JF bioRxiv FD Cold Spring Harbor Laboratory SP 802629 DO 10.1101/802629 A1 Eric W. Bridgeford A1 Shangsi Wang A1 Zhi Yang A1 Zeyi Wang A1 Ting Xu A1 Cameron Craddock A1 Gregory Kiar A1 William Gray-Roncal A1 Carey E. Priebe A1 Brian Caffo A1 Michael Milham A1 Xi-Nian Zuo A1 Consortium for Reliability and Reproduciblity A1 Joshua T. Vogelstein YR 2019 UL http://biorxiv.org/content/early/2019/10/13/802629.abstract AB The cost of data collection and processing is becoming prohibitively expensive for many research groups across disciplines, a problem that is exacerbated by the dependence of ever larger sample sizes to obtain reliable inferences for increasingly subtle questions. And yet, as more data is available and open access, more researchers desire to analyze it for different questions, often including previously unforeseen questions. To further increase sample sizes, existing datasets are often amalgamated. These reference datasets—datasets that serve to answer many disparate questions for different individuals—are increasingly common and important. Reference pipelines efficiently and flexibly analyze on all the datasets. How can one optimally design these reference datasets and pipelines to yield derivative data that are simultaneously useful for many different tasks? We propose an approach to experimental design that leverages multiple measurements for each distinct item (for example, an individual). The key insight is that each measurement of the same item should be more similar to other measurements of that item, as compared to measurements of any other item. In other words, we seek to optimally discriminate one item from another. We formalize the notion of discriminability, and introduce both a non-parameteric and parametric statistic to quantify the discriminability of potentially multivariate or non-Euclidean datasets. With this notion, one can make optimal decisions—either with regard to acquisition or analysis of data—by maximizing discriminability. Crucially, this optimization can be performed in the absence of any task-specific (or supervised) information. We show that optimizing decisions with respect to discriminability yields improved performance on subsequent inference tasks. We apply this strategy to a brain imaging dataset built by the “Consortium for Reliability and Reproducability” which consists of 24 disparate magnetic resonance imaging datasets, each with up to hundreds of individuals that were imaged multiple times. We show that by optimizing pipelines with respect to discriminability, we improve performance on multiple subsequent inference tasks, even though discriminability does not consider the tasks whatsoever.