RT Journal Article SR Electronic T1 Empirically-Derived Synthetic Populations to Overcome Small Sample Sizes JF bioRxiv FD Cold Spring Harbor Laboratory SP 441238 DO 10.1101/441238 A1 Erin E. Fowler A1 Anders Berglund A1 Thomas A. Sellers A1 Steven Eschrich A1 John Heine YR 2018 UL http://biorxiv.org/content/early/2018/10/11/441238.abstract AB Overfitting is a problem often encountered when developing multivariate predictive models with limited data. The objective of this work is to present a method to generate a synthetic population (SP) from a sparse seed dataset that has a similar multivariate structure to aid in model building. We used a multivariate kernel density estimation approach with an unconstrained bandwidth to generate SPs with data at the individual level. A matched case-control study (n=180 pairs) was used as the seed dataset. Cases and controls were considered as two subpopulations and analyzed separately. We included four continuous measures and one categorical variable for each subject. Bandwidth matrices were determined with differential evolution (DE) optimization based on covariance comparisons. Similarity between the seed dataset with datasets selected randomly from SPs were compared under the hypothesis that the structure should be similar. To evaluate similarity, we compared PCA score distributions and residuals summarized with the distance to the model in X-space (DModX).SPs were generated for both case and control groups. The probability of selecting seed replicas when constructing synthetic sample datasets randomly was minute. Within group, both PCA scores and residuals were similar across seed and synthetic samples; covariance comparisons also indicated the structure was similar.Feasibility of a new SP generation methodology was presented. This approach produced synthetic data at the patient level indistinguishable from the seed data. The methodology coupled kernel density estimation with DE optimization and deployed novel similarity metrics derived from PCA. The use of synthetic samples may be useful to mitigate overfitting in initial model building exercises. To further develop this approach into a research tool for model building, additional evaluation with increased dimensionality is required.