Abstract
Structural biologists have fit increasingly complex model types to protein X-ray crystallographic data, motivated by higher-resolving crystals, greater computational power, and a growing appreciation for protein dynamics. Once fit, a more complex model will generally fit the experimental data better, but it also provides greater capacity to overfit to experimental noise. While refinement progress is normally monitored for a given model type with a fixed number of parameters, comparatively little attention has been paid to the selection among distinct model types where the number of parameters can vary. Using metrics derived in the statistical field of model comparison, we develop a framework for statistically rigorous inference of model complexity. From analysis of simulated data, we find that the resulting information criteria are less likely to prefer an erroneously complex model type and are less sensitive to noise, compared to the crystallographic cross-validation criterion Rfree. Moreover, these information criteria suggest caution in using complex model types and for inferring protein conformational heterogeneity from experimental scattering data.