Abstract
The nascent field of genomic AI is rapidly expanding with new models, benchmarks, and findings. As the field diversifies, there is an increased need for a common set of measurement tools and perspectives to standardize model evaluation. Here, we present a statistically grounded framework for performance evaluation, visualization, and interpretation using the prominent sequence-based deep learning models Enformer and Borzoi as case studies. The Enformer model has been used for applications ranging from understanding regulatory mechanisms to variant effect prediction, but what makes it better or worse than precedent models? Does its follow-up, Borzoi, offer improved performance and more informative embeddings as well as finer resolution? Our goal is to propose a general blueprint for answering such questions and evaluating new models. We start by contrasting the few-shot performance of Enformer and Borzoi to precedent models on the GUANinE benchmark, which emphasizes complex genome interpretation tasks. We then examine Enformer and Borzoi intermediate embeddings in model-subjective principal component space, where we identify limiting aspects that affect model generalization. Finally, we present an interpretable decomposition of Enformer and Borzoi, which allows for global model interpretability and partial ‘backtracking’ to explanatory causal features. Through this case study, we illustrate a new protocol, Enformation Theory, for analyzing and interpreting deep learning models in genomics.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
Added new results for the Borzoi model, as figures and analyses related to both, and additional supplementary information