Abstract
Accurately predicting cellular responses to perturbations is essential for understanding cell behaviour in both healthy and diseased states. While perturbation data is ideal for building such predictive models, it is considerably sparser than baseline (non-perturbed) cellular data. To address this limitation, several foundational cell models have been developed using large-scale single-cell gene expression data. These models are fine-tuned after pre-training for specific tasks, such as predicting post-perturbation gene expression profiles, and are considered state-of-the-art for these problems. However, proper benchmarking of these models remains an unsolved challenge.
In this study, we benchmarked a recently published foundational model, scGPT, against baseline models. Surprisingly, we found that even the simplest baseline model - taking the mean of training examples - outperformed scGPT. Furthermore, machine learning models that incorporate biologically meaningful features outperformed scGPT by a large margin. Additionally, we identified that the current Perturb-Seq benchmark datasets exhibit low perturbation-specific variance, making them suboptimal for evaluating such models.
Our results highlight important limitations in current benchmarking approaches and provide insights into more effectively evaluating post-perturbation gene expression prediction models.
Competing Interest Statement
All authors are full-time employees of Turbine Ltd., KSz is a founder as well.