Abstract
Large-scale multi-subject single-cell data have become very common. For differential gene expression (DGE) analysis of these datasets, a common practice is to choose any of the two: pseudobulk methods or cell-wise methods. However, multi-subject single-cell studies have highly heterogeneous study designs. Some studies have case/control samples to be compared and tested, and some make within-sample perturbations. In this work, we report that to prevent severely inflated errors, we must treat the two categories of study designs differently in the DGE analysis. In studies with case/control labels, pseudobulk methods work the best, and cell-wise methods produce severe inflation of type II errors. Cell-wise methods work best in studies with within-sample perturbations, and pseudobulk methods produce severe inflation of type I errors. We provide mathematical proofs to support this argument. Surprisingly, many existing studies, even published ones, often choose an inappropriate DGE method. The most common temptation is to use cell-wise methods for case/control studies, which will only make p-values falsely look significant. Our analyses and proofs warn that choosing an appropriate DGE method is not an option but a requirement.
Competing Interest Statement
Buhm Han is the CTO of the Genealogy Inc.
Footnotes
Main text update