Abstract
Large-scale multi-subject single-cell data have become very common. However, these data have high heterogeneity in study designs, leading to confusion in differential gene expression (DGE) analysis. In this work, we show that pseudobulk produces a substantial amount of type 2 error when the group label being compared varies within a subject. When the comparison label is constant within a subject, type 1 error is inflated if pseudoreplication is not accounted for. As a general principle, we show that an appropriate method depends on the design of the study. We propose solutions to the inflated error rates and provide practical guidelines for researchers.
Competing Interest Statement
Buhm Han is the CTO of the Genealogy Inc.
Footnotes
The manuscript has undergone major changes. We now attribute the study design as the major factor of appropriate differential gene expression analysis.