Abstract
Single-cell RNA-sequencing (scRNA-seq) allows heterogeneity in gene expression levels to be studied in large populations of cells. Such heterogeneity can arise from both technical and biological factors, thus making decomposing sources of variation extremely difficult. We here describe a computationally efficient model that uses prior pathway annotation to guide inference of the biological drivers underpinning the heterogeneity. Moreover, we jointly update and improve gene set annotation and infer factors explaining variability that fall outside the existing annotation. We validate our method using simulations, which demonstrate both its accuracy and its ability to scale to large datasets with up to 100,000 cells. Moreover, through applications to real data we show that our model can robustly decompose scRNA-seq datasets into interpretable components and facilitate the identification of novel sub-populations.