Abstract
Single-cell transcriptomics enables the study of cellular heterogeneity, but current unsupervised strategies make it challenging to associate individual cells with sample conditions. We propose scMILD, a weakly supervised learning framework based on Multiple Instance Learning, which leverages sample-level labels to identify condition-associated cell subpopulations. scMILD employs a dual-branch architecture to perform sample-level classification and cell-level representation learning simultaneously. We validated the model’s reliable identification of condition-associated cells using controlled simulation studies with CRISPR-perturbed cells. Evaluated on diverse single-cell RNA-seq datasets, including Lupus, COVID-19, and Ulcerative Colitis, scMILD consistently outperformed state-of-the-art models and identified condition-specific cell subpopulations consistent with the original studies’ findings. This demonstrates scMILD’s potential for exploring cellular heterogeneity underlying various biological conditions and its applicability in different disease contexts.
Key Messages
scMILD: A novel weakly supervised framework for single-cell transcriptomics
Dual-branch architecture enables sample classification and cell subpopulation identification
Outperforms state-of-the-art models across diverse single-cell RNA-seq datasets
Identifies biologically relevant condition-associated cell subpopulations
Bridges the gap between sample-level phenotypes and cellular heterogeneity
Competing Interest Statement
The authors have declared no competing interest.