Abstract
Characterization of subpopulations is a key challenge in the emerging field of single-cell RNA-seq (scRNA-seq). In scRNA-seq data, gene expression has been used as feature to detect subpopulations, however, this data type is subject to significant amount of noise. Unconventionally, we propose to use filtered, effective and expressed nucleotide variations (eeSNVs) information from scRNA-seq data as improved predictive features for subpopulation identification. We developed a linear modeling framework called SSrGE (Sparse SNV inference to reflect Gene Expression) to detect eeSNVs that are associated with gene expression profiles. In all the datasets tested, these eeSNVs show better accuracy than gene expression for retrieving cell subpopulations. Moreover, bipartite graphs of cells in combination with eeSNVs have better visual representation of the different cell subpopulations than the other methods that use gene expression data. We ranked genes, according to their eeSNVs in the cancer scRNA-seq data, and found that genes in antigen processing and presentation pathway have top ranked eeSNVs. They include HLA-A, HLA-B, HLA-C and HLA-DRA in Human Leukocyte Antigen (HLA) complex, and B2M in histocompatibility complex MHC. Moreover, previously experimentally validated cancer relevant genes, such as KRAS and SPARC, are highly ranked for eeSNVs too. In summary, we emphasize that SNV features hidden in scRNA-seq data have merits for both subpopulation identification and linkage of genotype-phenotype relationship. The computational method is freely available at https://github.com/lanagarmire/SSrGE.