Feature extraction approach in single-cell gene expression profiling for cell-type marker identification

Background Recent advances in single-cell gene expression profiling technology have revolutionized the understanding of molecular processes underlying developmental cell and tissue differentiation, enabling the discovery of novel cell-types and molecular markers that characterize developmental trajectories. Common approaches for identifying marker genes are based on pairwise statistical testing for differential gene expression between cell-types in heterogeneous cell populations, which is challenging due to unequal sample sizes and variance between groups resulting in little statistical power and inflated type I errors. Results We developed an alternative feature extraction method, Marker gene Identification for Cell-type Identity (MICTI) that encodes the cell-type specific expression information to each gene in every single-cell. This approach identifies features (genes) that are cell-type specific for a given cell-type in heterogeneous cell population. To validate this approach, we used (i) simulated single cell RNA-seq data, (ii) human pancreatic islet single-cell RNA-seq data and (iii) a simulated mixture of human single-cell RNA-seq data related to immune cells, particularly B cells, CD4+ memory cells, CD8+ memory cells, dendritic cells, fibroblast cells, and lymphoblast cells. For all cases, we were able to identify established cell-type-specific markers. Conclusions Our approach represents a highly efficient and fast method as an alternative to differential expression analysis for molecular marker identification in heterogeneous single-cell RNA-seq data.

6 inefficient in terms of execution time and redundant use, especially when considering several cell-116 types in the heterogeneous cell population. 117 Alternative approaches to the problem make use of feature extraction, where information extraction 118 methods are used to mine higher-level information from the datasets resulting in non-redundant 119 information that can be used for high accuracy cell-type identification. Feature extraction is widely 120 used for dimensionality reduction such as principal component analysis (PCA) (Breu, Guggenbichler, 121 & Wollmann, 2008), independent component analysis (ICA) (Kachenoura,Albera,Senhadji,& 122 Comon, 2008) , non-negative matrix factorization (NMF) (Eggert & Krner, 2004), linear discriminate 123 analysis (LDA) (M. Li & Yuan, 2005) and autoencoders (Wang, Huang, Wang, & Wang, 2014). For 124 document classification and sentiment analysis, term frequency-inverse document frequency (TF-125 IDF) (Ramos, 2003) and latent Dirichlet allocation (LDA) (Blei et al., 2003) are widely used feature 126 extraction methods. 127 Here we adopt the TF-IDF method to develop MICTI (Fig. 1), a feature extraction method for Marker 128 gene Identification for Cell-type Identity in single-cell RNA-seq data. MICTI can be used as an 129 alternative method with differential gene expression for cell-type specific marker identification in 130 single-cell RNA-seq data. The aim of MICTI is to rapidly identify key cell-type (or cluster) specific 131 genes for the analysis of heterogeneous single-cell RNA-seq. Making use of the sparsity of the 132 normalized expression data, the cell-type (or cluster) specific information is encoded for each gene 133 in every cell. This allows MICTI to avoid variance based gene filtering steps typically used in the 134 preprocessing of single-cell RNA-seq analysis.

Cell type 4 features
Mean    cell-type E is relatively distinct as compared to the other cell-types. Fig. 2c shows a heatmap for the 174 markers identified by MICTI with p-value less than 0.01 and Z-score greater than 0. The cell-type 1 175 marker genes distinctly have high expression in cell-type 1 cells and low expression in the other cell-176 types. Markers for cell-type A and B share common marker genes. The same is true for cell-type C 177

244
The gene expression heatmap (Fig. 4b) for MICTI identified marker genes show that the expression 245 pattern of the cell-type specific marker genes are specific to the corresponding cell-types. We can 246 also notice that cells that are closely related functionally (B cells and dendritic cells or CD4+ and 247 CD8+ cells) share common cell-type marker genes. 248 266 23 genes were found to be cell-type specific genes for both CD4+ and CD8+ memory cells (Fig. 5).  ABT1  APEX1  ATP9B  BEX2  CABIN1  CCL20  CD40LG  CTLA4  DUSP6  HAX1  HIGD2A  HSPA5  MAF  MAGEH1  MAP3K14  MEA1  METTL18  MRFAP1L1  MRPL17  MRPL51  NFKBIA  NIP7  NOP10  PATL2  PMAIP1  POLE3  POP7  PPP1R11  RGS16  RNF113A  SAT1  TMED4  'TMEM208  TRIAP1  TSPYL1  TSPYL2 TUBB4B UTP3 AIF1  CARD16  CD1C  CD33  CLEC4A  CLEC7A  CLEC9A  CPVL  CRIP1  CST3  CSTA  FCER1A  FCER1G  FTH1  HLA-DMA  HLA-DMB  HLA-DQB1  IGSF6 LGALS2 LST1 LYZ MNDA MS4A6A RNASE6 S100A10 S100A4 S100A8 S100A9 TNFSF13B TYROBP MS4A1  CCR7  RPL36A  TNFRSF17  H2AFZ  PLEK  CCNB1  PLAC8  ARHGDIB  BLNK  PTTG1  HMSD  CTSC  MSRB1  ATP5G1  LRMP  HMGB2  SELL  MEF2C  NUSAP1  NDUFA9  PSMB9 CKS2 GPX4  AXL  CAV1  CD63  CFL2  COL1A1  COL1A2  CTGF  CTSK  CTTN  CYR61  DCN  DKK1  FTL  GPX1  GPX8  HMGN2  ID3  IFITM3  ILK  LUM  MFAP4  MMP1  MMP3  MT1E  MTRNR2L1  MTRNR2L10  MTRNR2L3  MTRNR2L4  MTRNR2L8  NQO1  NXN  PCOLCE  RPS26  S100A16  SPARC  TAGLN  THBS1  THY1  TIGD1  TIMP1  TNFRSF11B  TPM2 TUBA1A VIM  In addition, pairwise statistical testing, using for example BPSC and DEseq2 (Fig. 6), for each of the 317 cell-types against the other cell-types in the population take much execution time. Moreover, there is 318 no a gold standard method for differential expression analysis in the context of single-cell RNA-seq 319 data (Dal Molin, Baruzzo, & Di Camillo, 2017). Our feature extraction based method, which differs 320 from pairwise statistical testing methods in using cell-type specific mean feature-encoded expression 321 values for marker gene significance p-value calculation, can efficiently identify key cell-type specific 322 genes from heterogeneous cell population expression data. It also avoids gene filtering based-on 323 expression variance, a common procedure before proceeding to the downstream clustering and 324 differential analysis that might result in loss of important information, in most of single-cell RNA-325 seq analysis pipelines. It also has significantly less execution time when it is compared with the 326 statistical testing methods (Fig. 6). 327 MICTI does not depend on balanced numbers of cells in each of clusters. In the case studies 1, 2 and 328 3, the number of cells varied in each of the heterogeneous cell categories (Fig. 3a and Fig. 4a), but 329 MICTI identified key marker genes in each cluster or cell-types. We can also observe that the MICTI 330 identified cell-type marker genes for functionally related cell-types, such as B-cells and Dendritic 331 cells or CD4+ and CD8+ memory cells, share communality (Fig. 4b). be used as an alternative to statistical differential expression analysis for marker gene identification. 341

[CD4+ cells ] and [CD8+ cells]:
It not only avoids repeated statistical testing for marker gene identification in the heterogeneous cell 342 population, but it also keeps gene expression information otherwise lost in filtering genes based on 343 their variance across the cells as it is common practice in most single-cell RNA-seq pipelines. 344

345
MICTI is a feature extraction-based method for identification of cell-type specific marker genes in 346 single-cell RNA-seq data. We describe the typical workflow of MICTI in detail. Following grouping 347 of cells into their respective cell-type categories, in our case we did it manually by extracting cell 348 metadata information, the next step was to investigate the marker genes that gave rise to the identity 349 of the cell in the heterogeneous cell population. This was done by looking at the gene expression level 350 20 of a given gene in each cell weighted by the scarcity of expression factor for the given gene across 351 the cell-types. The scarcity of expression factor for a given gene was calculated by dividing the total 352 number of cells by the number of cells that express the given gene in the logarithmic scale. We 353 considered a gene as "expressed" if the normalized UMI count or TPM value was greater than the 354 user-defined threshold value as the binary Yes/No decision. We call this step as "feature-encoding" 355 for subsequent feature extraction as (equations 1-5). Fig. 1 shows the workflow of this feature 356 extraction approach for identifying cell-type specific genes or markers. 357 positive extreme Z-score valued genes were cluster-specific genes, whereas the negative extreme Z-403 score valued genes were the expressed genes across all clusters. In order to identify the statistically 404 significant cluster specific marker genes, we calculated the Bonferroni corrected p-values for each of 405 the clusters. We used 0.01 as a cut-off p-value, where the significant genes with p-values of less than 406 0.01 and Z-score of greater than 0 were cluster markers. 407

Gene list enrichment analysis 408
In order to validate cell-type specific marker genes, we used gene over-representation analysis for 409 pathways and gene ontology (GO) terms. For the over representation analysis, we used 410