sc-REnF:An entropy guided robust feature selection for clustering of single-cell rna-seq data

Many single-cell typing methods require pure clustering of cells, which is susceptible towards the technical noise, and heavily dependent on high quality informative genes selected in the preliminary steps of downstream analysis. Techniques for gene selection in single-cell RNA sequencing (scRNA-seq) data are seemingly simple which casts problems with respect to the resolution of (sub-)types detection, marker selection and ultimately impacts towards cell annotation. We introduce sc-REnF, a novel and robust entropy based feature (gene) selection method, which leverages the landmark advantage of ‘Renyi’ and ‘Tsallis’ entropy achieved in their original application, in single cell clustering. Thereby, gene selection is robust and less sensitive towards the technical noise present in the data, producing a pure clustering of cells, beyond classifying independent and unknown sample with utmost accuracy. The corresponding software is available at: https://github.com/Snehalikalall/sc-REnF


Introduction
In recent times technological advances have made it possible to study RNA-seq data at single cell resolution 1 . Single cell RNA sequencing (scRNA-seq) is a powerful tool to capture gene expression snapshots in individual cells. Cell type detection is one of the fundamental steps in downstream analysis of scRNA-seq data 2 . A widely used approach for this is to cluster the cells into different groups, and determine the identity of cells within the individual groups/clusters 3,4 . This provides an unsupervised method of annotating the different cell types present in the large population of scRNA-seq data [5][6][7][8] .
Starting from raw counts, scRNA-seq data analysis typically goes through the following steps before clustering: i) normalization (total-count and log-normalization, ii) feature selection, and ) iii) dimensionality reduction 9,10 . While normalization adjusts the differences between the samples of individual cells and log normalization reduces the skewness of the data, feature selection seeks to identify the most relevant (top) features (genes) from the large feature space. The top genes are either i) highly variable genes, selected by measuring the coefficient of variation 11,12 or ii) highly expressed genes having expression levels higher than the average across all cells 9 .
The performance of downstream analysis, mainly the clustering process is heavily dependent on the quality of selected top features/genes. The typical characteristics of good features/genes are: i) it should encode useful information about the biology of the system ii) should not include features that contain random noise iii) preserve the useful biological structure while reducing the size of the data so as to reduce the computational cost of later steps.
The conventional approach of gene selection based on high variability in the very first stage of the downstream analysis is seemingly simple. However there are two major caveats: i) The observed variability of genes depends on the pseudo-count, which is arbitrary and can introduce biases in the data ii) PCA dimensionality reduction implicitly depends on Euclidean geometry which may not be appropriate for highly sparse and skewed scRNA-seq data. To take care of (i) adding a small pseudocount to all normalized counts prior to log normalization is a common practice in this pipeline. It is required because CPM (counts per million) cannot change the dominating zeros in the scRNA-Seq data, and without that log transformation is not possible. For (2) application of PCA, despite its fast and memory-efficient behavior, is questionable because of the high sparsity, and discrete and skewed nature of the datasets.
In this paper, we address these two challenges, both in themselves, and in their combination. We first present a method that finds the most informative features/genes from the scRNA-seq datasets based on a generalized and wide spectrum entropy measures (Rényi and Tsallis entropy). Although information entropy-based feature selection is a year-long and highly developed subject in the domain of feature selection, applications of entropy in the single-cell domain remains unexplored. There are off course 2/34 some works exist (e.g 13,14 ), but these are not focused on the informative gene selection in single cell data, instead, Liu et al., 13 aim to purify the cluster annotation step whereas Tischendorf et al., 14 focuses to quantify differentiation potential in the cells. Here, we indeed demonstrate that employing Renyi and Tsallis entropy in the gene filtering process introduces major advantages both in terms of clustering accuracy, and in terms of a biologically meaningful interpretation (a.k.a. marker selection) of the results.
The latter point is established because we are able to reveal marker genes for different cell types for which markers had not been determined through earlier studies. Note that biological marker selection is usually a crucial step in the downstream analysis and is depends on the purity of the cell clusters annotated in the previous stage. Here, we present the most informative gene selection for pure clustering that ultimately leads to a good annotation of the clusters.
Beyond predicting cluster annotation of single cells at utmost accuracy and providing a biologically meaningful interpretation along with it based on scRNA-seq data alone, we also present our method that enables the clustering and annotation in a completely independent data of the same tissue. For this, we split data into a train test ratio of 8:2 and we demonstrate that the selected features in the training set are equally effective in the clustering of the test samples. We also make a comprehensive simulation study to established the proposed method in eight contaminated Gaussian Mixture data. The results prove that our study not only can adopt genes at utmost accuracy, also provides a robust selection with tuned parameters.

Summary of contributions:
In this work, we provide the following novelties: (1) We provide the first entropy-based gene selection approach for clustering single-cell data. We utilized Renyi and Tsallis entropy that has major advantages over the Shannon method for their controlling parameter (q), which makes them less sensitive (robust) against different noises present in the data.
(2) Our approach is the first one to explicitly address how to learn the feature relevancy and redundancy using Renyi and Tsallis entropy in the single-cell expression data. We raised an objective function that will minimize conditional entropy between the selected features and maximize the conditional entropy between the class label and feature.

3/34
(3) Our framework (with tuned parameter) can able to cluster unseen scRNA-seq expression data with utmost accuracy. Clustering hitherto unclassified data is crucial for the annotation of cell types. We present our framework to be effective in this case. We demonstrate that the selected features from the training set are equally valid for the completely unseen test data of the same tissue.
(4) Here we derived a new risk factor for Renyi (R q ) and Tsallis (R T q ) entropy (see Method). We theoretically proved that the iterative selection of features eventually minimizes the Renyi (R q ) and Tsallis (R T q ) risk, which strengthens the robustness of our objective function.
(5) Our method is less sensitive (robust) against different noises present in the data. This is because of the Renyi and Tsallis entropy which has the advantage of controlling their parameters over the Shannon method.
(6) It is difficult to achieve good clustering results in small sample single-cell RNA-seq data. Our method also provides good results for small-sample and large-feature sized single cell data. Thus our method can be utilized as a generalized framework for downstream analysis of the single-cell analysis pipeline.

A short description of feature selection and related works
Selection of relevant features remains an year-long important problem in machine learning domain, because of the ever-increasing size of the dataset. Examples of such high-dimensional data include genomic data, text data, images data, etc, where feature selection plays an important role 15 . The principle motivation of feature selection is to reduce the dimension of large datasets to decrease computational cost and enhance the accuracy of algorithms 16 applied on the data. Several machine learning and data mining algorithms 17 are widely used to extract or select meaningful and informative features in different domains such as text categorization 18 , image retrieval, intrusion detection 19 , genomic analysis 20 and so on. A wrapper model requires a learning algorithm to judge the goodness of a subset of features. It searches the non-redundant features to improve the performance of the learning algorithm and hence requires more computational cost than filter methods. Finally, the embedded model takes advantage of both the wrapper and the filter methods to select a subset of features.
Many interesting feature selection approaches were proposed in earlier studies, they are -recursive feature elimination 23 , sequential feature selection algorithms 24 , genetic algorithms based feature selection approaches 25 , mutual information or entropy based feature selection 26 , etc.
Several entropy based filter methods have been proposed in the literature 27 . Most of these methods use Shannon's entropy. Renyi and Tsallis entropy 28 is another option that needs to be investigated in this context. Recently, some advances in the fields of security and privacy have reinvigorated interest in Renyi and Tsallis entropy 29 and thus they are ripe for application in biological data as well.
There exist different works on feature selection based on entropy measures. In 30 , a novel algorithm called MIFS was proposed to automatically estimate the effective number of features using mutual information (Shannon Entropy). Jiang et al., 31 proposed a new model of relative decision entropy for feature selection, which is an extension of Shannon's information entropy in rough sets. A study was conducted by Lopes et al. 32 defining the sensitivity of entropy methods which is applied to a well-defined problem of gene regulatory network. 33 proposed a new information-theoretical method for feature selection using Renyi min-entropy. Another entropy based feature selection method is proposed in 27 for text categorization.

Results
In the following, we will first describe the workflow of our analysis pipeline and the basic ideas that support it.
First, we utilized a basic framework for preprocessing the scRNA-seq data. We use recent filtering techniques to filter out genes and cells from the data and use this as an input of the proposed method.
We then carry out a simulation study on contamination Gaussian Mixture data that proves that our 5/34 study can adopt genes at the utmost accuracy. The study also provides a robust selection using the tuned parameter, q.
Subsequently, in our real experiments, we choose genes in five single cell RNA sequence data.
Corroborated by our simulations, a sufficient number of informative genes are selected, which is potentially actionable in single cell data analysis in less computational cost.
Finally, our study yields the selected genes in the frame of marker genes in literature, documenting the plausibility of our predictions.

Workflow
The figure 1 describes the workflow of our analysis pipeline. All important steps are discussed in the paragraphs of this subsection. The data matrix M with expressed genes and good cells is normalized using a linear model and normality based normalizing transformation method (Linnorm) 34 . The resulting matrix is then log 2 transformed by adding one as a pseudo count. Thus, the preprocessed matrix M cl ×ge is derived and is used in the entropy based feature selection model.

6/34
Dimensionality reduction and clustering Here we use principal component analysis to reduce the dimension of the data before clustering. We adopt the conventional process used in the Scanpy 35 package for dimensionality reduction and clustering process. We pick the top 15 PCs and create a neighborhood graph of cells using the PCA representation. It is assumed that the top PCs are likely to represent the biological signals while the latter PCs are adopting the noise in the data. The dominant factors of heterogeneity are likely to be captured by the top PCs. The neighborhood graph of cells is then directly clustered by Leiden graph-clustering method 36 to groups cells into different clusters.

Marker identification
We compute ranking for the highly differential genes within each cluster. Here we identify DE genes in each cluster using the Wilcoxon Ranksum test, but other statistical tests may be utilized otherwise. We select the top ten DE genes from each cluster based on the p-values.
Clustering of unknown samples For cells of the unknown type, our method can able to cluster with the selected genes, yielding a good clustering that is as fine-grained as is justified by the available true labeled data. Clustering unknown samples are crucial in scRNA-seq data analysis and can be addressed by a supervised or unsupervised way. In supervised technique, the model trained with reference data can be used to predict the cell types of samples. In our case (which is supervised), we observed that the selected genes in reference data can be useful to cluster the unknown samples of the same tissue. The clusters may be annotated by matching the DE genes with canonical markers. This provides the key element of our approach to work in practice.

Validation of selected features/genes
We validate our selected features/genes in several ways. First, a comprehensive simulation study is conducted to ensure the robustness of our proposed method. Clustering results on synthetic data are evaluated using the Adjusted Rand Index (ARI) score. To validate the proposed risk measure, a stability test has also been conducted. Second, the clustering results on scRNa-seq data are evaluated using ARI to ensure proper and accurate partitioning of cells. Having a good partitioning, thirdly we compute DE genes within each cluster which are actually treated as marker genes of specific cells. Finally, clustering on unknown samples is validated through a test set built from labeled scRNA-seq data. TSNE 2 visualization is used in all cases to visualize the data with original and predicted annotation.

Clustering Synthetic Data: Validation
Clustering performance on synthetic contaminated Gaussian mixture datasets (see Method) is evaluated using Adjusted Rand Index (ARI). The ARI score ensures the quantity of matching between predicted cluster labels with known groups and is ranging from 0 (when clustering prediction is random) to 1 (when clustering is perfectly coherent with the known labels) 37 . k−means clustering is utilized to cluster the synthetic Gaussian mixture dataset. Renyi and Tsallis entropy with twelve q values ranging from 0.1 − 7 are used to select features from the synthetic datasets (both for overlapping and non-overlapping case, see method for details). We make a note of q values for which the two methods perform well. It can be observed from the Figure

Clustering scRNA-seq Data: Validation
After knowing the range of q for the Renyi and Tsallis method we applied sc-REnF on four scRNA-seq datasets. Clustering is done for the sake of validation. The selected genes by sc-REnF are used for the downstream analysis (PCA and clustering) and ARI score is computed after clustering. Table 1 illustrates the clustering performance on the four scRNA-seq datasets using Min Renyi, Renyi, Tsallis, and Shannon methods. It is observed that sc-REnF responds well for Renyi and Tsallis measures compare to the other two methods. Renyi based method achieves the highest ARI score for Pollen dataset qvalue of 0.7.
Similarly, the Tsallis method yields the highest ARI score for the darmanis dataset with qvalue of 0.3.
After gene selection, we perform dimensionality reduction using traditional PCA and create a neighborhood graph using the PC components, which is then clustered by the Leiden clustering method. Figure 3 shows the results of applying sc-REnF on Darmanis data. To explore how well sc-REnF can selects genes from Darmanis data we use a clustering technique to group cells into the cluster and match them with the original level. After gene selection, we use PCA for dimensionality reduction and Leiden clustering for grouping of cells.  which are then match with the original level.

Clustering unknown sample: Validation through test data
Clustering unknown sample is crucial for scRNA-seq analysis pipeline. Here we addressed this by performing a cross validation with a train test split of data in the ratio 7:3. First, the training dataset is used in sc-REnF to select informative genes. Top 50 genes are selected for clustering and validation. The performance of sc-REnF is computed on the test data using the top selected genes in the earlier step. We performed clustering on test data with the selected genes and ARI value is reported. The experiment is repeated 20 times with a random split of train-test data (7:3 ratio) in each case. Table 2 shows the median and standard deviation of the ARI score for Renyi and Tsallis measure.

Marker Gene Selection
The clustering results are further utilized in the marker gene selection for different cell types.  of cell astrocytes and fetal replicating cells (see Figure 4, panel-C for the cell annotation of cluster-3 and cluster-7). For CBMC data, gene 'S100A9', 'NKG7' 'LST1' have higher expression in cluster-0, cluster-3 and cluster-5, respectively, suggesting novel markers for cell Naive CD4T, NK and CD14+Mono cells (see Figure 6, panel-C for the cell annotation of cluster-0, cluster-3 and cluster-5).

Stability checking of sc-REnF
Here we explore an additional advantage of sc-REnF over the other measures by validating its stability of performance

Discussion
Clustering of cells in scRNA-seq data is an essential step for cell type discovery from a large population of cells. Owing to the large feature/gene set of scRNA-seq data, selection of most variable genes are crucial in the preprocessing step, which has immense effect in the later stage of downstream analysis.
The proposed method sc-REnF addressed this issue by using an entropy (Renyi, Tsallis) based feature selection method for identifying possible informative genes in the preprocessing steps. sc-REnF has the advantage over the conventional statistical approach that it can consider the cell-to-cell dependency based on generalized and wide spectrum entropy measures Renyi and Tsallis. We demonstrated that sc-REnF using Renyi and Tsallis method introduces major advantages both in terms of clustering accuracy and in terms of marker gene detection in the downstream analysis of scRNA-seq data. Although the primary objective of sc-REnF is variable gene selection in the preprocessing step of scRNA-seq data analysis, we extend the process towards the later stage of downstream analysis. We employ clustering technique to groups the cells using those selected genes. A precise clustering of cells also demonstrates the efficacy of our method for selecting the most variable genes in the first stage. This facilitates the selection of novel marker genes within each cluster. We pinpoint several markers, which shows a high expression level within a particular cluster, among them some of are also identified in previously published cell marker database.
Clustering of unknown samples based on the reference data is a crucial problem for the identification of cell types in scRNA-seq cell classification. We addressed the problem by cluster the unknown samples using the selected genes in the reference data. We demonstrate the advantage of using selected genes by sc-REnF in clustering of the unknown test sample. We observed good ARI scores in the clustering of test samples, suggesting the selected genes from the reference data is also effective to produce a perfect clustering in a completely unknown test sample.
The execution time of sc-REnF is directly proportional to the number of selected features and can be expensive when one needs to select a large number of features. However, this can be easily tackled with ever-increasing computing power in advanced servers. Additionally, the regularization parameter has not been considered in our proposed approach, which may sometime make the algorithm susceptible to overfitting unless carefully employed.
Taken together, the proposed method sc-REnF not only has good performance on informative gene selection in the preprocessing step but also has the ability to explore the classification of unknown cells

Synthetic Gaussian Mixture Data
We generated eight synthetic Gaussian mixture datasets (overlapping and non-overlapping) having k =

The process is repeated for all eight Gaussian mixture datasets.
Single Cell RNA sequence dataset The following single-cell RNA sequence datasets are used for evaluation of the proposed method. • CBMC: Cord blood mononuclear cells (CBMC) were profiled by CITE-seq which is proposed to measure both cellular protein and mRNA expression in one cell, by using oligonucleotide-labeled antibodies.
The data set consists of the expression levels of 2000 mRNAs and 13 protein, individually measured in 8,000 cord blood mononuclear cells (CBMCs).The dataset is available in the GEO website under the accession GSE100866.
The brief description of the dataset is given in Table 7.

Shannon Entropy
For a random variable X, the most popular Shannon entropy is defined as For more than one variables, one can suitably construct the joint or the Renyi entropy measures. For example, the Shannon conditional entropy of the random variable X given two random variables Y and Z is defined as The conditional Shannon entropy of three random variable X, Y , and Z is described below Note that, it quantifies the average residual entropy of X when the value of Y and Z are known.

16/34
It was first discovered by Alfred Renyi 44 in the context of information science. The Renyi entropy of the random variable X is defined in terms of a non-negative real number q, with q = 1, as given by Interestingly, note that, this Renyi entropy reduces to the Shannon entropy when q → 1. It can also be extended for the three random variables X, Y , and Z, so that their joint Renyi entropy is given by Accordingly, the conditional Renyi entropy can be defined as The Rényi entropy is a decreasing function of q. It can be showed that the conditional Renyi entropy closely correspond to a risk function, i.e. the expected error when we try to estimate the value of X, once we know the values of Y and Z. We refer to the associated risk as the Renyi Risk Function which is defined as Renyi min-entropy: An important special case of the Renyi entropy family is the Renyi min-entropy, corresponding to the case q → ∞; it is also arguably the most traditional way of measuring the unpredictability of a set of outcomes.
The Renyi min-entropy of X has the form

17/34
and accordingly the conditional Renyi min entropy of X given Y and Z is given by

Tsallis entropy
It is another generalization of Shannon entropy developed from the context of statistical mechanics that yields q-normal distribution as an equilibrium probability disribution 45 .
Mathematically, the Tsallis joint entropy of the three random variables X,Y and Z is defined in terms of a tuning parameter q > 0 as Accordingly we define the conditional Tsallis entropy of the random variable X given values of the random variables Y and Z as given by: Next, we define a Tsallis risk function, i.e. the expected error when we try to estimate the value of X, once we know the values of Y and Z, and refer to as the Tsallis Risk Function. It is defined as where P e ( j, k) is the joint escort probability distribution 46 given by P e ( j, k) = p( j, k) q ∑ j,k p( j, k) q . and added with feature subset S selected at the previous step. The dependency measure is evaluated using an appropriate entropy measure E as described below: we will use the Shannon, Renyi and Tsallis entropy as the specific choices for E .
Feature Relevance: Feature f i is more relevant to the class label C than feature f j in the context of the already selected subset S, when where E (·, ·) is a (bivariate) conditional entropy function.

Feature Redundance:
If feature f j shares similar information with feature f i than feature f j+1 , then feature f j is redundant to feature f i with given information about class label C ; it is characterized as where E (·|·, ·) is an appropriate conditional entropy.
Objective Function: We minimize the conditional entropy function between f i ∈ (F − S) and f s ∈ S (to reduce the redundancy between them) and maximize the conditional entropy function between class label C and f i ∈ (F − S) to select the first feature, where f s ∈ S is already selected feature.
The selected feature subset, {S} and the feature f i ∈ (F − S) are inductively define for Renyi entropy

19/34
as below: Our proposed algorithms for Renyi and Tsallis entropy are optimal in the sense of minimizing the corresponding risk functions, defined in Equation 6 and 11, respectively, these are stated by the following propositions: Theorem 0.1. At every step, the selected feature f i+1 minimizes the Renyi and Tsallis risk of classification among those feature which are in selected feature subset S.
Proof. In order to proof Theorem 0.1, We will start from the objection function Equation 15 .
According to our objective function: The dependency measure is evaluated using an appropriate entropy measure E as described below: we will use the Renyi and Tsallis entropy as the specific choices for E .
Let, u, v , v represent generic value tuples and values of f s ∈ S (Selected feature), f i+1 (To be selected feature at (i + 1) th step), and f ∈ (F − S) (Non selected features) respectively. Now, after putting the 20/34 generic representation in Equation 17, minimization of Renyi risk function will be proved.
Now, Multiplying a constant k = q 1−q in equation 20, we get Then, by definition of Renyi risk function, described in main paper Equation 6, it can be shown that: Thereafter, putting the generic representation in Equation 17, minimization of Tsallis risk function will be proved.

21/34 Simulation Parameters settings in this Study
Simulation parameters for all four methods have been summarized here. sc-REnF requires the number of features (genes) to be selected as the input parameter. We select the q-value, 0.3 for Tsallis, and 0.7 for Renyi entropy from synthetic data simulation study, discussed in Result Section. The following two entropy based method are compared with the proposed method sc-REnF. These are 1) Shannon entropy, 2) Min Renyi entropy. Single cell Consensus clustering (SC3) 3 is employed to validate the informative genes.
SC3 clustering is a tool for the unsupervised clustering of scRNA-seq data. SC3 achieves high accuracy and robustness by uniformly integrating different clustering solutions through a consensus approach.

Validation Metrics used in this Study
The efficacy of the proposed method is correlated with two competitive methods on seven single cell RNA-seq datasets. To verify our proposed method (sc-REnF), the performance metrics are: 1) Marker Gene Selection. 2)Adjusted Rank Index (ARI) measures the similarity between two data clusters, 3) Stability using non parametric Kruskal wallis test.