Generating realistic cell samples for gene selection in scRNA-seq data: A novel generative framework

High dimensional, small sample size (HDSS) scRNA-seq data presents a challenge to the gene selection task in single cell. Conventional gene selection techniques are unstable and less reliable due to the fewer number of available samples which affects cell clustering and annotation. Here, we present an improved version of generative adversarial network (GAN) called LSH-GAN to address this issue by producing new realistic samples and combining this with the original scRNA-seq data. We update the training procedure of the generator of GAN using locality sensitive hashing which speeds up the sample generation, thus maintains the feasibility of applying gene selection procedures in high dimension scRNA-seq data. Experimental results show a significant improvement in the performance of benchmark feature (gene) selection techniques on generated samples of one synthetic and four HDSS scRNA-seq data. Comprehensive simulation study ensures the applicability of the model in the feature (gene) selection domain of HDSS scRNA-seq data. Availability The corresponding software is available at https://github.com/Snehalikalall/LSH-GAN


Introduction 1
Identifying essential features is a persistent problem in machine learning, which is 2 generally known as the feature selection problem [1]. Recently, the emergence of high 3 dimensional biological data such as single cell RNA sequence (scRNA-seq) data has 4 posed a significant challenge to the machine learning researchers [2,3]. Handling the 5 high dimension, and small sample size (HDSS) data is difficult for feature selection (FS) 6 which posed a problem in classification techniques. Particularly, it affects both the 7 accuracy of the classification and increases the risk of overfitting. A few outliers can 8 drastically affect the FS techniques, and the selected feature sets may not be adequate 9 to discriminate the classes [4]. Moreover, high dimensionality increases the 10 computational time beyond acceptablity. So, feature selection is essential to reduce the 11 dimensionality of the data for further processing. 12 HDSS data is prevalent in the single cell domain due to the budgetary constraint of 13 single cell experiments. The general pipeline of scRNA-seq downstream analysis starts effect on the next stage of analysis. The gene selection step identifies the most relevant 18 features/genes from the normalized/preprocessed data and has an immense impact on 19 the cell clustering. The general procedure for selecting relevant genes which are 20 primarily based on high variation (highly variable genes) [10,11] or significantly high 21 expression (highly expressed genes) [5] suffers from a small sample effect. The general 22 FS techniques also failed to provide a stable and predictive feature set in this case due 23 to an ultra large size of feature (gene). One way to solve this issue is to go for a robust 24 and stable technique that does not overfit the data. A few attempts [12][13][14] were 25 observed recently which embed statistical and information-theoretic approach. 26 Although these methods result in stable features, however, these are not performed well 27 in small sample scRNA-seq data. 28 In this paper, we propose a generative model to sort out the problem of feature 29 (gene) selection in HDSS scRNA-seq data. It can be noted that if the sample size is 30 sufficiently large, the selected feature sets have a high probability of containing the most 31 relevant and discriminating features. We use generative adversarial model to generate 32 more samples to balance between feature and sample size. Generative Adversarial 33 Network (GAN) [15] has already been shown to be a powerful technique for learning 34 and generating complex distributions [16,17]. However, the training procedure of GAN 35 is difficult and unstable. The training suffers from instability because both the 36 generator and the discriminator model are trained simultaneously in a game that 37 requires a Nash equilibrium to complete the procedure. Gradient descent does this, but 38 sometimes it doesn't, which results in a costly time consuming training procedure. The 39 main contribution here is in modifying the generator input that results in a fast training 40 procedure. We create a subsample of original data based on locality sensitive hashing 41 (LSH) technique and augment this with noise distribution, which is given as input to 48 operates by conditioning the conventional model on additional data sources (maybe 49 class label or data from different modalities) to dictate the data generation. In our 50 model, we direct our attention to the additional sample generation from HDSS data.

51
However, the generated sample size becomes increasingly large with more features, the 52 generation of which may not be feasible for conventional generative models.

53
Augmenting subsample of real data distribution (p data (x)) with the prior noise (p z (z)) 54 makes the training procedure of our model faster than the conventional GAN. We 55 theoretically proved that the global minimum value of the virtual training criterion of 56 the generator is less than the traditional GAN (< −log4).

57
Summary of contributions: Here, we provide the following novelties: 58 -The proposed model is the first one to address the problem of gene selection in 59 HDSS scRNA-seq data using generative model.

60
-LSH-GAN can able to generate realistic samples in a faster way than the 61 traditional GAN. This makes LSH-GAN more feasible to use in the feature (gene) 62 selection problem of scRNA-seq data.

63
-Here we derive a new way of training a generator that combines subsamples of 64 original data with pure noise and takes this as input.    Generative adversarial network (GAN) is introduced in [15] which was proposed to train 76 a generative model. GAN consists of two blocks: a generative model (G) that learn the 77 data distribution (p(x)), and a discriminative model (D) that estimates the probability 78 that a sample came from the training data (X) rather than from the generator (G).

79
These two models can be non-linear mapping functions such as two neural networks. To 80 learn the generator distribution p g over data x, a differentiable mapping function is 81 built by generator G to map a prior noise distribution p z (z) to the data space as 82 G(z; θ g ). The discriminator function D(x; θ d ) returns a single scalar that represents the 83 probability of x coming from the real data rather than from generator distribution p g .

84
The goal of the generator is to fool the discriminator, which tries to distinguish between 85 true and generated data. Training of D ensures that the discriminator can properly 86 distinguish samples coming from both training samples and the generator. G and D are 87 simultaneously trained to minimize log(1 − D(G(z)) for G and maximize log(D(x)) for 88 D. It forms a two-player min-max game with value function V (G, D) In the following, we will describe the workflow of our analysis pipeline. neighborhood graph (k-nn graph) is constructed by using k = 5 for each data point.

105
This step computes the euclidean distances between the query point and its candidate 106 neighbors. Sampling is carried out in a 'greedy' fashion where each data point is 107 traversed sequentially and its corresponding five nearest neighbors are flagged out which 108 never visited again. Thus after one traversing a sub-set of samples is obtained which is 109 further down-sampled by performing the same step iteratively.

110
Generator of LSH-GAN The generator function (G) is modified by augmenting its 111 taken input data. Instead of giving the pure noise (p z (z)) as input we augment a 112 subsample of real data distribution (p data (x)) with it. The sampling of the input data is 113 done in the LSH step. Thus the Generator (G) function builds a mapping function from 114 z to data space (x) as G( z; θ g ) and is defined as: G(.) : z → x. Modifying the generator 115 in this way we claim that it can increase the probability of generating samples of real 116 data in lesser time.

117
Discriminator of LSH-GAN Here discriminator (D) takes both the real data 118 p data (x) and generated data coming from generator (G( z)), with probability density 119 (p z ( z)) and returns the scalar value, D(x) that represents the probability that the data 120 x is coming form the real data: So, the value function can be written as: D and G forms a two-player minimax game with value function L(G, D). We train D to 122 maximize the probability of correctly validate the real data and generated data. We 123 simultaneously train G to minimize log(1 − D(G( z))), where G( z) represents the 124 generated data from the generator by taking the noise (p z ) and the sampled data 125 p xs (x s ) as input.

126
Feature/gene selection using LSH-GAN The generated cell samples are utilized 127 for gene selection task. We have employed three feature selection methods (CV 2 Index, 128 pca-loading and Fano Factor ), widely used for the gene selection task in scRNA-seq 129 data. Single cell clustering method (SC3) technique is utilized to validate the selected 130 genes from the generated samples.

131
The whole algorithm and the sampling procedure is described in the 'LSH-GAN 132 algorithm'.
1: for number of training iterations do 2: x s =LSH-SAMPLING(x,k,t) 3: augment p xs (x s ) with prior noise p z (z) and give this (p z ( z)) to the generator, G.

4:
real data p data (x) and generated data p g (x) is given to discriminator D.
Update the Discriminator, D 5: {The adaptive momentum gradient decent rule is used in our experiment.} 8: procedure LSH-sampling(x, k, t) 9: Execute Locality Sensitive Hashing (LSH) on x and prepare a k-Nearest Neighbour list for each data point. 10: for number of iteration of sub-sampling t do 11: visit each data point sequentially in the order as it appears in data.

12:
if the data point is not visited earlier, select the data point and discard all its k neighbors from its nearest-neighbour list. 13: end for 14: end procedure

134
Following the above section, a sub-sampling of real data p xs (x s ) is augmented with the 135 prior noise distribution, p z (z). Due to this additional information in generator, we 136 assume that the probability D(G( z)) will increase by a factor, ζ.
Proof. Equation 2 can be written as: We know that, the function y = a log x + b log(1 − (x + ζ)) will have maximum value, at 140 x = a(1−ζ) a+b , for any (a, b) ∈ R 2 {0, 0} and ζ ∈ (0, 1). So, the optimum value of D for a 141 fixed generator, G is: The training objective for discriminator D is to maximize the log-likelihood of the conditional probability P (Y = y|x), where Y signify whether x is coming from real data April 23, 2021 5/14 distribution(y = 1) or coming from the generator(y = 0). Now the equation 2 can be written as: Theorem At p g (x) = p data (x) (global minimum criterion of value function L(G, D)), 143 the value of C(G) is less than (− log 4) .

144
proof From equation 6 we get where, JSD(p data (x)||p g (x)) represents Jensen-Shannon divergence between two 145 distributions p data and p g . Now, if the two distribution are equal, Jensen-Shannon 146 divergence (JSD) will be zero. Thus, for global minimum criterion of the value function 147 (p g = p data ) the Equation 7 is reduces to, This completes the proof.

150
For experimental validation first, we validate our proposed model in synthetic data.

151
The aim is to see the performance of LSH-GAN in generating realistic samples. Next,

152
we validate whether the generated samples can be used in the feature selection task.

186
The raw count matrix M ∈ R c×g , where c and g represents the number of cells and 187 genes, respectively, is normalized using Linnorm [30] Bioconductor package of R. We 188 select cells having more than a thousand expressed genes (non zero values) and choose 189 genes having a minimum read count more than 5 in at least 10% of the cells. Wasserstein distance between the real data distribution (p data ) and the generated data 200 distribution (p g ) to estimate the quality of the generated data.   219 Table 3.   3.5 Gene selection in HDSS scRNA-seq data using LSH-GAN 220 We trained the LSH-GAN model in four small sample scRNA-seq data (see table 1).

221
Here, a sub-sample of real data distribution is augmented with prior noise and used as 222 the input to the generator network. The generated data using LSH-GAN (with k=5) is 223 validated by computing the Wasserstein metric between the real and generated data 224 distribution for different epochs (see figure 3). sample size (S opt ) for generating samples from the scRNA-seq data. The aim is to know 239 whether the selected features/genes from the generated combined data can lead to a 240 pure clustering of cells. Table 3 shows the comparisons of the ARI values resulting from 241 the cell clustering. It is evident from the table that features/genes selected from the 242 generated combined data of the LSH-GAN model (with e opt and S opt ) produces better 243 clustering results than the traditional GAN model. Here, two models are trained with 244 the same number of epochs. Here we provide the detailed results of clustering on four used datasets using the genes 248 selected from the generated samples. For this, we adopted a widely used single cell 249 clustering method SC3 [29].  demonstrates the quality of generated cell samples using the LSH-GAN model.

276
One limitation of our method is that for feature selection we hardly found any linear 277 relationship between the clustering results with the sample size of generated scRNA-seq 278 data. The correct sample size should be selected by using a different range of values 279 between 0.25p to 1.5p, where p is the feature size. There may be some effects of 280 different parameters related to single cell clustering (SC3 method) and feature selection 281 (e.g. different FS method, number of selected features, etc.) which may play a critical 282 role in the clustering performance. However, we found clustering results are always 283 better for the generated data with more than 1p (p is the feature size) sample size. This 284 observation suggests that for feature selection in HDSS data, whenever we produce 285 samples larger than the feature size we will end up with a better clustering. The 286 feasibility of generating such samples is justified by the faster training procedure of 287 LSH-GAN model.

288
Taken together, the proposed model can generate good quality cell samples from 289 HDSS scRNA-seq data in a lesser number of iteration than the traditional GAN model. 290 Results show that LSH-GAN not only leads in the cell sample generation of scRNA-seq 291 data but also accelerates the way of gene selection and cell clustering in the downstream 292 analysis. We believe that LSH-GAN may be an important tool for computational 293 biologists to explore the realistic cell samples of HDSS scRNA-seq data and its 294 application in the downstream analysis.