Optimization and redevelopment of single-cell data analysis workflow based on deep generative models

The present single-cell RNA sequencing(scRNA-seq) analysis pipelines require a combination of appropriate normalization, dimension reduction, clustering, and specific-gene analysis algorithms, but the rationale for the choice of these algorithms is relatively subjective because of the lack of ground truth assessment conclusions. As the number of captured single-cells increases, the number of different types of noise cells also increases, which can strongly affect the analysis efficiency. For scRNA-seq, a technology that generates data through multi-process operations, the deep generative model should be a good choice for this type of data analysis, allowing simultaneous estimation of multiple unobservable parameters assumed in the data generation process. Hence, in our study, we sequenced a pool of pre-labeled single cells to obtain a batch of scRNA-seq data with main and fine labels, which was then used to evaluate the clustering and specific-gene analysis methods. Afterward, we applied two deep generative models to infer the probabilities of pseudo and impurity cells. And by stepwise removing the inferred noise cells, the clustering performance and the consistency of different specific-gene analysis methods are both greatly improved. After that, we applied Deep-LDA (a latent Dirichlet allocation-based deep generative model) to scRNA-seq data analysis. And this model takes the count matrix as input, and makes the classification and specific gene optimization process mutually dependent, which has more practical sense and simplifies the analysis workflow. At last, we successfully implemented the model with transferred knowledge to make single-cell annotation and verified its superior performance.


29
ScRNA-seq has grown to be the primary approach for research on histo-differentiation 1 , brain structure 2 , and 30 tumor immunology 3, 4, 5, 6 . However, in practice, the outcomes generated from this data type are frequently unstable 7 . 31 The major reason is that the obtained data which was sequenced from a huge number of single cells, is insufficient 32 to cover the whole transcriptome feature of each cell, leading to a high 0-inflation ratio and technical noise 8 . 33 Moreover, the transcripts of different cell types have poor overlap, causing a more complicated distribution in 34 observed data. 35 Many algorithms have been developed to reduce the noise within and between the cells. BASiCS 9 and Scran 10 36 estimate a specific scale factor for normalization. MNN algorithm 11 has been employed to integrate data from 37 multiple batches. DCA 12 , scVI 13 , and ZINB-WaVE 14 infer noise-free expression values with generative models. And 38 SCTransform 15 corrects the dependence between gene expression values and sequencing depths by means of 39 regularized negative binomial distribution regression. The gains from these methods are remarkable for the current 40 clustering and specific gene analysis. But for these methods reshaping observed expression value, there is a high 41 possibility of eliminating some useful information when increasing the positive signal in their assumption. 42 The commonly used single-cell clustering procedures work on the cell nearest neighbor network by KNN and 43 SNN, and then utilize the community-detection algorithms 16 , such as louvain 17 and leiden 18 , to iteratively optimize 44 cell clusters. And the existed noise cells which distributed widely separated from the target cells in latent space, such 45 as pseudo cells that capture environment RNA 19 and bits of impurity cells, can largely affect cell neighbor 46 relationship estimation and clustering iteration. Thus, removing these noise cells will theoretically improve the 47 the clustering scores for all procedures (Fig 2b; Fig S3a-f). And the overall scores of louvain-, leiden-and 104 densitypeak 34 -based clustering were outstanding and comparable, which we then suggested as the first echelon 105 methods. While DDRTree 35 -and Consensus 36 -based clustering were seriously affected by noise and the overall 106 scores were lower. In addition, with noise cell removal, the consistency between the distribution of annotated cluster 107 and corresponding real cell type was improved on main cell labels (Fig 2c left, Fig S4), while poorly improved on 108 fine cell labels which present more irregular shape within the main label distributed area, such as Naive CD4+ and 109 Memory CD8+ T cells (Fig 2c right, Fig S4). We deduced that the traditional clustering algorithm based on Euclidean 110 distance is difficult to fit irregularly distributed cells accurately. The consistency between different specific-gene analysis procedures 121 We employed 5 specific-gene analysis procedures to reanalyze the specific genes for the 10 types of immune cells 122 (Methods). The intersection of obtained specific-gene lists showed that t.test-, limma 37 -, and wilcox-based 123 procedures had high consistency (Fig 3a part 2-3), while FC.rank (Fold Change) and LRT-based (Likelihood Ratio 124 Test) procedures had low consistency with all other procedures (Fig 3a part 4-6). And the gene lists from the 3 high 125 consistency methods showed more gene intersection with the prior collected specific-gene lists (Fig 3a part 1; Fig S5; Table S8), which we then nominated as primary methods. Moreover, with the noise cell removal, the overall 127 consistency of the 5 methods was improved (Fig. 3b), which was mainly demonstrated by significantly reduced non-128 intersection part (Fig. 3b left 1) and significantly increased co-intersection part (Fig 3b right 1 Table S9). We applied the model on phase I data, and finally obtained 141 24 classes merged from the initial set 50 classes according to the class-specific genes intersection (Fig S6a; Methods). 142 The annotated class distributions were in high concordance with the real cell distribution (Fig 4c). And the inferred 143 specific genes showed high specificity for different classes (Fig 4d; Fig S6b). Then, we re-analyzed class-specific 144 genes by the 5 procedures tested before, and intersected the outcomes with Deep-LDA specific gene lists. The 145 intersection result indicated the specific genes of Deep-LDA had a high degree of consistency with that of the 3 146 primary methods nominated before (Fig 4d), which suggested Deep-LDA could also be a primary choice for specific 147 gene analysis. 148 Deep-LDA model was applied on the 3-phase data, whose clustering results had a high consistency with the real 160 distribution at all phases (Fig 5a; Fig S7a). From the annotated memory CD8+ T cell distribution, we found that the 161 distribution shape drawn from this model was more similar with the real distribution shape, and did not form a blocky 162 distribution like other clustering procedures (Fig 5a right, 5b), which suggested Deep-LDA has a higher nonlinear 163 fitting ability. Then, we applied the 13-clustering metrics to compare the Deep-LDA model with the other 5 clustering 164 methods. And the external evaluation score lines of the Deep-LDA model were located above or near the upper lines 165 of other methods' borders, and exceeded more especially on the phase I part, which indicated Deep-LDA had an 166 outstanding performance and should be one of the first echelon methods (Fig 5c; Fig S7b). In addition, the linear 167 fitted slope of the evaluation score lines revealed that Deep-LDA had lower noise sensitivity than other methods ( Fig  168   5c; Fig S7b). The poor internal clustering scores of Deep-LDA indicated that the outcome of the model was not 169 optimized according to the uniform dimensionality reduction space which was the space for internal clustering 170 metrics calculation, but was optimized according to the inferred feature space of different classes. 171 The encoder in Deep-LDA is for topic assignment, which serves as the cell type classifier. And the decoder is to 182 generate the gene probability distribution, whose parameter was optimized to be class-specific gene frequency 183 (Methods). Thus, we can preload the gene frequencies from independent reference datasets into the decoder, and 184 then freeze the decoder in the parameter inference process, which can generate a single-cell classifier by transferred 185 knowledge. In this project, we used the centroids of the 10 types of cells and an independent immune cell reference 186 dataset 31, 38 as transferred knowledge (Fig 6a, b left), and then applied the optimized models to perform single-cell 187 annotation (Methods). From the distribution on the UMAP plot, the annotated cell of both inferred results matched 188 well with real related-cell distribution (Fig 6a, b right). After that, we compared the two results with the outcomes 189 from cellID 39 and singleR 40 annotation tools with the 13-clustering metrics (Fig S8a, b; Methods), which verified 190 the better performance of Deep-LDA transfer learning (Fig 6c; Fig S8c). 191 Clustering is the crucial and essential step in the current scRNA-seq data analysis workflow 20 . In this paper, 200 we mainly worked on clustering-related evaluation, optimization and redevelopment. Firstly, we applied scRNA 201 sequencing on a pool of pre-labeled immune cells, by which a batch of data with main and fine labels was generated 202 as ground truth for clustering evaluation. Then, we eliminated two types of noise cells inferred by deep generative 203 models, which improved clustering performance and the consistency of several specific-gene analysis methods. After 204 that, by applying the Deep-LDA model for scRNA-seq data analysis, which couples the classification process with 205 class-specific genes analysis process, that is, the two processes are mutually dependent, we successfully simplified 206 the analysis workflow, shortened the analysis time, and at last verified its superior clustering performance. 207 The prior knowledge of cell-specific genes is the basis for clustering annotation 20 . However, the recommended 208 reliable specific genes of the current databases were not ranked high and even not significant in our specific lists 209 generated from the pooled sequencing data, and particularly worse in that of the fine-label clusters. More than that, 210 the resolution that the scRNA-seq can achieve was determined by transcriptome features 41 , and it should be much 211 higher than the clusters classified by fine labels, which were defined by cell membrane proteins in our data. So, the 212 current quality and quantity of the specific genes in prior knowledge were insufficient for accurate single-cell 213 annotation 42 . In addition, the quality of the obtained specific gene from most of the current scRNA-seq articles is 214 unstable, because of the unstable clustering quality. Therefore, the more non-randomly captured or pre-labeled single-cells are sequenced, due to the ground truth existence, the more cell-specific gene with higher reliability will 216 be contributed to current databases. 217 The community-detection algorithm is the primary choice in the current clustering analysis, which was 218 confirmed by the superior performance of louvain-and leiden-based methods in our project. However, an earlier 219 applied method, the density peak clustering, which only grouped cells by the cell density on the two-dimension plot, 220 can basically achieve the same superior performance. The reason is that the community-detected clusters also tend 221 to form distinguishable distribution on UMAP/tSNE, which is the basis for density peak clustering. By specifically 222 surveying the cell distribution on UMAP/tSNE, we found that for the more subdivided cell types, their shapes will 223 be more irregular, and the boundaries between them will be more blurred. We deduced that this undistinguishable 224 distribution will limit the capabilities of both community detection and density peak clustering. The major reason is 225 that the feature space for classes with the different subdivided degrees is treated uniformly, which should have 226 different granularity between inter-main label clusters and inter-fine label clusters. Therefore, we recommend that 227 the initial clusters obtained should be merged into major clusters by similarity index. And then the high variance 228 genes within the major cluster can be re-screened, that is, using finer-grained features of inner-cluster cells, to make 229 a new round of clustering for fine clusters. 230 The positive contribution of inferred noise cells removal was confirmed in our study. In addition, unlike 231 reshaping expression values for normalization, the excess quality control of noise cells, such as removing the cells 232 in differentiated states, would also have a positive contribution. This deduction should be understood as that we 233 should first map out the class relationships among the more discriminative cells in the rough-grained latent space, 234 and then analyze the relationships of the cells of the less discriminative with these established broad classes. The 235 impurity cell inference model in our project can learn the class affiliation probability. In the result part, the cells with 236 higher impurity probability are always distributed in the junction between two clusters (Fig. S2b). Besides that, the 237 deep generative model is an effective tool for estimating various types of noise cells, which can make a flexible 238 assumption about how the noise cells are involved in the data generation, and then infers the noise probability of 239 each cell by the observation data 27 . In addition to the model inference which has to make an assumption, we found 240 that noise cells had distinguishable distribution patterns on UMAP/tSNE (Fig 2a), so manually selecting abnormally 241 distributed cells should also be considered a good way. 242 In the Deep-LDA model, the training parameter of specific gene frequency gradually showed different 243 granularity inter-and inner-major clusters, which illustrated that the model is suitable for making classification in 244 the data space of main and fine clusters. And the initial set parameter of the class number describes the resolution 245 for classification. The generative architecture of Deep-LDA in our project was the classical LDA architecture of 246 topic modeling and was not re-designed according to the characteristic of scRNA-seq data, such as incorporating the 247 parameter for controlling the 0-inflation ratio. The neural network structure of the variational model was 248 implemented by fully connected layers. In practice, we had tested CNN, RNN and slide layers in model construction, 249 but they did not show better results. Therefore, by incorporating a more suitable network for learning the 250 transcriptome data characteristics (functional groups and network structure, etc.) and incorporating more regulatory 251 parameters in the generative process, the model will have greater potential in the future.  (Table S1). The purification kit, surface markers of 265 cells, and heoretical purity are shown in Table S2. The purification process was conducted strictly according to the 266 manufacturer's instructions. The purified and enriched cells were separated for subsequent single-cell sequencing 267 and bulk sequencing (Figure 1a). 268 269 Pooled sample scRNA-seq. Each type of enriched and purified cell was mixed with a human universal antibody 270 conjugated with Sample Tag (a specific oligonucleotide sequence) for sample identification, which has common 5' 271 and 3' ends and the specific `Sample Tag sequence`, whose combination is as follows: 272 GTTGTCAAGATGCTACCGTTCAGAG | Sample Tag sequence | AA…...25…...AA 273 The relationship between cell type and `Sample Tag sequence` is shown in Table S4. After the mixture, 10 types of 274 labeled cells constituted a pooled sample for single-cell sequencing 43 . Single cells were captured by microwell plates 275 and were lysed, and mRNA was hybridized with primer sequences on Cell Capture Beads. The BD Rhapsody TM 276 Targeted mRNA and Abseq Amplification Kit were used for library preparation. The libraries were sequenced on 277 Illumina's sequencer. Information about Tag and WTA files was found in Table S5. 278 279 Bulk sequencing. 10 types of enriched and purified cells were respectively sampled (each sample has more than 280 100 cells) for bulk sequencing and repeated 3 times. A total of 30 samples were sequenced. The samples were lysed, 281 reverse transcribed into cDNA, and amplified for library preparation. SMART-Seq® HT Kit was used for cDNA 282 amplification. The quality of the cDNA library was evaluated, and 6 samples were excluded. Finally, 24 samples 283 (Table S6) were sequenced on Illumina's sequencer. 284 285 Specific genes collection. We collected the specific genes from the Cellmarker database 44 , PanglaoDB database 45 , 286 and SC2Disease database 46 . The selected entries for each database are shown in Table S7. The relevant specific genes 287 of immune cell 47 , T cell 48 , CD4+ T cell, CD8+ T cell 49 , and Naive T cell 50 were obtained from references. The 288 specific genes of Naive T cell were obtained from the intersection of Naive CD4+ T cell and Naive CD8+ T cell. 289 The specific genes of B cell were obtained from the intersection of Naive B cell and Memory B cell. The specific 290 genes appearing in two or more cell types were removed. Finally, a total of 460 specific genes for 16 cell types were 291 collected (Table S8). 292 293 Data processing. Fastq files of the two types of sequencing were quality-checked with FASTQC, and adapters were 294 removed by TrimGalore. For single-cell sequencing data, Cell barcode and UMI sequence were estimated and 295 extracted by UMItools in the R1 files, and were then inserted into the R2 files. The R2 files were processed by STAR 296 for single-end alignment. As for bulk-seq sequencing data, the alignment was paired-end. 297 The reference genome was GRCh38(www.ncbi.nlm.nih.gov/assembly/GCF_000001405.39). The annotation 298 was Genecode v.29(www.gencodegenes.org/human/release_29.html). Read count was adjusted by the correspondent 299 UMIs. The true relationship between cell barcode and Sample Tag Sequence was estimated via Seurat's 300 Demultiplexing HTOs function (Demultiplexing with hashtag oligos) in Tag files 43 . 301 Then, the count matrix was log-transformed. 2000 genes with the maximum variance were selected and scaled. 302 The top 30 principal components (PCs) were obtained by PCA dimensional reduction. And the cell distribution was 303 visualized by UMAP and tSNE based on those PCs. For the 3-phase data with a stepwise noise cells removal, the 304 data processing was the same. 305

306
The composition of bulk-seq data. The 2000 genes from the data processing were used as gene signatures for 307 deconvolution. And then, the centroids of 10 types of immune cells were calculated to be the Signature gene file. 308 Then, we used CIBERSORT software to decompose the bulk-seq data by the signature gene file and to verify the 309 consistency between scRNA-seq data and bulk-seq data. Where the meaning of parameters unexplained was identical with those described above. 359 Davies-Bouldin index was to calculate the averaged inner-cluster distance of 2 clusters, which was divided by 360 the distance of the centroids of 2 clusters, and then found the maximum. The score tended to be lower when the 361 inner-cluster distance was smaller, and the intra-cluster distance was larger. The formula was the following: 362 DB 1 ∈ : ;

363
Where the meaning of parameters unexplained was identical with those described above. The value was calculated 364 by metrics davies_bouldin_score of sklearn. 365 366

Inter-cluster Evaluation Index 367
Purity was calculated as the following formula: 368

384
Where the meaning of parameters unexplained was identical with those described above. β was the hyperparameter 385 and was set to 1. 386 Fowlkes-Mallows index was the geometric mean of pair-wise precision and recall. 387