A universal deep neural network for in-depth cleaning of single-cell RNA-Seq data

Single cell RNA sequencing (scRNA-Seq) is being widely used in biomedical research and generated enormous volume and diversity of data. The raw data contain multiple types of noise and technical artifacts, which need thorough cleaning. Existing denoising and imputation methods largely focus on a single type of noise (i.e., dropouts) and have strong distribution assumptions which greatly limit their performance and application. Here we design and develop the AutoClass model, integrating two deep neural network components, an autoencoder, and a classifier, as to maximize both noise removal and signal retention. AutoClass is distribution agnostic as it makes no assumption on specific data distributions, hence can effectively clean a wide range of noise and artifacts. AutoClass outperforms the state-of-art methods in multiple types of scRNA-Seq data analyses, including data recovery, differential expression analysis, clustering analysis, and batch effect removal. Importantly, AutoClass is robust on key hyperparameter settings including bottleneck layer size, pre-clustering number and classifier weight. We have made AutoClass open source at: https://github.com/datapplab/AutoClass.


Introduction
scRNA-Seq has been widely adopted in biological and medical research 1-5 as an ultra-high 23 resolution and ultra-high throughput transcriptome profiling technology. Enormous amount of 24 data has been generated providing great opportunities and challenges in data analytics. 25 First of all, scRNA-Seq data come with multiple types of noise and quality issues. Some are 1 issues associated with gene expression profiling in general, including RNA amplification bias, 2 uneven library size, sequencing and mapping error, etc. Others are specific to single cell assays. 3 For example, extremely small sample quantity and low RNA capture rate result in large number 4 of false zero expression or dropout 6 . Individual cells vary in differentiation or cell cycle stages 7 , 5 health conditions or stochastic transcription activities, which are biological differences but 6 irrelevant in most studies. In addition, substantial batch effects are frequently observed 8 due to 7 inconsistence in sample batches and experiments. Most of these noises and variances are not 8 dropout and may follow Gaussian, Poisson or more complex distributions depending on the 9 source of the variances. All of these variances need to be corrected and cleaned so that 10 biologically relevant differences can be reconstructed and analyzed accurately. 11 Multiple statistical methods have been developed to impute and denoise scRNA-Seq data. Most 12 of these methods rely on distribution assumptions on scRNA-Seq data matrix. For example, 13 deep count autoencoder (DCA) 9 assumes negative binomial distribution with or without zero 14 inflation, SAVER 10 assumes negative binomial distribution, and scImpute 11 uses a mixture of 15 Gaussian and Gamma model. Currently, there is no consensus on the distribution of scRNA-Seq 16 data. Method with inaccurate distribution assumptions 12 may not denoise properly, but rather 17 introduce new complexities and artifacts. Importantly, these methods largely focus on dropouts 18 and ignore other types of noise and variances, which hinders accurate analysis and 19 interpretation of the data. 20 To address these issues, we developed AutoClass, a neural network-based method. AutoClass 21 integrates two neural network components: an autoencoder and a classifier (Figure 1a and 22 Methods). The autoencoder itself consists of two parts: an encoder and a decoder. The encoder 23 reduces data dimension and compresses the input data by decreasing hidden layer size 24 (number of neurons). The decoder, in the opposite, expands data dimension and reconstructs 25 the original input data from the compressed data by increasing hidden layer size. Note the 26 encoder and decoder are symmetric in both architecture and function. The data is most 27 compressed at the so-called bottleneck layer between the encoder and the decoder. The 28 autoencoder itself, as an unsupervised data reduction method, is not sufficient in separating 29 signal from noise ( Figure 1b). To ensure the encoding process filter out noise and retain signal, 1 we add a classifier branch from the bottleneck layer (Figure 1a and Methods). When cell classes 2 are unknown, virtual class labels are generated by pre-clustering. Therefore, AutoClass is a 3 composite deep neural network with both unsupervised (autoencoder) and supervised 4 (classifier) learning components. AutoClass does not presume any type of data distribution, 5 hence has the potential to correct a wide range noises and non-signal variances. In addition, it 6 can model non-linear relationships between genes with non-linear activation functions. In this 7 study, we extensively evaluated AutoClass against existing methods using multiple simulated 8 and real datasets. We demonstrated AutoClass can better reconstruct scRNA-Seq data and 9 enhance downstream analysis in multiple aspects. In addition, AutoClass is robust over 10 hyperparameter settings and the default setting applies well in various datasets and conditions. 11

12
Validation of the classifier component 13 The unique part of AutoClass is the classifier branch from the bottleneck layer. Since encoding 14 process losses information in the input data, the classifier branch is added to make sure 15 relevant information or signal is sufficiently retained. To show that the classifier is needed, we 16 simulated a scRNA-Seq Dataset 1 (see Methods and Supplementary Table 2) using Splatter 13   17 with 1,000 genes and 500 cells in 6 groups, with and without dropout. Applied both AutoClass 18 and a regular autoencoder without the classifier on the data with dropout, the results are 19 illustrated in two-dimensional t-SNE (see Methods) plots in Figure 1b. AutoClass but not the 20 regular autoencoder was able to recover cell type pattern, indicating the classifier component is 21 necessary for reconstructing scRNA-Seq data.

22
Gene expression data recovery 23 We evaluated expression value recovery on simulated scRNA-Seq data with different noise 24 types or distributions. We generated and scRNA-Seq dataset using Splatter with 500 cells, 1000 25 genes in five cell groups with (raw data, Dataset 2) and without dropout (true data). From the 26 same true data, we also generated 5 additional raw datasets by adding noise following different types. 15 We also measure the data recovery quality using other metrics. The mean squared error (MSE) 16 between the true data and imputed/denoised data for Dataset 3-7 (5 noise types other than 17 dropout) were also computed ( Figure 2e). Among the 5 tested methods, AutoClass consistently Seq Dataset 8 using Splatter with 1,000 genes and 500 cells in two cell groups. Here the ground 5 truth of truly differentially expressed genes is known. We applied Two-sample T-test to the true, 6 raw and imputed data using different methods. The median value of t-statistics for the truly 7 differentially expressed genes dropped from 5.79 in the true data to 2.11 in the raw data, and 8 increased back to 5.86 upon imputation by AutoClass, which was almost the same as in the true 9 data and higher than in all control methods ( Table 1. 25 We compared K-means clustering results on the 200 highest variable genes. The ground truth 1 or the actual number of cell types were used as number of clusters. Clustering results were 2 evaluated by four different metrics: adjusted Rand index 20 (ARI), Jaccard Index 21 (JI), normalized 3 mutual information 22 (NMI) and purity score 23 (PS). All of them range from 0 to 1, with 1 4 indicating a perfect match to the true groups. AutoClass is the only method improving all four 5 metrics from the raw data. In addition, AutoClass achieved the best clustering results for 3 out 6 of 4 datasets (  Where , , and are the bottleneck representation, the output of the decoder hence 1 the autoencoder, the output of the classifier and total loss, respectively. ̂ is the pre-clustering 2 cell type labels for clusters. � is the input of AutoClass, and has been normalized over library 3 size and followed by a log 2 transformation with pseudo count 1: is the raw count matrix and the size factor for cell is equal to the library size divided by 5 the median library size across cells. Library size is defined as the total number of counts per cell. 6 The final imputed data is the average prediction of the autoencoder over different cluster 7 numbers in pre-clustering: 8 = E( | ) (6) For all datasets in this manuscript, we used 3 consecutive cluster numbers, or AutoClass is stable on classifier weight in the range of 0.1-0.9 (supplementary Figure 7). We 3 found that in general classification loss is far smaller than reconstruction loss (Supplementary 4 Figure 8), to have a better balance between those two losses, we set the default value to be 5 = 0.9. This default value was used in all the datasets in this paper. 6 In addition, overfitting is a common problem in neural network models. Dropout of neurons 26 in 7 the bottleneck layer is used in AutoClass to prevent overfitting. Interestingly, a relatively high 8 dropout rate in AutoClass also helps to correct batch effect. In the batch effect removal 9 analyses, we set dropout rate to be 0.5 in AutoClass, and to be fair, also in DCA. But DCA was 10 unable to remove batch effect ( Figure 5, Supplementary Figure 3-4). The default dropout rate 11 0.1 in AutoClass was used in all the other datasets and analyses in this paper. 12 AutoClass hyperparameter settings for all the datasets can be found in Supplementary Table 1-13 3.
14 Analysis details 15 Noise types other than dropout: Dataset 3-7 and Dataset 9 were generated by manually adding 16 noise to the true data of Dataset 2 and Dataset 8, respectively. The noise was first generated by 17 Python numpy.random package with different noise distributions (details in Supplementary   18 Table 3) , and then centered (so that noise mean is 0). The noise was then added to true data, 19 all values were rounded to be integers and negative values set to 0, since scRNA-Seq data raw 20 counts are positive integers.

21
Highly variable genes: The highly variable genes in each dataset are ranked by the ratio 22 between gene-wise variance vs mean computed from non-zero values. 14 Real scRNA-seq datasets 15 We collected and analyzed multiple real scRNA datasets from published studies. All simulated datasets can be generated using the parameters specified in the Simulated scRNA-10 Seq datasets subsection, all the real datasets are publicly available as mentioned in the Real 11 scRNA-Seq datasets subsection. In addition, multiple simulated and real datasets were provided 12 in the GitHub repository above as demo datasets, ready for analysis.