Abstract
The analysis of single-cell RNA-seq (scRNA-seq) data is complicated and biased by excess zero or near zero counts, the so-called dropouts due to the low amounts of mRNA sequenced within individual cells. We introduce scImpute, a statistical method to accurately and robustly impute the dropouts in scRNA-seq data. scImpute is shown as an effective tool to enhance the clustering of cell populations, improve the accuracy of differential expression analysis, and aid the study of gene expression dynamics in time series scRNA-seq experiments.
1 Introduction
Bulk cell RNA-sequencing (RNA-seq) technology has been widely used for transcriptome profiling to study transcriptional structures, splicing patterns, and expression levels of transcripts and genes [1]. However, it is important to account for cell-specific transcriptome landscapes in order to address biological questions such as the cell heterogeneity and the gene expression stochasticity. The single-cell RNA sequencing (scRNA-seq) technologies are now emerging as a powerful tool to capture transcriptome-wide cell-to-cell variability [2, 3, 4]. The scRNA-seq enables the quantification of intra-population heterogeneity at a much higher resolution, potentially revealing dynamics in heterogeneous cell populations and complex tissues [5]. An important characteristic of scRNA-seq data is the “dropout” phenomenon where a gene is observed at a moderate expression level in one cell but is not detected in another cell [6]. Usually these events occur due to the low amounts of mRNA in individual cells, and thus a truly expressed transcript may not be detected by sequencing in some samples. This dropout phenomenon is also protocol-dependent. The number of cells that can be analyzed with one chip is usually no more than hundreds of cells on the Fluidigm C1 platform, with around 1–2 million reads per cell. On the other hand, protocols based on droplet microfluidics can parallelly profile more then 10, 000 cells, but with only 100–200k reads per cell [7]. As a consequence, there is a much higher dropout rate with droplet microfluidics technologies.
Methods for analyzing scRNA-seq data have been developed from different perspectives, such as clustering, cell type identification, and dimension reduction [8, 9, 10, 11, 12]. Some of these methods address the dropout events by implicit imputation while others do not. CIDR is the first clustering method that incorporates imputation of dropout values, but the imputed expression of a particular gene in a cell changes each time when the cell is paired up with a different cell [10]. Seurat is a computational strategy for spatial reconstruction of cells from single-cell gene expression data [11]. It includes an imputation step to impute the expression of landmark genes based on highly variable or so-called structured genes. ZIFA is a dimensionality reduction model specifically designed for zero-inflated single-cell gene expression analysis [12]. The model is built upon an empirical observation: dropout rate for a gene depends on its mean expression level in the population.
Since most downstream analyses on scRNA-seq, such as differential expression analysis, identification of cell-type-specific genes, and all the analyses mentioned earlier, rely on the accuracy of gene expression measurements, it is important to correct the false zero expression due to dropout events by model-based imputation methods. To our knowledge, MAGIC is the only available method for explicit and genome-wide imputation of single-cell gene expression profiles [13]. A key step in this method is to create a Markov transition matrix.In the imputation of a single cell, the weights of the other cells are determined through this matrix. However, imputing all counts including those not affected by dropout would introduce new bias into the data and possibly eliminate meaningful biological variation. It is also inappropriate to treat all zero counts as missing values, since some of them may reflect true biological non-expression. Therefore, we propose a new imputation method for scRNA-seq data, scImpute, to simultaneously determine which values are affected by dropout events in data and perform imputation only on dropout entries. To achieve this goal, scImpute first learns each gene’s dropout probability in each cell by fitting a mixture model. Next, scImpute imputes the (highly probable) dropout values in a cell by borrowing information of the same gene in other similar cells, which are selected based on the genes not severely affected by dropout events.
2 Results
2.1 scImpute improves the identification of cell populations
We first use a simulation study to illustrate scImpute’s efficacy and robustness in imputing gene expressions for better identification of multiple cell types (without true class labels supplied). We suppose there are two cell conditions c1 and c2, each with 50 cells, and 120 among 10, 000 genes are truly differentially expressed (DE) (details in Methods). scImpute is able to unveil the cell identities masked by dropout events (Figure 1a), and recover the missing gene expression to a good extent to make the imputed matrix resemble the complete matrix (Figure 1b). Imputed data by scImpute lead to accurate identification of DE genes (Figure 1c). MAGIC also improves grouping of cells, but it introduces artificial signals that largely alter the data (Figure 1b) and thus the PCA result (Figure 1a). We further investigate a case with with three cell types c1, c2, and c3 each with 50 cells, and 180 among 10, 000 genes are differentially expressed. Again, scImpute leads to expression profiles that mostly resemble the complete data. (Supplementary Figure S1). In addition, we assess how the prevalence of dropout values influences the performance of scImpute. The differential expression analysis based on the imputed data has increased accuracy as the dropout proportion in the raw data decreases. But scImpute is able to achieve > 80.0% AUC when the proportion of zero count in raw data is less than 75.0% (Supplementary Figure S3). The only tuning parameter in scImpute is a threshold t, and the imputation is only applied to the genes with dropout probabilities larger than t in a cell (see Methods). A sensitivity analysis suggests that scImpute is robust to varying threshold values (Supplementary Figure S2).
We also use two real data sets to illustrate scImpute’s capability in identifying cell types or subpopulations. The first one is a smaller data set of mouse preimplantation embryos [14]. It contains RNA-seq profiles of 268 single cells from 10 developmental stages (see Methods). 70.0% entries in the raw read count matrix are 0. To illustrate the the dropout phenomenon, we plot the log 10 read counts of two 16-cell stage cells as an example in Figure 1d. Even though the two cells come from the same stage, many expressed genes have zero counts in only one cell. This problem is alleviated in the imputed data by scImpute, and the Pearson correlation between the two cells increases from 0.53 to 0.89 (Figure 1d). MAGIC achieves an even higher correlation (0.99) but also introduces very large counts that do not exist in the raw data.
Although it is possible to differentiate the major developmental stages from the raw data, the imputed data by scImpute output more compact clusters (Figure 1e). MAGIC gives a very clean pattern of developmental stages, but it has a high risk of removing biologically meaningful variation, given that many cells of the same stage have almost identical scores in the first two PCs. We then compare the clustering results of the k-means algorithm (k = 6, 8, or 10) on the first two PCs. The results are evaluated by four different measures: adjusted rand index [15], Jaccard index [16], normalized mutual information (nmi) [17], and purity (see Supplementary Note 1). The four measures are all between 0 and 1, with 1 indicating perfect match between the clustering result and the truth. All the four measures indicate that scImpute leads to best agreement with the true labels (Figure 1f). This comparison suggests that scImpute could improve clustering among cell subpopulations by imputing dropout values in scRNA-seq data.
We also apply scImpute to a larger data set generated by the high-throughput droplet-based system [18]. It contains scRNA-seq data of 2, 885 293T cells and 3, 258 Jurkat cells. After we remove the genes that have zero expression in all the cells, the proportion of zero expression is 84.6% among the Jurkat cells (Figure S8) and 83.5% among the 293T cells. Among the Jurkat cells, the median of pairwise Pearson correlations is 0.61 in the raw data, and increases to 0.98 and 0.97 in the imputed data by scImpute and MAGIC, respectively. Among the 293T cells, the median of pairwise Pearson correlations is 0.59 in the raw data, and increases to 0.98 and 0.97 in the imputed data by scImpute and MAGIC, respectively. The increased correlations among cells within the same cell line (Figure 1 g-h) suggest that the imputation strengthens the similarities of cells. Among all the pairs of Jurkat and 293T cells, the median Pearson correlation is 0.55. scImpute decreases the median to 0.12, while MAGIC increases the median to 0.75. This result shows that scImpute is able to reveal trasncriptomic differences of different cell types, while MAGIC makes different cell types more similar. Therefore, scImpute is the better method to differentiate cell types by increasing both within-cell-type similarities and between-cell-type differences.
2.2 scImpute assists differential gene expression analysis on scRNA-seq data
An effective imputation method should lead to a better agreement between scRNA-seq and bulk RNA-seq data of the same biological condition, since bulk data have higher signal-to-noise ratios compared with scRNA-seq data. To evaluate whether the DE genes identified from single-cell or bulk data have larger overlap after imputation, we utilize a real data set with both single-cell and bulk RNA-seq experiments on human embryonic stem cells (ESC) and definitive endorderm cells (DEC) [19] This data set includes 6 samples of bulk RNA-seq (4 in H1 ESC and 2 in DEC) and 350 samples of scRNA-seq (212 in H1 ESC and 138 in DEC). The percentage of zero gene expression is 14.8% in bulk data and 49.1% in single-cell data.
We apply both scImpute and MAGIC to impute the gene expression values, and then compare the results of differential expression analysis with the results on raw data. Since the ground truth is unknown, we take the DE genes detected from bulk data as the standard to be compared against. Figure 2a illustrates the expression profile of the top 200 DE genes identified in the bulk data. We compare the DE gene lists obtained from the bulk and single-cell data to calculate precision and recall rates and F scores (Figure 2b, Supplementary Note 2). The imputed data lead to more similar DE gene lists (around 20% higher F score) to the DE genes from the bulk data, implying that differential analysis on imputed data has better accuracy compared with raw data. scImpute has a higher precision rate, while MAGIC has a higher recall rate, thus scImpute is preferred when users have a priority on the precision of the identified DE genes. We also assess the consistency of the identified DE genes from the raw and imputed data with the bulk data by comparing the overlap of top DE genes (Figure 2c). From this perspective, scImpute leads to the best agreement with the bulk data.
2.3 scImpute assists pattern recognition in time course scRNA-seq data
Chu et al. [19] also generated bulk and single-cell time-course RNA-seq data profiled at 0, 12, 24, 36, 72, and 96 h of differentiation during DEC emergence (Supplementary Table S2). We utilize this data set to show that scImpute can help recover DE signals difficult to identify in the raw time-course data, and reduce false discoveries resulted from dropouts. We first apply scImpute to the raw scRNA-seq data, and then use the R package scPattern [20] to identify genes with differential expression along the time points. scPattern identifies 3, 247 genes from the raw data and 5, 861 from the imputed data, and 2, 587 of them are overlapping. We infer that for the 660 genes that are only identified from the raw data, dropout events have introduced false down-regulating signals. The four example genes in Supplementary Figure S4a are only identified in the raw data, but neither the imputed data nor the bulk data provide strong evidence of differential expression. In these examples, the differential patterns are mainly due to varying dropout rates at different time points. As for the genes that are only identified in the imputed data, we suspect that dropout events of some genes make it difficult to separate biological signals from technical noises. But after imputation, the underlying expression patterns are more easily detected by statistical testing. The four genes in Supplementary Figure S4b illustrate how the imputed data by scImpute present up or down regulation patterns that are consistent with the bulk data. For a genome wide comparison, the imputed data have significantly higher Pearson correlations with the bulk data (Supplementary Figure S5).
3 Discussion
We propose a statistical method scImpute to address the dropout events prevalent in scRNA-seq data. scImpute focuses on imputing the missing expression values of dropout genes, while retaining the expression levels of genes that are largely unaffected by dropout events. Hence, scImpute can reduce technical variation resulted from scRNA-seq and better represent cell-to-cell biological variation, while it also avoids introducing excessive bias during its imputation process. An attractive advantage of scImpute is that it can be incorporated into any existing pipelines or downstream analysis of scRNA-seq data. scImpute takes the raw read count matrix as input and outputs an imputed count matrix of the same dimensions, so it can be seamlessly combined with other computational tools without data reformatting or transformation. Another important feature of scImpute is that it does not require manual tuning of parameters. The only parameter in the method is a threshold t on dropout probabilities. We show in a sensitivity analysis that scImpute is robust to the threshold value (Supplementary Figure S2), and a default threshold value 0.5 is sufficient for most scRNA-seq data. Moreover, cell type information is not necessary for the scImpute method. When cell type information is available, separate imputation on each cell type is expected to produce more accurate results. But as illustrated by simulation and real data results, scImpute is able to infer cell-type-specific expression even when the true labels are not supplied.
scImpute scales up well when the number of cells increases, and the computation efficiency can be largely improved if a filtering step on genes can be performed based on biological knowledge. Aside from computational complexity, another future direction is to further improve imputation efficiency when dropout rates in raw data are severely high, as with the droplet-based technologies. Imputation task becomes more difficult when proportion of missing values increases. More complicated models that account for gene similarities may yield more accurate imputation results, but the prevalence of dropout events may require additional prior knowledge on similar genes to assist modeling.
Methods
The scImpute model
The input of our method is a count matrix XC with rows representing genes and columns representing cells, and our eventual goal is to construct an imputed count matrix with the same dimensions. We start by normalizing the count matrix by the library size of each sample (cell) so that all samples have one million reads. Denote the normalized matrix by XN, we then make a matrix X by taking log 10 transformation with a pseudo count 1.01: where I is the total number of genes and J is the total number of cells. The pseudo count is added to avoid infinite values in parameter estimation in a later step.
Once we obtain the transformed gene expression matrix X, the first step is to infer which genes are affected by the dropout events in which cells. Instead of treating all zero values as dropout events, we construct a statistical model to systematically determine whether a zero value comes from a dropout event or not. With the existence of dropout events, most genes have a bimodal expression pattern that can be described by a mixture model of two components (Supplementary Figure S6). The first component is a Gamma distribution used to account for the dropouts, while the second component is a Normal distribution to represent the actual gene expression levels. Thus for each gene i, its expression can be considered as a random variable with density function where λi is gene i’s dropout rate, αi, βi are shape and rate parameters of Gamma distribution, and μi, σi are the mean and standard deviation of Normal distribution. The intuition behind this mixture model is that if a gene has high expression and low variation in most cells, then a zero count is more likely to be a dropout value; on the other hand, if a gene has constantly low expression or medium expression with high variation, then a zero count may reflect real biological variability. An advantage of this model is that it does not assume an empirical relationship between dropout rates and genes’ mean expression, as did in [6], allowing more flexibility in the model estimation. The parameters in the mixture model can be estimated by the Expectation-Maximization (EM) algorithm and we denote their estimates as , , , , and . It follows that the dropout probability of gene i in cell j can be estimated as
Therefore, each gene i has an overall dropout rate , which does not depend on individual cells, and also dropout probabilities dij, j = 1, 2, …, J, which may vary among different cells.
Now we impute the gene expressions cell by cell. For each cell j, we select a gene set Aj in need of imputation based on the genes’ dropout probabilities in cell j : Aj = {i : dij ≥ t}, where t is a threshold on dropout probabilities. Similarly we also have a gene set Bj = {i : dij < t} that have accurate gene expression with high confidence and do not need imputation. We learn cells’ similarities through gene set Bj to impute the expression of genes in set Aj by borrowing information from their expression in other similar cells. To do this, we construct a weighted Lasso model [21] where θ > 0 is the regularization parameter and wBj is the weight vector of gene set Bj. Note that the response XBj,j is a vector representing the Bj rows in the j-th column of X, and the design matrix XBj,−j is a sub-matrix of X with dimensions |Bj| × (j − 1). The weight of gene i is determined by its dropout rate: . The indices of cells selected by the Lasso are denoted as . The strengths of this model are two-folds. First, the Lasso ensures sparsity in variable selection, thus only cells that are informative for imputation in cell j would be selected. Second, incorporating weights on genes can give priority to genes with high expression, as these genes are more informative in learning cells’ relationships. Finally, the estimated coefficients from the Ordinary Least Squares (OLS) linear regression are used to impute the expression of gene set Aj in cell j:
We construct a separate lasso model for each cell to impute the expression of genes with high dropout probabilities. This method simultaneously determines the values that need imputation, and would not introduce bias to the high expressions of accurately measured genes. Supplementary Figure S7 give some real data examples of distribution of zero count proportion, showing the reasonability of dividing genes into set A and B. Supplementary Figure S8 illustrates the imputation step in scImpute with a toy example.
Generation and analysis of simulated data
In the simulation study with two cell types, we suppose there are two cell conditions c1 and c2, each with 50 cells, and there are 10, 000 genes in total. In this gene population, only 120 genes are truly differentially expressed, with one half having higher expression in c1 and the other half having higher expression in c2. We directly generate genes’ log 10-based read counts to construct the simulated data set. First, mean expressions of the 10, 000 genes are randomly drawn from a Normal distribution with mean 1.8 and standard deviation 0.5. Similarly, standard deviations of gene expressions are randomly drawn from a Normal distribution with mean 0.6 and standard deviation 0.1. These parameters are estimated from the real data set of mouse embryo cells. Second, we randomly draw 60 genes and shift their mean expression in condition c1 by multiplying it with an integer randomly sampled from {2, 3, …, 10}; we also draw another 60 genes and shift their mean expression in condition c2 in the same way. Next, the expression of each gene in the 100 cells can be simulated from Normal distributions defined by the parameters obtained in the first two steps. We refer to the resulting gene expression data as the complete data. Finally, we suppose the dropout rate of each gene follows a double exponential function exp(−0.1 * mean expression2), as assumed in [12]. Zero values are then introduced into the simulated data for each gene based on a Bernoulli distribution defined by its dropout rate, resulting in a gene expression matrix with excess zeros and in need of imputation. We refer to the gene expression data after introducing zero values as the raw data. Please note that the generation of gene expression values does not directly follow the mixture model used in scImpute, so that we can investigate the efficacy and robustness of scImpute in a fair way using this simulation. (Data with three cell types is simulated in a similar procedure.)
Based on the parameters estimated from real data, we simulate a 10, 000 × 100 gene expression matrix (raw data) with 71.8% of zero values, while the complete data only have 1.3% of zero values. Even though the two cell types are clearly distinguishable when we apply principal component analysis (PCA) to the complete data, they become indistingusiable when dropout events exist (Figure 1a). However, when we apply scImpute and MAGIC to impute for the dropout values, the relationships between the 100 cells are unveiled. Both methods are able to distinguish condition c1 from c2, but MAGIC introduces artificial signals that largely alter the data and thus the PCA result (Figure 1a). In the gene expression matrices, we observe that the dropout events largely mask the expression dynamics of the 120 DE genes. scImpute is able to recover the missing gene expression to a good extent and make the imputed matrix resemble the complete matrix, while MAGIC would largely distort the expression matrix and make it even more deviated from the complete data than the raw data (Figure 1b). This difference between scImpute and MAGIC can be more easily observed by directly comparing the accuracy of DE genes’ identification (Figure 1c). The DE genes are found by the t-test with a significance level of 0.01 or 0.05 on the simulated gene expression. The 120 true DE genes cannot be detected at all from the raw data, but scImpute is able to recover the differential pattern.
Data availability
The scRNA-seq data of mouse preimplantation embryos [14] is publicly available through GEO Series accession number GSE45719. It contains RNA-seq profiles of 268 single cells from zygote (4), early 2-cell stage (8), middle 2-cell stage (12), late 2-cell stage (10), 4-cell stage (14), 8-cell stage (37), 16-cell stage (50), early blastocyst (43), middle blastocyst (60), and late blastocyst (30) stages.
The scRNA-seq data of Jurkat and 293T cells [18] are available at https://www.10xgenomics.com/single-cell/.
The RNA-seq data on human embryonic cells [19] is publicly available under the GEO Series accession number GSE75748.