## ABSTRACT

High-throughput single-cell RNA-Seq (scRNA-Seq) methods can efficiently generate expression profiles for thousands of cells, and promise to enable the comprehensive molecular characterization of all cell types and states present in heterogeneous tissues. However, compared to bulk RNA-Seq, single-cell expression profiles are extremely noisy and only capture a fraction of transcripts present in the cell. Here, we propose an algorithm to smooth scRNA-Seq data, with the goal of significantly improving the signal-to-noise ratio of each profile, while largely preserving biological expression heterogeneity. The algorithm is based on the observation that across protocols, the technical noise exhibited by UMI-filtered scRNA-Seq data closely follows Poisson statistics. Smoothing is performed by first identifying the nearest neighbors of each cell in a step-wise fashion, based on variance-stabilized and partially smoothed expression profiles, and then aggregating their transcript counts. On data from human pancreatic islet tissue and peripheral blood mononuclear cells, we show that smoothing greatly facilitates the identification of clusters of cells and co-expressed genes. Using simulated datasets that closely mimic real expression data, we show that our algorithm drastically improves upon the accuracy of other smoothing methods. Our work implies that there exists a quantitative relationship between the number of cells profiled and the potential accuracy with which individual cell types or states can be characterized, and helps unlock the full potential of scRNA-Seq to elucidate molecular processes in healthy and disease tissues. Reference implementations of our algorithm can be found at https://github.com/yanailab/knn-smoothing.

## INTRODUCTION

Over the past decade, single-cell expression profiling by sequencing (scRNA-Seq) technology has advanced rapidly. After the transcriptomic profiling of a single cell (Tang et al. 2009), protocols were developed that incorporated cell-specific barcodes to enable the efficient profiling of tens or hundreds of cells in parallel (Islam, Kjällquist, et al. 2011; Hashimshony, Wagner, et al. 2012). scRNA-Seq methods were then improved by the incorporation of unique molecular identifiers (UMIs) that allow the identification and counting of individual transcripts (e.g., Islam, Zeisel, et al. 2014; Hashimshony, Senderovich, et al. 2016). More recently, single-cell protocols were combined with microfluidic technology (Klein et al. 2015; Macosko et al. 2015; Zheng et al. 2017), combinatorial barcoding (Cao et al. 2017; Rosenberg et al. 2017), or nanowell plates (Gierahn et al. 2017). These high-throughput scRNA-Seq methods allow the cost-efficient profiling of tens of thousands of cells in a single experiment.

Due to the typically very low amounts of starting material, and the inefficiencies of the various chemical reactions involved in library preparation, scRNA-Seq data is inherently noisy (Ziegenhain et al. 2017). This has motivated the development of many specialized statistical models, for example for determining differential expression (Kharchenko, Silberstein, and Scadden 2014), performing factor analysis (Pierson and Yau 2015), pathway analysis (Fan et al. 2016), or more general modeling of scRNA-Seq data (Risso et al. 2017). In addition, methods have been proposed to impute missing values (W. V. Li and J. J. Li 2017) and to perform smoothing (Dijk et al. 2017). Finally, many authors of scRNA-Seq studies have relied on ad-hoc approaches for mitigating noise, for example by clustering and averaging cells belonging to each cluster (Shekhar et al. 2016; Baron et al. 2016).

Fundamental to any statistical treatment are the assumptions that are made about the data. For methods aimed at analyzing scRNA-Seq data, assumptions about the noise characteristics determine which approach can be considered the most appropriate. All aforementioned approaches have assumed an overabundance of zero values, compared to what would be expected if the data followed a Poisson or negative binomial distribution. However, in the absence of true expression differences, the analysis by Ziegenhain et al. (2017) has suggested that across scRNA-Seq protocols, there is little evidence of excess-Poisson variability when expression is quantified by counting unique UMI sequences (“UMI filtering”) instead of raw reads (see Figure 5B in Ziegenhain et al. (2017)). This is consistent with reports describing individual UMI-based scRNA-Seq protocols, which have demonstrated that in the absence of true expression differences, the mean-variance relationship of genes or spike-ins closely follows that of Poisson-distributed data (Grün, Kester, and Oudenaarden 2014; Klein et al. 2015; Zheng et al. 2017).

In this work, we propose a smoothing algorithm that makes direct use of the observation that after normalization to account for efficiency noise (Grün, Kester, and Oudenaarden 2014), the technical noise associated with UMI counts from high-throughput scRNA-Seq protocols is entirely consistent with Poisson statistics. Instead of developing a parametric model, we propose an algorithm that smoothes scRNA-Seq data by aggregating gene-specific UMI counts from the *k* nearest neighbors of each cell. To accurately determine these neighbors, we propose to use an appropriate variance-stabilizing transformation, and to proceed in a step-wise fashion using partially smoothed profiles. Conveniently, the noise associated with the smoothed expression values is again Poisson-distributed, which simplifies their variance-stabilization and downstream analysis. We demonstrate the improved signal-to-noise ratio of scRNA-Seq data processed with our algorithm on real-world examples, and perform simulation studies to compare its accuracy to that of two other recently proposed methods for smoothing (or imputing) scRNA-Seq data (Dijk et al. 2017; W. V. Li and J. J. Li 2017).

## RESULTS

### The normalized UMI counts of replicate scRNA-Seq profiles are Poisson-distributed

To validate the Poisson-distributed nature of high-throughput scRNA-Seq data in the absence of true expression differences, we obtained data from control experiments conducted on three platforms: in-Drop (Klein et al. 2015), Drop-Seq (Macosko et al. 2015), and 10x Genomics (Zheng et al. 2017). In these experiments, droplets containing identical RNA pools were analyzed. Assuming that the number of transcripts in each droplet was sufficiently large, there are no true expression differences among droplets, and all of the observed differences among droplets can be attributed to technical noise arising from library preparation and sequencing. As expected from published results (cf. Figure 5A in Klein et al. (2015), Supplementary Figure 2f in Zheng et al. (2017)), data from both the inDrop platform and the 10x Genomics platform followed the Poisson distribution (see Figure 1a,c; see Methods), with the exception of highly expressed genes, which is likely due to global droplet-to-droplet differences in capture efficiency, previously referred to as “efficiency noise” (Grün, Kester, and Oudenaarden 2014).

For the Drop-Seq data, Macosko et al. (2015) did not discuss the mean-variance relationship, but we observed a pattern consistent with inDrop and 10x Genomics data (see Figure 3b). Interestingly, the y axis intercept of the Drop-Seq CV-mean relationship was clearly above 0, suggesting that transcript counts followed a scaled Poisson distribution (see Methods). A possible explanation could be that the computational pipeline used to derive the Drop-Seq UMI counts generated artificially inflated transcript counts, but we did not explore this hypothesis further.

To test whether the larger-than-expected variance of highly expressed genes can indeed be explained by efficiency noise, we normalized the expression profiles in each dataset to the median UMI count across profiles (Model I in Grün, Kester, and Oudenaarden (2014); see Methods). This resulted in an almost perfectly linear CV-mean relationship (see Figure 1d-f), suggesting that efficiency noise is indeed the dominating source of variation for very highly expressed genes.

Finally, we directly compared the frequency of UMI counts of zero for each gene to that predicted by Poisson statistics, and found that for the inDrop and 10x Genomics data, the observed values matched the theoretical prediction almost perfectly (see Figure 3g,i). For the Drop-Seq data, the frequency of zeros was slightly shifted upwards across the entire expression range (see Figure 3h), which may be due to artificially inflated UMI counts (see Methods).

In summary, we found that for all three high-throughput scRNA-Seq platforms examined, Poisson-distributed noise, in combination with the efficiency noise observed for very highly expressed genes, described virtually all of the observed technical variance, and that there was no evidence of substantial zero-inflation. We note that the recent publication describing the Quartz-Seq2 single-cell platform also reports a Poisson noise relationship (see Figure 2e in Sasagawa et al. (2017)), bringing the total number of high-throughput scRNA-Seq protocols with reported Poisson noise characteristics to four.

### Aggregation of *n* replicate profiles results in Poisson-distributed values with the signal-to-noise ratio increased by a factor of

Since the sum of independent Poisson-distributed variables is again Poisson-distributed, we reasoned that the aggregation of normalized expression values from *n* independent measurements of the same RNA pool would result in Poisson-distributed values, with the signal-to-noise ratio increased by a factor of (see Methods). Similarly, we predicted that averaging instead of aggregating (summing) would result in a scaled Poisson distribution with the same increased signal-to-noise ratio. We tested this idea on the in Drop pure RNA dataset previously shown in Figure 1a, which consisted of 935 expression profiles. Averaging randomly selected, non-overlapping sets of 16 profiles resulted in 58 new expression profiles, with genes exhibiting an almost exact four-fold increase in their signal-to-noise ratios, i.e., a four-fold reduction of their coefficients of variation, as expected (see Figure 2a). As an example, the UMI count distribution of the *GADPH* gene before and after averaging is shown in Figure 2b, and can be seen to closely match the theoretically predicted Poisson and scaled Poisson distributions, respectively. In summary, the results showed that independently of gene expression level, aggregating expression values from replicate profiles led to more accurate expression estimates that again exhibited Poisson-distributed noise profiles.

### The Freeman-Tukey transform effectively stabilizes the technical variance of high-throughput scRNA-Seq data

Based on the aforementioned results, we conceived an algorithm to smooth single-cell RNA-seq data, with the following outline:

• For each cell *C*:

Determine the

*k*nearest neighbors of C.Calculate a smoothed expression profile for

*C*by combining its UMI counts with those of the*k*nearest neighbors, on a gene-by-gene basis.(Optional) Divide

*C*’s new expression profile by*k*+ 1, to retain the scale of the original data.

The main challenge in implementing this algorithm is to devise an appropriate approach for determining the *k* nearest neighbors of each cell, and to choose an appropriate *k*. We defer the question of how to choose *k* to the Discussion, and focus here on the problem of determining the *k* nearest neighbors.

Due to the Poisson-distributed nature of scRNA-Seq data, the technical variance (noise) associated with each gene is directly proportional to its expression level. This type of extreme heteroskedasticity poses a problem when attempting to calculate cell-cell similarities, because the noise of highly expressed genes can drown out the true expression differences of more lowly expressed genes, therefore strongly biasing the analysis towards the most highly expressed genes. One strategy to address this issue is the application of an appropriate variance-stabilizing transformation, designed to render the technical variance independent of the gene expression level (Love, Huber, and Anders 2014). For bulk RNA-Seq data, a log-TPM (or log-RPKM) transform is commonly used for this purpose, even though lowly expressed genes will still exhibit unduly large variances under this transformation (Love, Huber, and Anders 2014). Based on our results, we reasoned that for scRNA-Seq data, the *Freeman-Tukey transform* (FTT), , would be a more appropriate choice, as it is designed to stabilize the variance of Poisson-distributed variables (Freeman and Tukey 1950).

To compare the abilities of the FTT and the log-TPM (transcripts per million) transform to stabilize the technical variance of scRNA-Seq data, we applied both transformations to the inDrop pure RNA dataset, and found that the FTT produced significantly better results (see Figure 3): With the log transform, genes with low-intermediate expression, which we considered to be those with expression values between the 60th and 80th percentile rank (of all protein-coding genes, not only genes expressed by K562 cells), had between three-and ten-fold higher levels of variance than the 10% most highly expressed genes (see Figure 3b). In contrast, with the FTT, the difference was no larger than two-fold, and the variances of lowly expressed genes were biased downwards, not upwards (see Figure 3c). Moreover, we found that the FTT also stabilized the variance of the aggregated profiles (see Figure 3d-f), which was expected, given our earlier observation that the aggregated UMI counts are again Poisson-distributed. In particular, a greater share of genes now had variances close to 1. This closely mirrored theoretical results, according to which the variance Poisson-distributed variables with mean λ ≥ 1 should be within 6% of the asymptotic value of 1 after FTT (Freeman and Tukey 1950). In summary, our analysis showed that distance calculations performed on Freeman-Tukey transformed (FT-transformed) UMI counts would give similar weight to genes with intermediate and high expression. Expression differences from lowly expressed genes would tend to be suppressed, but this suppression would become less severe for aggregated expression profiles.

### A k-nearest neighbor algorithm for smoothing scRNA-Seq data

The previously discussed ideas suggested that a simple way to determine the *k* nearest neighbors for all cells would be to normalize their expression profiles, apply the FTT, and then find the *k* closest cells for each cell based on the Euclidean metric. However, we reasoned that this simple approach could be improved upon, because the noisiness of the data itself can interfere with the accurate determination of the *k* nearest neighbors. We therefore instead decided to adopt a step-wise approach, whereby initially, each profile is only minimally smoothed (using *k*_{1} = 1). In the second step, a larger set of nearest neighbors (e.g., *k*_{2} = 3) is identified for each cell based on those minimally smoothed profiles, and the raw data is then smoothed using these larger sets of neighbors. Additional steps using increasing *k*_{i} are performed until the desired degree of smoothing is reached (i.e., *k*_{i} = *k*). By choosing the *i*’th step to use *k*_{i} = min{2^{i} — 1, *k*}, each step theoretically improves the signal-to-noise ratio of each individual expression measurement by a factor of — except for the last step, for which the improvement can be smaller —, and only a small number of steps are required even for large choices of *k* (e.g., six steps for *k* = 63). The resulting “kNN-smoothing” algorithm is formalized in Algorithm 1 (see https://github.com/yanailab/knn-smoothing for reference implementations in Python, R, and Matlab). Using simulation studies, we found that in contrast to a simple “one-step” algorithm, the step-wise approach resulted in a significantly more accurate selection of neighbors, especially for large *k* (see below).

### Application of kNN-smoothing to scRNA-Seq data of human pancreatic islets improves clustering results and recovers specific expression patterns for marker genes

To test whether kNN-smoothing would improve the ability to distinguish between different cell types in a scRNA-Seq experiment, we applied the algorithm (with *k*=15) to a single-cell expression dataset obtained from human pancreatic islet tissue, containing at least 14 distinct cell populations (Baron et al. 2016) (PANCREAS dataset). We first performed principal component analyses (PCA; see Methods) and observed several improvements after smoothing (see Figure 4a): First, cell type clusters appeared significantly more compact in principal component space, indicating that the smoothed expression profiles were more similar than unsmoothed profiles for cells of the same type, but more different for cells from distinct types. Second, a single cluster of cells that contained alpha cells as well as other cells separated into two highly distinct clusters after smoothing. Notably, all alpha cells were still contained within a single cluster after smoothing. This suggested smoothing helped reveal important differences that were not previously captured by the first two principal components. Third, the proportion of cells of each type that could be identified using simple marker gene expression thresholds increased slightly, suggesting that the expression values of individual marker was less noisy in the smoothed data. Finally, a much greater share of total variation was explained by the first two principal components (PCs) for the smoothed data than for the unsmoothed data (40.3% vs 20.8%), which would be consistent with a greater share of variation originating from true biological differences rather than technical noise.

We next performed hierarchical clustering on the smoothed data after filtering for the 1,000 most variable genes (see Methods). When we visualized the results as an expression heatmap (Eisen et al. 1998), several gene and cell clusters were readily discernible (see Figure 4b). A direct comparison between the smoothed and unsmoothed data showed that smoothing produced significantly less noisy expression patterns while preserving expression differences between relatively similar cell populations (see Figure 4c). To assess whether cell clusters delineated different cell types, we examined the expression patterns of known marker genes for nine cell types present in the data (Baron et al. 2016), and found that the hierarchical clustering of the smoothed expression profiles accurately grouped cells by their cell type (see Figure 4d, top panel). Moreover, compared to the unsmoothed data, the expression patterns of these marker genes appeared significantly less noisy (see Figure 4d, bottom panel). Finally, we repeated the entire analysis on the unsmoothed data, and found that it was considerably more difficult to discern clusters of genes and cells (see Figure S1a), and that judging by the expression patterns of the marker genes, not all cell types were clustered together appropriately (see Figure S1b). In summary, our analyses showed that kNN-smoothing with *k*=15 significantly improved the results obtained with PCA as well as hierarchical clustering, and that it recovered stable and cell type-specific expression patterns for all of the marker genes examined.

### Algorithm 1: K-nearest neighbor smoothing for UMI-filtered scRNA-Seq data

**Input**:

*p*, the number of genes.

*n*, the number of cells.

*X*, a *p* × *n* matrix containing the UMI counts for all genes and cells.

*k*, the number of neighbors to use for smoothing.

**Output**:

*S*, a *p* × *n* matrix containing the smoothed (aggregated) UMI counts.

1: **procedure** KNN-SMOOTH(*p*, *n, X, k*)

2: *S* = COPY(*X*)

3: *steps* = [log_{2} (*k* + 1)]

4: **for** *t* = 1 **to** *steps* **do**

5: *M* = MEDIAN-NORMALIZE(*S*) // a new *p* × *n* matrix

6: *F* = FREEMAN-TUKEY-TRANSFORM(*M*)//a new *p* × *n* matrix

7: *D* = PAIRWISE-DISTANCE(*F*) // a new *n* × *n* matrix

8: *A* = ARGSORT-ROWS(*D*) //anew *n* × *n* matrix

9: *k_step* = MIN({2^{t} — 1, *k*})

10: **for** *j* = 1 **to** *n* **do** // *empty matrix S*

11: **for** *i* = 1 to *p* **do**

12: *S*_{ij} = 0

13: **end for**

14: **end for**

15: **for** *j* = 1 **to** *n* **do** // go over all cells

16: **for** *v* = 1 to *k_step* + 1 **do** // go over all nearest neighbors (including self)

17: *u* = *A*_{jv}

18: **for** *i* = 1 **to** *p* **do** // aggregate original UMI counts for each gene

19: *S*_{ij} = *S*_{ij} + *X*_{iu}

20: **end for**

21: **end for**

22: **end for**

23: **end for**

24: **return** *S*

25: **end procedure**

Notes: For a two-dimensional matrix *X*, *X*_{ij} refers to the element in the *i*’th row and *j*’th column of *X*. COPY(*X*) returns an independent memory copy of *X* (not a reference). MEDIAN-NORMALIZE(*X*) returns a new matrix of the same dimension as *X*, in which the values in each column have been scaled by a constant so that the column sum equals the median column sum of *X*. FREEMAN-TUKEY-TRANSFORM(*X*) returns a new matrix of the same shape as *X*, in which all values have been Freeman-Tukey transformed (). PAIRWISE-DISTANCE(*X*) computes the pair-wise distance matrix *D* from *X*, so that *D*_{ij} is the Euclidean distance between the *i*’th column and the *j*’th column of *X*. For a matrix *D* with *n* columns, ARGSORT-ROWS(*D*) returns a matrix of indices *A* that sort *D* in a row-wise manner, i.e., *D*_{jA}*A*_{j1} ≤ *D*_{jA}*A*_{j2} <…≤ *D*_{j}*A*_{jn} for all *j*.

### Application of kNN-smoothing to scRNA-Seq data of human peripheral blood mononu-clear cells recovers robust expression profiles for diverse immune cell populations

As a second test of our algorithm, we applied kNN-smoothing to a dataset containing scRNA-Seq data for 4,340 peripheral blood mononuclear cells (PBMCs), obtained using the 10x Genomics “Chromium” protocol (the PMBC dataset;see Methods). PBMCs can easily be obtained from peripheral blood, have been studied extensively, and contain a diverse set of immune cell types (Kleiveland 2015), thus enjoying popularity as a point of reference for scRNA-Seq studies (e.g., Zheng et al. 2017; Gierahn et al. 2017). The identification and characterization of immune cell types in peripheral blood using scRNA-Seq is also an activate area of investigation (e.g., Villani et al. 2017). Since the PMBC dataset contained significantly more cells than the PANCREAS dataset, and the expression profiles exhibited significantly higher complexity (i.e., expression levels were less concentrated on a few highly expressed genes; data not shown), we chose to apply more aggressive smoothing using *k*=127. We compared the results of PCA applied before and after smoothing, and found that, again, smoothing significantly improved the compactness of cell type clusters in principal component space, and strongly increased the fraction of variance explained by the first two PCs — this time, from 16.6% to 70.4%. Moreover, using expression thresholds for individual marker genes (see below), we were able to assign one of four major cell type identities (T cells, CD14 monocytes, B cells, and dendritic cells) to 93% of all cells in the smoothed data. However, in the unsmoothed data, the technical noise was so strong that only 40% of the cells could be assigned an identity using the same expression thresholds (see Figure 5a).

Next, we performed hierarchical clustering after filtering for the 1,000 most variable genes, visualized the results as a heatmap, and obtained several easily distinguishable clusters of cells and genes, providing an overview of the heterogeneity in the data (see Figure 5b). Repeating the same clustering procedure on the unsmoothed data produced much less coherent clusters (see Figure S2). We compared the smoothed and smoothed data within a small region of the heatmap in a side-by-side comparison and observed that smoothing dramatically reduced the apparent noise levels, while largely preserving differences between similar sets of cells (see Figure 5c). Finally, we compiled a list of marker genes for the major cell types found in PBMC samples, including T cells, monocytes, B cells, NK cells, and dendritic cells (see Methods). In comparing the expression patterns of these genes across cells ordered according to the hierarchical clustering results, we found that smoothed resulted in vastly more stable expression patterns, while the expression of each marker gene appeared to remain confined to a specific subset of cells. A comparison with the full heatmap suggested that within most cell types, there existed significant population substructure. For example, several distinct clusters of cells were apparent among the set of T cells expressing *CD3D* and *CD3E*, which likely distinguish specific subsets such as CD4+ and CD8+ T cells, or naive and memory T cells. However, a more detailed analysis of the individual immune cell subsets was beyond the scope of this work. In summary, the application of aggressive smoothing (with *k*=127) to PBMC data led to significant improvements in the ability to cluster cells by their cell type, and produced stable and cell type-specific specific expression patterns for marker genes, thus demonstrating the applicability of kNN-smoothing to data generated using 10x Genomics’ high-throughput scRNA-Seq solution.

### Comparison with other smoothing methods on simulated datasets shows strongly improved performance of kNN-smoothing

To quantitatively compare the accuracy of kNN-smoothing with that of other smoothing methods, we devised an approach for simulating scRNA-Seq datasets containing a mixture of cell types. Our idea was to base each simulation on a real scRNA-Seq dataset, in order to make the simulated data as similar to real scRNA-Seq expression data as possible, both biologically and technically. To ensure biological similarity, we simulated clusters with expression profiles obtained from the real data, based on hierarchical clustering results. To ensure technical fidelity, we simulated Poisson-distributed sampling noise, modeled on top of efficiency noise, the distribution of which was again obtained from the real data (see Methods for details). We generated two datasets, SIM-PANCREAS (based on the PANCREAS dataset) and SIM-PBMC (based on the PMBC dataset). A visual comparison based on clustered heatmaps illustrated the similarity between real and simulated scRNA-Seq data (see Figures S3 and S4). We then applied kNN-smoothing, MAGIC (Dijk et al. 2017), and scImpute (W. V. Li and J. J. Li 2017) to the two datasets, and quantified the similarity of the results to the true cluster profiles from which the cell expression profiles were generated.

We tested different parameter settings for each method, and observed that as expected, the choice of *k* had a large effect on the accuracy of the results obtained with kNN-smoothing (see Figure 6). However, for all values of *k* ≥ 15 that we tested (up to *k*=511), kNN-smoothing outperformed MAGIC and scImpute on both datasets by a large margin, independently of the way in which we quantified accuracy. We first quantified the relative accuracy of each cell’s expression profile by calculating its Pearson correlation coefficient (PCC) with the true cluster expression profile, on log_{2}-transformed data. For kNN-smoothing with *k*=15, the median PCC across all cells in the SIM-PANCREAS dataset was approx. 0.93. For *k*=63, it was approx. 0.98. In contrast, the best values obtained by MAGIC and scImpute across all parameter settings were approx. 0.85 and 0.87, respectively (see Figure 6a). These differences were even more pronounced for the SIM-PBMC dataset (see Figure 6c), and when we quantified absolute accuracies by root-mean squared error (RMSE) on log-transformed data (see Figure 6b,d). We then quantified accuracies, using both PCC and RMSE, on square root-transformed data instead of log_{2}-transformed data. This resulted in slightly smaller absolute differences, but we again observed that kNN-smoothing clearly outperformed the other methods for *k* ≥ 15 (see Figure S5).

Our evaluation of kNN-smoothing on simulated data also showed that up to a certain point, choosing larger values of *k* produced increasingly accurate expression profiles. In fact, the median PCC for *k*=511 was very close to 1 in the SIM-PBMC dataset (see Figure 6c). However, the best median PCC for the SIM-PANCREAS dataset was obtained for *k*=255, and a significant fraction of cells exhibited much lower accuracies for *k*=255 and *k*=511 compared to *k*=127 (see Figure 6a). This apparent “over-smoothing” was not surprising, since a significant fraction of cells in the SIM-PANCREAS dataset belonged to clusters that were represented by less than 256 cells. Therefore, some of the 255 neighbors selected for these cells had to belong to other clusters, and using their expression values for smoothing resulted in less accurate expression profiles. To confirm that cluster size determined whether or not cells benefitted from smoothing with very large *k*, we examined the average accuracies of cells from the three largest and smallest clusters for different *k*. In both datasets, we observed that as predicted, accuracies started to drop off whenever *k* was chosen larger than the cluster size (see Figure 6e,f).

To obtain a more detailed view of the results of kNN-smoothing, MAGIC, and scImpute, we selected a representative cell from the largest cluster in the PANCREAS dataset (*n*=662), and examined the correlation of the smoothed profiles with the true cluster profile using scatter plots. For kNN-smoothing, we examined the results for *k*=15 and *k*=511, whereas for MAGIC and scImpute, we picked the parameter settings that achieved the best median PCC across all cells. The correlations for this particular cell mirrored the overall results (see Figure 6g-j), which showed that kNN-smoothing with either setting of *k* produced more highly correlated profiles than either of the two other methods. However, whereas the PCC for both MAGIC and scImpute was 0.88, the values reported by MAGIC were merely noisy and non-linear, while the scImpute results also exhibited some obvious smoothing artifacts (see Figure 6j).

Finally, we observed that for *k*=3, the median PCC of kNN-smoothing was sometimes lower than that for *k*=1. We believe this surprising result is related to size biases by the algorithm in the selection of neighbors (cells) to be used for smoothing (further discussed below). In conclusion, our evaluation of different smoothing methods on two simulated datasets showed that kNN-smoothing outperformed the other methods by a large margin for most choices of *k*, and in some cases recovered cell expression profiles with near-perfect accuracy.

### Other variants of kNN-smoothing are less accurate and exhibit stronger size selection bias in simulated datasets

In the design of our smoothing algorithm, we made several decisions based on theoretical considerations, as well as our intuitions. We therefore aimed to examine whether the performance of the resulting algorithm retrospectively validated these decisions Specifically, we aimed to compare the kNN-smoothing algorithm to a variant in which neighbors are identified in a single step, as opposed to a step-wise approach. Second, we aimed to test whether the choice of calculating cell-cell distances on median-normalized and FT-transformed data performed better than using a more commonly employed log-TPM transform. We refer two these two variants as the “single-step” variant and the “log-TPM” variant, respectively.

To test the accuracy of the different variants of the smoothing algorithm, we again relied on our simulated datasets (see above), and determined, for a range of different *k*, the fraction of cells with incorrect neighbors for each variant. We found that the log-TPM variant performed very poorly in both datasets, resulting in approximately 80% and 20%, respectively, of cells having an incorrect neighbor even for *k* = 1 in SIM-PANCREAS and SIM-PBMC (see Figure 7a,b). The “one-step” variant performed generally worse than the step-wise variant, with the exception of *k* = 15 and *k* = 31 in the SIM-PBMC dataset.

Over the course of our simulation experiments, we noticed that the average “sizes” (total UMI counts) of the smoothed “cells” (expression profiles) sometimes deviated significantly from the true UMI count of each cluster, which could only be explained by a size bias in the way in which neighbors were selected for each cell (the sizes of cells belonging to the same cluster varied due to our simulation of efficiency noise; see Methods). To examine whether kNN-smoothing and the two variants exhibited different size biases, we compared the distribution of smoothed profile sizes for a range of different *k*, focusing only on cells from the largest cluster in each dataset (see Figure 7c,d). We found that the algorithms exhibited strikingly different behaviors. Most notably, the one-step variant exhibited a strong systematic bias towards selecting “large” cells as neighbors (i.e., cells with a large total UMI count), resulting in smoothed cells that on average contained a much larger UMI count than the cluster profile that was used as the basis for the simulation of these cells. Since the first step of kNN-smoothing is identical to that of one-step smoothing with *k*=1, it shared this bias for large cells in its first step. Astonishingly, the opposite was true for neighbors selected in its second step (*k* = 3), when smoothed cells exhibited smaller-than-average sizes. However, by the fourth step (*k* = 15), the average sizes were very close to the true cluster values in both datasets. The log-TPM variant exhibited similar behavior, but the distribution of sizes was generally much more spread out. Based on theoretical considerations, we think that it is undesirable for an algorithm to exhibit an overly strong size bias, as it will make very uneven use of the information available (see Discussion). We therefore believe that the near-convergence of the average cell size to the true cluster UMI count, as achieved by the kNN-smoothing algorithm for *k* ≥ 15, represents a desirable property that again makes kNN-smoothing preferable to the algorithm variants examined. In summary, our evaluation of the effects of our initial design decisions validated those decisions, as they resulted in an algorithm that provides more accurate results, and makes more even use of information from cells that differ in their total UMI counts (e.g., due to efficiency noise).

### A Python implementation of kNN-smoothing processes datasets containing thousands of cells within a few minutes

For an analysis method to be of practical use, it not only needs to provide accurate results, but it must also finish in a reasonable amount of time. We therefore measured the runtime of our Python implementation of kNN-smoothing on Chromium PBMC data containing 21,425 expressed genes, using subsampling to test datasets with sizes ranging from *n*=2,000 to *n*=8,000 cells, on a laptop with an Intel^{®} Core™ i7-6600U processor and 20 GiB of memory (see Methods). We found that the runtime ranged from a few seconds to just over 14 minutes (for *k*=511 and *n*=8,000), and that runtime increased linearly with *k* (see Figure 8a). The two phases of the algorithm have different time complexities with respect to n: The identification of neighbors has a complexity of *𝒪* (*n*^{2}) (as it requires the calculation of distances between all pairs of cells), whereas the smoothing part has a complexity of *𝒪* (*n*) (as it simply requires the aggregation of UMI counts for all cells). Accordingly, we observed that as the size of the dataset increased, the first phase (identification of neighbors) consumed an increasingly large fraction of the total runtime (data not shown).

We also calculated the memory footprint of our Python implementation, which requires three copies of the expression matrix (original, smoothed, smoothed and transformed) and two *n*-by-*n* arrays (the distance matrix and a sorted indexing array) to be held in memory. We assumed that each expression measurement would be represented in memory by an 8-byte floating point value. From the results Figure 8b, it appears that for datasets containing approx. 20,000 protein-coding genes, the largest datasets that can be analyzed (without memory swapping) contain approx. 5k, 10k, and 20k cells, for computers with 4 GiB, 8 GiB, and 16 GiB of memory, respectively. Overall, these results demonstrate that kNN-smoothing can be run on most laptops and PCs for datasets containing several thousand cells, in a time-span of minutes or even seconds.

## DISCUSSION

### Importance of smoothing for the analysis of scRNA-Seq data

In this work, we have proposed *k-nearest neighbor smoothing* (kNN-smoothing), a novel algorithm for smoothing high-throughput scRNA-Seq data, aimed at significantly improving the signal-to-noise ratios of the gene expression values for each cell by aggregating information from similar cells (“neighbors”). It might appear that by smoothing single-cell data, one is compromising on important information pertaining to the individuality of each cell. We note that while cell-to-cell variation within a given cell type is of clear importance, in most applications one is querying for cell populations that are each represented by an appreciable number of cells. Thus, given the routine profiling of thousands or even tens of thousands of cells, and the inherent noisiness of the data under study, our smoothing algorithm offers a clear advantage in terms of the identification of those populations.

We designed the kNN-smoothing algorithm based on the observation that data from multiple high-throughput scRNA-Seq protocols (including inDrop, Drop-seq, and 10x Genomics’ Chromium) share common technical noise characteristics. Specifically, after the application of “median-normalization” to account for efficiency noise, the gene expression values in technical replicates are approximately Poisson-distributed. We believe that this is a direct consequence of the fact that all of these protocols only capture a small fraction of transcripts of each cell, employ 3’- or 5’-end counting (“tagging”), and avoid overcounting of amplified transcripts by UMI-filtering. Therefore, we predict that the Poisson noise characteristic applies to all such scRNA-Seq protocols that use UMI filtering, but not to other scRNA-Seq protocols. This idea clearly warrants a more detailed investigation, which is beyond the scope of this paper. Whatever the origins of the noise characteristics described here, the fact that they are shared between the aforementioned protocols implies that our proposed algorithm is in principle applicable to any dataset generated using those protocols.

We have demonstrated the application of kNN-smoothing to data generated using the inDrop (Klein et al. 2015) and Chromium (Zheng et al. 2017) protocols, and shown that in both cases, the algorithm was able to recover cell type-specific expression patterns for previously described marker genes. Moreover, the achieved noise reduction made it straightforward to apply hierarchical clustering (Eisen et al. 1998), a powerful method for exploratory analysis of gene expression data that performs poorly on unsmoothed scRNA-seq data. It also resulted in principal components capturing much larger fractions of total variance, and led to a significantly improved separation of individual cell populations along the first two principal components. This implies that kNN-smoothing has the potential to improve the performance of many advanced analysis methods that rely on PCA or other dimensionality reduction techniques, including methods for systematic exploratory analysis (e.g., Wagner 2015) and trajectory inference (e.g., Cao et al. 2017). Importantly, kNN-smoothing works by aggregating information across cells, rather than across genes. Therefore, it avoids the introduction of artificial gene-gene dependencies, which are highly problematic when downstream analyses involve methods whose null models assume independence between genes, such as GO enrichment analysis (Subramanian et al. 2005; Eden et al. 2009). At the same time, kNN-smoothing clearly introduces dependencies between cells. Naturally, the extent to which this is the case depends on the magnitude of *k*.

Recently, researchers and funding bodies have proposed the generation of “cell atlases”, systematic efforts aimed at providing exhaustive molecular descriptions of all cell types and states present in human tissues under healthy as well as disease conditions such as cancer (Regev et al. 2017; *National Cancer Institute* 2017; *The Chan Zuckerberg Initiative* 2018). As scRNA-Seq is generally seen as an important experimental methodology for the realization of these projects, kNN-smoothing could represent a valuable analysis tool for the identification of novel cell types and states, as well as for the characterization of their expression profiles.

### How to choose *k*?

he results obtained when applying kNN-smoothing to a particular dataset strongly depend on the choice of *k*. Choosing *k* very small might not adequately reduce noise. On the other hand, choosing *k* too large incurs the risk of smoothing over biologically relevant expression heterogeneity. Moreover, large *k* can also lead to artifactual expression profiles that consist of averages of profiles belonging to different cell populations. Our method provides no guarantee that a smoothed expression profile accurately reflects an existing cell population. During the exploratory phase of data analysis, we therefore recommend to test different choices of *k*. When a signal of interest has been identified (such as a gene-gene correlation, a cluster of cells, an expression signature, etc.), it can be determined what minimum of value of *k* is required in order to obtain this signal. When this value is large, adequate controls should be performed to ensure that the observed signal is not a smoothing artifact.

An appropriate choice of *k* also depends on the particular application: When analyzing cells undergoing a highly dynamic process (e.g., differentiation), large values of *k* might result in an overly coars picture of the transcriptomic changes. In contrast, when aiming to distinguish distinct cell types, larger choices of *k* can help identify robust expression profiles for each type.

### Comparison with previously reported methods

Our algorithm combines a previously proposed normalization method (Grün, Kester, and Oudenaarden 2014) with a standard variance-stabilizing transformation (VST) for Poisson-distributed data (Freeman and Tukey 1950). We are not aware of prior work suggesting the use of a VST in the context of smoothing scRNA-Seq data. Instead, most work has focused on parametric modeling (see Introduction). While these approaches can certainly be effective, our work suggests that they are not strictly necessary to effectively to address the issue of noise in scRNA-Seq data. Moreover, sophisticated models often require complex inference procedures, which can be difficult to implement correctly and efficiently. In contrast, our method requires only a few lines of code, while still being based on statistical theory, and our Python implementation runs in a matter of seconds or minutes on datasets containing a few thousand cells.

Simple aggregation or averaging of scRNA-Seq expression profiles has been previously employed in specific contexts, for example for library size normalization (Lun, Bach, and Marioni 2016). Recently, La Manno et al. (2017) employed a simple version of k-nearest neighbor smoothing (“pooling”) as part of a method designed to estimate the time derivative of mRNA abundance based on unspliced RNA sequences. The authors defined the most similar cells based on log-transformed data (for read counts from the SMART-Seq2 protocol), or PCA-transformed data (for UMI counts from inDrop and 10x Genomics protocols). However, they did not provide any justification for their choices of similarity metrics, a discussion of the statistical properties of the data before and after smoothing, or a quantification of the gain in expression accuracies achieved. Moreover, neither of these studies aimed to develop a general-purpose method to improve the signal-to-noise ratio of scRNA-Seq data, or employed a step-wise approach for defining the nearest neighbors, as we have done here. Our work can be compared to other recently proposed methods that aim to specifically address the issue of technical noise in scRNA-Seq data: Dijk et al. (2017) aimed to apply the idea of manifold learning using diffusion maps to scRNA-Seq data (see Supplementary Text for a demonstration of kNN-smoothing on one of the datasets analyzed in their study), and W. V. Li and J. J. Li (2017) developed an algorithm that borrows information among similar cells in order to “impute” the expression values of genes that in many cells exhibit UMI counts of exactly zero (“missing values”). Aside from the clear methodological differences between these two methods and kNN-smoothing, it is noteworthy that the respective study authors also made completely different assumptions about the noise characteristics of scRNA-Seq data. For their simulation studies, neither Dijk et al. (2017) and W. V. Li and J. J. Li (2017) generated Poisson-distributed expression data. Dijk et al. (2017) started from bulk microarray expression data, which was then “downsampled using an exponential distribution” to obtain specific proportions of zero values, while W. V. Li and J. J. Li (2017) defined gene-specific “dropout rate[s]”, and set individual expression values to zero using Bernoulli trials with those rates. Based on the results presented in this work, we believe that neither of these approaches faithfully reproduces the noise characteristics of UMI-filtered scRNA-Seq data.

### Use of simulation studies to quantify the accuracy of scRNA-Seq smoothing methods

As scRNA-Seq is currently the only technology that can be used to interrogate complete transcriptomes of single cells in a highly parallelized fashion, there exist no “gold standard” datasets to benchmark scRNA-Seq smoothing algorithms (i.e., datasets that contain a heterogeneous mixture of cells whose true single-cell expression profiles have been determined using an orthogonal method). Therefore, one most resort to simulation studies in order to quantitatively assess the accuracies of smoothing methods. Here, we established a new method for using real scRNA-Seq datasets to simulate UMI-filtered scRNA-Seq data that consist of a mixture of cell types (clusters). The simulated data exhibit Poisson-distributed sampling noise, modeled on top of efficiency noise, for which we used the observed distribution of total UMI counts per cell in the real data. (This might result in an overestimate of efficiency noise, as some differences in total UMI counts could also reflect biological differences in total mRNA abundance and/or cell size.) Our methodology is based on the understanding of the sources and characteristics of technical noise in UMI-filtered scRNA-Seq data as described in this work, and a visual comparison between the real and the synthetic datasets led us to conclude that it can also reproduce the majority of the biological heterogeneity observed in the real dataset. For the analyses reported here, we decided to limit the simulations to *K* =10 clusters, but the procedure is compatible with any integer choice of *K* for 1 ≤ *K* ≤ *n* (where *n* is the number of cells in the real data), and the use of hierarchical clustering ensures consistency between datasets generated using similar choices of *K* (e.g., for *K* = 11, one of the clusters present in the *K* = 10 dataset would be split into two distinct clusters, while all other clusters remain identical).

Based on the simulated data, we were able to show that with *k* ≥ 7, kNN-smoothing produced much more accurate results for both simulated datasets, when compared to MAGIC (Dijk et al. 2017) and scImpute (W. V. Li and J. J. Li 2017). This was true for all MAGIC and scImpute parameter settings tested, independently of whether we quantified accuracy using both relative (PCC) or absolute (RMSE) measures, and independently of whether we used log_{2}-transformed or square root-transformed expression values in these calculations. In some cases, kNN-smoothing was able to recover the true expression profile with near-perfect accuracy, which we never observed for either of the two other methods. Our results therefore suggest that kNN-smoothing generally outperforms MAGIC and scImpute on UMI-filtered scRNA-Seq data containing highly heterogeneous cell populations.

A limitation of our approach to simulating scRNA-Seq data is that it ignores certain biological sources of heterogeneity: For example, cells from the same cell type might be in different cell cycle phases, and these differences would be lost (averaged out) as part of the simulation procedure. More generally, our current approach is unable to simulate datasets that contain a mixture of cells from different stages of a continuous dynamic process (such as cell differentiation), and procedures that can simulate UMI-filtered scRNA-Seq data for those types of experiments need to be established in order to quantitatively evaluate the performance of smoothing methods in such scenarios.

### Implications for study design

Based on the work described here, it is tempting to speculate that in theory, there is no limit as to how accurately the average expression profile of individual cell populations and sub-populations can be determined using scRNA-Seq. Our analysis suggests that the signal-to-noise ratio can always be improved by aggregating more profiles from “biologically identical” cells. In practice, however, the number of cells that can be analyzed is limited by the protocol used, the cost of the experiment, the number of cells available, and/or the rarity of the population of interest. Furthermore, the accuracy with which “biologically identical” cells can be identified based on their noisy profile depends on several factors, including the granularity required (e.g., can cells in different cell cycle stages be considered identical for the purpose of the analysis?), and the precise measure of similarity adopted. When the transcriptomic differences between cell populations of interest become too small to allow a reliable identification of neighbors, it is not clear how to perform smoothing and extract the true biological signal. In this work, we have determined similarity on the basis of the expression of all genes, but restricting this calculation to a subset of genes or employing different distance metrics could be more appropriate in certain settings.

More generally, the quadratic relationship between “cell coverage” (loosely defined as the average number of profiles obtained for each cell population) and potential quantification accuracy brings into focus the question of what constitutes an optimal number of sequencing reads per cell. While a quantitative treatment of this issue is beyond the scope of this work, it is clear that in many situations, it would be more beneficial to sequence additional cells, rather than increase the read coverage per cell. The precise optimum likely depends on numerous factors, and is difficult to determine without an examination of all the experimental, statistical, and computational factors involved in scRNA-Seq studies. However, since sequencing often represents the single most expensive part of the experiment, this question clearly warrants further investigation.

### Future directions

In this work, we have used multiple datasets to demonstrate that PCA and hierarchical clustering, two basic techniques for analyzing gene expression data benefit strongly from kNN-smoothing. In future work, we hope to explore the effect of smoothing for additional types of analyses, including differential expression analysis, gene set enrichment analysis, or exploratory analysis using prior knowledge (Wagner 2015). We anticipate that our kNN-smoothing algorithm will benefit all of these approaches, and generally enable the more effective analysis of scRNA-Seq data in wide variety of settings. It should again be noted, however, that smoothed expression profiles of cells are no longer statistically independent, so smoothing should not be used naively in combination with statistical tests for differential expression.

The use of a global *k* could limit the effectiveness of our algorithm in cases where different cell populations are present at very different abundances. As an extreme example, if one population constitutes 5% of all cells, and another 95%, *k* should not be chosen larger than 5% of the total number of profiles, in order to avoid artifacts. However, the expression profile of the population present at 95% could benefit from larger choices of *k*. It would therefore seem useful to automatically adjust *k* for each cell. This is the approach chosen by Dijk et al. (2017), who use the distance of a cell to its *ka*’th neighbor as an important parameter in the calculation of the smoothed profile. However, a complication associated with this approach is that different expression profiles would exhibit distinct technical noise levels, since they would be the result of aggregating or averaging over different numbers of cells. Another way to address this issue would be to cluster cells by type before performing more aggressive smoothing.

High-throughput scRNA-Seq technology is widely believed to hold enormous potential for the analysis of heterogeneous tissues and dynamic cellular processes in health and disease. However, the inherent noisiness of the data means that greater computational efforts are required in order to realize this potential. Fortunately, data from different protocols exhibit very similar statistical properties, presumably due to their shared reliance on 5’- or 3’-end counting and UMI filtering. These properties should directly inform the design of effective algorithms for smoothing and analysis of scRNA-Seq data. We have described a generally applicable, easy-to-implement approach for improving the signal-to-noise ratio of single-cell expression profiles, which promises to significantly expand the realm of possibilities for downstream analyses of scRNA-Seq data.

## METHODS

### Download and processing of inDrop pure RNA replicate data

Raw sequencing data were downloaded from SRA (experiment accession SRX863258). In this experiment by Klein et al. (2015), droplets containing pure RNA extracted from K562 cells were processed using the inDrop protocol. The downloaded data were processed using a custom pipeline. Briefly, SRA data were converted to the FASTQ format using fastq-dump. Next, the “W1” adapter sequence of the inDrop RT primer were located in the barcode mate sequence (the first mate of the paired-end sequencing), by comparing the 22-mer sequences starting at positions 9-12 in the read with the known W1 sequence, allowing at most two mismatches. Reads for which the W1 sequence could not be located in this way were discarded. The start position of the W1 sequence was then used to infer the length of the first part of the inDrop cell barcode in each read, which can range from 8-11 bp, as well as the start position of the second part of the inDrop cell barcode, which always consists of 8 bp. Cell barcode sequences were mapped to the known list of 384 barcode sequences for each read, allowing at most one mismatch. The resulting barcode combination was used to identify the cell from which the read originated. Finally, the UMI sequence was extracted, and only with low-confidence base calls for the six bases comprising the UMI sequence (minimum PHRED score less than 20) were discarded. The mRNA mate sequences (the second mate of the paired-end-sequencing) were mapped to the human genome, release GRCh38, using STAR 2.5.3a with parameter “–outSAMmultNmax 1” and default parameters otherwise. Testing the overlap of mapped reads with exons of protein-coding genes and UMI-filtering was performed using custom Python scripts. Droplets (barcodes) were filtered for having a total UMI count of at least 10,000, resulting in a dataset containing UMI counts for 19,865 protein-coding genes across 935 droplets.

### Download of 10x Genomics ERCC spike-in expression data

UMI counts for ERCC spike-in RNA processed using the 10x Genomics scRNA-Seq protocol (Zheng et al. 2017) were downloaded from the 10x Genomic website. The dataset consisted of UMI counts for 92 spike-ins across 1,015 droplets.

### Download of Drop-Seq ERCC spike-in expression data

UMI counts for ERCC spike-in RNA processed using the 10x Genomics scRNA-Seq protocol (Macosko et al. 2015) were downloaded from GEO accession number GSM1629193. The dataset consisted of UMI counts for 80 spike-ins across 84 droplets.

### Prediction of scRNA-Seq noise characteristics based on Poisson statistics

In this paper, we initially focus on the technical variation observed in scRNA-Seq data for droplets containing identical pools of pure mRNA. Let be the observed UMI count for the *i*’th gene (or ERCC spike-in) in the *j*’th droplet, for *i* = 1,…, *p* and *j* = 1,…, *n*. Similarly, let be a random variable representing the UMI count for the *i*’th gene in the *j*’th cell. We assume that is Poisson-distributed with mean , where *m*_{i} is the number of mRNA molecules present for the *i*’th gene, and *e*_{j} corresponding to the capture efficiency of the scRNA-Seq protocol for the *j*’th droplet (both *m*_{i} and *e*_{j} are unknown). We further assume that ,…, independent, for all *i*. For the sake of simplicity, we assume that the read coverage (the number of reads sequenced per cell) is infinite, so that there are no cases in which a transcript is not observed due to limited read coverage. In practice, limited read coverage will not invalidate the Poisson assumption, but result in lower “effective” capture efficiencies.

If all *e*_{j} were identical (say, equal to *e*^{global}), then ,…, Poisson(), with = *m*_{i}*e*^{global}. Grün, Kester, and Oudenaarden (2014) have proposed to normalize the expression profile of each cell to the median total UMI count across cells (Model I in Grün et al.), in order to counteract the differences in capture efficiency (“efficiency noise”). Median-normalization consists of calculating the total UMI count per profile (cell or droplet), , calculating the median *t*^{med} = median{*t*_{1},…, *t*_{n}}, and then multiplying each by the factor .

Based on the results by Grün et al., we hypothesized that median-normalized data would be approximately Poisson-distributed, as long as the differences in capture efficiency were not too extreme. Therefore, we let ,…, represent the UMI counts for the i’th gene after median-normalization, and assume them to be i.i.d. Poisson().

For Poisson-distributed variables, the variance is always equal to the expectation (defined by λ). Let *N*_{i} ~ Poisson(). For the coefficient of variation (CV) of *N*_{i}, we have:

Taking the logarithm on both sides gives:

Therefore, the relationship between log *E*(*N*_{i}) and log *CV*(*N*_{i}) is linear with a slope of −0.5. This is indicated by the gray lines in Figure 1a-f.

The probability of observing a count of zero for *N*_{i} is given by the Poisson PMF:

Therefore, *P*(*N*_{i} = 0) = *e*^{-λi} values are shown as the orange lines in Figure 1g-i.

If a computational pipeline used to determine UMI counts reports systematically inflated values, then the median-normalized UMI counts for the *i*’th gene can be approximately represented by a scaled Poisson variable , where *c* is the inflation factor. then has mean and variance , so for *CV*(), we have:

Taking the log on both sides gives:

Therefore, the relationship between log *E*() and log *CV*() will still be linear, but with an y-axis intercept of 0.5 log *c* instead of 0, which is consistent with Figure 3b,e.

### Prediction of the effect of aggregating scRNA-Seq expression profiles from technical replicates

We again assume that for droplets containing identical pools of pure mRNA, the median-normalized UMI counts ,…, Poisson(). Let , and *N*_{i} ~ Poisson(). It is clear that :

Similarly, for averaged UMI counts :

This effect is demonstrated in Figure 2.

### Smoothing of scRNA-Seq expression profiles from biological samples based on Poisson statistics

In real data, genes can exhibit differential expression across cells. Therefore, we define λ_{ij} = *m*_{ij}*e*_{j}, where *m*_{ij} is the number of mRNA molecules present for the *i*’th gene in the *j*’th cell, and *e*_{j} is the capture efficiency of the scRNA-Seq protocol for the *j*’th cell. Let *U*_{ij} be a random variable representing the UMI count for the *i*’th gene in the *j*’th cell. We again assume that *U*_{ij} is Poisson-distributed with mean λ_{ij}, and that *U*_{i1},…, *U*_{in} are independent, for all *i*. Let be the set of *k* nearest neighbors of the *j*’th cell, as determined in Algorithm 1. Let . We then define the aggregated expression level , and note that *A*_{ij} ~ Poisson(). From the aforementioned discussion, it follows that if the *k* neighbors have transcriptomes that are sufficiently similar to that of the *j*’th cell, and if the efficiency noise is not too strong, then . Similarly, we can calculate the averaged expression level *S*_{ij} = *A*_{ij}/(*k* + 1). Then *S*_{ij} is a Poisson variable with mean , scaled by a factor of 1/(*k* + 1), and therefore has the same CV as *A*_{ij}. The point here is that even if the *U*_{ij} are not identically distributed (due to expression differences and/or efficiency noise), simple aggregation or averaging will always result in Poisson-distributed smoothed values. The same is not true for weighted sums or averages. Let {*w*_{j0}, *w*_{j0},…, *w*_{jk}} represent weights (all positive), and let . Then the weighted sum *W*_{ij} is neither a Poisson nor a scaled Poisson variable, unless all weights are identical.

### Download and processing of inDrop pancreatic islet data

Raw sequencing data were downloaded from SRA (experiment accession SRX1935938). In this experiment by Baron et al. (2016), inDrop was applied to pancreatic islet tissue from a human donor. Data was processed using the same pipeline used for the inDrop pure RNA data, and only profiles with a total UMI count of at least 1,000, resulting in a dataset containing UMI counts for 19,865 protein-coding genes across 2,109 cells. We refer to this dataset as the PANCREAS dataset.

### Download and processing of 10x Genomics Chromium (v2) peripheral blood mononuclear cell (PBMC) data

We downloaded the UMI-filtered expression matrix of the dataset titled “4k PBMCs from a Healthy Donor” from the 10x Genomics website (www.10xgenomics.com). The data was processed by 10x Genomics using the “Cell Ranger” software, version 2.1.0. A QC report of the dataset is available on the 10x Genomics website. The downloaded expression matrix contained 33,694 genes and 4,340 cells. We removed 13,921 genes that had no expression in the entire dataset, and then another 8 genes with duplicate gene names (keeping only the first instance of each gene). The final dataset contained 19,765 genes. We refer to this dataset as the PMBC dataset.

### Download and processing of mouse myeloid progenitor data

UMI counts were downloaded from GEO, accession number GSE72857. The 19 clusters for cells are available at MAGIC’s (Dijk et al. 2017) code repository: https://github.com/pkathail/magic/issues/34. 27,297 cells with cluster labels were used for performing k-nearest neighbor smoothing (see Algorithm 1), and smoothed values were normalized to TPM (UMI-filtered transcripts per million). For visualization as a heatmap in Figure S6a-b, the z-score of every gene across cells was calculated. For scatter plots in Figure S6c-e, the expression of each gene was log_{2} (TPM + 1).

### Analysis of scRNA-Seq data using principal component analysis (PCA) and hierarchical clustering

Both PCA and hierarchical clustering were performed on median-normalized and Freeman-Tukey transformed (FT-transformed) data. The procedure that we refer to as “median-normalization” is equivalent to “Model I” in Grün, Kester, and Oudenaarden (2014). It involves first calculating the median total UMI count across all cells in the dataset, and then scaling the expression profile of each cell so that its total UMI count equals this median value. More formally, for a dataset containing *p* genes and *n* cells, let *U*_{j} = (*u*_{1j},…, *u*_{pj})^{T} represent the expression profile (gene UMI counts) of the *j*’th cell (either unsmoothed, or after kNN-smoothing without dividing by k+1). Let represent the total UMI count of the *j*’th cell. Then let *t*^{med} = median {*t*_{1},…, *t*_{n}} be the median total UMI count. Median-normalization then consists of calculating scaled expression profiles .

The Freeman-Tukey transform is a variance-stabilization transformation for Poisson-distributed data proposed by Freeman and Tukey (1950). It is defined as . We apply this transformation to the normalized UMI counts to ensure that independently of gene expression level, the absolute level of technical noise is comparable between genes. Specifically, we calculate the transformed UMI counts as .

PCA was performed on median-normalized and FT-transformed data, retaining all genes in the PANCREAS and PMBC datasets, respectively, using the sklearn.decomposition.PCA class from *scikit-learn* v0.19.1. Hierarchical clustering was also performed on median-normalized and FT-transformed data, but after filtering for the 1,000 most variable genes, using the scipy.cluster.hierarchy.linkage function from *scipy* v1.0.0. More specifically, we calculated the variance for each gene in median-normalized and FT-transformed data, and retained the 1,000 genes with the largest variance. For clustering cells, we used Euclidean distance, and for clustering cells, we used correlation distance. In both cases, we used average linkage. for clustering genes and Euclidean distance for clustering cells, both with average linkage. To visualize the clustered data as a heatmap, we re-ordered the genes and cells according to the results of the hierarchical clustering, and standardized the expression values of each gene by substracting the mean and dividing by its sample standard deviation.

### Selection of cell type-specific marker genes

For cell types in the PANCREAS dataset, we selected the same genes used by Baron et al. (2016). For the PMBC dataset, we manually selected genes based on well-known markers, a previously published analysis of scRNA-Seq PBMC data (Zheng et al. 2017), and literature searches. In particular, for moncoytes, we followed known protein surface markers and selected *CD33*, a myeloid lineage marker, *CD14*, specifically expressed in monocytes, and *CD16*, expressed on a subset of monocytes, as well as certain NK cells and T cells (Naeim et al. 2013). To mark dendritic cells, we selected FCER1A and CLEC9A, both previously shown to be specifically expressed in those cells (Villani et al. 2017). For T cells, we used *CD3D* and *CD3E*, the protein products of which form a dimer of the T cell receptor complex, and are pan T cell markers (Naeim et al. 2013). We also included *CD8A* and *CD8B*, encoding two isoforms of the CD8 T cell co-receptor present on cytotoxic T cells. For NK cells, we included *NCAM1* (CD56), *NCR1* (CD335), and *KLRD1*(CD94), all of which are expressed on NK cells at the protein level (Naeim et al. 2013). Finally, for B cell, s we included *CD19, MS4A1* (CD20), and *CD79A*, all well-known B cell markers (Naeim et al. 2013).

### Simulation of scRNA-Seq data

The SIM-PANCREAS dataset was simulated based on the PANCREAS dataset using the following approach: First, we used smoothing and hierarchical clustering to group the cells in the PANCREAS dataset into ten clusters. To do so, we applied kNN-smomothing with *k* = 31. Then, the smoothed dataset was median-normalized, and the normalized values were Freeman-Tukey transformed. Then, the dataset was filtered for the top 2,000 most variable genes, and hierarchical (agglomerative) clustering was performed on the cells, using average linkage and the Euclidean distance metric. The resulting tree was cut at the appropriate height to produce ten clusters. We chose hierarchical clustering over other clustering methods because it simplifies the visualization of clustering results, and because it can ensure a certain degree of consistency between simulated datasets that only differ in terms of the number of clusters simulated.

After assigning all cells to one often clusters, we calculated the cluster expression profiles by averaging the expression profiles of all cells assigned to that cluster, using the original (unsmoothed) UMI counts. For each cell in PANCREAS, we then simulated a corresponding expression profile for inclusion in the SIM-PANCREAS dataset, by looking up the cluster it was assigned to, scaling the cluster expression profile to match the observed number of transcripts for that cell, and then drawing the expression value for each gene from a Poisson distribution with the corresponding λ parameter.

To formalize this procedure, let *p* be the number of genes in the PANCREAS dataset, and let *u*_{j} = (*u*_{1j},…, *u*_{pj})^{T} represent the expression profile (gene UMI counts) of the *j*’th cell (before smoothing). Let *z*_{j} ∈ {1,…, 10} represent the cluster assignment of the *j*’th cell (obtained using hierarchical clustering, as described above). For the simulation, we then define a corresponding set of 10 clusters. Let *e*_{c} = (*e*_{1c},…, *e*_{pc})^{T} represent the true expression profile of the *j*’th cluster, which we define using *e*_{ic} = represent the total UMI count of the *j*’th cell. Let represent the average total UMI count for cells in the *c*’th cluster. We use this information to simulate a dataset with *n* cells. Let represent the expression profile (gene UMI counts) of the *j*’th cell in the simulated dataset. We obtain each by sampling from a Poisson distribution with mean parameter λ_{ij}, where λ_{ij} = (*t*_{j}/*a*_{zj}) * *e*_{izj}.

The SIM-PBMC dataset was simulated based on the PMBC dataset using a completely analogous procedure.

### Comparison of the accuracies of kNN-smoothing, MAGIC, and scImpute on simulated data

We downloaded MAGIC (commit 4d5efb4) from GitHub, and installed the Python package included. We also installed the scImpute R package (v0.0.4; commit dda0441) from GitHub, using the command install_github ("Vivianstats/scImpute"). We then applied both methods, as well as kNN-smoothing, to the SIM-PANCREAS dataset (testing different parameter choices; see below). For each cell in the dataset, we looked up the identity of the cluster that was used as the basis for the simulation of that cell’s expression profile. The expression profile of that cluster represented the ground truth that the smoothed expression profile should ideally be identical to. To quantify the similarity between the smoothed and the ground truth expression profile, we first applied a log_{2}-transformation to both profiles, adding a pseudocount of 1: *f*(*x*) = log_{2}(*x* + 1). We then calculated the Pearson correlation coefficient (PCC) between the smoothed and ground truth expression profiles, as well as the root mean squared distance (RMSE) between those profiles. We visualized the results using boxplots in which each value represents the PCC or RMSE of a single profile (cell) after smoothing. We also calculated PCC and RMSE for values transformed using a square root transformation instead of a log-transformation: *f*(*x*) = , and visualized the results as a boxplot. Finally, we repeated the entire procedure for the SIM-PBMC dataset.

For MAGIC, we varied the *t* parameter between 1 and 9, while setting the other parameters to the values recommended in the tutorial provided by the authors of this method: n_pca_components=20, k=30, ka=10. We reasoned that of all parameters, *t* has by far the strongest effect on the smoothing results, as it is the power to which the Markov affinity matrix is raised. *t* can also be interpreted as the length of a random walk, and larger values of *t* therefore lead to much stronger smoothing (Dijk et al. 2017). For scImpute, we decided to vary both *t* and *K*. In this paper, we refer to *t* as *d*, in order to avoid confusion with MAGIC’s *t* parameter. *d* is the dropout probability threshold that determines the set of genes which will have their expression values imputed. *K* is the number of clusters that determines the sets of candidate neighbors, used to build statistical models to estimate dropout probabilities for each gene (W V. Li and J. J. Li 2017).

We applied MAGIC using its Python interface (function SCData.run_magic), in accordance with the tutorial. We noticed that MAGIC dropped all genes that had no expression in any cell in the simulated datasets, and therefore took care to add these genes back (with zero values) to the smoothed matrix, in order to ensure an unbiased comparison with the other methods (additional or missing zero values change the value of the PCC). We applied scImpute using its R interface (function scimpute). It is noteworthy that while the runtime of MAGIC was comparable to kNN-smoothing (usually finishing within seconds or minutes), scImpute routinely took several hours to finish, even when using 4 CPU cores (ncores=4).

### Measuring the runtime of the kNN-smoothing Python implementation

To measure the runtime of our kNN-smoothing Python implementation, we downloaded the UMI-filtered gene expression matrix of the dataset titled “8k PBMCs from a Healthy Donor” from the 10x Genomics website. After filtering for genes with expression and removing duplicated (analogous to our processing of the PMBC dataset), we obtained a dataset containing 21,425 genes and 8,381 cells. To test the runtime of kNN-smoothing we randomly sampled *n*=2,000, *n*=4,000 and *n*=8,000 cells (without replacement) and measured the runtime (wall time) of the algorithm for different settings of *k*. For each combination of *n* and *k*, we repeated this procedure three times. All tests were performed using Python v3.5.4 on Ubuntu^{®} 17.10.

## ACKNOWLEDGMENTS

We would like to thank Bo Xia, Maayan Baron, Dr. Gustavo Franca for helpful discussions.