Abstract
Single-cell RNA-Sequencing (scRNA-Seq) has become the most widely used high-throughput method for transcription profiling of individual cells. Systematic errors, including batch effects, have been widely reported as a major challenge in high-throughput technologies. Surprisingly, these issues have received minimal attention in published studies based on scRNA-Seq technology. We examined data from all fifteen published studies including at least 200 samples and found that systematic errors can explain a substantial percentage of observed cell-to-cell expression variability. Specifically, we found that the proportion of genes reported as expressed explains a substantial part of observed variability and that this quantity varies systematically across experimental batches. Furthermore, we found that experimental designs that confound outcomes of interest with batch effects are common. Finally, we propose a simple experimental design that can ameliorate the effect of theses systematic errors have on downstream results.
Single-cell RNA-Sequencing (scRNA-Seq) has become the primary tool for profiling the transcriptomes of hundreds or even thousands of individual cells in parallel. Our experience with highthroughput genomic data in general, is that well thought-out data processing pipelines are essential to produce meaningful downstream results1-3. We expect the same to be true for scRNA-seq data. Here we show that while some tools developed for analyzing bulk RNA-Seq can be used for scRNA-Seq data, such as the mapping and alignment software, other steps in the processing, such as normalization, quality control and quantification, require new methods to account for the additional variability that is specific to this technology.
One of the most challenging sources of unwanted variability and systematic error in high-throughput data are what are commonly referred to as batch effects. Given the way that scRNA-Seq experiments are conducted, there is much room for concern regarding batch effects4. Specifically, batch effects occur when cells from one biological group or condition are cultured, captured and sequenced separate from cells in a second condition. Although batch information is not always included in the experimental annotations that are publicly available, one can extract surrogate variables from the raw sequencing (FASTQ) files5. Namely, the sequencing instrument used, the run number from the instrument and the flow cell lane. Although the sequencing is unlikely to be a major source of unwanted variability, it serves as a surrogate for other experimental procedures that very likely do have an effect, such as starting material, PCR amplification reagents/conditions, and cell cycle stage of the cells6-8. Here we will refer to the resulting differences induced by different groupings of these sources of variability as batch effects.
In a completely confounded study, it is not possible to determine if the biological condition or batch effects are driving the observed variation. In contrast, incorporating biological replicates across in the experimental design and processing the replicates across multiple batches permits observed variation to be attributed to biology or batch effects (Figure 1). To demonstrate the widespread problem of systematic bias, batch effects, and confounded experimental designs in scRNA-Seq studies, we surveyed several published data sets. We discuss the consequences of failing to consider the presence of this unwanted technical variability, and consider new strategies to minimize its impact on scRNA-Seq data.
The top section depicts a completely confounded study design of processing individual cells from three biological groups (represented by shapes) in three separate batches (represented by colors). In this case, we cannot determine if biology or batch effects drive the observed variation. The bottom section depicts a balanced study design consisting of multiple replicates (rep) split and processed across multiple batches. The use of multiple replicates allows observed variation be attributed to biology (cells cluster by shape) or batch effects (cells cluster by color).
Batch and outcomes of interest are confounded in published scRNA-Seq experiments
We examined all publicly available scRNA-Seq data sets including at least 200 samples to investigate the extent of confounding biological variation with batch effects. These data sets were created using six different scRNA-Seq protocols for sequencing9-15 and five studies include the use of unique molecular identifiers16 (UMIs) for counting cDNA molecules. For each study, we downloaded the processed data available on GEO17 and reconstructed the study design from the sequence identifiers provided in the FASTQ files. We used the standardized Pearson contingency coefficient to assess the experimental design between processing batches and outcome of interest and found values ranging from 82.1% to 100% (perfect confounding) in eight studies (Table 1). Note that with this level of confounding it is nearly impossible to parse batch effects from biological variation. We could not assess the experimental design in the other seven studies because no biological outcomes of interest were defined, the purpose of the study was the discovery of novel cell types or the paper described a new technology. We note that in one of these studies for the discovery of novel cell types, the predicted cell type was provided and we found that the confounding percentage between the reported cell type and batch was 80.5%18.
Column 1 shows the publications. Column 2 shows the organism. Column 3 shows the single-cell technology used for sequencing. Column 4 shows the number of cells (samples) included in the study. Column 5 shows the number of genes included in the data uploaded to the public repository. Column 6 indicates the units in which the values were reported. Column 7 shows the level of confounding between biological condition and batch effect quantified using the standardized Pearson contingency coefficient as a measure of association. The percentage ranges from 0% (no confounding) to 100% (completely confounded). Column 8 shows the Pearson correlation between the first principal component of the log transformed data and the proportion of detected genes. Column 9 provides the citation for the study.
+ The main purpose of this study was to investigate monoallelic gene expression in mouse embryos, but here we consider the different developmental stages (oocyte to blastocyst) as the biological condition as an example.
* Differences between the biological conditions in these studies were pronounced; therefore we binned the conditions into separate into groups for this analysis each with its own corresponding correlation coefficient.
** The available processed data was previously filtered by the authors and excluded the majority of non-detected genes, which partially explains the lower correlation (0.54 compared to 0.65–0.93).
Proportion of detected genes is a major source of technical cell-to-cell noise
Most, if not all, published studies using scRNA-Seq rely explicitly or implicitly on computing distances between the cell expression profiles. Principal component analysis is used explicitly to quantify biological or molecular distance19 or implicitly to approximate distance between individual cells. We used the processed expression data available on GEO, applied principal components analysis on the log (base 2) transformed values (adding 1 to avoid logs of 0), and computed the proportion of detected genes from the same data set with the exception of one study19. In this exception, the processed expression data available on GEO excluded most non-detected genes and the values for each gene were centered by removing the average. For this case, we computed the proportion of detected genes from the raw data. We found wide variation in the proportion of detected genes across cells in all studies: from 1% detected to 65%. Furthermore, we found strong correlation between the first principal component and the proportion of detected genes within each data set (Figure 2, Figure S1).
The principal components were computed using the publicly available processed data available on GEO. The proportion of detected genes was calculated using the same processed data on GEO for all studies except for Patel et al. (2014). In this case, because most non-detected genes were excluded from the publicly available processed data, we computed the proportion of detected genes from the raw data.
To determine if the variability in the proportion of detected genes was biologically or technically driven we compared the variability across biological groups to the variability across batches. For the cases in which the experimental design permitted this comparison (Tables S1-S6), we found that batch explained more variability than biological group (Figures S2-S7). We note that for three of the studies20-22 we split the defined biological groups into subgroups because the differences between biological groups were pronounced, and we found similar results. In the two studies for which batch was completely confounded with biological group (Table S7 and S8) we also observed variability across batch (Figure S8). However, in these cases it is impossible to separate variability due to biology or to batch. In seven of the studies, there were no biological outcomes of interest defined or the purpose of the study was for discovery of novel cell types (Figs S9-S15).
Batch effects lead to differences in detection rates, which lead to apparent differences between biological groups
To illustrate potential down stream effects, we examined 430 single-cells from five biological groups of interest for one of the studies in which at least one biological condition was split across two batches (Table S1). We confirmed that cells cluster by biological group (Figure 3A) as reported in the original publication. However, four of the five biological groups were confounded with batch (Table S1) and batches can also explain the clustering (Figure 3B). The one group that was not confounded with batch confirms the high level of variability explained by processing cells in different batches (Figure 3C). As expected from the previous results, different batches lead to different proportions of detected genes (Figure 3D), which we have shown to be correlated with the first principal component and may be driving the observed biological variation across groups.
Illustration with public data19 of how batch effects lead to differences in detection rates, which lead to apparent differences between biological groups. (A) Using principal components analysis, scRNA-Seq samples cluster by biological group, but the observed biological variation across groups is confounded with (B) technical variation from processing the cells in different batches. (C) Within one group (Group 5), the cells cluster by batch. (D) Furthermore, individual batches of cells have different proportions of detected genes, which may be driving the observed biological variation across groups.
Detection rate has indirect effects on reported gene expression measurements
Using the processed scRNA-Seq data available on GEO, we computed the median of the non-zero measurements for each cell. For each study, we noticed a strong non-linear relationship between the median expression and the proportion of detected genes (Figure 4, Figure S16). We also found that the entire distribution of the non-zero genes changed with the proportion detected genes (Figure S17). The overall level of expression changing with the proportion of detected genes could be biologically driven, but there is a reasonable explanation of how it can be a technical artifact, which we explain using statistical notation. Let Xij be the unobserved expression level for sample i and gene j. Let us consider only expressed genes (Xij > 0). In the sequencing experiment, each expressed gene has a probability of being amplified, which means we observe a quantity proportional to XijZij where Zij = 1 if the gene was expressed at a high enough level to be detected and amplified and 0 otherwise. This implies that the expected amount of RNA we will obtain is a quantity proportional to
Non-linear relationship between the median gene expression and the proportion of detected genes using processed scRNA-Seq data available on GEO. Failure to account for differences of the proportion of detected genes between cells over-inflates the gene expression estimates of cells with a low proportion of detected genes. The blue curves were obtained by fitting a locally weighted scatter plot smoothing (loess) with a degree of 1 and span of 0.75 for all figures. Because the range of proportion of detected genes varied from study to study, the range of the x-axis differs across plots.
Note: We could not include Patel et al. (2014) because of the row-standardization applied by the authors in the processed data available on GEO.

Because we know the technical variation affects the probability detection Pr (Zij = 1) = pi, which varies between samples, this quantity should be calculated for each sample i. We assume that, within a homogenous population for example, the expected level of a gene that is detected, E(Xij|Zij = 1), is the similar across cells. Then, the total amount of RNA is proportional to the detection rate of the expressed genes pi. Now because experimentally we amplify to have roughly the same total amount of RNA for each sample, this implies that we have to amplify in a way that is equivalent to multiplying by , with K the total we aspire to reach. Thus the median expression of the expressed genes will be proportional to
, which is consistent with the data (Figure 4, Figure S16). To confirm this, we implemented an adjustment based on these derivations. Specifically, we used the five studies that count cDNA molecules based on UMIs and followed a standard normalization procedure15, which scales the raw UMI counts Yij for the jth gene (or transcript) in the ith cell by the total number of UMIs Ni in the sample. Using the normalized UMI data, we multiplied the ith cell by an adjustment factor
motivated by the derivations above. This adjustment removed much of the dependence between the median expression values and the proportion of detected genes (Figure S18). We found similar results using normalized read counts from scRNA-Seq data sets based on TPM, RPKM and FPKM (results not shown).
Experimental design solutions
Batch effects are not limited to scRNA-Seq data and have been shown to be a widespread problem in high-throughput experiments3, 5, but there are currently no published general statistical solutions to the problem of batch effects in scRNA-Seq data. For the specific application of differential expression, a proposed solution is to account for differences in the proportion of detected genes by explicitly including it as a covariate in a linear regression model23. However, given the current levels of confounding and experimental designs, this approach will not be able to distinguish biological from technical effects. For example, some studies have demonstrated cells with different biological phenotypes can express a different number of genes24. However, due to the nature of the experimental protocol needed to run scRNA-Seq experiments imposed by the way cells are captured and sequenced in batches, standard balanced experimental designs are not possible4, 25, 26. An experimental design solution is to use biological replicates, namely independently repeating the experiment multiple times for each biological condition (Figure 1). This approach allows for multiple batches of cells to be randomized across sequencing runs, flow cells and lanes as in bulk-RNA-Seq. With this design we can then model and adjust for batch effects due to systematic experimental bias. Other considerations for the design of a scRNA-Seq experiment such as the number of cells that should be generated can be determined using a power analysis, which will depend multiple factors including the goal of the study, the library complexity or sequencing depth required to detect most expressed genes, the protocol used to quantify the expression values such as using unique molecular identifiers16, and the success rate for failed single-cell libraries due to degraded RNA or low amplification efficiency. A more detailed discussion of how these factors affect the experimental design has been recently published4, 26.
Discussion
Batch effects and unwanted technical cell-to-cell noise remains a challenge in the analysis of scRNA-Seq data. The challenge is more complex than in previous sequencing experiments since experimental batches lead to different detection rates, which in turn lead to different transcription level estimates. In addition, detection of a gene or transcript in scRNA-Seq experiments is heavily dependent on the experimental protocol, which leads to systematic differences in the proportion of detected genes between batches of cells. The development of statistical methods that account for these systematic biases will therefore be essential in the analysis of scRNA-Seq data. Incorporating biological replicates in the experimental study design provides a solution to reducing confounding between biological condition and batch effects and will permit modeling of the technical variability that relates to processing the cells in different batches.
Competing Interests
The authors declare no competing interests.
Acknowledgements
We thank Bradley Bernstein who provided insightful comments that greatly improved the manuscript. This research was supported by NIH R01 grants GM083084, RR021967/GM103552 and HG005220.
Footnotes
Emails: Stephanie C. Hicks, shicks{at}jimmy.harvard.edu, Mingxiang Teng, mxteng{at}jimmy.harvard.edu, Rafael A. Irizarry, rafa{at}jimmy.harvard.edu