Systematic comparative analysis of single cell RNA-sequencing methods

A multitude of single-cell RNA sequencing methods have been developed in recent years, with dramatic advances in scale and power, and enabling major discoveries and large scale cell mapping efforts. However, these methods have not been systematically and comprehensively benchmarked. Here, we directly compare seven methods for single cell and/or single nucleus profiling from three types of samples – cell lines, peripheral blood mononuclear cells and brain tissue – generating 36 libraries in six separate experiments in a single center. To analyze these datasets, we developed and applied scumi, a flexible computational pipeline that can be used for any scRNA-seq method. We evaluated the methods for both basic performance and for their ability to recover known biological information in the samples. Our study will help guide experiments with the methods in this study as well as serve as a benchmark for future studies and for computational algorithm development.

Single-cell RNA sequencing (scRNA-seq) has emerged as a central tool for identifying and characterizing cell types, states, lineages, and circuitry 1-3 . The rapid growth in the scale and robustness of lab protocols and associated computational tools has opened the way to substantial scientific discoveries and to an international initiative, the Human Cell Atlas (HCA), to build comprehensive reference maps of all human cells 4 .
ScRNA-seq methods differ in how they tag transcripts for their cell-of-origin and generate libraries for sequencing. Low-throughput, plate-based methods 5, 6 sort a cell into a well of a multi-well plate; high-throughput, bead-based methods distribute a cell suspension into tiny droplets [7][8][9] or wells 10,11 containing reagents and barcoded beads to produce a single droplet or well with one cell and one bead that is used to mark all the cDNA generated from that cell; and scalable, combinatorial indexing methods reverse transcribe and barcode mRNAs in situ inside each cell or nucleus, without physically isolating single cells [12][13][14] (Supplementary Fig. 1).
ScRNA-seq remains a rapidly evolving field 15 with continued development of new methods and improvement of existing ones. There is thus an urgent need to provide comparison and benchmarking information to help guide users make informed choices based on each method's capabilities and limitations, compare newly proposed methods to existing ones, and identify shared weaknesses as targets for experimental improvement. Prior comparisons of scRNA-seq methods [16][17][18][19][20][21] , though useful, have several shortcomings. Many are outdated given the fast-paced field, or are incomplete, inapplicable (e.g., not actually performed with single cells), or insufficiently controlled (e.g., performed using different samples for comparisons); others limit their assessment to basic technical factors, but do not assess the key benchmark of ability to recover meaningful biological information, such as population heterogeneity and structure. In particular, comparisons often focused on cultured cell lines, even though most scRNA-seq studies focus in practice on tissues and primary cells. An updated comparison of scRNA-seq methods is needed to help scRNA-seq method developers better understand the advantages and disadvantages of different methods, researchers to choose appropriate experimental methods to answer their biological questions, and computational method developers to create new data and software packages for data processing.
Here, we systematically and directly compared seven methods (Fig. 1, Supplementary Fig. 1), including two low-throughput plate-based methods (Smart-seq2 5 and CEL-Seq2 6 ) and five highthroughput methods (10x Chromium 9 , Drop-seq 8 , Seq-Well 10 , inDrops 7 , and sci-RNA-seq 12 ), producing expression profiles from ~92,000 cells overall. We analyzed at either the single cell or single nucleus level three sample types -a mixture of human and mouse cell lines, human peripheral blood mononuclear cells (PBMCs), and mouse cortex, each sample with two replicates -to generate a total of 36 different scRNA-seq libraries. For mouse cortex we tested four single nucleus RNA-seq methods 9,12,22,23 , in what is, to the best of our knowledge, the first such comparison. For each sample type, we characterized performance with basic metrics, and for PBMC and cortex libraries, we further examined how well the methods capture biological information. This is a critical part of most scRNA-seq studies, but has not been evaluated in other comparison studies that used only relatively homogeneous cell lines 16,20 . Our study provides both immediate guidance on each method's relative performance, and an experimental and computational benchmarking framework to assess future techniques. For the low-throughput methods, Smart-seq2 and CEL-Seq2 performed similarly, though the latter may be affected more by contaminating reads from other cells. Among the high-throughput methods, 10x Chromium was the top performer, both for single cell and single nucleus data.

A comparison of scRNA-seq methods
We selected seven scRNA-seq methods for comparison and tested each with up to three sample types: a mixture of mouse and human cell lines, frozen human PBMCs, and mouse cortex nuclei ( Fig. 1, Supplementary Figs. 1 and 2). We chose to profile a cell line mixture with 50% human HEK293 and 50% mouse NIH3T3 cells ("mixture") because (1) these cells are a common test 8,9,12,24 for samples with relatively high amounts of RNA per cell and (2) multiplets can be detected when cell barcodes have a substantial fraction of reads from both species. In mixture experiments, we grew HEK293 and NIH3T3 cell lines separately, mixed together equal number of cells for each cell line, and processed aliquots with seven scRNA-seq methods in parallel. We profiled frozen aliquots of human PBMCs because (1) they are a heterogeneous mixture of cells, particularly with respect to their amount of RNA per cell, yet they do not require dissociation (a separate technical challenge), (2) their cell types and associated expression patterns are well-studied. (We do not include data for sci-RNA-seq with PBMCs because we detected very few genes (e.g., <10 genes) per cell in several experiments (data not shown).) We wanted to extend our study of single cells to single nuclei as such samples have distinct properties including lower RNA input amounts. We profiled the mouse cortex because brain tissue is a major example of a tissue type commonly analyzed, by necessity, through single nucleus RNA-seq. We thus used four methods, which had been previously applied to nuclei 9,12,22,25 , to test with nuclei from mouse cerebral cortex. Each sample type was tested in two experiments (Mixture1 and Mixture2, PBMC1 and PBMC2, Cortex1 and Cortex2) run on different days to assess reproducibility.
In each comparison experiment, we started with one sample with processing of aliquots starting at the same time for each method. The only exceptions were for Seq-Well in PBMC1, in which we thawed an identical PBMC aliquot a second time to obtain a Seq-Well dataset with sufficient cells profiled for PBMCs, and for 10x Chromium in PBMC1, in which we thawed an identical aliquot to directly compare version 2 (v2; designated as "B") with version 3 (v3), which was not available when our original comparisons were done. In each experiment, we aimed to collect data from ~384 cells for the low-throughput methods and ~3,000 cells for the high-throughput methods. In each experiment, we also used an aliquot of cells to generate a bulk RNA-seq library as a control.
We sequenced all libraries together in an attempt to avoid batch effects due to varying sequence quality among Illumina flowcell lanes, with the following exceptions. The inDrops libraries were sequenced separately because they have an opposite read structure from those generated from the other methods with the cDNA in read 1 needing a long read and indexing information in read 2 needing a short read (Online Methods). We performed additional sequencing for some libraries in an attempt to sequence similar numbers of reads per cell for each low or high throughput method (Online Methods). We aimed for 50,000 to 100,000 reads per cell for high-throughput methods and 750,000 to 1,000,000 reads per cell for low-throughput methods.

scumi computational pipeline allows unified analysis across any scRNA-seq method
Because each method had its own standard computational pipeline, we developed a new "universal" pipeline, to permit direct comparison of all the experimental methods, and remove processing differences introduced by these existing pipelines (Supplementary Fig. 2). First, we developed the scumi software package (single-cell RNA-sequencing with UMI, Supplementary   Fig. 2a), which starts from FASTQ files as input and generates gene-cell expression count matrices for downstream analyses. We used scumi to analyze and compare the data from all the methods tested.
Second, we addressed the major pre-processing challenge of filtering out low-quality cells prior to downstream analysis (Supplementary Fig. 2b). This is particularly important when comparing methods, to ensure that our approach is fair to all methods and less subjective. In particular, when selecting the top cell barcodes with the largest number of reads or UMIs assigned to them, the challenge is to decide what threshold to choose for excluding lower quality cells or barcodes likely reflecting ambient RNA rather than real cells (or nuclei). For the mixture experiments we removed cells based on their complexity (UMIs or reads per cell). For the more complex PBMCs and cortex samples, consisting of different cells with different characteristics, such a simple approach could bias against the recovery of cell types with relatively small amounts of RNA. Instead, we first looked at more cell barcodes than we expected to truly recover from experiments, did an initial clustering, looked for differential gene expression to find cluster specific marker genes, and removed cells in clusters likely to be low quality (Supplementary Fig. 2b, Online Methods).
Third, prior to calculating metrics that potentially show improvements with greater sequencing depths, such as the number of genes per cell or ability to detect known cell types, we sampled the same number of reads per cell for all the methods of the same type, either low-or high-throughput, in a given experiment (Online Methods, Supplementary Fig. 2c). This leads to better relative performance for methods that have a higher fraction of informative reads, i.e. aligning to genes and present in cells used for analysis. Note that because for most experiments we sequenced the poly(T) sequences that follow the cell barcode and UMI sequences in all methods except SMART-seq2, we tracked and removed reads without poly(T) at the expected positions.
Finally, we assessed the methods by several key metrics spanning (1) the structure and alignment of reads to the nuclear and mitochondrial genomes; (2) sensitivity in capturing RNA molecules; (3) extent of multiplets (assessed in mixture experiments); (4) their technical precision/reproducibility with respect to expression estimates; and (5) the ability to recover meaningful biological distinctions in cell types (for the PBMC and cortex experiments). We describe each of these assessments next.

Read structure and alignment reveal efficiency differences among methods
First, we characterized the methods by the distribution of reads from each library with respect to their structure and alignment with the genome (Supplementary Fig. 3). These metrics inform about the "efficiency" of methods in generating useful reads for downstream analysis. In the mixture experiments, we considered only uniquely mapped reads to minimize the effects of multimapped reads on calculating cell multiplet rates and other metrics.
The methods varied in the fraction of reads without poly(T) at the expected positions. Methods that do not use beads to capture mRNA (CEL-Seq2 and sci-RNA-seq) generally had poly(T) detected at the expected positions (Supplementary Fig. 3, Supplementary Tables 1,2). The 10x Chromium data also had a high fraction of reads with poly(T). By contrast, Drop-seq, Seq-Well, and inDrops showed high percentages of reads without poly(T). Interestingly, although Drop-seq and Seq-Well used the same batch of beads, the percentages of reads without poly(T) from the two methods were different, with 6.5% and 7.5% for Drop-seq, but 10 Supplementary Fig. 3b). These results suggest that some aspect of the methods independent of the beads contributed to the proportion of with reads without poly(T). In the nuclei experiments, both 10x Chromium (v2) and DroNc-Seq had a higher fraction of reads lacking poly(T) with nuclei than in the other experiments (Supplementary Tables 1-3 We next considered the distributions of reads across these categories: exonic, intronic, intergenic, overlapping different genes (ambiguous), multi-mapped, and unmapped. Exonic reads are typically the only reads used in scRNA-seq studies of cells, whereas intronic reads are used for studies with nuclei. In the mixture experiment, both replicates of Smart-seq2 and one replicate of inDrops had the highest fraction of exonic reads (51.0%, 53.7%, and 56.9%, respectively), with sci-RNA-seq performing worst (28.7% and 29.4%, Supplementary Fig. 3a). Overall, the PBMC samples had a lower fraction of reads aligned to exons than the mixture samples ( Supplementary   Fig. 3a,b), with one replicate of inDrops having the highest fraction of exonic reads (46%), and Seq-Well having the lowest (20%, Supplementary Fig. 3b).
To determine the extent to which existing annotation limits recovery of reads aligning to genes 26 , we used the PBMC1 and PBMC2 bulk RNA-seq to create a matched transcriptome and new annotation (Online Methods). The new transcriptome annotations led to very few (<2%) additional reads aligning (Supplementary Table 4).
While the relative performance of each method was generally similar between the cortex nuclei and the other experiments, there was a higher ratio of intron-aligning reads to exon-aligning reads in nuclei than in cells (Supplementary Fig. 3) as expected because nuclei contain a higher proportion of unspliced transcripts than whole cells 27 . We assessed whether reads aligned in the sense or antisense orientation for each method, except Smart-seq2, which is not strand-specific.
We did not observe this level of antisense reads in bulk cortex nuclei nor in PBMC scRNA-seq datasets (data not shown). We excluded these antisense reads from our subsequent analysis.

Fraction of mitochondrial reads varies, but is at or below level in bulk RNA-seq
We determined the fraction of reads aligning to mitochondrial genes, as these can be associated with lower quality scRNA-seq samples 28 on the one hand, but can also be useful for inferring lineage relations between cells based on mtDNA mutations 29 on the other hand.
In the mixture experiments, CEL-Seq2 had the highest frequency of such reads at 8.7% ( Supplementary Fig. 4a,b), but this corresponds to the levels of mitochondrial transcripts in bulk RNA-seq for the cell line samples: 12.6% and 12.3% for NIH3T3 cells, and 8.5% and 9.1% for HEK293 cells, suggesting this may be an accurate reflection of the cells' transcriptome. In PBMCs, inDrops and CEL-Seq2 had the highest mitochondrial ratios at ~11% (Supplementary Fig. 4b).
This may reflect the fact that both methods use in vitro transcription amplification rather than PCR.
In addition, 10x Chromium (v3) had a high mitochondrial ratio at 11.2% (Supplementary Fig.   4b). Bulk RNA-seq of PBMC1 and PBMC2 had 12.6% and 12.3% mitochondrial transcripts, respectively, again suggesting this may be an accurate reflection of the transcriptome. In cortex nuclei, all methods had low mitochondrial ratios (Supplementary Fig. 4c), possibly because the isolation of nuclei successfully removed most mitochondria from the samples. This is consistent with the low (1.9% and 2.2%) fraction of mitochondrial transcripts in the matched bulk RNA-seq samples.

Similar relative ranking of method sensitivity across experiments
As scRNA-seq methods start with limited RNA inputs, a key quality metric is the sensitivity, or the ability to capture RNA molecules. We assessed the sensitivity of each method by measuring the number of detected UMIs or genes per cell in datasets sampled to the same number of reads per cell (Online Methods; Supplementary Table 5). The only exception was Seq-Well PBMC1 with ~46,000 reads per cell compared to ~69,000 reads per cell for the other high-throughput methods in PBMC1 (Supplementary Table 5). For the mixture experiments, we report the results for mouse and human cells separately as the number of UMIs and genes per cell in the two cell types differs, such that differences in the ratio of human to mouse cells among the libraries (Supplementary Table 1) could skew the results, but the overall ranking of the methods is the same for both human and mouse cells (Fig. 2a,b, Supplementary Fig. 5a).
Overall, low-throughput methods Smart-seq2 and CEL-Seq2 had the highest sensitivities, whereas among high-throughput methods, 10x Chromium detected the most UMIs and genes per cell. In the mixture experiments, inDrops had the lowest sensitivity and Seq-Well detected fewer genes per cell compared to 10x Chromium (v2) and sci-RNA-seq, but more genes per cell compared to Drop-seq and inDrops. The relative ranking of the methods was generally consistent when comparing the median number of detected UMIs per cell (Fig. 2a), detected genes per cell ( Fig.   2b), or mean detected reads per cell (Supplementary Fig. 5a). Similarly, in PMBCs lowthroughput methods detected more UMIs and genes per cell than the high-throughput methods. (Fig. 3, Supplementary Fig. 5b), with similar performance of Smart-seq2 (2406 and 2632 median number of genes detected) and CEL-Seq2 (2717 and 2545; Fig. 3b). Among the high-throughput methods, 10x Chromium (v3) had the largest median number of UMIs (4494) and genes per cell UMIs; 513 and 372 genes) had the lowest (Fig. 3). In cortex nuclei, Smart-seq2 was the only low-throughput method tested and we sequenced to a slightly higher depth than for the other samples (Supplementary Tables 1-3) and used all the reads. As expected, Smart-seq2 detected more genes per cell than the high-throughput methods. (Fig. 4, Supplementary Fig. 5c).
Although it is not possible to directly compare these metrics to the mixture and PBMC experiments as each set of cells differ in many ways, including the amount of RNA per cell or nucleus, it is possible to compare the relative ranking of different methods. Although sci-RNA-seq had better sensitivity than Drop-seq with cell lines (Fig. 2), the two methods performed similarly in Cortex1 and DroNc-seq was more sensitive in Cortex2 (Fig. 4).
To explore how sensitivity varied with sequencing depth, we sampled fewer reads per cell from each method in the PBMC datasets. For each dataset, the relative ranking of the methods with respect to the median number of genes per cell detected remained the same at all sequencing depths ( Supplementary Fig. 6). In addition, it appears that the number of genes detected may not have saturated at these sequencing depths, except for Seq-Well PBMC2.

Mixture experiments enable detection of multiplets and reads from other cells
In the mixture experiment, we were able to assess the frequency of multiplets, two or more cells being sequenced together and assigned one cell barcode because we started with a mixture of human and mouse cells. The observed multiplet rates were less than 3.5% for all seven tested methods (Fig. 2c), except for the first inDrops experiment, which also had a high fraction of reads without poly(T) (Supplementary Table 1). The multiplet rate depends on the number of cells used in each experiment 9 and the ratio of mouse to human cells, but it was not possible to sequence exactly the same number of cells nor the same ratio of mouse and human cells with each method in our experiments (Supplemental Table 1). The multiplet rates of low-throughput methods (Smart-seq2 and CEL-Seq2) were the lowest (<1%) of all the methods, as expected as FACS was used to place a single cell in each well of a plate (Fig. 2c).
We also examined how the estimated multiplet rate varied with the number of detected UMIs per cell. Generally, the multiplet rates were high in cells with the largest number of UMIs (Fig. 2c), as expected because multiplets are expected to have more RNA input. While most cells with intermediate quantities of UMIs were not multiplets, cells with lowest number of UMIs in some cases had higher rates suggesting that these cells might be low-quality or have more contributions from cell-free ambient RNA (Fig. 2c). In sci-RNA-seq experiment 2, the rate of multiplets decreased more gradually than for other methods for unknown reasons (Fig. 2c).
We also used the mixture experiments to ask whether the genes detected in a cell were actually from that cell instead of "contamination" from other cells. As sequencing depth increased, more genes were detected from the "wrong" species ( Supplementary Fig. 7a,b), as reflected by the slope of a regression line along the cell barcodes adjacent to each axis (Online Methods), such that the best performing methods have the lowest slope. For the low-throughput methods, Smart-seq2 performed much better than CEL-Seq2 -perhaps related to the pooling of cells or their barcoding during library construction in CEL-Seq2. Among the high-throughput methods, inDrops had the lowest slope and Seq-Well had the highest slope.

Reproducibility, technical precision, and accuracy in gene expression quantification
We compared the reproducibility of expression quantifications from replicates, among the different methods. In the mixture experiments, we calculated the Pearson correlation coefficients between pseudo-bulk data of each of the scRNA-seq mixture human and mouse datasets (Online Methods). As expected, replicates were almost always more correlated with each other than with pseudo-bulks from different methods (Supplementary Fig. 8). The pseudo-bulks from sci-RNAseq, although correlated well with each other from the two replicates, were not as correlated with the pseudo-bulks from other methods. The pseudo-bulks from the other methods were highly correlated with each other (Supplementary Fig. 8), suggesting that all methods have high accuracies in quantifying the gene expression, consistent with previous comparison studies that demonstrated scRNA-seq methods had high accuracies based on measured ERCC spike-ins and their annotated molarities 16,18 . Notably, pairs of methods with similar molecular biology -CEL-Seq2 with inDrops and Drop-seq with Seq-Well -were more highly correlated with each other (Supplementary Fig. 8).
To assess technical precision in the mixture experiment, which consisted of two homogenous cell lines grown in controlled conditions in culture, we also compared the variation in scRNA-seq data, which we expect to be primarily driven in this case by technical variation. Previous studies have demonstrated that the technical variation generally follows Poisson distributions 16,30,31 . CEL-Seq2, inDrops, and Drop-seq consistently had relatively low extra Poisson CVs ( Supplementary   Fig. 9). Consistent with previous findings, Smart-seq2 data had the highest extra Poisson CV, most likely because no UMIs were used (Supplementary Fig. 9).
To examine how well the scRNA-seq methods reflect gene expression captured in bulk experiments, we compared them in the PBMC datasets. For each library, we combined each of the single cell datasets, sampled to the same number of reads per cell, into a pseudo-bulk dataset and compared each of them to the matched bulk (Online Methods). Most libraries showed a strong correlation with the bulk with Smart-seq2, 10x Chromium, and inDrops having the highest correlations (Supplementary Fig. 10).

Individual methods vary in their ability to distinguish and recover cell types
A key consideration for a scientist in choosing a scRNA-Seq experimental method is its ability to uncover the underlying biology of interest. Among the many biological features studied by scRNA-seq, one of the most prominent use cases is the identification and recovery of distinct cell types by clustering of scRNA-Seq profiles. Both the PBMC and mouse cortex datasets consist of diverse cell types, and were chosen to allow us to compare methods in the context of this use case.
To this end, we processed the data with the goal of a fair and optimal assessment of each method.
Not only did we sample the same number of reads per cell for each low-and high-throughput method in each experiment as above for the sensitivity metrics, we also performed another round For each PBMC or mouse cortex dataset, we clustered the cells or nuclei based on their gene expression profiles to assess how well they detected the known cell types and their associated transcriptional profiles. For clustering, the Louvain algorithm's performance largely depends on the parameter settings. For each dataset, we searched a range of parameters to select the optimal clustering to recover each of the expected cell types (Online Methods). For each cluster, we assigned a cell type identity based on known marker genes (Online Methods). Next, to quantify the quality of the clusters at separating cell types, we scored the expression of each cell for each cell type signature generated from known marker genes and calculated the area under the curve (AUC) for each cluster to measure how well the cells in a cluster score for each cell type (Online Methods). The AUC summarizes the performance of the gene signature scores in separating a cluster of cells from the rest of the cells, with AUC =1 for all cell types as the ideal outcome.
For PMBCs, methods varied in the ability to distinguish cell types, as well as in the proportion of cell types recovered and in some cases to recover certain cell types altogether. As expected, methods had more difficulty in distinguishing transcriptionally related cell types, such as CD4 + T cells, cytotoxic T cells, and natural killer (NK) cells ( Fig. 5a,b, Supplementary Fig. 11). From the t-SNE plots for PBMC2, we observed that 10x Chromium and inDrops performed well ( Fig.   5a, Supplementary Fig. 11b). Given that all the libraries for each experiment were generated from the same sample, we compared the fraction of cells assigned to each cell type within an experiment to see whether they were consistent (Fig. 5b). Generally, most methods were able to successfully recover the abundant cell types in PBMCs, i.e., CD4 + T helper cells, cytotoxic T cells, NK cells, B cells, and CD14 + monocytes, but varied in the relative abundance of cell types.
Methods also varied in whether cell types were detected, particularly for the rarer cell types, such as megakaryocytes (Fig. 5b). For the low-throughput methods, we did not profile a sufficient number of cells to recover the rarer cell types (Fig. 5b). In PBMC1 among the high-throughput methods, 10x Chromium (v2) showed the best quality for both the number of cell types identified and the average AUCs across cell types, followed by Drop-seq and 10x Chromium (v3), with Seq-Well and inDrops not identifying two cell types (Fig. 5c). In PBMC2, 10x Chromium (v2) and inDrops performed well -identifying all the cell types (Fig. 5c). For Seq-Well PBMC2, the poor performance was strongly influenced by the low number of cells recovered in the experiment (Fig.  5c). The two low-throughput methods performed similarly for the AUC measurements and did not have sufficient cells to capture and identify clusters for the rarer cell types (Fig. 5c).
Similar to PBMCs, the mouse cortex also has well-defined cell types, including excitatory and inhibitory neurons, astrocytes, oligodendrocytes, oligodendrocyte progenitor cells (OPCs), microglia, endothelial cells, and pericytes 32 . In both experiments for all the methods, apart from for sci-RNA-seq, we identified all these cell types, except pericytes, a rare cell type only found in DroNc-seq in Cortex1 ( Fig. 6; Supplementary Fig. 12). In the sci-RNA-seq datasets, we also could not find OPCs and microglia (Fig. 6). In the AUC analysis, Smart-seq2, 10x Chromium (v2), and DroNc-seq all had high AUCs, though their relative ability to detect the expected cells varied by cell type (Fig. 6c). Notably, even the small number of cells in the Smart-seq2 datasets (295 and 349) sufficed to find these cell types, in contrast to the PBMC datasets (Fig. 5). There were some clusters of cells in the sci-RNA-seq datasets for which we could not confidently assign cell types (7% and 4% of cells; Fig. 6a,b). These cells were not used in calculating the AUCs.

Pooled data analysis across methods enhances biological signal and consistency
Two general reasons may underlie the failure to detect certain cell types in the PBMC and cortex experiments: (1) libraries did not contain cDNAs from these cell types due to experimental issues; or (2) data quality from these cells may not have been insufficient to identify them, given this depth of sequencing and number of cells. To distinguish between these possibilities, we combined for each PBMC experiment all the sampled data together using Harmony 33 , re-clustered the cells ( Supplementary Fig. 13a), and repeated our analysis. Following this analysis, all cell-types were detected in each library, supporting the second possibility and showing the power of accruing data across methods (Supplementary Fig. 13b-d). Moreover, we were able to determine in which cell type these missing cell types were originally assigned (Supplementary Fig. 13c,d). Although most of the combined and individual cell type assignments agree, some cell types seemed to be harder to distinguish. For example, in several libraries such as Smart-seq2 and CEL-Seq2, the undetected dendritic cells were grouped with the CD14 + or CD16 + monocytes ( Supplementary   Fig. 13c,d). Overall, 10x Chromium (v2) was the most consistent between the combined and individual level clustering, followed closely by 10x Chromium (v3), and others having fairly high but variable levels of consistency.
To check the cell types assigned by the combined analysis, we examined the cells assigned to cell types missing in our original analysis of each library separately. In most cases (20/25), we found that these cells could be assigned the same identity using our original AUC method (Online Methods), with some exceptions for rare cell types with only ~1-2% of cells in a cluster (Supplementary Table 6). Thus, the failure to identify all the relevant cell types was due, as least in part, to data quality issues such as reads that could not be used in the analysis (Supplemental Table 2), with the possible exception of two rarer cell types in our datasets, megakaryocytes and plasmacytoid dendritic cells, which may not have been present in some datasets. Another possible explanation is that low quality cells were included that prevented identification of distinct cell types -this points to the difficulty in finding an optimal filtering threshold as well.
The combined analysis of mouse cortex nuclei (Supplementary Fig. 14a) allowed us to identify all the cell types in each dataset, except for pericytes, which clustered with endothelial cells (Supplementary Fig. 14b,c). In most cases, the combined and individual assignments agree ( Fig.   6, Supplementary Fig. 14c). The 10x Chromium (v2) was the most consistent between the joint and individual level clustering, while sci-RNA-seq was the least (Supplementary Fig. 14c). In particular, some cell types were harder to distinguish-for example, oligodendrocytes and oligodendrocyte precursors, and microglia and excitatory neurons. All the cell clusters assigned new cell types by Harmony were confirmed when cells from individual libraries were checked with our AUC method (Online Methods).

Comparison of scumi with standard computational pipelines
Although we developed and used the scumi computational pipeline in this study to analyze each method's datasets in as similar a manner as possible, we also processed each of the datasets with its original method-specific pipeline for comparative purposes. There are significant differences in how the individual pipelines work. For example, the Drop-seq pipeline 8 only considers primary alignments and discards secondary ones. For the reads that can be equally mapped to multiple positions, the primary alignment is typically randomly chosen from these alignments. On the other hand, Cell Ranger 9 (for 10x data) only counts those multi-mapped reads if all the alignments of a read overlap a single gene.
Generally, there was good agreement between the basic metrics generated by the method-specific and scumi pipelines in terms of the median number of UMIs and genes per cell for the PBMC datasets (Supplementary Table 2). In particular, the relative ranking of all the methods for median genes per cell is the same between the scumi and method-specific computational pipelines (Supplementary Table 2). One of the most challenging steps in the analysis is choosing the correct number of cells in a dataset and most of these method-specific pipelines do not include this as a formal part of their pipeline or it does not work that well in practice, so that we followed standard practices and manually chose the number of cells based on the number of genes per cell in most cases (Online Methods). In addition, we examined the Pearson correlation of the gene expression for the two pipelines for each PBMC dataset. We found very high correlations for all datasets for both UMIs and genes (Supplementary Fig. 15). The genes per cell output agrees more closely than that of the UMIs per cell (Supplementary Table 2), perhaps because of the differences in how the method-specific pipelines process the data.

DISCUSSION
In this study, we systematically benchmarked seven methods across three major categories: platebased, bead-based, and combinatorial index-based methods. We selected representative methods that are more widely used and for which we had the expertise and resources to prepare the libraries.
Our results with two replicates each for three different sample types -mixture of cell lines, human PBMCs and mouse cortex -were generally consistent in their ranking of the methods for sensitivity (Figs. 2-4), reproducibility (Supplementary Fig. 8 All of the methods were able to generate useful data, but overall we found that 10x Chromium had the strongest consistent performance. In our limited testing of 10 Chromium (v3), which was introduced recently, we note that it had higher sensitivity (Fig. 3), but we did not detect improved cell type identification (Fig. 5) and had a higher fraction of reads aligned to mitochondrial genes ( Supplementary Fig. 4). sci-RNA-seq, which has shown great scaling to larger numbers of single cells 13 , may require optimization for use with some samples, such as PBMCs. We note that we used the original version with two rounds of indexing 12 . Moreover, its performance with cortex nuclei was not ideal as it could not assign an identity to some cells and did not detect all the cell types present (Fig. 6, Supplementary Fig. 14). For the low-throughput methods, Smart-seq2 and CEL-Seq2 performed similarly without a consistent pattern for which was better (Figs. 2-5). For studies that require the highest sensitivity, these two methods are clearly better than the highthroughput methods (Figs. 2-4). Smart-seq2 has inherent advantages for genetic variant detection and studying RNA splicing isoforms because its sequencing is not limited to the 3' end of genesalong with the disadvantage of lacking UMIs. Note however that in CEL-Seq2 we cannot rule out the issue of contaminating reads from other cells (Supplementary Fig. 7) 34 .
Looking beyond performance, we compared the time and reagent costs for each method as performed in this study (Supplementary Table 7). Drop-seq, inDrops, and Seq-Well had the lowest costs -noting that they are not fully packaged commercial products at this time -and Smart-seq2 was the most expensive, primarily because there is no pooling during library preparation. Many of the methods, particularly sci-RNA-seq, would be more cost effective with larger numbers of single cells or nuclei 13 . The 10x Chromium method required the least time and Smart-seq2, CEL-Seq2, and inDrops took the most time. We did not utilize automation, but it could decrease hands-on time and affect cost.
Analyzing single nuclei rather than single cells is an important strategy, which addresses tissues that cannot be readily dissociated into a single cell suspension (such as brain, skeletal muscle or adipose) and frozen samples, as well minimizes the alteration of gene expression which may be caused by dissociation 35,36 . As in previous studies 37, 38 , we found that single nucleus RNA-seq generally performed well for sensitivity (Fig. 4) and classification of cell types (Fig. 6). Even with the inclusion of intron-aligning reads in our analysis, a higher fraction reads for 10x Chromium, and to a lesser extent for DroNc-seq, could not be analyzed because of the absence of a poly(T) sequence or aligning in an antisense orientation (Supplementary Fig. 3).
To systematically analyze and compare data generated from different scRNA-seq protocols and avoid introducing potential biases from method-specific computational pipelines, we developed the scumi software package. Although other scRNA-Seq processing pipelines 8,9,18,[39][40][41][42][43][44][45][46] are available, some are specifically designed to process data from one or a few protocols 8,9,[43][44][45][46] and others can process data from different protocols, but are not scalable to large datasets 18,40 or are harder to use due to dependency and maintenance issues 42 . scumi is flexible to process reads from different protocols, works with different aligners, is scalable to billions of reads in a single run, has functions for detecting and correcting cell-barcodes, collapsing and filtering UMIs, and efficiently sampling reads to compare different methods, and, most importantly, is relatively easy to use. It takes FASTQ files as inputs and generates gene-cell expression count matrices. Except for removing the reads lacking poly(T) sequences, scumi keeps all other reads in BAM files that can be used for visualization or to pinpoint potential problems in the data. A significant challenge in scRNA-seq analysis, particularly when comparing across methods, is to decide which cells to exclude as low-quality. For samples such as PBMCs and cortex, heterogeneity in the RNA content of the different cell types makes a simple threshold of UMIs or genes per cell inappropriate as it introduces biases against cells with less RNA. Our pipeline provides multiple, customizable solutions to the problem of how to set threshold to filter out low-quality cells, including a mixture model or filtering of initially clustered datasets. Although scumi is not optimized for data generated from any one specific platform, we showed that it produced results that had high concordance with those from method-specific pipelines (Supplementary Table 2, Supplementary Fig. 15). We foresee that scumi could be useful for analyzing scRNA-seq data from new platforms and for future benchmarking studies of scRNA-seq methods.
Our study, the scumi pipeline, data and approaches will be a resource for future research in many fields where scRNA-seq methods are applied, and provides important guidance. First, using a coherently and reproducibly collected set of data, spanning three sample types, it provides direct guidance on key methods by a rich set of parameters and considerations -from technical to biological. It spans key and popular methods, including the first comparison of single nucleus RNA-seq methods. Second, the results presented here for each method could be used to further optimize and improve the existing scRNA-seq methods. Third, our use of representative and easily accessible sample types -common cell lines, human PBMCs, and mouse cortex from the C57BL/6 strain-should allow future studies, particularly those introducing new or improved methods, to make direct comparisons to this benchmark study. Indeed, all data sets were collected in a manner that allows open sharing, including of the human PBMC data. Finally, we expect that our datasets will be valuable for computational method developers to benchmark algorithms, e.g., batch correction algorithms, and build pipelines for efforts such as the Human Cell Atlas, the BRAIN with ad libitum food and water. We obtained whole cortex samples by euthanizing 1 month old C57BL/6 male mice by cervical dislocation. We then dissected the whole cortex out of the first hemisphere by a midsagittal cut and a cut made between the cerebellum and the hemisphere. The hippocampus, olfactory bulb, and all basal ganglia were dissected out, and the cortex was placed in 500 µl RNAlater (Thermo Fisher Scientific) in a 1.5 ml tube, followed by incubation overnight at 4 ℃ before storage at -80 ℃. Then second half of the cortex was processed similarly and placed in a different tube. We took only 2 min to harvest the first half of the cortex after the cervical dislocation step and another minute for the second half of the cortex.

Single cell or nucleus experimental design
We performed two experiments with each single cell method for the mixed cell lines and PBMCs, except as noted. To generate data for Seq-Well, we performed a second PBMC1 experiment on a different day with an aliquot identical to the one used in the main PBMC1 experiment. Similarly, we performed a third PBMC1 experiment with 10x Chromium (v2) and (v3) on a different day. In addition, we performed two experiments with four methods for the mouse whole cortex nuclei. In all cases, each lab method was started at the same time by different researchers, so that the results would be directly comparable without any confounding due to the time cells or nuclei waited to start the experiment.

Cell sorting
We passed each cell suspension through a 35 µm filter (Falcon) and added 20 µl of 1 µM TO-PRO®-3 Iodide (Thermo Fisher Scientific) per ml to stain dead cells. We used a MoFlo Astrios EQ cell sorter (Beckman Coulter) and set fluorescence activated cell sorting (FACS) gating on forward scatter plot, side scatter plot and on fluorescent channels to pick TO-PRO®-3 negative (live cells) while excluding debris and doublets. We used a 100 µm nozzle to sort single cells into 96-well plates containing 5 µl TCL buffer (Qiagen) with 1% beta-mercaptoethanol for Smart-seq2 and 384-well plates containing 0.6 µl 1% NP40 (Thermo Fisher Scientific) for CEL-Seq2.

Nuclei isolation and sorting
We isolated single nuclei as previously described 23 with the following modifications. We processed the left and right cortical hemispheres of one mouse separately through Dounce homogenization. We then combined the nuclei suspensions during the washing step. We aliquoted We used FACS settings similar to those for single cells described above except for picking Violet positive (for nuclei) to sort 10,000 nuclei into 20 μl 10x RB buffer for 10x Chromium (10x Genomics). We also sorted single nuclei into 96-well plates as described above for Smart-seq2.

Smart-seq2
We prepared RNA-seq libraries from single cells and single nuclei sorted into 96-well plates using a modified Smart-seq2 method 48  incubated the RT reaction at 50 ℃ for 90 min followed by 85 ℃ for 5 min. We then added 15 µl PCR master mix containing 12.5 µl KAPA HiFi HotStart ReadyMix (KAPA Biosystems)) and 0.5 µl 10 µM PCR primer (5'-AAGCAGTGGTATCAACGCAGAGT). We amplified for 21 PCR cycles with conditions as described 48 . We purified PCR products twice with 0.7x AMPure XP SPRI beads (Beckman Coulter Genomics) and eluted in 20 µl EB buffer (Qiagen). We quantified the cDNAs with the Quant-iT™ PicoGreen® dsDNA Assay Kit (Thermo Fisher Scientific) using an EnVision 2104 Multi-label Reader (Perkin Elmer). We normalized the cDNA concentrations and used 0.075 ng of each in a quarter volume of a NexteraXT (Illumina) library construction reaction. We pooled equal volumes from each well and purified with 0.6x AMPure XP SPRI beads.

CEL-Seq2
We prepared CEL-Seq2 libraries from single cells sorted into 384-well plates following the

Drop-seq
We performed Drop-seq experiments with single cells according to the published paper 8

DroNc-seq
We performed the DroNc-seq experiment with single nuclei according to the published paper 23 as described for Drop-seq (see above) with the following additional details. 1) We used a DroNc-seq device that generated droplets 75 µm in diameter. 2) We size selected beads <40 μm in diameter using a strainer (PluriSelect, #43-50040-03). 3) We used 300 nuclei/µl and 350 beads/µl. 4) We flowed cell and bead suspensions at 1.5 ml/hour, while flowing the oil at 16 ml/hour. 5) We collected the emulsion into a 50 ml Falcon tube for 22 min for each collection and did 2 collections for each experiment. We incubated the collections for up to 45 min at room temperature before breaking the emulsion.

Seq-Well
We prepared the Seq-Well libraries following the published paper 10  For the single nucleus experiment, we followed the published paper 12 with similar details as for the mixed cell experiment with these additional changes. 1) We eliminated the methanol fixation step. 2) We used a final concentration of 300 nuclei/μl and aliquoted 600 nuclei/well across one 384-well plate. 3) We pooled the nuclei after RT and filtered through a 35 µm strainer (Falcon) prior to staining and sorting. 4) We gel-purified the final libraries after PCR to efficiently remove the large quantity of gDNA fragments following a procedure similar to a previously published paper 49 . Briefly, we ran the samples on a 10% Criterion TBE gel (Bio-Rad) at 100 V for 65-70 min prior to staining with 1/1000x SYBR Green (Thermo Fisher Scientific) for 10-20 min. We cut out the region from 300 to 600 bp from the gel, collected it into a 1.5 ml Eppendorf tube, crushed it with a disposable pestle (Kontes/Kimble Chase, #749520-0090), added 400 μl of 0.3M NaCl to the crushed gel pieces and incubated with rotation at room temperature for at least 4 h to overnight.
We used a Spin-X cellulose acetate filter (Corning, #8163) to remove the gel particles by spinning for 3 min at 15,000 x g. We ethanol precipitated the flow through by adding 1 μl glycogen (20 μg/μl; Roche, #10901393001) and 2.5 volumes 100% ice cold ethanol followed by incubation at -20 °C for at least 30 min and then 20 min of centrifugation at 15,000 x g and 4 °C. We washed the pellets with 1 ml 70% ethanol and air dried before resuspending the libraries in EB buffer (Qiagen).

inDrops
We prepared inDrops libraries based on a previously described protocol 7 where XXXXXX is the indexing barcodes for multiplexing libraries.

RNA-Seq from bulk samples
We pelleted the cultured cells, PBMCs, and isolated nuclei leftover from Tube A after taking out the aliquot for DroNc-seq (see Nuclei isolation and sorting section) by spinning at 500

Read demultiplexing
As methods using a single index read (e.g., 10x Chromium) or dual index reads (e.g., Smart-seq2) were pooled together for sequencing, we adopted a two-round demultiplexing strategy. We first demultiplexed the reads using both index reads to extract reads from Smart-seq2 and sci-RNAseq. In the next round, we further demultiplexed the reads using a single index read. In the case of barcode collisions (i.e., a single index library index read matches that of a dual index library), we extracted the single index library reads from the unmatched reads after the first demultiplexing step. We only kept reads with one or no mismatches in the index reads. and UMIs can be found in Supplementary Table 11.

Annotating each cDNA read with its cell barcode and UMI
The scumi pipeline also corrects for sequence errors in the indices used by bead-based methods.
We have implemented code similar to the standard Drop-seq pipeline that overcomes the problems observed for some batches of Drop-seq beads, in which up to 20% of the cell barcodes have errors in the last base (base 12), mostly because the beads (encoding the cell barcodes) only synthesized 11 (or fewer) bases 8 . In such cases, base 12 of the cell barcode is actually the first base of the UMI sequence, and the last base of the UMI sequence is from the poly(T) sequence. This bead synthesizing error can be detected by calculating the frequencies of T bases in the UMI sequences.
The scumi pipeline first detects possible erroneous cell barcodes and then merges these cell barcodes that are the same in their first 11 bases but differ in the 12th base. If more than one base of the UMI sequences (with the same cell barcode) had a high-frequency of T bases (more than 80%), these cell barcodes were removed from further analyses.

Mapping reads to a reference genome
We aligned the merged FASTQ files (each cDNA read with its cell barcode and UMI annotations) to a reference genome using STAR 51

Annotating each alignment with a gene tag
We use featureCounts 53 from the Subread package, v1.6.2, to add a gene tag to each alignment. To count reads overlapping with introns for single nucleus RNA-seq data, we used a two-step approach to first count the reads overlapping with exons. In the second step, the reads not overlapping with exons were recounted if they overlapped with introns. We only included reads aligning in the sense orientation with the genome annotation, except for Smart-seq2, which does not generate strand-specific data.

Counting transcripts of each gene in each cell
For the UMI-based methods, we used scumi to generate a cell x gene UMI count matrix. We included a multi-mapped read if all its alignments overlapped with a single gene, similar to the Cell Ranger pipeline 9 . We collapsed UMIs in reads from the same gene from the same cell based on a Hamming distance of one. To prevent over-collapsing UMIs 42 , we did not collapse two UMIs -in the same gene in the same cell -if they each had more than five reads support. For Smart-seq2, we used a similar procedure to generate the count matrix used for the sensitivity and technical precision metrics, except we created a cell x gene read count matrix. For clustering Smart-seq2 data and downstream analysis, we used RSEM 54 v1.3.0 to generate a cell x gene transcripts per million (TPM) matrix, which was used instead of the UMI count matrix. We generated the RSEM reference using the FASTA and GTF files used for creating the STAR and HISAT2 references (see Mapping reads to a reference genome). When generating the RSEM reference for cortex data, we modified the GTF to include one unspliced transcript per gene that included all introns and exons in that gene. This allowed us to count reads that mapped to introns. For all the high-throughput PBMC datasets, we extracted two times the number of expected cell barcodes for each method, choosing the cells with the most reads. We removed cells with a high fraction of reads aligning to mitochondrial genes (names starting with 'mt-' for mouse and 'MT-' for human) -greater than 75th percentile + 3 * IQR of the mitochondrial ratios across the top returned cell barcodes, where IQR stands for interquartile range. For each cell, its UMI counts were divided by the total number of UMIs from that cell and then scaled by multiplying 10,000 to get transcripts per 10,000 (TP10K). We then added 1 to these TP10K and log transformed by the natural log. We then performed PCA using all genes, did clustering analysis (Louvain clustering 56,57 of the k-nearest neighbor (k-NN) graph built from the first 50 principal components of each single-cell dataset with parameter k=30 and a resolution parameter used for Louvain optimization of 1.0, implemented in the Seurat package 58 v2.3.4 (see Parameter selection for clustering analysis section for more details), followed by differential gene expression analysis with the FindAllMarkers command in Seurat to find cluster specific (up-regulated) marker genes. To filter out clusters of cells likely derived from low quality cells or empty droplets, we removed clusters with insufficient markers genes, as follows. First, we identified marker genes for each cluster as genes expressed in ≥25% of cells in that cluster and with FDR <0.01 (significantly highly expressed in the cluster compared to the cells not in that cluster). Second, we excluded ribosomal protein coding genes, MALAT1, and genes starting with MTRNR, as they could be erroneously identified as marker genes after normalization and scaling because for cells with a small number of UMIs, the UMIs of highly expressed genes will be weighted more than those from cells with a large number of UMIs based on the scaling formula xj/sumj xj, where xj is the expression of gene j of a cell. Third, we only kept the clusters in which > 70% of the top 15 marker genes (or 10 out of all markers genes in a cluster that had <15 marker genes) were not mitochondrial protein coding genes as high expression of mitochondrial genes can indicate stressed cells or empty droplets. This process was repeated twice.

Selecting the number of cells
We used a modified strategy for Smart-seq2 and CEL-Seq2 because there were fewer cells to cluster, which potentially could have led to the low-quality cells not forming distinct clusters. The assumptions for these cell selections were that (1) there were enough low quality cells to form distinct clusters and (2) the clustering algorithm did not split high-quality cells of the same cell type into many distinct clusters because this could have led to some sub-clusters having too few marker genes. We therefore set k=5 (the number of neighbors in building the k-NN graph) to detect small clusters with a low solution parameter of 0.5 to prevent splitting cells of the same type into many clusters. We also only used the top 25 principal components as we did not expect to identify as many cell types from a smaller number of cells.
For each cortex dataset, we observed that the total numbers of UMIs from individual cell barcodes returned by scumi followed a bimodal distribution. We therefore first used a mixture of two Student's t distribution models to filter nuclei with few UMIs (with probability ≥0.5 likelihood for the mixture component with relatively few UMIs compared to the cell barcodes from the other mixture component) and then applied the filtering approach used for PBMC data. We used the top 25 principal components and set the number of nearest neighbors k to 10 in clustering analyses (the choice of parameters was to help recover rarer cell types with lower numbers of UMI per cell, such as the oligodendrocyte precursors). To prevent splitting up big clusters due to the small k, we lowered the resolution to 0.8. For these nuclei libraries, the mitochondrial ratios were very low compared to those from cells, so that it was less likely to see low quality nuclei with mitochondrial protein coding genes as the top marker genes. Therefore, in addition to using mitochondrial protein coding genes to remove poor quality nuclei, we removed cluster-specific marker genes that were also expressed in >70% of the nuclei from the other clusters in a given library. We also used the top 20 marker genes in each cluster for filtering. For Smart-seq2 data, we only used the mixture model for cell selection as the remaining clusters could be assigned known mouse cortex cell types and applying the cluster-based filtering removed many cells that could be easily identified as known cell types.

Multiplet Detection
In the mixture experiments, we considered cells multiplets if they had >15% of UMIs from mouse and >15% of UMIs from human. For Smart-seq2, this calculation was instead based on the percent reads, though with the same cutoffs. We report the observed multiplet rate, though the actual multiplet rate was likely to be about twice this frequency due to the presence of undetected multiplets that are all from cells of the same species (assuming about equal numbers of human and mouse cells).

Detection of both human and mouse genes expressed in single cells
In the Mixture experiments, for each human cell, we computed the percentage of detected mouse genes in that cell. Similarly, we repeated this analysis to detect human genes from mouse cells. To show the relationship between sequencing depth and the number of detected genes from the other species, we drew scatter plots for the detected human cells and mouse cells where the x-axis is the number of detected human genes and the y-axis is the number of detected mouse genes. Generally, the number of detected genes y from the other species increases linearly with the number of detected genes x from the called species. Therefore, we used robust linear regression (Huber Mestimator, implemented in the MASS R package) to fit y = ax + b, for human cells and mouse cells, independently. The fitted line and its slope (a) were plotted on the scatter plots.

Pseudo-bulk expression profile comparisons
We

Bulk RNA-seq processing and analysis
Expression in bulk RNA-seq data was quantified with RSEM 54 . We extracted TPM values and divided by 100 to transform the data into TP10K. For PBMCs, the log of the resulting TP10K was then compared to the pseudo-bulk data from each of the single cell methods (see Pseudo-bulk expression profile comparisons section) using Pearson correlation, with the exception of Smart-seq2. For Smart-seq2, we instead combined all FASTQ files into one joint FASTQ, and quantified TPM using RSEM using the same pipeline as for the bulk data. The percent mitochondrial transcripts for bulk data was calculated by taking the sum of TPM values over all mitochondrial genes (defined above) and dividing through by 1,000,000.
In addition to expression, de novo transcript construction was performed on the bulk RNA-seq data in the hope of correcting mistakes in the standard annotation. To achieve this reannotation, bulk RNA-seq data was mapped to the genome with HISAT2 52 v 2.0.5, using the -dta flag and otherwise default parameters. De novo transcript construction was then performed with StringTie 59 v0.1.18 with the parameters -v --rf -c 10 and -G, using the GTF from the scumi pipeline.
In order to quantify the improvement in performance by using the new annotation, we extracted reads from the annotated BAM file output by scumi that did not overlap any exons in the standard annotation (using the equally sampled data). We used featureCounts 53 to count the number of these reads that overlapped an exon in the reannotated GTF output, to quantify how many more reads per cell we recovered by using the reannotated reference.

Technical precision
We computed the extra Poisson variations that include both biological variations (assumed to be small and the same across methods in our data) and technical variations that cannot be explained by simple Poisson models 16  High extra Poisson CV indicates high technical noise. We used only genes expressed in more than 5% of the cells.

Sampling reads
In order to correct for differences in sequencing depth between methods, we used seqtk v1.0 (https://github.com/lh3/seqtk) to sample the sequencing data, so that for each method we could analyze nearly the same average number of reads per cell. Low-and high-throughput methods were sampled independently. For a given experiment, we first decided on the average number reads per cell to sample. We usually set this equal to the lowest average number of reads per cell for a method in that experiment, except for PBMC1 Seq-Well, which had a lower average number of reads per cell, so that we chose the library with the next lowest average number of reads per cell. For each library in a given experiment, we then derived a sampling ratio by dividing this sampling target by the original average number of reads per cell in that library. We sampled the FASTQ file for each library with the 'seqtk sample' command, using the sampling ratio calculated above and the random seed set to 100 (after combining FASTQ files from different sequencing runs). Although we aimed to have the exact same average number of reads per cell for each library, there were some small deviations from this in practice because the number of cells we identified in each library was not always the same before and after sampling, as well as due to the random nature of the sampling. Details of the sampling for each experiment are in Supplementary Table   5.

Effects of sequencing depth on gene detection
To estimate the sensitivity of each method at different sequencing depths, we sampled datasets to a given numbers of reads per cell. To efficiently sample without running scumi starting over using

Automatically assigning cell types to clusters
We followed common practices for scRNA-seq data clustering. Specifically, cells were divided into non-overlapping clusters by using the Louvain community detection algorithm 56,57 . For each cell from a dataset, we computed its k-nearest neighbors in that dataset, and then built a directed k-NN graph using all the cells from that dataset. This directed k-NN graph was further converted to an undirected weighed graph by using shared neighbors. The Louvain algorithm was used to partition the undirected weighted k-NN graph into non-overlapping clusters.
We used marker genes for each cell type to compute a score for each cell and automatically assign cell types to clusters. Both human PBMC and mouse cortex have well-annotated cell types and marker genes for each cell type 32,60,61 . We generated lists of marker genes for each tissue with manual curation (Supplementary Tables 12 and 13). The score of cell i for cell type m is a normalized version of the percentage of total counts from marker genes from cell type m.
Assuming that there were Nm marker genes for cell type m, we considered these Nm genes combined Based on the scores, we assigned cell types to clusters using the area under the receiver operating characteristic curves (AUCs). Stated another way, for a given cluster and a given cell type c, a cell i in that cluster is a true positive if the score si,c is above a given threshold and a false negative otherwise. On the other hand, a cell not in that cluster is a false positive if it has a score above the threshold and a true negative otherwise. A receiver operating characteristic (ROC) curve plots the true positive rate against the false positive rate at different score thresholds. The AUC is 1.0 for perfectly assigning a cell type to a cluster (we can find a threshed score perfectly separating a cluster from the rest), and around 0.5 for randomly assigning a score to a cell. Specifically, for each cluster, the cell type with the maximum AUC was assigned to that cluster. As the same type of cells can be split into several clusters, after initial assignment of cell types to clusters, we recomputed the AUC of a cluster for a cell type by excluding other clusters of cells that were assigned to that cell type. This process was repeated until no changes in the cluster assignment.
We then calculated the AUC for a cell type by merging the cluster of cells that were assigned to that cell type.

Parameter selection for clustering analysis
We considered the following parameters for the

Scoring each dataset to compare scRNA-seq methods
We used the AUCs to quantify the quality of each recovered cell type. After clustering analysis and assigning cell types to clusters, we also calculated the AUC of each cell type (from its cell type marker gene scores in separating that cluster of cells from the rest).

Standard, method-specific computational pipelines
We processed each of the PBMC datasets through the standard pipeline for that scRNA-seq method and compared the output statistics to those from scumi. If there were multiple sequencing runs for a given library, we combined the FASTQ files before further analysis. Additionally, files were renamed to match the naming schemes required for a given pipeline. Finally, we chose a cutoff by visual inspection of the histogram of genes per.
We used the inDrops pipeline from the initial publication 7 (available at https://github.com/indrops/indrops). We first constructed references according to the inDrops README using RSEM 54 (v1.3) with the GTF and FASTA files used to construct our scumi references. We trimmed the reads to be the same length in all runs (read 1 50 bases, read 2 19 bases, index reads 8 bases). We then ran the pipeline using the default YAML configuration, modified to work with Bowtie 62 (v1.1.1). We then extracted the gene by cell UMI count matrix.
For cell selection, we chose the nGene cutoff used by visual inspection of the histogram of nGenes.
For CEL-Seq2 data, we used the pipeline from the original publication 6 (available at https://github.com/yanailab/CEL-Seq-pipeline). We created a Bowtie 2 reference using the bowtie2-build command in Bowtie 2 63 (v2.2.1), using the same FASTA file as for the other pipelines. We processed the reads with the CEL-Seq2 pipeline with the default configuration file, modified to include our updated reference. We extracted and combined the gene by cell UMI count matrix. For cell selection, we chose the nGene cutoff used by visual inspection of the histogram of nGenes.
For Smart-seq2, we used the same pipeline as in the scumi results -HISAT2 52 to map the FASTQ file, featureCounts 53 to annotate the BAM file, followed by counting reads that overlap the exons of a unique gene -except for cell selection we chose the nGene cutoff by visual inspection of the histogram of nGenes.

Clustering combined datasets
For the PBMC and cortex datasets, to find cell types that were missed in analyzing each dataset independently, we took the count data for each method (after sampling the same number of reads and cells as describe above) in each experiment, and merged them into one count matrix. For Smart-seq2 we used reads counts, while for the other methods we used UMI counts. We loaded the results into R using Seurat, normalized to log counts per 10,000, and scaled the data. We identified variable genes separately for each library using the 'FindVariableGenes' function (with         Table 5 for the numbers of cells used. We could not confidently assign cell types to some clusters of cells from sci-RNA-seq and these cells were not used in calculating the AUCs.

Supplementary Figures
Supplementary Figure 1. Description of scRNA-seq methods evaluated.
Salient details for seven protocols tested in this paper.