Abstract
The recent rapid spread of single cell RNA sequencing (scRNA-seq) methods has created a large variety of experimental and computational pipelines for which best practices have not been established, yet. Here, we use simulations based on five scRNA-seq library protocols in combination with nine realistic differential expression (DE) setups to systematically evaluate three mapping, four imputation, seven normalisation and four differential expression testing approaches resulting in ∼ 3,000 pipelines, allowing us to also assess interactions among pipeline steps. We find that choices of normalisation and library preparation protocols have the biggest impact on scRNA-seq analyses. Specifically, we find that library preparation determines the ability to detect symmetric expression differences, while normalisation dominates pipeline performance in asymmetric DE-setups. Finally, we illustrate the importance of informed choices by showing that a good scRNA-seq pipeline can have the same impact on detecting a biological signal as quadrupling the sample size.
Introduction
Many experimental protocols and computational analysis approaches exist for single cell RNA sequencing (scRNA-seq). Furthermore, scRNA-seq analyses can have different goals including differential expression (DE) analysis, clustering of cells, classification of cells and trajectory reconstruction 1. All these goals have the first analysis steps in common in that they require expression counts or normalised counts. Here, we focus on these important first choices made in any scRNA-seq study, using DE-inference as performance read-out. Benchmarking studies exist only separately for each analysis step, which are library preparation protocols 2,3, alignment 4,5, annotations 6, count matrix preprocessing 7,8 and normalisation 9. However, the impact of the combined choices of the separate analysis steps on overall pipeline performance has not been quantified. In order to achieve a fair and unbiased comparison of computational pipelines, simulations of realistic data sets are necessary. This is because the ground truth of real data is unknown and alternatives, such as concordance analyses are bound to favour similar and not necessarily better methods.
To this end, we integrated popular methods for each analysis step into our simulation framework powsimR 10. As the basis for simulations, powsimR uses raw count matrices to describe the mean-variance relationship of gene expression measures. This includes the variance introduced during the experiment itself as well as extra variance due to the first to computational steps of expression quantification. Adding differential expression then provides us with detailed performance measures based on how faithfully DE-genes can be recovered.
One main assumption in traditional DE-analysis is that differences in expression are symmetric. This implies that either a small fraction of genes is DE while the expression of the majority of genes remains constant or similar numbers of genes are up- and down-regulated so that the mean total mRNA content does differ between groups 11. This assumption is no longer true when diverse cell types are considered. For example, Zeisel et al. 12 found up to 60% DE genes and differing amounts of total mRNA levels between cell types. This issue of asymmetry is conceptually one of the characteristics that distinguishes single cell from bulk RNA-seq and has not been addressed so far. Therefore, we simulate varying numbers of DE-genes in conjunction with small to large differences in mRNA content including the entire spectrum of possible DE-settings.
Realistic simulations in conjunction with a wide array of scRNA-seq methods, allow us not only to quantify the performance of individual pipeline steps, but also to quantify interdependencies among the steps. Moreover, the relative importance of the various steps to the overall pipeline can be estimated. Hence, our analysis provides sound recommendations regarding the construction of an optimal computational scRNA-seq pipeline for the data at hand.
Results
The starting point for our comprehensive pipeline comparison is a representative selection of scRNA-seq library preparation protocols (Figure 1A). Here, we included one full-length method (Smart-seq213) and four UMI methods 14,15,2,16. The UMI strategies encompass two plate-based (SCRB-seq, CEL-seq2) and the most common non-commercial and commercial droplet-based protocols (Drop-seq, 10X Chromium). CEL-seq2 differs from SCRB-seq in that it relies on linear amplification by in vitro transcription, while SCRB-seq relies on PCR amplification using the same strategy as 10X Chromium (see Ziegenhain et al. 17, 2 for a detailed discussion). We then combine the library preparation protocols with three mapping approaches 18,19,20 and three annotation schemes 21,22,23 resulting in 45 distinct raw count matrices (Online Methods). We simulated 27 distinct DE-setups per matrix, each with 20 replicates, resulting in a total of 19,980 simulated data sets (Figure 1B).
Genome-mapping quantifies more genes with high accuracy
We first investigated how expression quantification is affected by different alignment methods using our selection of scRNA-seq experiments. For each of the three following strategies we picked one the most popular methods (Supplementary Figure S2): 1. alignment of reads to the genome using splice-aware alignment (STAR 18), 2. alignment to the transcriptome (BWA19) and 3. pseudo-alignment of reads guided by a transcriptome (kallisto 24).We then combined these with three annotation schemes including two curated schemes (RefSeq 21 and Vega 23) and the more inclusive GENCODE 22 (Supplementary Table S2).
First, we assessed the performance by the number of reads or UMIs that were aligned and assigned to genes (Figure 2A and Supplementary Figure S3). Alignment rates of reads are comparable across all scRNA-seq protocols. Assignment rates on the other hand show some interaction between mapper and protocol. Notably, kallisto appears to have more problems assigning reads from the 3’ UMI protocols and does fine with the full length protocol Smart-seq2 (Supplementary Figure S5 and S6). Generally, STAR in combination with GENCODE aligned (82-86%) and assigned (37-63%) the most reads. BWA assigned a slightly lower fraction of reads (22-44%), but - suspiciously - these were distributed across more UMIs. As reads with the same UMI are more likely to originate from the same mRNA molecule and thus the same gene, the average number of genes with which one UMI sequence is associated, can be seen as a measure of false mapping. Indeed, we find that the same UMI is associated with more genes when mapped by BWA than when mapped by STAR (Figure 2B). This indicates a high false mapping rate, that probably inflates the number of genes that are detected by BWA (Figure 2C and Supplementary Figure S4). In contrast, the final UMI count matrix obtained with kallisto is more sparse, assigning the smallest number of reads and detecting 20-25% fewer genes than STAR (Figure 2A,C). Genes that are specifically missed by kallisto have on average more exons and are associated with more transcripts. It is also the higher number transcripts annotated in Gencode as compared to RefSeq that allows kallisto to perform better with RefSeq than with Gencode annotation (Supplementary Table S2 and Supplementary Figures S5 and S6).
This said, it remains to be seen what impact the differences in read or UMI counts obtained through the different alignment strategies and annotations have on the power to detect DE-genes.
As already indicated from the low fraction of assigned reads, kallisto has the lowest mean expression and the highest gene dropout rates (Figure 2D and Supplementary Figure S7) and, as expected from a high fraction of falsely mapped reads, BWA has the largest variance. To estimate the impact that these statistics have on the power to detect DE-genes, we use the mean-variance relationship to simulate data sets with DE-genes (Figure 2D,E). As previously reported 2, UMI protocols have a noticeably higher power than Smart-seq2 (Figure 2F). Moreover for Smart-seq2, we find that kallisto performs slightly better than STAR, while for UMI-methods STAR performs better (Figure 2F and Supplementary Figure S9).
In summary, using BWA to map to the transcriptome introduces noise, thus considerably reducing the power to detect DE-genes as compared to genome alignment using STAR or the pseudo-alignment strategy kallisto, but given the lower mapping rate of kallisto STAR with GENCODE is generally preferable.
Many asymmetric expression changes pose a problem without spike-in data
The next step in any RNA-seq analysis is the normalisation of the count matrix. The main idea here is that the resulting size factors correct for differing sequencing depths. In order to improve normalisation, spike-ins as an added standard can help, but are not feasible for all scRNA-seq library preparations. Another avenue to improve normalisation would be to deal with sparsity by imputing missing data prior to normalisation as discussed in the next chapter (Figure 1C). To begin with, we compare how much the estimated size factors deviate from the truth. As long as there is only a small proportion of DE-genes or if the differences are symmetric, estimated size factors are not too far from the simulated ones and there are no large differences among methods (Figure 3A and Supplementary Figure S12). However with increasing asymmetry, size factors deviate more and more and the single cell methods scran 25 and SCnorm 26 perform markedly better than the bulk methods TMM 27, MR 28 and Positive Counts as well as the single cell method Linnorm 29. Census 30 is an outlier in that it has a constant deviation of 0.1, which is due to filling in 1 when library sizes could not be calculated.
To determine the effect of these deviations on downstream analyses, we evaluated the performance of differential expression inference using different normalisation methods (Figure 3B and Supplementary Figure S15). Firstly, the differences in the TPR across normalisation methods are only minor, only Linnorm performed consistently worse (Supplementary Figure S13). In contrast, the ability to control the FDR heavily depends on the normalisation method (Supplementary Figure S14). For small numbers of DE-genes or symmetrically distributed changes, the FDR is well controlled for all methods except Linnorm. However, with an increasing number and asymmetry of DE-genes, only SCnorm and scran keep FDR control, provided that cells are grouped or clustered prior to normalisation. In our most extreme scenario with 60% DE-genes and complete asymmetry, all methods except Census loose FDR control. SCnorm, scran, Positive Counts and MR regain FDR control with spike-ins for 60% completely asymmetric DE-genes (Supplementary Figure S14). Given similar TPR of the methods, this FDR control determines the pAUC (3B,C).
Since in real data it is usually unknown what proportion of genes is DE and whether cells contain differing levels of mRNA, we recommend a method that is robust under all tested scenarios. Thus, for most experimental setups scran is a good choice, only for Smart-seq2 data without spike-ins, Census might be a better choice.
Imputation has little impact on pipeline performance
If the main reason why normalisation methods perform worse for scRNA-seq than for bulk data is the sparsity of the count matrix, reducing this sparsity by either more stringent filtering or imputation of missing values should remedy the problem 31. Here, we test the impact of frequency filtering and three imputation approaches (DrImpute 32, scone 33, SAVER 34) on normalisation performance. Note, that we use the imputation or filtering only to obtain size factor estimates, that are then used together with the raw count matrix for DE-testing.
We find that simple frequency filtering has no effect on normalisation results (Figure 3D). Performance as measured by pAUC is identical to using raw counts. In contrast, imputation can have an effect on performance and there are large differences among methods. Imputation with DrImpute and scone rarely increased the pAUC and occasionally as in the case of SCRB-seq with MR normalisation, the pAUC even decreased by 100% and 76%, respectively due to worse FDR control relative to using raw counts (Supplementary Figure 18). In contrast, these two imputation methods achieved an appreciable increase in pAUC together with scran normalisation, ∼ 28%, 4% and 9% for 10X Genomics, SCRB-seq and Smart-seq2 data, respectively. SAVER on the other hand never made things worse, irrespective of data set and normalisation method but was able to rescue FDR control for MR normalisation of UMI data, even in a completely asymmetric DE-pattern.
These observations suggest that data sets with a high gene dropout rate might benefit more from imputation than data sets with a relatively low gene dropout rate (Supplementary Figure S16-18). In order to further investigate the effect of imputation on sparse data, we downsampled the Smart-seq2 and SCRB-seq data, which were originally based on 1 million reads/cell, to make them more comparable to the 10X-HGMM data with on average of 60,000 reads/cell. A radical downsampling to 10% of the original sequencing depth decreases the number of detected genes for SCRB-seq by only 1%, suggesting that the original RNA-seq library was sequenced to saturation. In contrast, the Smart-seq2 data were much less saturated at 1 million reads/cell: Down-sampling reduced the number of detected genes by 34%. However, the relative effect of imputation on performance remains small. This is probably due to the fact that the main effect of downsampling is a reduction in the detected genes, which also cannot be imputed. Thus, if a good normalisation method is used to begin with (e.g. scran with clustering), the improvement by imputation remains relatively small.
Good normalisation removes the need for specialised single cell DE-tools
The final step in our pipeline analysis is the detection of DE-genes. Recently, Soneson and Robinson 31 benchmarked 36 DE approaches and found that edgeR 27, MAST35, limma-trend 36 and even the T-Test performed well. Moreover, they found that for edgeR, it is important to incorporate an estimate of the dropout rate per cell. Therefore, we combine edgeR here with zingeR 37.
Both edgeR-zingeR and limma-trend in combination with a good normalisation reach similar pAUCs as using the simulated size factors (Figure 4). However, in the case of edgeR-zingeR this performance is achieved by a higher TPR paid while loosing FDR control (see Supplementary Figure S20), even in the case of symmetric DE-settings (Supplementary Figure S22-S24).
Nevertheless, we find that DE-analysis performance strongly depends on the normalisation method and on the library preparation method. In combination with the simulated size factors or scran normalisation, even a T-Test performs well. Conversely, in combination with MR or SCnorm, the T-Test has an increased FDR (Supplementary Figure S20). SCnorms bad performance with a T-Test was surprising given SCnorms good performance with limma-trend (Figure 3B). One explanation could be the relatively large deviation of SCnorm derived size factors (Figure 3A and Supplementary Figure S12) which inflate the expression estimates.
Furthermore, we find that MAST performs consistently worse than the other DE-tools when applied to UMI-based data, but -except in combination with SCnorm- it is doing fine with Smart-seq2 data. Interestingly, Census normalisation in combination with edgeR-zingeR outperformed limma-trend with Smart-seq2 (Supplementary Figure S25). In concordance with Soneson and Robinson 31, we found that limma-trend, a DE-tool developed for bulk RNA-seq data showed the most robust performance. Moreover, library preparation and normalisation appeared to have a stronger effect on pipeline performance than the choice of DE-tool.
Normalisation is overall the most influential step
Because we tested a nearly exhaustive number of ∼3,000 possible scRNA-seq pipelines, starting with the choice of library preparation protocol and ending with DE-testing, we can estimate the contribution of each separate step to pipeline performance for our different DE-settings (Figure 1 B). We used a beta regression model to explain the variance in pipeline performance with the choices made at the seven pipeline steps 1) library preparation protocol, 2) spike-in usage, 3) alignment method, 4) annotation scheme, 5) preprocessing of counts, 6) normalisation and 7) DE-tool as explanatory variables. We used the difference in pseudo-R2 between the full model including all seven pipeline steps and leave-one-out reduced models to measure the contribution of each separate step to overall performance.
All pipeline choices together (the full model) explain ∼ 50% and ∼ 60% of the variance in performance, for 20% and 60% DE-genes, respectively (Figure 5A). Choices of preprocessing the count matrix contribute very little (∼R2 <= 1%). The same is true for annotation (∼R2 <= 2%) and aligner choices (∼R2 <= 5%). For aligner and annotation, it is important to note that these are upper bounds, because our simulations do not include differences in gene detection rates (Figure 2C).
Surprisingly, the choice of DE-tool only matters for symmetric DE-setups , and the choice of library preparation protocol has an even bigger impact on performance for symmetric DE-setups and additionally for 5% asymmetric changes . Normalisation choices have overall a large impact in all DE-settings (∼R2 = 12 - 38%), where the importance increases with increasing levels of DE-genes and increasing asymmetry. Spike-ins are only necessary if many asymmetric changes are expected and have little or no impact if only 5% of the genes are DE or the changes are symmetric (Figure 5A). Moreover, for completely asymmetric DE-patterns, the regression model did not converge without normalisation and spike-ins, because their absence or presence alone pushed the MCCs to the extremes.
For the best performing pipeline SCRB-seq + STAR + GENCODE + SAVER imputation + scran with clustering + limma-trend, using 384 cells per group instead of 96 improves performance only by 6.5-8%. Sample size is more important if a naive pipeline is used. For SCRB-seq + BWA + GENCODE + no count matrix preprocessing + MR + T-Test the performance gain by increasing sample size is 10-12% and even worse, for many asymmetric DE-genes, lower sample sizes occasionally appear to perform better (Figure 5B and Supplementary Figure S21). Next, we tested our pipeline on publicly available 10X Genomics data set containing the expression profiles of approx. 1000 human peripheral mononuclear blood cells (PBMC) 16. First, we classified the cells using SingleR 38 into the celltypes available in the Blueprint Epigenomics Reference 39 distinguishing Monocytes, NK-cells, CD8+T-cells, CD4+T-cells and B-cells (Figure 5C,D). We applied the previously defined good (STAR + gencode + SAVER imputation + scran with clustering + limma-trend) and naive (BWA + gencode + no preprocessing + MR + T-Test) pipeline to identify DE-genes between the cell types. Cross-referencing the identified DE-genes with known differences in marker gene expression 39, we find that the good pipeline always identifies a higher fraction of the marker genes as DE than the naive pipeline (Figure 5E). Comparing NK-cells and CD8+ T-cells, the good pipeline identifies 148 known markers as DE, while the naive pipeline finds only 54. The diminished separation between those two cell-types using the naive pipeline is already visible in the UMAP (Figure 5D).
In summary, we identify normalisation and library preparation as the most influential choices and the observation that differences in computational steps alone can significantly lower the required sample size nicely illustrates the importance of bioinformatic choices.
Discussion
Here we evaluate the performance of complete computational pipelines for the analysis of scRNA-seq data under realistic conditions with large numbers of DE-genes and differences in total mRNA contents between groups (Figure 1). Furthermore, our simulations allow us not only to investigate the influence of choices made at each pipeline step separately, but also to estimate the relative importance and interactions between different steps of an entire scRNA-seq analysis pipeline. We implemented all assessed computational methods and more in powsimR, so that users can easily evaluate pipeline performance given their own data and expected DE-settings.
Beginning with the creation of the raw count matrix, we find that transcriptome mapping with BWA 19 appears to recover the largest number of genes. However, many of these are probably due to falsely mapped reads, also increase expression variance which ultimately results in a lower sensitivity (Figure 2C-F). In contrast, the pseudo-alignment method kallisto 24 appears to assign reads precisely, but looses a lot of reads or UMIs with 3’ UMI-data. One possible explanation is that the 3’end of a gene alone often ends up in chimeric equivalence classes, thus leading to a lower gene detection rate and mean expression. Finally, a genome mapping approach using the splice-aware aligner STAR 18 in conjunction with GENCODE annotation recovers the most genes with the highest accuracy (Figure 5F).
Concerning the preprocessing of the count matrix, we found in concordance with Andrews and Hemberg 40 that in particular for sparse data such as 10X, SAVER34 imputation before normalisation improves performance, while filtering genes has no effect with our data sets and combinations of normalisation and DE-testing methods.
The choice that had the largest impact on performance throughout all tested DE-settings is the choice of normalisation method. Only for symmetric changes, the choice of library preparation protocol had a slightly larger impact than normalisation. In line with Evans et al. 11, we found that normalisation performance of bulk methods and also some of the single cell methods declined with asymmetry (Figure 3B). In particular, for 60% completely asymmetric DE-genes only Census retained FDR control. Unfortunately, Census is not recommended for the use with UMI-counts. Thus, for UMI-counts and 60% completely asymmetric changes, only the use of spike-ins could restore test performance. In the debate about the usefulness of spike-ins 41,17, we land on the pro side: Our simulations clearly show that spike-ins are useful in DE-testing settings with asymmetric changes which is likely to be a common phenomenon in scRNA-seq data. Due to good performance across DE-settings and its speed (Supplementary Figure S22) we would recommend scran with prior clustering as the best choice for normalisation (Figure 5F).
The choice in DE-testing method, our final pipeline step had relatively little impact on overall pipeline performance. A good normalisation prior to DE-testing alleviates the need for more complex and thus vulnerable methods, such as for example MASTs hurdle model which implicitly assumes that the CPM values were generated from zero inflated negative binomial count distribution. Indeed, in Vieth et al. 10 we showed that also scRNA-seq data fit a negative binomial distribution rather well and that the previously reported zero-inflation in scRNA-seq data is mainly due to amplification noise which is removed in UMI-data. Hence, it is not surprising that in concordance with Soneson and Robinson 31, we find that relatively straight forward DE-testing methods adapted from bulk RNA-seq perform well with scRNA-seq data.
Finally, we want to remark that paying attention to the details in a computational pipeline and in particular to normalisation pays off. The effect of using a good pipeline as compared to a naively compiled one has a similar or even greater effect on the potential to detect a biological signal in scRNA-seq data as an increase in cell numbers from 96 to 384 cells per group (Figure 5B).
Online Methods
Single Cell RNA-seq Data Sets
The starting point for our comprehensive pipeline comparison is the scRNA-seq library preparation (Figure 1 A). In our comparison, we included the gene expression profiles of mouse embryonic stem cells (mESC) as published in Ziegenhain et al. 2 (Supplementary Figure S1). We selected four data sets for our comparison: Smart-seq213 a well-based full-length scRNA-seq protocol, CEL-seq215 a well-based 3’ UMI-protocol using linear amplification, SCRB-seq a well-based 3’ UMI-protocol with PCR amplification 42,2 and Drop-seq 14 a droplet-based 3’ UMI-protocol. In addition, 92 poly-adenylated synthetic RNA transcripts of known concentration designed by the External RNA Control Consortium (ERCCs) 43 were spiked in for all methods except Drop-seq. All raw cDNA sequencing reads were cut to a length of 45 bases and downsampled to one million cDNA reads per cell (Supplementary Table S1 and Supplementary Figure S1).
Finally, we added a 10X Chromium data set sequencing mouse NIH3T3 cells 16, yielding ∼ 400 good cells with on average ∼ 60,000 reads/cell and another 10X data set analysing ∼ 1,000 human peripheral blood mononuclear cells (PBMCs).
These choices of library preparation protocols cover the diversity in current protocols without imposing partiality due to biological differences and technical handling.
Gene Expression Quantification
For genome mapping and quantification of the UMI-data with a splice-aware aligner, we used the zUMIs 44 (v.0.0.3) pipeline with STAR 18 (v.2.5.3a) and the mouse genome (Mus musculus.GRm38) together with annotation files (gtf) for GENCODE (vM15), Vega (VEGA68) and RefSeq (Release 85) (Supplementary Table S2). zUMIs is a fast and flexible pipeline for processing scRNA-seq data where cell barcode or UMI reads with low sequence quality reads are filtered out prior to UMI collapsing by sequence identity which yields identical count results as UMI-tools 45,44. For Smart-Seq2 we use the same pipeline settings as in zUMIs, simply omitting the UMI collapsing step (Supplementary Table S3).
For transcriptome alignment, we downloaded transcriptome fasta files corresponding to the annotations listed above. We used BWA 19 (v0.7.12) to align the scRNA-seq reads to these transcriptomes. We only removed reads that aligned equally well to transcripts of different genes as truly multi-mapped. The remaining reads were tallied up per gene. For UMI data, the reads were collapsed per gene by identity, similar to the strategy recommended in zUMIs.
For kallisto 24 (v0.43.1), a transcriptome-guided pseudo-alignment method, we followed the recommended quantification procedure to yield abundance estimates per equivalence class. To be comparable with other alignment methods, the counts per equivalence class were collapsed per gene. The counts of equivalence classes representing multiple genes were filtered out. For SCRB-seq, CEL-seq2, Drop-seq and 10X Genomics libraries, we chose the UMI-aware quantification option. The ERCC spike-in sequences were appended to the genome or transcriptome sequences for quantification.
Simulations
We used powsimR to estimate, simulate and evaluate single cell RNA-seq experiments 10. PowsimR has been independently validated for benchmarking DE-approaches 31 and consistently reproduces the mean-variance relationship and dropout rates of genes of scRNA-seq data (see also Supplementary Figure 28). The gene expression quantification using three different aligners in combination with three annotations per library preparation protocol produced 45 count matrices. These count matrices are the basis for our estimation in powsimR. Genes needed at least one read or UMI count in at least one cell to be considered in the estimation for simulation parameters. Since we 10 and others 46,47 have found previously, we assume that UMI counts follow a negative binomial distribution and only Smart-seq2 needs the inclusion of zero-inflation. To simulate spike-in data, we added an implementation of the simulation framework for pure technical variation of spike-ins described in Kim et al. 48 to powsimR. The parameters required for these simulations were estimated from 92 ERCC spike-ins in the SCRB-seq, CEL-seq2 and Smart-seq2 data, respectively 2. To evaluate the effect of differing sequencing depths, we added a new module to powsimR that estimates the degree of PCR amplification for UMI data. This allows the user to downsample a read count matrix by binomial thinning as implemented in edgeR thinCounts() 27 and then to reconstruct the corresponding UMI count matrix base on the estimated PCR amplification rates.
For a detailed evaluation of the pipelines, we simulated two groups of cells for pairwise comparisons with the following three sample size setups: 96 vs. 96, 384 vs. 384 or 50 vs. 200 cells (Figure 1B). For simplicity, we kept the number of genes that we simulated constant at 10,000. To introduce slight variation in expression capture, we draw a different size factor for each cell from a narrow normal distribution (X ∼ N (µ = 1, σ = 0.1)) (Figure 1B). This distribution fits the considered data sets well, irrespective of the applied library preparation method. Furthermore, the two groups of cells can have 5%, 20% or 60% differentially expressed genes. To capture the asymmetry of observed expression differences, we considered three setups of DE-patterns: symmetric (50% up- and 50% down-regulated), asymmetric (75% up- and 25% down-regulated) or completely asymmetric (100% up-regulated). The magnitude of expression change is drawn from a narrow gamma distribution (X ∼ Γ(α = 1, β = 2)) defining the log2 fold change, which is then added to the sampled mean expression. The combination of these parameters results in a total of 27 DE-setups that were then applied to the parameter estimates from 37 different count matrices to simulate 20 replicates for each setting, producing a total of 19,980 simulated data sets.
These data sets were then analysed by a nearly exhaustive number of combinations of four imputation strategies (scone, SAVER, DrImpute),gene dropout filtering (remove genes with more than 80% zero expression values) together with seven normalisation approaches (TMM, MR, Linnorm, Census, Linnorm, scran, SCnorm) with or without spike-ins, depending on library preparation protocol and method (Figure 1C). Normalisation factors were then derived as described in Soneson and Robinson 31 and used in conjunction with the raw count matrices for DE-testing using four representative approaches (T-Test, limma-trend, edgeR-zingeR, MAST). The resulting p-values were corrected for multiple testing with Benjamini-Hochberg FDR and we applied a threshold level of 10% to define positive test results. All these steps were seamlessly implemented into powsimR (github: https://github.com/bvieth/powsimR). In total we analysed 2,979 different RNA-seq pipelines.
Evaluation metrics
To evaluate the normalisation results, we determined the root mean squared error (RMSE) of a robust linear model using the difference between estimated and simulated library size factors as response variable in rlm() implemented in R-package MASS 49 (v.7.3-51.1) (Supplementary Figure S10) 9.
All other measures are based on the final results of an entire scRNA-seq pipeline ending with DE-testing. Knowing the identity of the genes that were simulated to show differing expression levels and the results of the DE-testing, we used a number of metrics related to the confusion matrix tabulating the number of true positives, false positives, true negatives and false negatives. We define the power to detect differential expression with the TPR . The false discovery rate is defined as . We combine these two measures in a TPR versus FDR curve to quantify the trade-off between true and false discoveries in a genome-wide multiple testing setup as advocated by 50. We then summarise these curves by their partial area under curve (pAUC) of TPR versus observed FDR that still ensures FDR control at the nominal level of 10% (Supplementary Figure S11). This way of calculating the AUC is ideal for data with relatively high true negative rates as the partial integration does not punish methods that are over-conservative, i.e. that stay way below the nominal FDR.
To summarise the whole confusion matrix in one representative value we chose the Matthews Correlation Coefficient because it is a balanced measure ensuring a reliable comparison of method performance across all DE-settings 50,51. As for the pAUC, we calculated the maximal value of MCC where the cutoff still ensured FDR control at the nominal level of 10%.
To quantify the relative contribution of each step in the analysis pipeline, we used the MCC as a response variable in a beta regression model implemented in R-package betareg (v.3.1-1) 52 with each individual pipeline step. Because the MCC assumes the extremes of 0 and 1 in some DE-settings, we applied the recommended transformation, namely , where n is the sample size 53. The contribution is then given by the difference between the full model pseudo - R2 containing all covariates versus a model leaving one step out at a time. This is then scaled to the total variance explained to give relative ∼R2 percentages.
Data Availability
The scRNA-seq data used in this manuscript are all publicly available, and they are summarised in Supplementary Table S1. The SCRB-seq, Smart-seq2, Drop-seq, CEL-seq2 data are available at the Gene Expression Omnibus (GEO) under accession code GSE75790. The HGMM and PBMC data sets are available at 10x Genomics’s official website (https://support.10xgenomics.com/single-cell-gene-expression/datasets).
Code Availability
The software and code used are summarised in Supplementary Table S3 and S4. A compendium containing processing scripts and detailed instructions to reproduce the analysis for this manuscript is available from the following GitHub repository (https://github.com/bvieth/scRNA-seq-pipelines).
Author Contributions
B.V. and I.H. conceived the study. B.V. prepared and analysed the scRNA-seq data. B.V implemented and conducted the simulation and evaluation framework. S.P.and C.Z. helped in data processing and power simulations. W.E. and I.H. supervised the work and provided guidance in data analysis. B.V., I.H., and W.E. wrote the manuscript. All authors read and approved the final manuscript.
Acknowledgments
This work was supported by the Deutsche Forschungsgemeinschaft (DFG) through LMUexcellent, SFB1243 (Subproject A14/A15) and DFG grant HE 7669/1-1. C.Z. is recipient of an EMBO long-term fellowship (ALTF 673-2017).
Footnotes
↵+ hellmann{at}bio.lmu.de
In this revised manuscript, we added the analysis of a real dataset and show that pipeline choices indeed have an effect on identification and characterization of cell-types in scRNA-seq datasets. Furthermore, we investigate the detection biases that lead to the observed differences in the genes found by the different mappers and annotations. Finally, we added a downsampling function to our simulator powsimR, that now allows us to evaluate different sequencing depths and thus improves the comparability between the 10X Chromium and the other library preparation methods.