Abstract
Normalisation of single cell RNA sequencing (scRNA-seq) data is a prerequisite to their interpretation. The marked technical variability and high amounts of missing observations typical of scRNA-seq datasets make this task particularly challenging. Here, we introduce bayNorm, a novel Bayesian approach for scaling and inference of scRNA-seq counts. The method’s likelihood function follows a binomial model of mRNA capture, while priors are estimated from expression values across cells using an empirical Bayes approach. We demonstrate using publicly-available scRNA-seq datasets and simulated expression data that bayNorm allows robust imputation of missing values generating realistic transcript distributions that match single molecule FISH measurements. Moreover, by using priors informed by dataset structures, bayNorm improves accuracy and sensitivity of differential expression analysis and reduces batch effect compared to other existing methods. Altogether, bayNorm provides an efficient, integrated solution for global scaling normalisation, imputation and true count recovery of gene expression measurements from scRNA-seq data.
Introduction
scRNA-seq is a method of choice for profiling gene expression heterogeneity genome-wide across tissues in health and disease 1, 2. Because it relies on the detection of minute amounts of biological material, namely the RNA content of one single cell, scRNA-seq is characterised by unique and strong technical biases. These arise mainly because scRNA-seq library preparation protocols recover only a small fraction of the total RNA molecules present in each cell. As a result, scRNA-seq data are usually very sparse with many genes showing missing values (i.e. zero values, also called dropouts). The fraction of all transcripts recovered from a cell is called capture efficiency and varies from cell to cell, resulting in strong technical variability in transcripts expression levels and dropouts rates. Moreover, capture efficiencies tend to vary between experimental batches resulting in confounding “batch effects”. Correcting for these biases in order to recover scRNA-seq counts reflecting accurately the original numbers of transcripts present in a cell remains a major challenge in the field 3–5
A common approach to scRNA-seq normalisation is the use of cell-specific global scaling factors. These methods are based on principles developed for normalisation of bulk RNA-seq experiments and assume that gene specific biases are small 3. Typically, read counts per cell are divided by a cell specific scaling factor estimated either from spike-in controls6, or directly from the transcriptome data using methods developed initially for bulk RNA-seq7–9 or specifically for scRNA-seq 10,11. A recent method called SCnorm extended the global scaling approach by introducing different scaling factors for different expression groups12.
Importantly, scaling methods do not correct for cell-to-cell variations in dropout rates, as genes with zero counts remain zero after division by a scaling factor. Several approaches have been designed to tackle this problem. A series of methods use zero-inflated distribution functions, to explicitly model the dropout characteristics13–15. Alternatively, other studies have proposed to infer dropouts based on expression values pooled across cells or genes16–19. For instance, scImpute pools expression values across similar cell subpopulations in each dataset and imputes dropouts using a Gamma-Normal mixture model and population specific thresholds18. Similarly, the MAGIC package is based on pooling gene expression values across cells using a network-based similarity metric 19. Conversely, the Saver approach pools expression values across genes within each cell using a Gamma-Poisson Bayesian model 17. The Gamma-Poisson model is also used in two other packages called Splatter and scVI for simulating and normalising scRNA-seq data respectively 20, 21. scVI belongs to new class of approaches which implement deep learning variational autoencoder or autoencoder methods 16,20,22–24. For instance, DCA, an autoencoder method, utilises a zero-inflated negative binomial noise model16. Experimental batch-to-batch variations are another common source of technical variability in scRNA-seq data. The origin of batch effects is not fully understood but results at least in part from differences in average capture efficiencies across experiments 25. Several methods have been developed to specifically remove batch effect in scRNA-seq data26–28.
The methods discussed above, treat normalisation, imputation, and batch effect correction as separate tasks. Moreover, they rely on strong assumptions such as the zero-inflation model. Here we provide a detailed account of a novel integrated approach called bayNorm which performs all the processing steps discussed above at the same time using minimal assumptions. We compared its performance with a series of available packages focusing on true count recovery, differential expression analysis and batch effect correction.
The bayNorm rationale
bayNorm is a Bayesian implementation of global scaling normalisation that simultaneously imputes missing values in scRNA-seq data. bayNorm generates for each gene (i) in each cell (j) a posterior distribution of original expression counts , given the observed scRNA-seq read count for that gene (xij) (Fig. 1a). Using the Bayes rule we have:
Where is the posterior distribution of true gene expression counts of a given gene in a given cell. is a likelihood function that depends on the cell specific capture efficiency (βj). Specific capture efficiencies can be estimated using spike-in controls or directly from the data using scaling factors provided by different methods3 and normalised to the dataset’s mean capture efficiency < β > (see Methods). is a gene specific prior expression distribution and P(xij) is the marginal likelihood. The outputs of bayNorm are either samples (3D array) or point estimates (2D array) from the posterior distributions (Fig. S1).
The binomial model is an appropriate choice for the bayNorm likelihood function
The bayNorm likelihood function is at the core of the approach and describes the empirical distribution of the raw experimental scRNA-seq counts. The binomial model describes the random sampling of a fraction of a cell transcriptome with constant probability. This is a simple model of transcript capture in scRNA-seq29 and we therefore hypothesised that it would be a good choice for bayNorm likelihood function. For the prior , we assume a negative binomial model, which describes the bursty distribution of mRNAs in simple models of gene expression 30, 31. Gene specific prior parameters are estimated using an empirical Bayes approach by pooling gene expression values across multiple cells of the dataset (see Methods for details).
To validate our choice of binomial likelihood model and prior estimates, we generated simulated scRNA-seq data based on these assumptions and investigated how closely they captured statistics of several published scRNA-seq datasets (Fig. 1 b-e, Fig. S2–7) 12, 29, 32, 33. The simulations assumed mRNA counts per cell that followed negative binomial distributions and used gene specific priors obtained with bayNorm (Fig. 1, Binomial_bayNorm), or sampled from estimates obtained with a modified version of the Splatter package (Fig. 1, ‘Binomial_Splatter’, Supplementary Notes 1)21. These were compared with simulations generated with the original Splatter package which is based on the Gamma-Poisson distribution21. We note that in Splatter, scaling factors are multiplicative to the Gamma distribution’s mean. In bayNorm, however, the cell specific capture efficiencies, which act as scaling factors, are set as the probability parameter of the binomial model. We found that the binomial model captures the variance-mean relationship of experimental scRNA-seq data well (Fig. 1b).
Another important feature of scRNA-seq data is their large amount of missing values, or dropouts, and several models have been proposed to explain this phenomenon 14, 15, 25, 34, 35. We therefore investigated how well the binomial model would capture dropout rates in experimental data. Our simulated dataset generated using the ‘Binomial_bayNorm’ function reproduced accurately the dependence of dropout fractions on gene expression means performing better than Splatter (Fig. 1c-e). Moreover, a parameter free approximation based on the binomial model predicted the dropout fraction to depend on an exponential of the negative mean expression (, see Methods). This functions produced a very close fit to the experimental data providing additional support for our choice of the binomial model (Fig. 1c). Notably, the Binomial_bayNorm simulation protocol using inferred gene-specific priors together with cell specific parameters (βj) was the only one that recovered the distribution of dropout rates per gene observed in experimental data (Fig. 1d). Finally, the results presented on Fig. 1b-e could be replicated consistently using several additional experimental scRNA-seq datasets (Fig. S2–7).
The datasets discussed so far include sequencing scores corrected for PCR amplification biases using unique molecular identifiers (UMIs)36. Some popular protocols, however, do not include UMIs, and are therefore likely to be less well described by the binomial distribution due to technical variability arising from PCR amplification bias. Accordingly, their dependence of dropout fractions on the mean expression has been reported to be more complex than in UMI-based datasets 35. We investigated this issue further and found that a simple scaling of non-UMI raw data by a constant factor produced a reasonable match to the binomial model (Fig. S9; see Methods). This scaling factor can be interpreted as the average number of times original mRNA molecules were sequenced after PCR amplification. This indicates that, provided appropriate scaling, non-UMI datasets are also compatible with the bayNorm model. Importantly, as bayNorm recovered dropouts rates successfully in both UMI-based and non-UMI protocols without the need of specific assumptions, we conclude that invoking zero-inflation models is not required to describe scRNA-seq data. Consistent with this, the differences in mean expression levels of lowly expressed genes observed between bulk and scRNA-seq data, which were suggested to be indicative of zero-inflation, were recovered by our simulated data using the binomial model only (Fig. S10)37.
We note that the ability of simulation protocols to recover the statistics of experimental data depended intimately on the value of cell-specific capture efficiencies (βj). We used different ways to estimate β (spike-in, Scran scaling factors, trimmed means, or housekeeping genes; Supplementary Note) together with different < β > in the Binomial_Splatter simulation protocol. We found that changes in βj values affected recovery of the distribution of dropout rates per cell. (Fig. S8). In particular, we found that the use of spike-in controls or of housekeeping reference gene expression levels did not improve estimates of capture efficiencies (Fig. S8c-f). Altogether, this analysis demonstrates that accurate statistics of experimental scRNA-seq data can be consistently retrieved using the binomial model and empirical Bayes estimation of gene expression parameters implemented in bayNorm along with accurate estimates of cell-specific capture efficiencies.
bayNorm enables recovery of true gene expression distributions from scRNA-seq data
Single-cell RNA-seq provides a unique opportunity to study stochastic cell-to-cell variability in gene expression at a near genome-wide scale. However, doing this requires normalisation approaches able to retrieve from scRNA-seq data transcripts levels matching quantitatively in vivo mRNA numbers 32. With this in mind, we evaluated bayNorm performance in reconstructing true gene expression levels from a series of experimental scRNA-seq datasets that contained matched single molecule fluorescence in situ hybridisation (smFISH) measurements for a series of genes. We used global mean capture efficiencies < β > estimated directly from smFISH together with gene specific priors informed by the sequencing data (Fig S11). After bayNorm normalisation, scRNA-seq counts reproduced accurately count distributions obtained by smFISH for several mRNAs (Fig 2a-b). We then compared bayNorm performance with a series of published normalisation methods (Supplementary note 4, Fig 2). All methods captured mean smFISH counts across different genes well (Fig. 2c-d, Fig S11). However, noise in gene expression (coefficient of variation, CV) and expression dispersion (Gini coefficient) measured by smFISH were better captured by bayNorm compared to normalisation by scaling or by several recent normalisation and imputation methods (Fig. 2e-f, Fig. 2g-h) 12,16–19. bayNorm’s good performance could also be confirmed in a series of simulation studies (Fig S12, Supplementary note 1). In summary, bayNorm combined with gene specific priors inferred directly from the scRNA-seq data, retrieves gene expression variability matching closely smFISH data.
bayNorm enables accurate and sensitive differential expression analysis
Differential genes expression analysis (DE) in scRNA-seq studies is challenging as several factors including variability in capture efficiencies, dropout rates, sequencing depth, and experimental batch effects can introduce significant, yet spurious, differential expression signal. Normalisation and imputation approaches have, therefore, a significant impact on the sensitivity and accuracy of DE analysis protocols. Two features of the bayNorm approach have the potential to improve the performance of DE analysis. Firstly, bayNorm posterior distribution of original counts maintains the uncertainty resulting from small capture efficiencies and could therefore reduce false positive DE discovery rates38. Secondly, the use of priors specific to each group of cells compared in the DE analysis could increase true positive discovery rates. With this in mind, we have assessed bayNorm performance in DE analysis using several experimental scRNA-seq datasets and compared it to other existing methods. To identify DE genes we use MAST13, which performs well in terms of false positives rates, precision and recall 39. MAST was first applied to individual sample from the bayNorm posterior distribution (3D array, Fig. S1). Differentially expressed genes were then called based on the median of Benjamini-Hochberg adjusted P-values of the individual samples28.
As mentioned above, differences in capture efficiencies between cells is a source of technical variability that could affect DE analysis. To test bayNorm’s ability to correct for this bias, we selected the 100 cells with the highest and lowest capture efficiencies based on total counts in recent UMI-based scRNA-seq study 29. We then applied bayNorm to the 200 cells using global prior estimation based on the combination of the two groups (see Methods). In this design, the two groups of cells differ based only on their capture efficiencies, and significant differential expression is therefore not expected. Fig. 3a shows the number of genes called differentially expressed as a function of increasing average expression levels using a series of normalisation and imputation methods12. bayNorm normalised data show almost no differentially expressed genes, outperforming all the other methods. Moreover, log2 gene expression ratios between cells of the two groups, were consistently close to zero, confirming bayNorm ability to correct for biases inherent to different capture efficiencies in UMI-based datasets (Fig. 3b).
Sequencing depth is another parameter affecting DE analysis especially because it impacts on the dropout rates of lowly expressed genes. Moreover, differences in sequencing depth are likely to affect levels of capture efficiencies, especially for non-UMI datasets where PCR biases are not accounted for. To assess bayNorm’s ability to correct for this source of bias, we used a benchmark dataset published by Bacher and colleagues12 that consists of non-UMI based scRNA-seq data for two groups of cells isolated from a single culture and sequenced to a depth of either 1 million or 4 million reads per cell. bayNorm and other imputation methods performed well in this setting (Fig. S13). However, a global scaling approach on its own led to poor results, unless performed independently on groups of genes with similar mRNA expression levels as in SCnorm. Finally, bayNorm corrected robustly for variability in sequencing depth when applied to a series of simulated datasets (Fig. S14–15)12.
We have shown that bayNorm is efficient at removing spurious differential expression from scRNA-seq data caused by variability in capture efficiencies and sequencing depth. We next explored bayNorm performance in supporting sensitive and robust detection of genes truly regulated between samples. To do this, we used two experimental scRNA-seq datasets40, 41 and lists of benchmark DE genes derived from matched bulk RNA-seq data39, 42. To maximise sensitivity, we used priors specific to each groups of cells in the comparison (we call this design “local priors”). With the first dataset, bayNorm normalised data generated an AUC value as high as other normalisation methods demonstrating that the approach supports sensitive DE detection (Fig. 3c). Analysis of the second dataset (UMI-based)41 confirmed this observation with bayNorm performing better than all other methods (Fig. 3d). Importantly, bayNorm performance did not depend on the number of cells in each group, except for groups with very low numbers of cells (Fig. 3d, Fig. S16). Finally, using a series of simulated datasets, we explored situations where the compared groups have different mean capture efficiencies and found that bayNorm supported robust DE detection in all cases (Fig. S17).
Three important parameters should be considered before bayNorm normalisation: i) the choice of priors, ii) the choice of average capture efficiencies < β >, iii) the choice of bayNorm output format. Prior parameters can be either estimated for all cells across groups (global) or within each group (local). Since priors are gene specific, applying bayNorm across homogeneous cells (i.e. using global prior) allows for mitigating technical variations (Fig S18a-b). On the other hand, using priors estimated “locally” within each group amplifies differences in signals between heterogeneous groups of cells increasing sensitivity (Fig S18c-d). Average capture efficiencies < β > are specific to each scRNA-seq protocol and reflect their overall sensitivity. This value represents the ratio of the average number of mRNA molecules sequenced per cell to the total number of mRNA molecules present in an average cell. It is not always easy to determine as quantitative calibration methods such as smFISH are not widely used, and approaches based on spike-in controls have important shortcomings3. We investigated the impact of inaccurate estimation of < β > on biases in DE detection. Critically we found that DE results based on bayNorm normalised data are not affected significantly by a 2 fold change of < β > (Fig. S20–S21). Finally, bayNorm output consists of either samples from its posterior distributions (3D array) or the modes of these distributions as point estimates (2D arrays). For DE analysis using MAST, 3D arrays reduces false positive rates but 2D arrays perform slightly better in terms of AUC (Fig S18c-d). Fig S19 shows DE results for two other non-parametric methods: ROTS43 and Wilcoxon test39. Both approaches perform equally well with 3D arrays but show variable results when applied to 2D arrays with the Wilcoxon test performing less well.
In summary, our analysis demonstrates that in addition to correcting for technical biases, bayNorm also supports robust and accurate DE analysis of a wide range of experimental and simulated scRNA-seq datasets.
bayNorm correction of experimental batch effects
scRNA-seq protocols are subject to significant experimental batch effects33. In cases where the study design does not take this problem into account by distributing cases and controls across batches for instance, batch effects can lead to artefactual differences in gene expression of single cells, resulting in inaccurate biological conclusions. bayNorm can mitigate batch effects in two ways. First, as described above, bayNorm efficiently corrects for differences in capture efficiencies which is a pervasive source of batch-to-batch variability37. Second, the use of bayNorm data-informed priors is an efficient way to mitigate batch variation by estimating prior parameters across different batches but within the same biological condition. To investigate bayNorm’s performance for batch effect correction we use data from the Tung study 33 where scRNA-seq data were obtained in triplicates for three induced pluripotent stem cell lines (iPSC) derived from three individuals. Sequencing libraries were prepared in three experimental batches, each containing one repeat of each line33. We first used priors calculated within each individual, but across batches (bayNorm local (individual)). This strategy allows for maintaining differences between individuals while minimising batch effects as illustrated by PCA analysis (Fig. 4a-b, Fig S22). To assess the normalisation performance quantitatively, we extracted the number of genes differentially expressed between each pair of batches within the same individual (Fig S23). We defined the ratio of the number of DE genes (adjusted PMAST < 0.05) and the total number of genes (13058) to be the false positive rates (FPR). In theory, batch effects should be the main source of differential expression between these samples33. In parallel, we tested whether bayNorm also maintained differences between individuals using the same settings. To do this, we defined DE genes between the iPSC lines NA19101 and NA19239 and compared it to a benchmark list of 498 DE genes42. Efficient batch effect correction is expected to minimise FPR while maximizing Area Under the Curve (AUC) values of DE detection between individuals. We find that using bayNorm with “within individual” local priors (estimated across different batches within the same line) outperformed other methods in terms of correcting batch effects while maintaining meaningful biological information. As expected, using bayNorm and global priors (estimated across batches and individuals, bayNorm global) preserves low FPR, but reduces AUC significantly. Finally, using bayNorm with “within batch” local priors (bayNorm local (batch)) result in higher false positive rates, which is also expected.
Overall we have showed that the flexibility of priors selection afforded by bayNorm Bayesian approach enables robust correction of batch effects, while maintaining sensitive detection of differentially expressed genes.
Conclusions
We introduced bayNorm, a versatile Bayesian approach for implementing global scaling that simultaneously provides imputation of missing values and true counts recovery of scRNA-seq data. We showed that using a binomial model of mRNA capture as likelihood and an empirical Bayes approach to estimating gene expression priors across cells results in simulated data almost identical to experimental scRNA-seq measurements. Importantly, this suggests that zero-inflated models are not required to explain the frequency of dropout observed in scRNA-seq. Although designed initially for UMI-containing scRNA-seq protocols, a simple scaling factor makes bayNorm applicable to non-UMI data as well. This flexibility will allow using this approach with most present and future scRNA-seq datasets. We showed using datasets that combine smFISH and scRNA-seq, that bayNorm is accurately recovering true gene expression across a wide range of expression levels. This approach could therefore be particularly useful for quantitative analysis of more difficult scRNA-seq datasets, such as those generated from small quiescent cells or microbes, for instance. In fact, we have recently used bayNorm successfully in the first scRNA-seq study of fission yeast44. One of the most powerful features of bayNorm is its use of gene expression priors directly calculated from gene expression values across cells. We showed that by grouping cells according to experiment design or phenotypic features increased significantly the robustness and sensitivity of differential expression analysis. This allows almost complete removal of sequencing depth and capture efficiency biases, and reduced batch effects. Critically, this approach preserved accurate and sensitive detection of benchmark DE genes.
Accurate estimation of cell capture efficiencies (or scaling factors) is central to most scRNA-seq normalisation methods including bayNorm. Interestingly, we observed that the choice of cell specific capture efficiencies affect how closely simulated data recovers statistics of real data. We therefore propose that comparison of drop-out rates per cell in simulated datasets and experimental data could be used as a tool to inform appropriate choice of global scaling factors and mean capture efficiency estimates. The option to tailor bayNorm priors based on phenotypic information about cell subpopulations will be a powerful asset for discovery of gene expression programmes associated with specific phenotypic features of single cells such as cell size44. Finally, the concepts and mathematical framework behind bayNorm will be useful if combined with other emerging theoretical approaches such as deep learning, for instance 16, 20, 22–24. Overall, bayNorm provides a simple and integrated solution to remove the technical biases typical of scRNA-seq approaches, while enabling robust and accurate detection of cell-specific changes in gene expression. bayNorm has been made freely available as an R package (see Methods).
Methods
1 The Bayesian model used in bayNorm
A scRNAseq dataset is typically represented in a matrix of dimension P × Q, where P denotes the total number of genes observed and Q denotes the total number of cells studied. The element xij (i ∈ {1, 2, …, P} and j ∈ {1, 2, …, Q}) in the matrix represents the number of transcripts reported for the ith gene in the jth cell. This is equal to the total number of sequencing reads mapping to that gene in that cell for a non-UMI protocol. For UMI based protocols this is equal to the number of individual UMIs mapping to each gene[5, 6]. The matrix can include data from different groups or batches of cells, representing different biological conditions. This can be represented as a vector of labels for the cell groups or conditions (Cj).
A common approach for normalizing scRNAseq data is based on the use of a global scaling factor (sj), ignoring any gene specific biases (for a recent review see[7]). The normalized data is obtained by dividing the raw data for each cell j by the its global scaling factor sj:
In bayNorm, we implement global scaling using a Bayesian approach. We assume given the original number of transcripts in the cell , the number of transcripts observed (xij) follows a Binomial model with probability βj [1], which we refer to as capture effeiciency and it represents the probability of original transcripts in the cell to be observed. In addition, we assume that the original number or true count of the ith gene in the jth cell follows Negative Binomial distribution with parameters mean (μ), size (or dispersion parameter, ϕ), such that:
So, overall we have the following model:
Using the Bayes rule, we have the following posterior distribution of original number of mRNAs for each gene in each cell:
The prior parameters μ and ϕ of each gene were estimated using an empirical Bayesian method as discussed in detail in Section 4 below.
The marginal distribution for gene i in cell j is which follows from using and with in Eq. (4). Hence we have that the number of transcripts reported for the ith gene in the jth cell has a Negative Binomial distribution with mean μiβj and size ϕi.
It can also be shown that the posterior distribution of is a shifted Negative Binomial distribution. To sample from the posterior distribution, we note that the original count can be expressed as where ζij is the lost count satisfying
The posterior mean and variance then evaluate to
Note that when ϕi is small, the mean of posterior tends to . After estimating the posterior distribution for each gene in each cell, we can either sample a certain number of draws from it (3D array output, see Supplementary Figure S1) or extract the mean or mode of posterior [8] as (2D array output, see Supplementary Figure S1).
2 Binomial distribution and dropout probability
The binomial model of capture in scRNA-seq predicts the dropout rate for a particular gene: in a given cell j. Across a group of non-homogeneous cells, we may approximate this expression by
For small this expression tends to . In dropout vs mean expression (dropout-mean) (Figure 1c, Sup Figures S2c, S3c, S4c, S5c, S6c and S7c), the line “” follows the lower limit of the trend. We note that a Poisson model of RNA-seq that is used by several authors also predicts dropout rates to be Pr(x = 0) = λ0/0! exp(−λ) = exp(-λ), where [9, 2].
To further show that Binomial distribution can capture the relationship between dropout rates and mean expression, we simulated data based on real experimental data[1, 10, 11] by adapting simulation protocols proposed in the R package Splatter[2]. The details about the simulation procedure can be found in the supplementary note 1. The resulting dropout-mean plot of simulated data based on Binomial model is very close to that of the real scRNA-seq data for UMI-based protocols. As shown in the Supplementary Figures S2c, S3c, S4c, S5c and S6c, the dropout-mean trend of UMI data is close to the asymptotic line “” (Binomial_Splatter and Binomial_bayNorm simulated data perform similar to each other and the real experimental data). data based on real experimental data[1, 10, 11] as discussed in the results and supplementary note 1. The resulting dropout-mean plot of simulated data based on Binomial model is very close to that of the real scRNA-seq data for UMI-based protocols.
3 Estimation of capture efficiencies
Cell specific capture efficiency βj and global scaling factor (sj) are closely related. We can transform scaling factors estimated by different methods (see below) into βj values with the following formula: , a scalar, is an estimate of global mean capture efficiency across all cells, which ranges between 0 and 1.
There are two different methods for estimating and βj:
If spike-ins or smFISH data are available they can be used to estimate capture efficiencies. We can either divide the total number of observed spik-ins in each cell by the total number of input spike-ins, or we can fit a linear regression[1] to estimate the cell specific βj. If smFISH data is available, we can fit a linear regression between the mean expression of raw data (response variable) and the mean expression of the smFISH data (explanatory variable). The coefficient of the explanatory variable can be used as [12].
The raw data itself can be directly used for estimation of cell specific global scaling factors (sj). Then equation 13 and an estimate of can be used to estimate βj. There are different methods available for estimation of global scaling factors. Some were developed for bulk RNA-seq data[13, 14] and some are specific to scRNA-seq data[15, 16]. The value of depends on the protocol used and can be batch dependent. For example, for Droplet based protocol, it is about 0.06[1] or 0.12[17]. can also be estimated by spike-ins or smFISH data as explained above.
We finally note, that estimates of capture efficiency discussed above will assume cells have simular original transcript content. Therefore, the bayNorm outputs estimates of original transcript counts for a typical cell, which is corrected for variation in cell size and transcript content. This is usually desirable for down-stream analysis such as DE detection. However, if one is interested in absolute origianl count and has additional information such as cell size or total transcirpt content per cell, the capture efficiencies can be approporiatly rescaled for this purpose.
4 Estimation of prior parameters
4.1 Maximisation of marginal distribution
Using an emperical bayes approach, one can use the maximisation of marginal likelihood distribution of the observed counts across cells to estimate prior parameters [18]. Let Mi denotes the marginal likelihood function for the ith gene across cells. Assuming independence between cells, the log-marginal distribution for the ith gene is where Pr(xij|μi, ϕi, βj) is the Negative Binomial in Eq. (5). Maximizing of Eq. (14) yields the pair (μi, ϕi).
The above optimization needs to be done for each of the P genes. We refer to the ϕ and/or μ estimated by maximizing marginal distribution as BB estimates for convenience, because bayNorm utilizes spectral projected gradient method (spg) from the R package named “BB”. Optimizing the marginal distribution with respect to both μ and ϕ (2D optimization) is computationally intensive. If we had a good estimate μ, then we could optimize the marginal distribution with respect to φ alone, which would be much more efficient.
4.2 Method of Moments
A heuristic way of estimating μi and ϕi is through a variant of the Method of Moments. The first step is to do a simple normalization of the raw data, to scale expressions given the cell specific capture efficiencies (βj). The simple normalized count is calculated as following: where the numerator of the scaling factor of xij is obtained by taking the average of scaled total counts across cells.
Based on simple normalized data, we are able to estimate prior parameters μ and ϕ of the Negative Binomial distribution using the Method of Moments Estimation (MME), which simply equates the theoretical and empirical moments. This estimation method is fast and simulations suggests it provides good estimates of μ but the drawback is that the estimation of ϕ show a systematic bias (see Supplementary Figure S24 a-b).
4.3 The combined method
Based on simulation studies (Supplementary Figure S24), the most robust and efficient estimation of μ and ϕ can be obtained using the following combined approach, which is the default setting in bayNorm:
Based on simple normalized data, we use the MME method for each gene to obtain MME estimated μ and ϕ.
Although the BB estimated ϕ is much closer to the true ϕ, many estimates are at the upper boundary of the search space (Supplementary Figures S24 c-d). So, we find adjusting the MME estimated ϕ by a factor which can be estimated by fitting a linear regression between MME estimated ϕ and BB estimated ϕ works best (Supplementary Figures S24 c-d). This adjusted MME estimated ϕ together with the MME estimated μ and estimates of βj can be used in approximating posterior distribution for each gene in each cell.
Cells are grouped together for prior estimation, based on cell-specific attributes (Cj). Prior estimation can be done over all cells irrespective of the experimental condition. We refer to this procedure as “global”. Alternatively, suppose that there are multiple groups of cells in the datasets and we have reasons to believe each group could behave differently. Then we can estimate the prior parameters “μ and ϕ” within each group respectively (within groups with the same Cj value). We refer to this procedure as “local”. Estimating prior parameters across a certain group of cells based on “global” procedure allow for removing potential batch effects. Multiple groups normalization based on “local” procedure allows for amplifying the inter-groups’ differences while mitigating the intra-group’s variability, which is suitable for DE detection.
5 Code availability
The R package bayNorm is available at https://github.com/WT215/bayNorm.
The codes for producing figures in the paper are provided at https://github.com/WT215/bayNorm_papercode.
In the Bacher study, the code for running MAST and log fold change calculation was kindly provided by Rhonda Bacher, the author of SCnorm[19].
In the Torre study, the code for transforming counts per million normalized data to UMI data was kindly provided by Mo Huang, the author of SAVER[9].
Acknowledgements
We are grateful to Dan Hebenstreit for critical reading of the manuscript. The benchmark lists used in the Islam and Tung studies were kindly provided by Maria K. Jaakkola and Chengzhong Ye respectively. We would like to thank Rhonda Bacher for providing R code for running MAST and producing figures 3a and 3b. We would like to thank Mo Huang for the code for preprocessing data from the Torre Study. We would also like to thank Lennart Kester for providing smFISH data used in the Grün study. This research was supported by the UK Medical Research Council, and a Leverhulme Research Project Grant (RPG-2014-408). WT is supported by a Roth Scholarship from the Department of Mathematics at Imperial College. PT acknowledges a fellowship from The Royal Commission for the Exhibition of 1851. The authors used the computing resources of the UK Medical Bioinformatics partnership (UK MED-BIO; aggregation, integration, visualisation and analysis of large, complex data), which is supported by the UK Medical Research Council (grant no. MR/L01632X/1) and the Imperial College Research Computing Service (DOI: 10.14469/hpc/2232) for access to their HPC facilities (CX1 cluster).