ROSeq: A rank based approach to modeling gene expression in single cells

Summary Single cell transcriptomics provides a window into cell-to-cell variability in complex tissues. Modeling single cell expression is challenging due to high noise levels and technical bias. In the past years, considerable efforts have been made to devise suitable parametric models for single cell expression data. We use Discrete Generalized Beta Distribution (DGBD) to model read counts corresponding to a gene as a function of rank. Use of DGBD yields better overall fit across genes compared to the widely used mixture model comprising Poisson and Negative Binomial density functions. Further, we use Wald’s test to probe into differential expression across cell sub-types. We package our implementation as a standalone software called ROSeq. When applied on real data-sets, ROSeq performed competitively compared to the state of the art methods including MAST, SCDE and ROTS. Software Availability The Windows, macOS and Linux - compatible softwares are available for download at https://malaalam.github.io/ROSeq Contact abhianik@gmail.com debarka@iiitd.ac.in Supplementary information Supplementary data is available at Bioinformatics online.


Introduction
In the past few years, single cell RNA-Sequencing (scRNA-seq) has dramatically accelerated characterization of molecular heterogeneity in healthy and diseased tissue samples (Editorial, 2014).The declining cost of library preparation and sequencing have fostered the adaptation of singlecell transcriptomics as a routine assay in studies arising from diverse domains including stem cell research, oncology, and developmental biology.The field of single-cell transcriptomics suffers severely from various data quality issues, mainly due to the lack of starting RNA material.High levels of noise and technical bias pose significant challenges to singlecell gene expression modeling, which in turn hinders arrival at statistically apt conclusions about cell-type specific gene expression patterns.
A number of parametric and nonparametric methods have already been proposed for modeling expression data and finding differentially expressed genes (DEGs).SCDE (Kharchenko et al., 2014), MAST (Finak et al., 2015) and ROTS (Elo et al., 2008) are notable among these.SCDE and MAST model gene expression using well-known probability density functions and mixed models involving some of those.ROTS, on the other hand, uses a modified, data-adaptive t-type statistic to measure the difference in expression levels.We conjectured that considering ranks instead of absolute expression estimates would make a model less susceptible to the noise and the technical bias, as commonly observed in single-cell data.To realize the same, we employed Discrete Generalized Beta Distribution (DGBD) (Martinez-Mekler et al., 2009) to model the distribution of expression ranks (illustrated in the method section) instead of the raw count.The consideration of rank-ordering distribution was inspired by the seminal work by Martinez-Mekler and colleagues, where they demonstrated the universal applicability of the same in linking frequency estimates and their ranks.We developed ROSeq: a Wald-type test to determine differential expression from scRNA-seq data.
k is a scalar with a default value of 0.05, and σ is the standard deviation of the pulled expression estimates across the cell-groups.Each of these bins corresponds to a rank.Therefore, for each group, cell frequency for each bin maps to a rank.These frequencies are normalized group-wise by dividing by the total cell count within a concerned group.ROSeq uses DGBD to model the distribution of ranks as follows.
Here, yr denotes the fraction of cells in each cell-group corresponding to the rank r and N denotes the total number of bins.a and b are the shape parameters.Finally, A is the normalizing constant ensuring that the sum of the normalized frequencies equals one (see Supplementary Material).For each gene, for a given group, ROSeq estimates the values of a and b by maximizing the Log-Likelihood corresponding to probability mass function depicted in Equation 1.
Let us assume that the input cell subpopulations G1 and G2 consist of m and n cells respectively.For any gene gi, ROSeq first estimates the shape-parameter values (a1 = a1, b1 = b1) and (a2 = a2, b2 = b2) respectively.Under the DGBD model, the desired testing for differential gene expressions is equivalent to the test of the null hypothesis H0 : a1 = a2, b1 = b2 against the omnibus alternative.
Further, ROSeq uses the (asymptotically) optimum two-sample Wald test based on the MLE of the parameters and its asymptotic variances given by the inverse of the Fisher information matrix.We can estimate the asymptotic variance matrices for the MLEs ( a1, b1) and ( a2, b2) as V1 = I( a1, b1) −1 and V2 = I( a2, b2) −1 respectively.The Wald test statistic T for testing H0 can be written as follows: where w = n m+n .The test statistic T follows a central Chi-square distribution χ 2 2 with two degrees of freedom.A gene is considered differentially expressed (rejection of H0) at 95% level of significance, if the observed value of the test statistic T exceeds the 95% quantile of the χ 2 2 distribution.

Applications on Real Datasets
For comparative benchmarking, we used two scRNA-seq datasets from past studies that also performed bulk RNA-seq on the same samples.DEGs called on bulk RNA-seq data were used as the benchmark for single cell differential expression analyses.ROSeq performed competitively as compared to the existing best practice methods including SCDE, MAST and ROTS.
In the first dataset (obtained from Tung et al., 2017), read count data corresponding to 288 single cells each was available from the three human induced pluripotent stem cell lines (NA19098, NA19101 and NA19239).(For computing the results using the SCDE method, 96 single cells were selected randomly without replacement from each subpopulation -this was done in order to stay within the memory limits of a typical personal workstation).The second data-set is from Trapnell et al., 2014.RNA was extracted from primary human myoblasts before and after differentiation (77 and 79 cells respectively) and sequenced in order to investigate for differential expression between these two classes.
Bulk RNA-seq data corresponding to these classes was present in the form of three replicates each, for both the datasets.Differential expression was performed on this read count data in a pair-wise fashion using DESeq (see Anders and Huber, 2010), in order to establish the ground truth : genes with an adjusted p-value less than 0.05 and an absolute log 2 fold change greater than two were considered to be differentially expressed.Similarly, genes with an adjusted p-value greater than 0.1 were NOT considered to be differentially expressed.

Conclusions
ROSeq produces the largest area under the curve in the comparisons corresponding to the Tung data set (see Figure 1 (b) for the differential expression analysis between individuals NA19098 and NA19101) and is the second-best for the Trapnell data set (see Supplementary Material), which indicates that it performs competitively when compared to other methods.ROSeq also provides a good fit to the actual read count data, as is seen in Figure 1(c) where the coefficient of determination R 2 resulting from fitting a DGBD is high and always above 0.7 in magnitude.
In conclusion, we have developed a software ROSeq that addresses the problem of determining differential expression between two subpopulations.We use a novel approach of employing a DGBD to model read counts corresponding to a gene as a function of ranks.Since read count data arising from any source could be ranked, the usage of ROSeq is in principle extensible to other assays.Parameter tuning for actively adjusting the bin size, procedures for eliminating outliers and dealing with technical dropout noise shall be investigated in the subsequent works.Technical noise includes non-linear biases which creep in due to the million-fold amplification of the genetic material, and the tendency of the sequencer to miss the read counts for certain genes in a cell-library, because the quantity is lower than a certain observable threshold (this phenomenon is referred to as drop-outs).
Different scRNA-seq methods proceed differently, with modeling the read counts corresponding to a gene in a sub-population of cells and handling the above mentioned sources of noise.For example, the SCDE package developed by Kharchenko et al., 2014  In order to transform the DGBD into a probability mass function, several bins are constructed between the minimum and the maximum value of the read counts corresponding to a gene.Each bin has a width of k × σ, where k = 0.05 and σ is the standard deviation of the available read count data for each gene.The best fitting parameters a = a 0 and b = b 0 are found for each sub-population corresponding to a gene by maximizing the Log Likelihood expression as mentioned in Equation 2. Subsequently, in order to evaluate for differential expression between the two sub-populations of single cells, the two-sample Wald Test shall be used to obtain a statistic that varies as a χ 2 2 distribution.
In the next section, mathematical details of the proposed methodology which form the base for determining the statistic shall be highlighted.Subsequently, the performance of the rank-based method, compiled and packaged as a [Windows, macOS, Linux] compatible software called ROSeq (short for 'Rank-Ordered Distributions for scRNA-Seq Read Counts') shall be compared with some scRNA-Seq techniques commonly used for evaluating differential expression, namely SCDE (Kharchenko et al., 2014 andRitchie et al., 2010), MAST (Finak et al., 2015), ROTS (Elo et al., 2008)

Extended Theory
The Discrete Generalized Beta Distribution (DGBD) expresses the normalized frequency y r or the probability mass function for each bin as a function of the rank r of the bin using two parameters a and b.Let N be the total number of bins for a given gene and sub-population.Then the DGBD formulation is expressed as: where A is the normalizing constant ensuring that the sum of the normalized frequencies equals one and is given by: For a given data set, we need to properly estimate the parameters a and b so that the fitted DGBD with these estimated parameters are closest to the true data (in a suitable probabilistic sense).
The best-fitting parameters a = a 0 and b = b 0 are determined by maximizing the Log-Likelihood corresponding to the model given by Equation 1.The log-likelihood function, logL, for the DGBD expression is specified in the equation 2.
The resulting estimates ( a 0 , b 0 ) correspond to the DGBD under which the observed data is most likely to be generated and is commonly known as the maximum likelihood estimator (MLE).They are most efficient (least standard error) and enjoys several optimum properties in a large sample (Casella and Berger, 2002).
Next, let us consider the two sub-populations 1 & 2 with respective number of bins m and n and the problem of testing if the genes in these two sub-populations are differentially expressed in a desired statistical significance level.Let the DGBD parameters corresponding to these sub-populations are denoted by (a 1 , b 1 ) and (a 2 , b 2 ), respectively, and their MLEs based on the available data are given by ( a 1 , b 1 ) and ( a 2 , b 2 ).Note that, under the DGBD model, the desired testing for differential gene expressions is equivalent to the test for the null hypothesis H 0 : a 1 = a 2 , b 1 = b 2 against the omnibus alternative.
For this purpose, here we will use the (asymptotically) optimum two-sample Wald test based on the MLE of the parameters and its asymptotic variances given by the inverse of the Fisher information  2, the estimated Fisher information matrix can be obtained as I ( a 0 , b 0 ), where Note that, for our DGBD model, we have: So in order to evaluate the above mentioned double derivatives, the first order derivative ∂log A ∂a and ∂log A ∂b are determined as follows: Re-writing Equation 4in a more succinct form in the Equation 5below, we get: where, Evaluating the partial derivatives of u 1 , v 1 , u 2 and v 2 with respect to a and b, in the Equation 6: Now, using above formulas, we can easily derive the estimated Fisher Information Matrix for both the sub-populations and hence obtain the asymptotic variance matrices of the MLEs ( a 1 , b 1 ) and ( a 2 , b 2 ) as given by , respectively.Then, the two-sample Wald test statistic for testing H 0 is given by: Prior to analyzing the read count data for differential expression, filtering was performed by eliminating any genes with five or less non-zero read counts.This filtering procedure is similar to Hemberg, 2017's filtering procedure, specified on the online scRNA-Seq course.Median Normalization was implemented next for use by methods such as Wilcoxon Rank-Sum and ROSeq (other packages such as MAST, ROTS and SCDE have an inbuilt normalization step in their implementation), in order to equalize coverage across the two sub-populations or classes of cells under investigation.

Tung Data-set
A receiver operating characteristic curve (ROC) for the three pairwise comparisons is plotted in Figures 1a, 1b

Installation
ROSeq can be installed as a stand-alone application on a [Windows, macOS or Linux] Operating System.The structure of the available files for installation looks as in Figure 4.The following steps are needed to ensure a successful setup.
• Double click on the icon, post the installation process (an icon is placed on the desktop by default).
There are three panels in the software -the Notes panel for making notes about the current session, Status panel which updates the user if the action went through successfully and the Pre-Processing panel which lets the user upload read count data matrix for the desired analysis of differential expression.
If the radio button for Normalized Read Count data file upload is selected, then the path to the respective file needs to be indicated by clicking on the Open File button.In this matrix of read counts, each row is indicative of read counts corresponding to one gene while each column is representative of read counts sequenced from one single cell.The uiimport function is used to read the number of genes, but that number is changeable in case the user wishes to investigate only a subset of genes.
One of the requirements for the read count matrix is that the two groups of single cells are contiguous.A user input which is desired is the starting and ending column index for each group/subpopulation.
In case a raw read count matrix is uploaded by the user, the minimum number of non-zero read counts for available gene is specified by the user, as a means to filter out low quality genes.Also, median normalization is used to equalize coverage across all the available single cells.

Fig. 1 :
Fig.1: (a) An explanatory schematic to ROSeq: for each gene, single cells are assigned to bins (represented as squares in the figure) based on the magnitude of the corresponding, normalized read counts.These bins are subsequently ranked with respect to the number of cells allotted, with rank one being assigned to the bin comprising of the most number of cells, and so on.Eventually, a rank distribution is made for both groups and compared -using the Wald test -to determine if the gene is differentially expressed.(b) ROC curve for evaluation of differential expression between individuals NA19098 and NA19101 -ROSeq performs the best in terms of area under the curve (c) Histogram showing the coefficient of determination R 2 from modeling single cell read count data using a Rank-Ordered Distribution on Subpopulation One (NA19098) -Most genes are modeled with a fit higher than 0.7 (d) An example of the model fit on a gene, randomly picked from Subpopulation One(NA 19098)

Figure 1 :Figure 3 :
Figure 1: (a) ROC curve for evaluation of differential expression between cells from individuals NA19098 and NA19101 (b) ROC curve for evaluation of differential expression between cells from individuals NA19101 and NA19239 (c) ROC curve for evaluation of differential expression between cells from individuals NA19239 and NA19098 (d) ROC curve for evaluation of differential expression between primary human myoblasts before and (24 hours) after differentiation

Figure 4 :
Figure 4: Structure of the ROSeq Directory Svensson et al., 2018ncing (scRNA-seq) has been widely popular since the year 2013(Nature  Methods Editorial, 2014).The dropping cost of sequencing, the arrival of new protocols and the possibility to survey the diversity of cell types has aided this phenomenon and has led to a greater number of cells being sequenced in experiments over the years (For example,Svensson et al., 2018have shown an exponential increase in the cell numbers reported in publications over time).Due to a higher variability in scRNA-Seq data (see the work ofKharchenko et al., 2014), methods built around bulk cell data are insufficiently able to handle single cell read count data and newer methods are needed for its analysis.
Miao and Zhang, 2017)f single cell read count data include the presence of biological and technical noise.Since the starting quantity of mRNA prior to amplification is less, the noise attains a comparable value to the actual gene expression levels, and should not be ignored.As a result, many of the methods (for example,Miao and Zhang, 2017)have focused on modeling the read counts resulting from successful amplification and from noise, individually.Biological noise arises from the stochastic nature of gene expression.Munsky et al., 2012 suggested that genetically identical cells in identical environments display variable phenotypes.Others such as Zenklusen et al., 2008 have shown that expression levels vary substantially among cells.Zopf et al., 2013 argue that a substantial portion of the stochastic variability observed in single cell gene expression experiments may be caused by global changes in transcription due to cell cycling.
models a single cell as a probabilistic mixture of a negative binomial to capture successful amplification and a low magnitude poisson distribution to represent the technical dropouts and transcriptionally silent genes.In contrast to SCDE, the MAST model developed byFinak et al.,2015 fits a hurdle model across the read counts for a given sub population and solves for the Cellular Detection Rate (CDR), which acts as a proxy for factors (drop outs, cell volume) that influence gene expression.Mekler et al., 2009 have shown the ability of DGBDs to provide a two parameter (a and b) functional form for rank-ordered distributions, which gives good fits to data from life sciences among other fields of application, and surpasses other two-parameter models.
In this work, we use a Discrete Generalized Beta Distribution (DGBD) in order to provide a nonlinear, polynomial fit across the read counts for a given sub population.PreviouslyZipf, 1949 hasshown the existence of a linear relationship between the logarithm of frequency and the logarithm of rank, for the occurrence of words in texts.DGBD is an extension of the original, one parameter Zipf's , and Wilcoxon Rank-Sum.
For the log-likelihood function of the DGBD model given in Equation review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made available under