Abstract
RNA-sequencing (RNA-Seq) has become a preferred option to quantify gene expression, because it is more accurate and reliable than microarrays. In RNA-Seq experiments, the expression level of a gene is measured by the count of short reads that are mapped to the gene region. Although some normal-based statistical methods may also be applied to log-transformed read counts, they are not ideal for directly modeling RNA-Seq data. Two discrete distributions, Poisson distribution and negative binomial distribution, have been commonly used in the literature to model RNA-Seq data, where the latter is a natural extension of the former with allowance of overdispersion. Due to the technical difficulty in modeling correlated counts, most existing classifiers based on discrete distributions assume that genes are independent of each other. However, as we show in this paper, the independence assumption may cause non-ignorable bias in estimating the discriminant score, making the classification inaccurate. To this end, we drop the independence assumption and explicitly model the dependence between genes using Gaussian copula. We apply a Bayesian approach to estimate covariance matrix and the overdispersion parameter in negative binomial distribution. Both synthetic data and real data are used to demonstrate the advantages of our model.
1 Introduction
RNA-sequencing (RNA-Seq) is a revolutionary tool for the study of transcriptomes (Mardis [2008]; Wang et al. [2009]). Compared to hybridization-based microarrays, RNA-Seq eliminates the need for species-specific sequence information and provides more reliable measurements for gene expression (Marioni et al. [2008]). The huge number of reads produced by RNA-Seq experiment enables researchers to better detect novel transcripts and quantify the gene expression in ultra-high resolution. Essentially, RNA-Seq consists of three distinct phases: (1) RNA is isolated from tissue and segmented to an average length of 200 base pairs; (2) RNA segments are reverse transcribed to cDNAs; (3) The cDNAs are mapped to reference transcriptome or genome. An RNA-Seq experiment usually produces tens of millions of short reads between 25 and 300 base pairs in length. The number of reads mapped to each transcript provides a digital measure of transcript abundance.
Poisson distribution and negative binomial distribution are commonly used distributions to model RNA-Seq data. Based on these two distributions, numerous methods have been proposed to detect the differentially expressed genes, including but not limited to edgeR (Robinson & Smyth [2008]), DESeq2 (Love et al. [2014]), baySeq (Hardcastle & Kelly [2014]), BBSeq (Zhou et al. [2011]), and SAMseq (Li & Tibshirani [2013]). Despite the significant advances in differential expression analysis, the progress on classification of RNA-Seq data is relatively recent. Witten (Witten [2011]) developed a Poisson linear discriminant analysis (PLDA) by assuming that the data follow a Poisson distribution. However, in the presence of overdispersion, the Poisson assumption might not be appropriate. Dong et al. (Dong et al. [2016]) further extended the Poisson classifier to a negative binomial classifier (NBLDA), and explored how the dispersion affects the classifications. Other classifiers developed for RNA-Seq data include logistic regression model and partial least square method (Tan et al. [2014]). Due to the difficulty of modeling correlated counts, most existing classifiers assume that all genes are independent of each other. However, as pointed out by Dong et al. (Dong et al. [2016]), this assumption is very restrictive and may not be realistic in practice. The objective of the paper is to numerically assess the effect of independence assumption on classification of RNA-Seq data, and to develop a new classifier incorporating the dependence between genes using continuous latent variables and Gaussian copula. A Metropolis-Hasting algorithm in combination with Gibbs sampler (Lee [2014], Liu & Daniels [2006]) is adopted to estimate the covariance matrix and overdispersion parameters in our model. Our new classifier explicitly models two important aspects of RNA-Seq data: overdispersion of read counts and correlation between genes, therefore provides accurate parameter estimate and sample classification.
Copula is an important tool in modeling the dependence between random variables of any type. It is especially useful for modeling multiple discrete variables whose joint distribution can be extremely complicated. To begin with, we provide a short review of copula function and Gaussian copula. Consider a vector of random variables (X1, X2,…, Xp), the copula function of (X1, X2,…, Xp), C: [0,1]p → [0,1], is defined as the cumulative distribution function (cdf) of (F(X1),F(X2),…,F(Xp)):
By definition, a copula function is a multivariate distribution function where the marginal of each random variable is uniform. Sklar’s Theorem guarantees that any multivariate distribution can be expressed with univariate marginals and a copula function which links the marginals. In practice, we can completely separate the choice of marginals and the choice of copula. Popular copulas include, but not limited to Gaussian copula, Student’s t copula, Clayton’s copula and Frank’s copula (Nelson [1999]). Clayton’s copula and Frank’s copula both belong to the bivariate Archimedean copula family. For more than two dimensions, Gaussian copula is convenient to model the complex correlation structure (both positive and negative correlations). The Gaussian copula is based on multivariate normal distribution: where Φ represents the cdf of standard normal distribution, Φp(…|Ω) represents the cdf of p-dimension normal distribution with correlation matrix Ω. To connect discrete marginals and continuous copula function, we introduce a latent variable and treat the observed count as the discretized value of the continuous latent variable (in spirit, it is same as multivariate Probit model).
The remainder of this paper is structured as follows. In Section 2, we formally describe the statistical framework for classification of RNA-Seq data and introduce a Bayesian approach to estimate unknown parameters. Numerical studies are conducted to compare different models and classifiers in Section 3. In Section 4, we apply the proposed method to two real data sets including the cervical cancer data and HapMap data. We discuss and conclude this paper in Sections 5 and 6.
2 Methods
In this section, we propose a new classifier for RNA-Seq data based on copula function. We assume that the data follow a complex multivariate distribution with negative binomial marginals. The correlation between genes can be described by a Gaussian copula. A general Bayesian framework for estimating parameters in each class is discussed.
2.1 Negative binomial distributions for marginal model
First, we consider only one class. Let x = (x1…,xn)T be the n × p data matrix, where xij denotes the observed number of reads mapped to gene i in sample j, i = 1,2,…, p and j = 1,2,…,n. We consider the following negative binomial distribution Fij, i = 1,2,…,p, j = 1,…,n for marginals (Dong et al. [2016]): where μij = E (Xij), sj is the size factor to scale read counts for the jth sample due to different sequencing depth, λi is the total number of reads for gene i, and δi is the overdispersion parameter for gene i, i.e., V(Xij) = μij + μij2δi. The estimates of λi and sj in (1) are straightforward:
For overdispersion parameter δi the moment estimate and shrinkage estimate (Yu et al. [2013]) are commonly used estimates. However, both methods suffer from instability, e.g., the moment method sometimes gives a negative value. In this paper, we treat δi as unknown parameter, which is to be estimated jointly with other parameters in a Bayesian framework.
2.2 Gaussian copula for dependence between genes
In general, the analysis of correlated counts might be difficult because of the lack of suitable discrete multivariate distribution that can model complex correlation structures. To surmount this difficulty, we model the correlation via Gaussian copula, so that the correlation between read counts can be created through the correlation of the continuous latent variables. Let Zj = (Z1j,…,Zpj)T be the Gaussian latent variables (with unit variance) of Xj, and Zj ~ Np(0, Ω), where Ω represents the covariance or correlation matrix. The observed counts xj are the discretized values of zj by quantile matching. The relation between Xij and Zij can be interpreted as follows: where Fij() represents the cumulative distribution function of variable Xij, xij takes nonnegative integer values 0,1,2,… and Fij(–1 |δi) = 0 by definition of cdf.
The Gaussian copula of latent variables has the following simple form:
Based on (2) and (3), the likelihood function can be obtained immediately: where δ = (δ1,…, δp), δp denotes the multivariate normal density of dimension p, with mean vector 0 and unit marginal variance. The endpoints defining integration region Rij = (Lij, Uij] are specified as a function of the parameter δi
2.3 Bayesian estimation of parameters
Finding the maximizer of (4) is impractical due to the complexity and non-convexity of the likelihood function. Here, we consider the Bayesian approach proposed by Lee (Lee [2014]; Liu & Daniels [2006]) for parameter estimate. The posterior distribution of the parameters and latent variables can be written as follow: where the priors are specified as f(δi) = IG(α0, β0), and f(Ω) = IWp(Ψ0, ν0). We use IW( α0, β0) to denote the inverse-gamma distribution with shape parameter α0 and rate parameter β0, and use IW(Ψ0, ν0) to denote the inverse-Wishart distribution with scale matrix Ψ0 and degree of freedom Ψ0. The Gibbs sampling can be implemented based on the following conditional distributions: where z–i,j = (z1j,…,z(i–1)j,z(i+1)j,…,zpi),z–i, = {z–i,j, j=1,…,n}, δ–i=(δ1,…, δi–1, δi+1,…, δp), z = (z1,…,zn)T. Conditioning on {z–i,j, δi, Ω }, latent variable Zij follows a truncated normal distribution, where the mean and variance depend on {z–i,j, Ω}, and the endpoints depend on {xij, δi}. We use a routine Metropolis-Hasting algorithm to sample δi and use a parameter-expanded reparameterization and Metropolis-Hasting algorithm (PXMH, Lee [2014]; Liu & Daniels [2006]) to sample Ω. Details about these algorithms are provided in the Appendix.
2.4 Classification
We consider the classification problem when the RNA-Seq were conducted over multiple classes, i.e., K ≥ 2. Let yj ∈ {1,2,…,K} denotes the class label of sample j, i.e., yj = k ⇔ j ∈ Ck. The marginal distribution of Xij in class k can be formulated in a similar way to (1): where dik and δik are gene- and class-specific parameters among the K classes. The overdispersion parameter δik can be estimated in the PXMH algorithm and dik can be estimated by
Based on the trained models for all the K classes, the class label for new observation x* can be predicted. By Bayes’ rule: where fk is the probability density function for class k. The prior probability πk can be estimated by . We assign the new observation x* to class k that maximizes the following discriminant score (posterior probability):
3 Simulation study
We conducted two simulation studies with K = 2 to benchmark our new classifier. In Simulation I, we evaluated the performance of the Bayesian approach in estimating the discriminant score P(y* = 1|x*) under different correlation settings. In Simulation II, we compared the performance of six classifiers under different settings of dispersion and correlation strength.
3.1 Simulation I
Given the correlation matrices Ω, k = 1,2, we generated data in two steps:
Step 1: Simulate latent variables zjk ∼ Np(0,Ωk), j = 1,…,nk,k = 1,2, where Ω is the correlation matrix or covariance matrix, n1 = n2 = 50 and p = 50
Step 2: Transform zjk to xjk, j = 1,…,n using (2), where μi1 = 20, μi2 = μi1 + Δi, Δi ∼ Unif (–15,15) and δi1 = δi2 ∼ Unif (1,10) for i = 1,…, p
We compared the copula-based model with Dong et al.'s independence model under two settings of autoregressive correlation structure Ω1(i, j) = Ω2(i, j) = exp(–a|i – j|): (1)a = 0.5; (2)a = 1.5. For each class, we trained the model using half of the samples (25 samples randomly chosen in each class) and then calculated the discriminant score P(y* = 1|x*) for each of the rest samples. For both models, we estimated μik by . For independence model, we estimated the overdispersion parameter δik using R package sSeq, which implements the shrinkage method by Yu et al. The score P(y* = 1 |x*) was then estimated under independence assumption. For the copula-based model, we jointly estimated δik and Ωik using PXMH algorithm with the following priors: δik ∼ IG(0.5,0.5) and Σ ∼ IW(5,I50). A chain of 15,000 iterations was generated and the last 10,000 samples were kept for calculating the posterior mean.
Figures 1 shows the estimation bias, i.e., P̂(y* = 1 |x*) – P(y* = 1 |x*), by two models under two settings. In both settings, the copula-based method shows its superiority over the independence model, and the improvement is more significant in the presence of stronger correlation. Since moderate and strong coexpression between genes were commonly seen in real data, the ignorance of such information may lead to lower prediction accuracy.
3.2 Simulation II
In the second simulation, we compared our copula-based classifier with the other five classifiers including Poisson linear discriminant analysis (PLDA), negative binomial linear discriminant analysis (NBLDA), k-nearest neighbors (KNN), partial least square method (PLS) and logistic regression method. We implemented PLDA using R package PoiClaClu and implemented NBLDA using the R source code provided by the Dong et al. (https://github.com/yangchadam/NBLDA). In NBLDA, the overdispersion parameters were estimated by R package sSeq. For KNN, we chose parameter K = 1,3,5. To implement PLS, we used the function ’spls()’ provided in R package spls. For our new classifier, same priors were used as in Simulation I. A chain of 15,000 iterations was generated and the last 10,000 samples are retained for estimation.
The data are generated under four different settings (for all settings, μi1 = 20, μi2 = μi1 + Δi, Δi ∼ Unif (–15,15)):
Setting 1 (weaker correlation, smaller dispersion): Ωk(i, j) = exp(–1.5|i – j |), δik ∼ Unif(1,5), k = 1,2
Setting 2 (weaker correlation, larger dispersion): Ωk(i, j) = exp(–1.5|i – j |), δik ∼ Unif(5,20), k = 1,2
Setting 3 (stronger correlation, smaller dispersion): Ωk(i, j) = exp(–0.5|i – j |), δik ∼ Unif(1,5), k = 1,2
Setting 4 (stronger correlation, larger dispersion): Ωk(i, j) = exp(–0.5|i – j |), δik ∼ Unif(5,20), k = 1,2
The comparison results under different sample sizes (n = 10,30,60,100) are shown in Figures 2 and 3. Due to the independence assumption, other five classifiers including PLDA and NBLDA failed to model the dependence structure between genes, therefore the estimated probabilities were biased (see Simulation I). It is observed that the copula-based model performs consistently better than the other classifiers in terms of classification accuracy, especially in setting 4 with stronger correlation and larger dispersion. When the correlation between genes are very weak, our model has similar performance with Dong et al.’s independence model.
4 Real data analysis
In this experiment, we considered two real data sets including the cervical cancer data (Witten et al. [2010], available in Gene Expression Omnibus (GEO) with access number GSE20592) and the HapMap data (Montgomery et al. [2010]; Pickrell et al. [2010], available at ftp://ftp.ncbi.nlm.nih.gov/hapmap). The cervical cancer data contains 58 samples (29 tumor samples and 29 normal controls), and 714 microRNAs which were differentially expressed in cancer group and normal group. The HapMap data contains 129 samples (60 CEU samples and 69 YRI samples) and a total number of 52,580 genes. For both data sets, we removed genes with less than 10 reads across all samples. Three different classifiers including Poisson classifier, negative binomial classifier and the copula-based classifier were compared in terms of the misclassification rate. In our classifier, same priors were used as in the simulation studies.
We noted that the real data sets often contain large portion of irrelevant and redundant genes. A gene screening could greatly reduce the computing time and improve the classification accuracy. We conducted gene selections using R package edgeR (available in Bioconductor, www.bioconductor.org), as suggested by Dong et al. The algorithm implemented in edgeR is based on negative binomial model and takes overdispersion into account, therefore it is suitable for our problem. This method first estimates the overdispersion parameter for each gene by maximizing the combination of gene-specific conditional likelihood and the overall conditional likelihood, and then constructs an exact test using negative binomial distribution.
For Cervical cancer data, 40 samples were randomly assigned to the training set and the rest 18 samples were assigned to the testing set. A total of 20, 50, 100, 300 genes were selected, respectively. For HapMap data, the samples were randomly split into training set and testing set, with 70 samples and 59 samples, respectively. A total of 50, 100, 300, 500 genes were selected, respectively. Three different classifiers were then trained by the training data and applied to the testing data. The whole procedure were repeated for 100 times and the average misclassification rate were recorded.
The comparison results are shown in Figure 4. For both data sets, the copula-based model is more accurate than the other two classifiers. Figure 5 displays the distribution of correlation coefficients between every pair of genes in the cervical cancer data (log-transformed). It can be seen that the vast majority of the correlations are positive, and half of them are fairly strong (above 0.5), indicating that the independence assumption in PLDA and NBLDA is violated.
5 Discussion
In this paper, we have proposed a new classifier for RNA-Seq data. Different from other classifiers, it incorporates the dependence between genes in the supervised classification problem. To the best of our knowledge, this is the first work that applies copula model to the classification of count data. Numerical comparisons show that our new model achieves better estimate of discriminant scores than existing methods, therefore results in more accurate sample classification. The improvement is more significant in the presence of stronger correlation between genes. In addition to RNA-Seq data, this classifier can be generally applied to other digital gene expression data to improve the classification accuracy.
The copula-based classifier introduced in this paper assumes that the reads count of each gene follows a negative binomial distribution. This assumption has been widely used in practice since the negative binomial distribution is flexible to model and quantify the overdispersion in RNA-seq data. However, the negative binomial assumption still might be violated in some real data sets. In the copula-based method, one could choose alternative marginal models (e.g., Poisson mixture model) which better fits the data, and the Bayesian estimation introduced in this paper still can be applied to estimate the covariance matrix since the estimation of copula only depends on the cumulative distribution functions of the marginals. It is also noteworthy that the current PXMH algorithm for parameter estimation is time-consuming when the number of selected genes is large. For example, in the analysis of HapMap data, a single run (out of 100 runs in total) with 500 selected genes takes about 2.5-3 minutes with C++ implementation. In the future study, we would like to explore other optimizations such as EM-type method for computational efficiency. For example, rather than joint estimating {δ, Ω} in Bayesian framework, we may first estimate δ using stable method based on read counts, and then estimate Ω by EM update where latent variables Z can be treated as missing values. The EM-type method may greatly reduce the computing time. To model the correlation between genes, we use Gaussian copula as it is convenient for multivariate problem. Another possible future work is to compare different latent variables and copula functions, e.g., Student’s t copula (Nelson [1999]) and Gaussian mixture copula (Zhang & Shi [2016]), in a model comparison framework (Mai & Zhang [2016], Zhang et al. [2014],
6 Conclusion
RNA-sequencing experiment quantify gene expression by the count of short reads mapped to the gene region. When biological replicates are available, the negative binomial distribution allowing overdispersion is better suited for modeling RNA-Seq data than Poisson distribution. Recently, Dong et al. (Dong et al. [2016]) developed a classifier based on negative binomial distribution, which outperforms previous methods including the Poisson classifier and K-nearest neighbors classifier. However, due to the difficulty of modeling the dependence in discrete data, most existing classifiers assume that all the genes are independent of each other. In this paper, we systematically investigate the effect of independence assumption on discriminant score calculation and classification. In addition, we developed a copula-based classifier for RNA-Seq data that incorporates the dependence structure between genes, while maintaining the negative binomial marginals. Our numerical comparisons and real data analysis demonstrate that the new classifier performs better than existing methods including Dong et al.’s negative binomial classifier.
Competing Interests
The author has declared that no competing interests exist.
Acknowledgement
Support has been provided in part by the Arkansas Biosciences Institute, the major research component of the Arkansas Tobacco Settlement Proceeds Act of 2000.
Appendix
Sampling zij conditioning on {z–i,j, δi, Ω}
We sample zij from univariate normal distribution ϕ(Zij|μij,ωij2) truncated between Lij· and Uij: where Ω i,–i = {Ωhl, h = i, l ≠ i}, Ω−i,i = {Ωhl,h ≠ i, l = i}, Ω−i,−i = {Ωhl, h ≠ i, l ≠ i} and z−i,j = (z1 j, …, z(i−1)j, z(i+1)j, …, zpj).
Sampling δi conditioning on {z–i,·, δ–i, Ω}
We sample δi based on the following density function:
Where .
Suppose δi is the current value and δi* is the generated value from the proposal distribution. The Metropolis-Hasting probability of moving from δi to δi* is: where the proposal fT(δi|δ̂5,ν) is a t distribution which dominates the normal tails, δ̂i and ν represents location and degree of freedom, respectively.
Sampling Ω conditioning on {z, δ}
Sampling of Ω can be problematic due to the constraint Ωii= 1,i = 1,…,p. Here we sample Ω using parameter-expanded reparameterization and Metropolis-Hasting algorithm (PXMH, Lee [2014]; Liu & Daniels [2006]). The conditional density can be written as follows:
The PXMH algorithm first simulates a covariance matrix Σ and then transforms it to a correlation matrix Ω. For convenience, define , then Ω = D−1ΣD−1. Since Ω has p fewer parameters than Σ, an additional constraint is imposed:
The PXMH Algorithm consists of three steps:
PX step: Sample Σ ∼ IWp(ν,Ψ), where ν = ν0 + n and
MH step: Move to the new value Σ* with probability:
Transform covariance matrix Σ to correlation matrix Ω, Ω* = D*−1 Σ* D*−1
Abbreviations
- PLDA
- : Poisson linear discriminant analysis;
- NBLDA
- : Negative binomial linear discriminant analysis;
- KNN
- : K-nearest neighbors;
- PXMH
- : Parameter-expanded reparameterization and Metropolis-Hasting