RT Journal Article SR Electronic T1 Classification of RNA-Seq Data via Gaussian Copulas JF bioRxiv FD Cold Spring Harbor Laboratory SP 116046 DO 10.1101/116046 A1 Qingyang Zhang YR 2017 UL http://biorxiv.org/content/early/2017/03/11/116046.abstract AB RNA-sequencing (RNA-Seq) has become a preferred option to quantify gene expression, because it is more accurate and reliable than microarrays. In RNA-Seq experiments, the expression level of a gene is measured by the count of short reads that are mapped to the gene region. Although some normal-based statistical methods may also be applied to log-transformed read counts, they are not ideal for directly modeling RNA-Seq data. Two discrete distributions, Poisson distribution and negative binomial distribution, have been commonly used in the literature to model RNA-Seq data, where the latter is a natural extension of the former with allowance of overdispersion. Due to the technical difficulty in modeling correlated counts, most existing classifiers based on discrete distributions assume that genes are independent of each other. However, as we show in this paper, the independence assumption may cause non-ignorable bias in estimating the discriminant score, making the classification inaccurate. To this end, we drop the independence assumption and explicitly model the dependence between genes using Gaussian copula. We apply a Bayesian approach to estimate covariance matrix and the overdispersion parameter in negative binomial distribution. Both synthetic data and real data are used to demonstrate the advantages of our model.PLDA: Poisson linear discriminant analysis;NBLDA: Negative binomial linear discriminant analysis;KNN: K-nearest neighbors;PXMH: Parameter-expanded reparameterization and Metropolis-Hasting