A sparse negative binomial classifier with covariate adjustment for RNA-seq data

Tanbin Rahman; Hsin-En Huang; An-Shun Tai; Wen-Ping Hsieh; George Tseng

doi:10.1101/636340

Abstract

Supervised machine learning methods have been increasingly used in biomedical research and in clinical practice. In transcriptomic applications, RNA-seq data have become dominating and have gradually replaced traditional microarray due to its reduced background noise and increased digital precision. Most existing machine learning methods are, however, designed for continuous intensities of microarray and are not suitable for RNA-seq count data. In this paper, we develop a negative binomial model via generalized linear model framework with double regularization for gene and covariate sparsity to accommodate three key elements: adequate modeling of count data with overdispersion, gene selection and adjustment for covariate effect. The proposed method is evaluated in simulations and two real applications using cervical tumor miRNA-seq data and schizophrenia post-mortem brain tissue RNA-seq data to demonstrate its superior performance in prediction accuracy and feature selection.

In the past two decades, microarray and RNA sequencing (RNA-seq) are routine procedures to study transcriptome of organisms in modern biomedical studies. In recent years, RNA-seq [5, 20] has become a popular experimental approach for generating a comprehensive catalog of protein-coding genes and non-coding RNAs [13], and it largely replaces the microarray technology due to its low background noise and increased precision. The most important difference between RNA-seq and microarray technology is that RNA-seq outputs millions of sequencing reads rather than the continuous fluorescent intensities in microarray data. Unlike microarray, RNA-seq can detect novel transcripts, gene fusions, single nucleotide variants, and indels (insertion/deletion). It can also detect a higher percentage of differentially expressed genes than microarray, especially for genes with low expression [24].

In machine learning, classification methods are used to construct a prediction model based on a training dataset with known class label so future independent samples can be classified with high accuracy. For example, labels in clinical research can be case/control, disease subtypes, drug response or prognostic outcome. Many popular machine learning methods have been widely applied to microarray studies, such as linear discriminate analysis [9], support vector machines [3] and random forest [7]. However, for discrete data nature in RNA-seq, many powerful tools for microarray assuming continuous data input or Gaussian assumption may be inappropriate. A common practice to solve this problem is to transform RNA-seq data into continuous values such as FPKM or TPM [6] and possibly taking additional log-transformation. However, such data transformation can lead to loss of information from the original data [14, 18], producing less accurate inference. Particularly, the transformation often produces greater loss of information for genes with lower counts [15]. To accommodate discrete data in RNA-Seq, Poisson distribution and negative binomial distribution are two common distributions expected to better fit the data generation process and data characteristics. Witten [22] proposed a sparse Poisson linear discriminant analysis (sPLDA) based on Poisson assumption for the count data. However, Poisson distribution assumes equal mean and variance, which is often not true. In real RNA-seq data, the variance is often larger than the mean, leading to the need of an overdispersion parameter. Witten [22] reconciled this problem by proposing a power transformation to the data for eliminating overdispersion. However, as we will see later, the power transformation can perform well when the overdispersion is small but performs poorly when overdispersion becomes large. Hence, direct modeling by negative binomial assumption rather than a Poisson distribution is more appropriate. To this end, Dong et al. [8] proposed negative binomial linear discriminant analysis (denoted as NBLDA_PE) by adding a dispersion parameter. They, however, borrowed the point estimation from sPLDA in [22] and did not pursue a principled inference such as maximum likelihood, consequently producing worse performance than the method we will propose later.

Since the number of genes is often much larger than the number of samples in transcriptomic studies (a standard “small-n-large-p” problem), feature selection is critical to achieve better prediction accuracy and model interpretation. Witten [22] proposed a somewhat ad hoc soft-thresholding operator, similar to univariate Lasso estimator in regression, for gene selection in sPLDA but the method is not applicable to the NBLDA_PE model due to the addition of dispersion parameter. In the NBLDA_PE model proposed by [8], feature selection issue was not discussed, except that they used “edgeR” package to reduce the number of genes in the input data. Such a two-step filtering method is well-known to have inferior performance than methods with embedded feature selection. In fact, Zararsiz et al. [23] have compared sPLDA and NBLDA_PE, and showed that the power transformed sPLDA generally performed better than NBLDA_PE in their simulations and the worse performance in NBLDA_PE mainly came from the lack of feature selection. Finally, another critical factor to consider in transcriptomic modeling is the adjustment of covariates such as gender, race and age since it is well-known that many genes are systematically impacted by these factors. For example, Peters et al. [16] have identified 1,497 genes that are differentially expressed with age in a whole-blood gene expression meta-analysis of 14,983 individuals. A classification model allowing for covariate adjustment is expected to provide better accuracy and deeper biological insight.

To account for all aforementioned factors, we propose a sparse negative binomial model (snbClass) for classification analysis with covariate selection and adjustment. The method is based on generalized linear model (GLM) with a first regularization for feature sparsity. The GLM framework also allows straightforward covariate adjustment and a second regularization term on covariates, facilitating further covariate selection. Such covariate adjustment is not possible through existing sPLDA or NBLDA_PE methods. The paper is structured as following. In Section 1.1, we will briefly describe the two existing methods sPLDA [22] and NBLDA_PE [8], and then followed by our proposed methods sNBLDA_GLM and sNBLDA_GLM.sC in Section 1.2. Section 1.3 and 1.4 will discuss parameter estimation and model selection of the proposed method. Benchmarks for evaluation are described in Section 1.5. Section 2 presents simulation studies and Section 3 shows two real applications of cervical tumor miRNA data and schizophrenia RNA-seq data. Conclusion and discussion are included in Section 4.

1 Existing and proposed methods

In this section, we will first describe two existing methods for classification analysis of count data from RNA-seq and then propose our new method. To unify the notation, denote by X the count data matrix with elements X_ij referred to the sequence count for the j-th gene and the i-th sample (i = 1, 2, … n and j = 1, 2, … p). In addition, x_i = (X_i1 … X_ip)^T denotes i–th row of X, corresponding to feature measurements of observation i. Also, define and X_‥ = ∑_i,j X_ij. Moreover, in the classification setting where each observation belongs to one of the K classes, we let disjoint sets C_k ⊂ {1, …, n} contain the indices of observations in class k. That is, class label y_i = k if and only if i ∈ C_k. Furthermore, we denote .

1.1 Two existing methods for classification of RNA-seq data

1.1.1 Sparse Poisson linear discrimination analysis (sPLDA)

Witten [22] introduced a log-linear Poisson model with feature selection, which resulted in a simple diagonal linear discriminant analysis suitable for count data (referred as “sPLDA” hereafter in this paper). Under the assumption of gene independence, the model is based on the following formulation, where s_i is the normalizing factor (a.k.a. size factor) for sample i and g_j is the ground mean for the j-th gene, allowing for variations both in samples and genes. For a given gene j, d_1j, …, d_kj allows the j-th gene to be differentially expressed between the classes if any of d_kj ≠ 1(1 ≤ k ≤ K).

RNA-Seq data often contain over-dispersion such that variances are larger than means, whereas an important constraint in Poisson model is the equivalent mean and variance. To overcome this, Witten [22] proposed a transformation of count data with a proper choice of u such that, . From simulations of the original paper, this correction performs well in the presence of weak to moderate overdispersion.

Suppose be a future new sample for prediction. The discriminant score for assigning x* to class k is, where y* is the predicted label, is the estimate of prior probability of belonging to the kth class estimated by the fraction of samples belonging to class k and s* is the estimated normalization factor for the new sample x* for which we do not know the class label. The classifier assigns x* to the class with the largest discriminant score. The paper also implemented a somewhat ad hoc soft-thresholding operator for feature selection in the classifier, which is motivated from univariate lasso regularization in regression for feature selection: where , , β is the hyperparameter pre-determined in the estimation of d_kj, v is the tuning parameter chosen by cross validation and S(x, a) = sign(x)(|x| − a)₊ is the soft thresholding parameter. means gene j is not differentially expressed across the classes and thus, is not selected in the classifier.

1.1.2 Negative binomial linear discrimination analysis (NBLDA_PE)

Dong et al. [8] extended sPLDA into a negative binomial model to explicitly allow overdispersion property in RNA-seq data:

Under the formulation, E(X_ij) = μ_ij and . Similar to sPLDA, for a new observation x*, prediction is made by the maximized discriminant score: where ϕ_j is the dispersion parameter for the jth gene, and is the same as defined previously. We note that the point estimate of and are borrowed directly from Witten’s sPLDA model without theoretical justification and the similar soft-thresholding in sPLDA cannot be easily incorporated into the procedure due to the increased complexity with ϕ_j.

In the literature, several popular procedures have been used for estimating the size factor, including simple sum of counts, median ratio [1] and quantile method [4]. Witten [22] and Dong et al. [8] showed that the performance is comparable among the three methods. Here, we will use the quantile method for all methods for a fair comparison. In quantile method, the normalization factor for sample i (1 ≤ i ≤ n) is estimated as (or equivalently some papares also use , which is what we adopt in this paper), where q_i is the 75th quantile of sequence counts of all genes for the ith sample. For a new sample x*, the normalizing factor is estimated as , where q_i (1 ≤ i ≤ n) come from training data and q* is the 75th count quantile for sample x*. Note that the vector of normalization factors and dispersion denoted by s and ϕ respectively will be pre-estimated in all negative binomial models in this paper before inference. ϕ are estimated by weighted likelihood empirical Bayes method using the edgeR package [17] with class label considered. We denote the method proposed by [8] as “NBLDA_PE” to emphasize the ad hoc “point estimation” procedure inherited from sPLDA in [22].

1.2 Proposed method: sparse negative binomial classifier via generalized linear model

We first consider a model without covariate in section 1.2.1. Then we extend it to covariate in section 1.2.2.

1.2.1 Sparse negative binomial classifier without covariate adjustment (sNBLDA_GLM)

Similar to NBLDA_PE, we specify the following negative binomial model in a generalized linear model (GLM) setting: where s_i is the normalization factor of the i-th sample, β_jk is the mean count in log-scale of the k-th class for the j-th gene and ϕ_j is the dispersion parameter of the j-th gene. Under the assumption of independence between genes, the corresponding log-likelihood can be written as, where, Θ = {(β_k, ϕ); k = 1, …, K}, β_k = (β_1k, …, β_pk), ϕ = (ϕ₁, …, ϕ_p), I(y_i = k) is the indicator function taking value 1 if y_i = k and 0 otherwise, and f (X_ij; β_jk, ϕ_j) is the density function of NB(s_i exp (β_jk), ϕ_j). Now, suppose we have a new observation x* for which we intend to predict the class label. By Bayes theorem, we can derive the discriminant score as Here, x* is assigned to class k for which the discriminant score is maximized. Note that the form of the discriminant score in the current model is identical to that proposed in [8], except that we reparametrize μ_ijk = s_ig_jd_kj to log(μ_ijk) = log(s_i) + β_jk. The major difference is in the parameter estimation. [8] directly borrows the point estimation of μ_ijk from the Poisson model in [22], while we will derive MLE of Equation (2) (see below) using iteratively reweighted least squares (IRLS) method to be shown in the next subsection.

In order to incorporate variable (gene) selection, we add a penalty term . Here, is the average of β_jk’s over the K classes for a given j-th gene. Hence, the following penalized likelihood is maximized to obtain estimation of β with pre-estimated ϕ: Here, λ is a tuning parameter controlling sparsity of the variable selection. The form of the discriminant scores for prediction is the same as in equation 1.

1.2.2 Sparse negative binomial classifier with covariate adjustment (sNBLDA_GLM.C and sNBLDA_GLM.sC)

In real applications, information of multiple clinical variables is often available and some of them may be associated with subsets of genes. Commonly encountered clinical variables can include age, gender, race, etc. Failure of covariate adjustment can greatly reduce prediction accuracy and replicability. In our GLM framework, covariate adjustment can be straightforwardly incorporated in the linear regression term: Here, z_q = (Z_1q, …, Z_nq) includes values of the q-th covariate over n samples and parameter α_qj corresponds to the coefficient of the q-th covariate in the j-th gene. Under the assumption of gene independence and adding penalty terms for both genes and covariates, the problem can be presented as maximization of the following penalized log-likelihood with double regularization: where, β is the collection of all β_jk parameters and α is the collection of all α_qj parameters. λ₁ and λ₂ are tuning parameters controlling for levels of sparsity of variable selection in genes and covariates, respectively.

Similarly, for a new sample x* with vector of clinical vector under this framework, we can derive the following discriminant score:

As before, x* is assigned to the class with the highest discriminant score. We note that when λ₂ = 0, Equation 4 performs covariate adjustment using all covariates for all genes without regularization in covariate parameters α_qj. We will denote this method as “sNBLDA_GLM.C”. In this case, when the number of covariates Q becomes large, performance of parameter estimation and prediction accuracy are expected to decline. With proper choice of λ₂ in Equation (4), the method can adequately select a subset of covariates in each gene to improve the performance. For illustration purpose, we refer to this method as “sNBLDA_GLM.sC” in this paper, where “sC” means sparsity on covariates. This is the method we recommend in general applications when clinical covariates are available and will be referred to as “snbClass” in the R package and future applications. When clinical covariates do not exist, the method naturally reduces to “sNBLDA_GLM”.

1.3 Estimation in sNBLDA_GLM and sNBLDA_GLM.sC

1.3.1 Estimation of sNBLDA_GLM

Maximizing the log-likelihood derived in Equation (2) is equivalent to minimizing the following penalized weighted least square function, where and τ_ijk = log(s_i) + β_jk + (x_ij − μ_ijk)/μ_ijk.

Given the estimates at the t-th step, the updates of (t+1) step is:,

Calculate
Update
Solve
Update

This is repeated until convergence of . The update of in Step (3) is given by, Here, [ ]₍₊₎ is soft thresholding function such that [u]₍₊₎ takes the value u when u is positive and 0 otherwise, is the estimate of β_jk under no penalization and .

1.3.2 Estimation of sNBLDA_GLM.sC

Similar to sNBLDA_GLM, the problem of maximizing the penalized log-likelihood in Equation (4) can be represented as minimizing the penalized weighted least square function given below in Equation (7), where, and . The estimation of each of the and α_qj is given by the following algorithm. The steps involved in IRLS given the estimates obtained at the t-th step is given below,

Calculate
Update
Solve
Solve
Update

The steps are repeated until convergence of the parameters β_jk, α_1j, …, α_qj. Then the penalized estimate of the parameters in step 3 and step 4 are respectively given by, and, where,

1.4 Selection of tuning parameters in regularization

Both sNBLDA_GLM and sNBLDA_GLM.sC methods involve selection of regularization parameters λ or (λ₁, λ₂). We apply V-fold cross validation as a tool to determin the tuning parameter [19]. For each given tuning parameter, we divide the dataset into V equal folds and samples in the K classes are split into V folds as even as possible. In each iteration, one fold is set aside as the test set and the remaining (V − 1) folds are used as the training set. The classifier is built from the training set and then validated in the test set for evaluating accuracy. This procedure is repeated until all V folds have been chosen as the test set and the averaged accuracy is calculated. The tuning parameter corresponding to the highest averaged accuracy is chosen for the final model construction. We apply 10-fold (V=10) cross validation for all simulations and real applications in this paper. We note that nested cross validation is used for real applications for a fair accuracy evaluation. In this case, the outer loop of 10-fold cross validation is conventionally used to estimate accuracy. In each cross validation, the 9 folds of training set undergo an inner loop of 10-fold cross validation to determine λ or (λ₁, λ₂).

1.5 Benchmarks for evaluation

Performance of different methods will be judged by two major criteria: accuracy of prediction and accuracy of feature selection. For prediction performance, simple averaged accuracy is used when true class labels are known: . For feature selection performance, we derive the area under the curve (AUC) [2] values of the receiver operating characteristic (ROC) curves. We also evaluate the performance of sNBLDA_GLM, sNBLDA_GLM.sC and sPLDA in terms of estimating the true parameters β_jk when the gene expression is affected by covariates. Here, we define where B is the number of datasets simulated.

2 Simulations

In this section, we will devise two simulation schemes to compare the performance of sPLDA and NBLDA_PE to our proposed model sNBLDA_GLM and sNBLDA_GLM.sC under different settings. In Simulation 1, there is no covariate effect over the expression levels of the genes. Here, we compare sPLDA, NBLDA_PE and sNBLDA_GLM over different level of signal strength under three different levels of dispersion in the data. In Simulation 2, we develop a simulation scheme where two covariates are introduced which can affect expression level of certain proportion of the genes. Here, we compare sPLDA, NBLDA_PE, sNBLDA_GLM and sNBLDA_GLM.sC in the presence of covariate effects.

In order to mimic real data, we use a real RNA-seq dataset downloaded from Gene Expression Omnibus (GEO, GSE47474) to retrieve key parameters and perform the simulation. The dataset includes 72 samples with 36 coming from HIV-1 transgenic and 36 from control rat strains [12]. We compute the mean counts of each gene over all samples to obtain an empirical distribution of mean counts, which will be used for obtaining baseline expression levels in all the simulations. Each simulation is repeated 100 times and the average result is reported.

2.1 Simulation settings

Simulation 1: Without covariate effect

In this simulation, we sample the count data by x_ij|C_i = k ~ NB(s_ib_j exp (δ_jk Δ_j), ϕ_j) for each gene j(1 ≤ j ≤ 1000) and sample i(1 ≤ i ≤ 120) in class k(1 ≤ k ≤ 3), where the number of informative feature is 300. The notation of the parameters as well as the settings are given below:

The library size factor s_i is sampled from Unif(0.75,1.25) for each sample i.
b_j is the baseline which is sampled from the empirical distribution of the mean expression described previously.
δ_jk represents the pattern of genes j in class k. For all δ_jk ∈ {−1, 0, 1}, 1 indicating a up-regulated trend of genes in this class relative to other classes, −1 indicating it is down-regulated and 0 indicating no difference.
There exists three gene patterns for the 300 informative genes: (δ_j1, δ_j2, δ_j3) = (1,0,−1), (0,1,1) and (−1,−1,0). For non-informative genes, the pattern is (0,0,0).
Sample the main effect size parameter Δ_j for each gene j from a truncated normal distribution TN (ζ, 0.1², ζ/2, ∞), where ζ is the mean and values smaller than ζ/2 are truncated.
ϕ_j ~ TN (ν, 0.1, 0, ∞) and ν is chosen as 1, 5 and 10.
100 of the samples are used as training set and the remaining 1,000 samples are used as testing set

Simulation 2: Incorporating covariate effect

We sample the count data by for each gene j(1 ≤ j ≤ 1000) and sample i(1 ≤ i ≤ 120) in class k(1 ≤ k ≤ 3) with two covariates (z₁ and z₂; Q=2), where the number of informative feature is 300. The notation of parameters are as follows:

We generate a binary covariate (e.g. gender) for each sample i from Ber(0.5)(i.e.z_1i ~ Ber(0.5)) and generate a continuous covariate (e.g. age) for each sample i from Gamma(5, 10)
ϕ_j ~ TN (ν, 0.1, 0, ∞) where ν ∈ {10, 1}
γ_qj represents the pattern of gene j in covariate q for all γ_qj ∈ {0, 1}; there exist three patterns: (γ_1j, γ_2j) = (1, 1), (1, 0), (0, 1), and (0, 0) with probablity (ρ/3,ρ/3,ρ/3 and 1−ρ) respectively. When ρ = 0, all genes are not impacted by covariates. We choose the proportion of covariate-impacted genes ρ to be 0.125, 0.25 and 0.5.
Sample the main effect size parameter Δ_j for each gene j in class k from a truncated normal distribution TN (0.25, 0.1², 0.125, ∞)
The effect size parameter of covariates ϵ_qj for each gene j in covariate q is drawn from the product of random sign (i.e. half probability to be 1 and half to be −1) and a truncated normal distribution TN (η, 0.1², η/2, ∞) × κ where κ takes value 1 with probability 0.5 and −1 otherwise. We use the different value of η ∈ {0.1, 0.3, 0.5, 0.7} for different level of signal strength.
Other parameters are set the same as Simulation 1 except that ζ is set at 0.25.
100 of the samples are used as training set and the remaining 1,000 samples are used as testing set.

2.2 Simulation results

Results of Simulation 1 are summarized in Figure 1. In Figure 1(a), average prediction accuracy of the three models sPLDA, NBLDA_PE and sNBLDA_GLM were compared over three different levels of dispersions ν ∈ {1, 5, 10}. The larger the value of ν, the smaller the level of dispersion in the simulated datasets. In all different levels of ζ and ν, sNBLDA_GLM outperformed the other two methods. As expected, NBLDA_PE was superior to sPLDA when ν was small (large overdispersion) but their performances became comparable when ν was large, confirming good performance of power transformation to correct dispersion in sPLDA only for small overdispersion. Figure 1(b) shows results of variable selection by AUC. sNBLDA_GLM clearly outperformed sPLDA in all cases while NBLDA_PE could not perform variable selection and was not applicable in this plot. The new method was also compared to three popular classification methods such as support vector machines (SVM), random forest (RF) and classification and regression tree (CART) in supplement Figure S3. The result showed inferior performance in these methods due to ignorance of count data and transformation to continuous inputs.

Figure 1.

Results for Simulation 1 without covariate effect

Figure 2 demonstrates results of Simulation 2 using sPLDA, NBLDA_PE, sNBLDA_GLM (no covariate adjustment) and sNBLDA_GLM.sC (with covariate adjustment and regularization) when varying percent of genes impacted by covariates ρ = 0.125, 0.25 and 0.5. Figure 2(a) shows averaged prediction accuracy of varying η and ν = 1 or 10. When ν = 1 (high level of dispersion), sNBLDA_GLM.sC outperformed all other methods as the impact of covariates on gene expression η increased. The prediction accuracy for sNBLDA_GLM.sC remained high with increased η due to its capacity of adjusting covariate effect, while prediction accuracy of the other three methods dropped with increased η although sNBLDA_GLM still outperformed sPLDA and NBLDA_PE. When ν = 10, similar pattern was observed. The margin between sNBLDA_GLM.sC and sNBLDA_GLM became much smaller but sNBLDA_GLM.sC was still the best performer. Supplementary Figure S4 includes comparison with SVM, RF and CART, all of which performed much worse than sNBLDA_GLM.sC.

Figure 2.

Results for Simulation 2 setting

Variable selection performance between sPLDA, sNBLDA_GLM and sNBLDA_GLM.sC is shown in 2(b). Similarly, we observed stable and high performance of sNBLDA_GLM.sC with increasing η, while performance of sNBLDA_GLM dropped for increased η due to the lack of covariate adjustment. sPLDA performed the worst in all cases. It is intriguing that the variable selection gap between NBLDA_GLM.sC and NBLDA_GLM was larger in ν = 10 than in ν = 1, which is contrary to the prediction accuracy in Figure 2(a). An evaluation of the parameter estimates between sPLDA, sNBLDA_GLM and sNBLDA_GLM.sC was carried out in terms of RMSE in supplement Figure S1, where sNBLDA_GLM.sC performed the best. To examine the advantage of covariate regularization, we compared sNBLDA_GLM.C (i.e. λ₂ = 0 in Equation 4; all covariates are used) with sNBLDA_GLM.sC (with covariate regularization) in Supplement Figure S2. The result shows clear improvement of covariate regularization on prediction accuracy but less on feature selection.

3 Real applications

3.1 Cervical tumor miRNA data

This RNA-seq dataset measures expression level of miRNAs in tumor and nontumor cervical tissues in human samples [21]. The data contains information of over 714 microRNAs for 29 control samples (samples with no tumor) and 29 tumor samples. No clinical information (covariates) is available for adjustment. This dataset has been used in both sPLDA and NBLDA_PE papers and thus is a good dataset to evaluate our new method. [8] found that NBLDA_PE performed better than sPLDA in terms of prediction accuracy because of high dispersion estimate in this dataset. In Figure 3, we compare prediction accuracy (y-axis) between sPLDA and sNBLDA_GLM based on 10-fold cross-validation when different number of genes are selected (x-axis) as proposed for the corresponding models. Since there is no variable selection in NBLDA_PE, we only perform cross-validation considering all miRNAs (shown as “X” in the figure). sNBLDA_GLM generally outperforms the other two methods in different number of selected genes. Specifically, it achieves 95% prediction accuracy with a small number of 37 genes while NBLDA_PE and sPLDA generally achieves 91% accuracy. The result shows clear improvement of sNBLDA_GLM in prediction accuracy and variable selection.

Figure 3.

Prediction accuracy (y-axis) of sNBLDA_GLM (dashed line) and sPLDA (dotted line) with varying number of selected miRNAs (x-axis) in the cervical tumor application. NBLDA_PE does not allow variable selection and is shown with “X” symbol.

3.2 Schizophrenia RNA-seq dataset

The schizophrenia RNA-seq dataset (http://www.synapse.org/CMC) was obtained from the CommonMind Consortium [11] using post-mortem human dorsolateral prefrontal cortex tissues from 258 schizophrenia patients and 279 controls. Here we restrict our analysis to patients with age below 50 and post-mortem interval (PMI; time relapsed from the person has died to the tissues are frozen) less than 30 hours, producing 150 subjects where 100 are controls and 50 suffer from schizophrenia. Five clinical variables are available: age of death, gender, PMI, pH level and ethnicity (Caucasian or African American). At first, we ran a differential expression analysis on each covariate and found a higher percentage of DE genes affected by age of death, ethnicity, PMI and pH. However, since pH had some missing values, we only considered the other three clinical variables in the sNBLDA_GLM.sC model. We performed routine data preprocessing and filtering to keep genes with at least 70% of the samples having gene expression level greater than 0 and mean count across the samples greater than 10, producing a count data matrix with 16989 genes for machine learning. Similar to simulation and previous application, 10-fold cross-validation was performed to evaluate sPLDA, NBLDA_PE, sNBLDA_GLM and sNBLDA_GLM.sC. We further perform DE analysis to narrow down to 250, 500, 750, 1000, 1500, 2000 and 5000 genes in each training set before adopting the four machine learning methods. Even though three of the four methods have embedded feature selection capacity, the feature selection is usually difficult for ultra-high dimensionality (e.g. 16,989 gene features in our case). We performed a pre-screening by differential expression analysis to reduce dimensionality to 250-5000. This procedure is similar to the sure independence screening idea in [10] and can usually improve prediction performance. Figure 4 shows the 10-fold cross validation accuracy of the four methods for different gene size after DE analysis pre-selection. For the three methods with embedded feature selection (sPLDA, sNBLDA_GLM and sNBLDA_GLM.sC), varied tuning parameter for feature selection was applied and the best prediction accuracy was reported in Figure 4. The result clearly demonstrates better prediction performance of sNBLDA_GLM.sC, especially when the pre-screening by DE analysis reduce the input gene size to 250-1000. However, when large number of genes are input to the sNBLDA_GLM.sC algorithm (e.g. 2000 or 5000 genes after pre-screening), its performance dropped to close to sNBLDA_GLM and the advantage of covariate adjustment is diminished. Nevertheless, our proposed GLM approach generally outperforms sPLDA and NBLDA_PE. As a result, we recommend pre-screening of ultra-high dimensional data, such as regular RNA-seq datasets, down to 250-5000 features before applying sNBLDA_GLM.sC. The result shows inferior performance of sNBLDA_GLM.C, showing necessity of covariate regularization. The accuracy performance of the methods discussed in this paper is compared with other methods appropriate for continuous data is summarized in Figure S5.

Figure 4.

Prediction accuracy (y-axis) of sNBLDA_GLM.sC, sNBLDA_GLM, NBLDA_PE and sPLDA with varying input gene number after DE analysis pre-screening (x-axis) in the schizophrenia post-mortem brain RNA-seq data.

4 Conclusion and Discussion

In this paper, we proposed a sparse negative binomial classifier based on a GLM framework with and without covariate adjustment. The method incorporates three key elements in RNA-seq machine learning modeling: adequate modeling for count data, feature selection and adjustment of covariate effects. Existing methods such as sPLDA does not consider overdispersion properly, NBLDA_PE does not embed regularization for feature selection and both methods cannot adjust for covariate effect in gene expression.

Our new approach assumes a negative binomial model to allow overdispersion, adopts GLM to allow covariate adjustment and facilitates double regularization for feature selection and covariate selection. Extensive simulations and two real applications showed superior performance of the proposed approach in terms of prediction accuracy and feature selection. Particularly, the new methods achieved higher prediction accuracy with smaller number of selected genes or miRNAs in the two real applications.

One major limitation of all methods in this paper is that the methods are based on independent assumption of gene expression. Due to the complex form of multivariate negative binomial model and the potentially heavy computational cost, this is not addressed in this paper but will be a future direction.

References

1.↵
S. Anders and W. Huber. Differential expression analysis for sequence count data. Genome biology, 11(10):R106, 2010.
OpenUrl CrossRef PubMed
2.↵
A. P. Bradley. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recogn., 30(7):1145–1159, July 1997.
OpenUrl CrossRef Web of Science
3.↵
M. P. Brown, W. N. Grundy, D. Lin, N. Cristianini, C. W. Sugnet, T. S. Furey, M. Ares, and D. Haussler. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proceedings of the National Academy of Sciences, 97(1):262–267, 2000.
OpenUrl Abstract/FREE Full Text
4.↵
J. H. Bullard, E. Purdom, K. D. Hansen, and S. Dudoit. Evaluation of statistical methods for normalization and differential expression in mrna-seq experiments. BMC bioinformatics, 11(1):94, 2010.
OpenUrl CrossRef PubMed
5.↵
Y. Chu and D. R. Corey. Rna sequencing: Platform selection, experimental design, and data interpretation. Nucleic Acid Therapeutics, 22(4):271–274, 2012. PMID: 22830413.
OpenUrl PubMed
6.↵
A. Conesa, P. Madrigal, S. Tarazona, D. Gomez-Cabrero, A. Cervera, A. McPherson, M. W. Szcześsniak, D. J. Gaffney, L. L. Elo, X. Zhang, et al. A survey of best practices for rna-seq data analysis. Genome biology, 17(1):13, 2016.
OpenUrl CrossRef PubMed
7.↵
R. Díaz-Uriarte and S. A. De Andres. Gene selection and classification of microarray data using random forest. BMC bioinformatics, 7(1):3, 2006.
OpenUrl CrossRef PubMed
8.↵
K. Dong, H. Zhao, T. Tong, and X. Wan. Nblda: negative binomial linear discriminant analysis for rna-seq data. BMC Bioinformatics, 17(1):369, Sep 2016.
OpenUrl
9.↵
S. Dudoit, J. Fridlyand, and T. P. Speed. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American statistical association, 97(457):77–87, 2002.
OpenUrl CrossRef Web of Science
10.↵
J. Fan and J. Lv. Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5):849–911, 2008.
OpenUrl CrossRef PubMed
11.↵
M. Fromer, P. Roussos, S. K. Sieberts, J. S. Johnson, D. H. Kavanagh, T. M. Perumal, D. M. Ruderfer, E. C. Oh, A. Topol, H. R. Shah, et al. Gene expression elucidates functional impact of polygenic risk for schizophrenia. Nature neuroscience, 19(11):1442, 2016.
OpenUrl CrossRef PubMed
12.↵
M. D. Li, J. Cao, S. Wang, J. Wang, S. Sarkar, M. Vigorito, J. Z. Ma, and S. L. Chang. Transcriptome sequencing of gene expression in the brain of the hiv-1 transgenic rat. PLoS One, 8(3):e59582, 2013.
OpenUrl CrossRef PubMed
13.↵
D. J. Lorenz, R. S. Gill, R. Mitra, and S. Datta. Using rna-seq data to detect differentially expressed genes. In Statistical analysis of next generation sequencing data, pages 25–49. Springer, 2014.
14.↵
J. C. Marioni, C. E. Mason, S. M. Mane, M. Stephens, and Y. Gilad. Rna-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome research, 18(9):1509–1517, 2008.
OpenUrl Abstract/FREE Full Text
15.↵
D. J. McCarthy, Y. Chen, and G. K. Smyth. Differential expression analysis of multifactor rna-seq experiments with respect to biological variation. Nucleic acids research, 40(10):4288–4297, 2012.
OpenUrl CrossRef PubMed Web of Science
16.↵
M. J. Peters, R. Joehanes, L. C. Pilling, C. Schurmann, K. N. Conneely, J. Powell, E. Reinmaa, G. L. Sutphin, A. Zhernakova, K. Schramm, et al. The transcriptional landscape of age in human peripheral blood. Nature communications, 6:8570, 2015.
OpenUrl
17.↵
M. D. Robinson, D. J. McCarthy, and G. K. Smyth. edger: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1):139–140, 2010.
OpenUrl CrossRef PubMed Web of Science
18.↵
M. D. Robinson and A. Oshlack. A scaling normalization method for differential expression analysis of rna-seq data. Genome biology, 11(3):R25, 2010.
OpenUrl CrossRef PubMed
19.↵
M. Stone. Cross-validatory choice and assessment of statistical predictions. Roy. Stat. Soc., 36:111–147, 1974.
OpenUrl
20.↵
Z. Wang, M. Gerstein, and M. Snyder. Rna-seq: a revolutionary tool for transcriptomics. Nature reviews genetics, 10(1):57, 2009.
OpenUrl CrossRef PubMed Web of Science
21.↵
D. Witten, R. Tibshirani, S. G. Gu, A. Fire, and W.-O. Lui. Ultra-high throughput sequencing-based small rna discovery and discrete statistical biomarker analysis in a collection of cervical tumours and matched controls. BMC Biology, 8(1):58, May 2010.
OpenUrl
22.↵
D. M. Witten. Classification and clustering of sequencing data using a poisson model. Ann. Appl. Stat., 5(4):2493–2518, 12 2011.
OpenUrl CrossRef
23.↵
A.G. Zararsız, D. Goksuluk, S. Korkmaz, V. Eldem, G. E. Zararsiz, I. P. Duru, and A. Ozturk. A comprehensive simulation study on classification of rna-seq data. PloS one, 12(8):e0182507, 2017.
OpenUrl
24.↵
S. Zhao, W.-P. Fung-Leung, A. Bittner, K. Ngo, and X. Liu. Comparison of rna-seq and microarray in transcriptome profiling of activated t cells. PloS one, 9(1):e78644, 2014.
OpenUrl CrossRef PubMed

View the discussion thread.

Posted May 15, 2019.

Download PDF

Citation Tools

Subject Area

Genomics

Subject Areas

All Articles

Animal Behavior and Cognition (5215)
Biochemistry (11745)
Bioengineering (8752)
Bioinformatics (29200)
Biophysics (14972)
Cancer Biology (12096)
Cell Biology (17411)
Clinical Trials (138)
Developmental Biology (9421)
Ecology (14182)
Epidemiology (2067)
Evolutionary Biology (18308)
Genetics (12245)
Genomics (16803)
Immunology (11869)
Microbiology (28085)
Molecular Biology (11592)
Neuroscience (60969)
Paleontology (451)
Pathology (1871)
Pharmacology and Toxicology (3238)
Physiology (4959)
Plant Biology (10427)
Scientific Communication and Education (1683)
Synthetic Biology (2885)
Systems Biology (7340)
Zoology (1651)

[1] 1.↵
S. Anders and W. Huber. Differential expression analysis for sequence count data. Genome biology, 11(10):R106, 2010.
OpenUrl CrossRef PubMed

[2] 2.↵
A. P. Bradley. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recogn., 30(7):1145–1159, July 1997.
OpenUrl CrossRef Web of Science

[3] 3.↵
M. P. Brown, W. N. Grundy, D. Lin, N. Cristianini, C. W. Sugnet, T. S. Furey, M. Ares, and D. Haussler. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proceedings of the National Academy of Sciences, 97(1):262–267, 2000.
OpenUrl Abstract/FREE Full Text

[4] 4.↵
J. H. Bullard, E. Purdom, K. D. Hansen, and S. Dudoit. Evaluation of statistical methods for normalization and differential expression in mrna-seq experiments. BMC bioinformatics, 11(1):94, 2010.
OpenUrl CrossRef PubMed

[5] 5.↵
Y. Chu and D. R. Corey. Rna sequencing: Platform selection, experimental design, and data interpretation. Nucleic Acid Therapeutics, 22(4):271–274, 2012. PMID: 22830413.
OpenUrl PubMed

[6] 6.↵
A. Conesa, P. Madrigal, S. Tarazona, D. Gomez-Cabrero, A. Cervera, A. McPherson, M. W. Szcześsniak, D. J. Gaffney, L. L. Elo, X. Zhang, et al. A survey of best practices for rna-seq data analysis. Genome biology, 17(1):13, 2016.
OpenUrl CrossRef PubMed

[7] 7.↵
R. Díaz-Uriarte and S. A. De Andres. Gene selection and classification of microarray data using random forest. BMC bioinformatics, 7(1):3, 2006.
OpenUrl CrossRef PubMed

[8] 8.↵
K. Dong, H. Zhao, T. Tong, and X. Wan. Nblda: negative binomial linear discriminant analysis for rna-seq data. BMC Bioinformatics, 17(1):369, Sep 2016.
OpenUrl

[9] 9.↵
S. Dudoit, J. Fridlyand, and T. P. Speed. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American statistical association, 97(457):77–87, 2002.
OpenUrl CrossRef Web of Science

[10] 10.↵
J. Fan and J. Lv. Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5):849–911, 2008.
OpenUrl CrossRef PubMed

[11] 11.↵
M. Fromer, P. Roussos, S. K. Sieberts, J. S. Johnson, D. H. Kavanagh, T. M. Perumal, D. M. Ruderfer, E. C. Oh, A. Topol, H. R. Shah, et al. Gene expression elucidates functional impact of polygenic risk for schizophrenia. Nature neuroscience, 19(11):1442, 2016.
OpenUrl CrossRef PubMed

[12] 12.↵
M. D. Li, J. Cao, S. Wang, J. Wang, S. Sarkar, M. Vigorito, J. Z. Ma, and S. L. Chang. Transcriptome sequencing of gene expression in the brain of the hiv-1 transgenic rat. PLoS One, 8(3):e59582, 2013.
OpenUrl CrossRef PubMed

[13] 13.↵
D. J. Lorenz, R. S. Gill, R. Mitra, and S. Datta. Using rna-seq data to detect differentially expressed genes. In Statistical analysis of next generation sequencing data, pages 25–49. Springer, 2014.

[14] 14.↵
J. C. Marioni, C. E. Mason, S. M. Mane, M. Stephens, and Y. Gilad. Rna-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome research, 18(9):1509–1517, 2008.
OpenUrl Abstract/FREE Full Text

[15] 15.↵
D. J. McCarthy, Y. Chen, and G. K. Smyth. Differential expression analysis of multifactor rna-seq experiments with respect to biological variation. Nucleic acids research, 40(10):4288–4297, 2012.
OpenUrl CrossRef PubMed Web of Science

[16] 16.↵
M. J. Peters, R. Joehanes, L. C. Pilling, C. Schurmann, K. N. Conneely, J. Powell, E. Reinmaa, G. L. Sutphin, A. Zhernakova, K. Schramm, et al. The transcriptional landscape of age in human peripheral blood. Nature communications, 6:8570, 2015.
OpenUrl

[17] 17.↵
M. D. Robinson, D. J. McCarthy, and G. K. Smyth. edger: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1):139–140, 2010.
OpenUrl CrossRef PubMed Web of Science

[18] 18.↵
M. D. Robinson and A. Oshlack. A scaling normalization method for differential expression analysis of rna-seq data. Genome biology, 11(3):R25, 2010.
OpenUrl CrossRef PubMed

[19] 19.↵
M. Stone. Cross-validatory choice and assessment of statistical predictions. Roy. Stat. Soc., 36:111–147, 1974.
OpenUrl

[20] 20.↵
Z. Wang, M. Gerstein, and M. Snyder. Rna-seq: a revolutionary tool for transcriptomics. Nature reviews genetics, 10(1):57, 2009.
OpenUrl CrossRef PubMed Web of Science

[21] 21.↵
D. Witten, R. Tibshirani, S. G. Gu, A. Fire, and W.-O. Lui. Ultra-high throughput sequencing-based small rna discovery and discrete statistical biomarker analysis in a collection of cervical tumours and matched controls. BMC Biology, 8(1):58, May 2010.
OpenUrl

[22] 22.↵
D. M. Witten. Classification and clustering of sequencing data using a poisson model. Ann. Appl. Stat., 5(4):2493–2518, 12 2011.
OpenUrl CrossRef

[23] 23.↵
A.G. Zararsız, D. Goksuluk, S. Korkmaz, V. Eldem, G. E. Zararsiz, I. P. Duru, and A. Ozturk. A comprehensive simulation study on classification of rna-seq data. PloS one, 12(8):e0182507, 2017.
OpenUrl

[24] 24.↵
S. Zhao, W.-P. Fung-Leung, A. Bittner, K. Ngo, and X. Liu. Comparison of rna-seq and microarray in transcriptome profiling of activated t cells. PloS one, 9(1):e78644, 2014.
OpenUrl CrossRef PubMed