ABSTRACT
Background Protein-coding RNAs (mRNA) have been the primary target of most transcriptome studies in the past, but in recent years, attention has expanded to include long non-coding RNAs (lncRNA). lncRNAs are typically expressed at low levels, and are inherently highly variable. This is a fundamental challenge for differential expression (DE) analysis. In this study, the performance of 14 popular tools for testing DE in RNA-seq data along with their normalization methods is comprehensively evaluated, with a particular focus on lncRNAs and low abundant mRNAs.
Results Thirteen performance metrics were used to evaluate DE tools and normalization methods using simulations and analyses of six diverse RNA-seq datasets. Non-parametric procedures are used to simulate gene expression data in such a way that realistic levels of expression and variability are preserved in the simulated data. Throughout the assessment, we kept track of the results for mRNA and lncRNA separately. All statistical models exhibited inferior performance for lncRNAs compared to mRNAs across all simulated scenarios and analysis of benchmark RNA-seq datasets. No single tool uniformly outperformed the others.
Conclusion Overall, the linear modeling with empirical Bayes moderation (limma) and the nonparametric approach (SAMSeq) showed best performance: good control of the false discovery rate (FDR) and reasonable sensitivity. However, for achieving a sensitivity of at least 50%, more than 80 samples are required when studying expression levels in a realistic clinical settings such as in cancer research. About half of the methods showed severe excess of false discoveries, making these methods unreliable for differential expression analysis and jeopardizing reproducible science. The detailed results of our study can be consulted through a user-friendly web application, http://statapps.ugent.be/tools/AppDGE/
Introduction
Proteins and messenger RNAs (mRNA) have been the primary target of transcriptome studies. However, RNA sequencing technology has revealed that the human genome is pervasively transcribed, resulting in thousands of novel non-coding RNA genes. Hence, attention is expanding to one of the most poorly understood, yet most common RNA species: long non-coding RNAs (lncRNAs)1,2. These lncRNAs form a large and diverse class of transcribed RNA molecules, constituting up to 70% of the transcriptome with a defined length of ≥ 200 nucleotides. While they do not encode for proteins, lncRNAs are strong regulators of gene expression3. The discovery and study of lncRNAs is of major relevance to human health and disease because they represent an extensive, largely unexplored, and functional component of the genome3–7. In contrast to mRNAs, lncRNAs are generally expressed in low amounts, typically an order of magnitude lower than mRNA expression levels2,5,8–10. Furthermore, several studies demonstrated that lncRNA expression levels are very noisy, which is a characteristic shared with low count data from massively parallel RNA sequencing10–13.
Following the advent of RNA-sequencing (RNA-seq) technologies, several statistical tools for differential gene expression (DGE) analysis have been introduced. However, low and noisy read counts, such as those coming from lncRNAs, are potentially challenging the tools14,15. For example, it is commonly observed that low count genes or transcripts show large variability of the fold-change estimates and thus exhibit inherently noisier inferential behavior. The majority of the methods suggest removal of low expressed genes before the start of the data analysis, but this procedure essentially blocks the researchers from studying lncRNAs. In our study, no such severe filtering was applied, leaving almost all lncRNAs in the dataset. To our knowledge, there is no statistical method particularly developed for the analysis of lncRNA-seq data and therefore transcriptome studies make use of statistical methods that assume sufficient expression levels. In this paper, we evaluated and compared the performance of many popular statistical methods (see Table 1) developed for testing DGE of RNA-seq data (hereafter referred to as “DE tools” or “DE methods”), with special emphasis on lncRNAs and low abundant mRNAs. All tools considered in this study are popular (in terms of number of citation), available as R software packages16, and use gene or transcript level read counts as input. Our conclusions are based on 6 RNA-seq datasets and many realistic simulations, representing various typical gene expression experiments.
Previous comparative studies of DE tools15,17–22 focused on mRNA and some of these concluded that DE tools show inferior performance for genes or transcripts with low counts. We extend previous studies by including lncRNAs and low expressed mRNAs separately. Our results are based on diverse types of RNA-seq datasets that vary with respect to their biological and technical features, such as species (human, mouse, rat), experimental design (control versus treatment, diseased versus non-diseased, and tissues comparison), and level of biological variability. We assess the degree of concordance among results returned from the DE tools, and we study important statistical properties of DE tools, such as their ability to control the false discovery rate and their sensitivity for the detection of differential expression. The latter are empirically investigated using a nonparametric resampling-based simulation procedure. The simulation method essentially resamples data from a real RNA-seq dataset to create realistic gene expression scenarios. Consequently, our results reflect the genuine behavior of the DE tools under study, in contrast to simulation studies based on parametric assumptions. By starting from a variety of real and representative RNA-seq datasets, the scope of our findings is wide. To our knowledge, our study is the largest empirical evaluation study conducted so far, both in terms of the number of real datasets used and performance metrics evaluated as in the number of DE tools included (Figure A2 of Additional File 1). Note that the evaluation of a method that relies on a parametric assumption (e.g. edgeR, DESeq and DESeq2 assume a negative binomial distribution) by means of simulated counts using the same distribution as used in18, will give too optimistic results. Moreover, these results do not reflect a realistic setting because the distributional assumption cannot be expected to hold in general23.
Our study consists of three paths: first we evaluated various normalization procedures, second we compared the performance of DE tools across various publicly available RNA-seq datasets, and third we used simulation procedures to evaluate and compare the performance of the tools under a variety of gene expression experiment scenarios (Figure 1), like sample size and fraction of DE genes. We applied most DE tools without modifying their default settings, except for few tools for which minor modifications were made to allow them to work for low abundant genes such as lncRNAs (see Methods section for details).
Results and discussion
RNA-seq datasets
Six publicly available benchmark RNA-seq datasets were used for the concordance analysis. Three of them were used as source datasets for generating non-parametric simulated data. The description of the datasets can be found in the Methods section; a summary is presented in Table 2.
The degree of homogeneity among samples, as measured by Pearson’s correlation coefficient, was lowest for the Zhang dataset followed by GTEx (the estimated biological coefficients of variation shown in Figure A1-panel A of Additional File 1 also support this fact). As expected, the other datasets had replicates that are more homogeneous because they were obtained from inbred animals or cultured cell lines, in contrast to the GTEx or Zhang datasets containing tissues for different human individuals. For the Zhang and NGP Nutlin datasets, lncRNAs showed relatively higher heterogeneity across samples than mRNAs. In addition, lncRNAs showed on average lower expression than mRNAs (Figure A1-panel B).
Comparison of normalization methods
Comparing DE tools requires careful attention to the normalization methods. Previous studies17,20,42,43 have pointed out that the normalization procedure can affect the result of DE analysis. We therefore first explored the performance of 5 normalization methods that are used in conjunction with the DE methods evaluated in this study. The normalization methods were compared using the metrics from Dillies et al.42, such as their capability to reduce variability that attributes to technical sources, their capability of eliminating bias due to library size differences, and their effect on DGE analysis.
Box plots of log transformed counts showed that for all 6 datasets all normalization methods succeeded in aligning the sample specific distributions and hence no library size effects were noticeable after normalization (Figures A3-A8 of Additional File 1). Furthermore, the condition-specific genewise coefficient of variation (CV), which is a proxy for intra-group biological variability, was reduced for all datasets by all normalization procedures (Figures A9-A14). Nearly equal levels of biological variability across methods were observed, even with quantile normalization that was found to result in high CV in other studies42,44. The overlap of DE genes with different normalization methods was high (Figure 2). Ignoring quantile normalization (QN), a minimum of 96% similarity was observed. QN-based DE analysis gives deviating results, particularly for designs with small numbers of replicates (< 5); the lowest proportion of similarity was 68%. Overall, the results suggest that all normalization methods perform almost equally, except QN. Nevertheless, for the concordance analysis of the DE tools (see next section) we include a limma method that uses QN (named limmaQN) to further investigate its effect on other relevant performance metrics of DE tools.
Concordance Analysis
Fourteen DE tools were run on 6 RNA-seq datasets, and (dis)similarities among the results were examined. The concordance analysis focused on four quantitative and one qualitative metric: (1) number of genes identified as significantly differentially expressed (SDE); (2) the degree of agreement on gene ranking; (3) similarity of fold-change estimates; (4) handling of genes with special characteristics (lncRNAs, genes with low counts, genes with outliers), and (5) computation time. The results for individual datasets are presented in Additional File 1 Sections 3.3-3.8.
Results show that the 14 DE tools show substantial variability in the numbers of SDE genes. The marginal summary across all datasets indicates that DESeq, baySeq, and limmaQN detected the smallest number of SDE genes, whereas QuasiSeq and SAMSeq returned the largest numbers (Figure 3, Figure A15). The variability among DE tools with respect to the number of SDE genes seems to be related to the biological variability in the dataset. For the Zhang and GTEx RNA-seq datasets, characterized by the largest intra-group biological variability, the numbers of SDE genes were quite different among the DE tools. In contrast, the numbers of SDE genes from the NGP Nutlin and CRC AZA datasets, all displaying low biological variability, were relatively similar among tools. lncRNAs and low abundant genes in general were underrepresented among the SDE genes (Additional File 1 Section 3.3-3.8). For example, 25% of the SDE genes were lncRNAs, whereas the data contains 40% lncRNAs.
Many DE tools detected the same set of SDE genes (Figure 3, Figure A15). On average, limmaQN, DESeq, baySeq, and SAMSeq showed the smallest concordance with all other tested DE tools. It was also observed that the overlap of SDE is smaller for lncRNAs than for mRNAs (Figure A16). In the Zhang dataset, there is less than 70% and 60% SDE overlap across all DE tools for mRNAs and lncRNAs, respectively.
Accurate gene ranking is an essential step for downstream analysis such as gene set enrichment analysis (GSEA)45. In the current study, genes are ranked taking into account both the significance and magnitude of differential expression, using the n-score46. Summarized results across datasets (Figure 3, Figure A15) indicate that all DE tools strongly agree, except for baySeq, SAMSeq, and limmaQN. Apart from baySeq, this is somewhat in contrast to the findings in Soneson et al.18. This might be due to the difference in the score used to rank genes, as only P-values were used to rank genes in Soneson et al.18. Except for limmaQN, gene ranking agreement among all DE tools was nearly the same for lncRNAs and mRNAs (Figure A16). A slightly lower agreement for lncRNAs was observed when the most variable dataset (Zhang) is used. limmaQN consistently exhibited the lowest level of agreement with all other DE tools with respect to the concordance metrics discussed so far, and this is likely due to the quantile normalization it uses (see higher).
For moderately to highly expressed mRNA genes (the 75% highest expressed genes) the log-fold-change (LFC) estimates from all DE tools were strongly correlated, with Pearson correlation coefficients ranging from 0.95 to 0.99 (Additional File 1, Sections 3.3-3.8). For the 25% lowest abundant genes, the estimates show weaker correlations (Pearson correlation less than 0.8 in most cases). Similarly, the correlations for lncRNAs were lower than for mRNAs (Figure A16).
In Section 3.2.1 of Additional File 1, we also examined the handling of genes with unusual counts across samples. It turned out that only the limma tools, DESeq2, and baySeq showed concordant decisions with edgeR robust and SAMSeq (particularly developed to better cope with outliers), suggesting that they are neither sensitive nor conservatives tools.
The computation time to run DGE analysis presented in Section 3.2.2 of Additional File 1 shows that baySeq and DESeq require the longest time, whereas limma tools and PoissonSeq run fast. For RNA-seq data with 10 replicates per group and 19,150 mRNAs, the slowest tools, baySeq and DESeq, were approximately 8,000 and 2,000 times slower than the fastest tool, limmaQN, respectively.
In an attempt to come to an overall conclusion, the results were combined in a hierarchical clustering analysis of the DE tools resulting in 3 clusters (Figure 3). DESeq, baySeq, and limmaQN cluster together, generally showing the least number of SDE genes, lower overlap, and lower gene ranking agreement with all other tested DE tools. The second cluster includes QuasiSeq, edgeR exact, edgeR GLM, and edgeR QL, showing the highest concordance with respect to calling SDE, gene ranking, and LFC estimates. This cluster also shares the same modeling scheme (negative binomial models) and normalization method (TMM). The third cluster includes DESeq2, edgeR robust, limmaVoom(+QW), limmaVst, PoissonSeq, and SAMSeq, showing high concordance to all other DE tools like the second group except that they showed slightly less similar LFC estimates. Moderating the LFC estimate for low expressed and/or outliers genes (DESeq2 and edgeR robust) or transformation of counts to log scale (limmaVoom (+QW) and limmaVst) may explain slightly different LFC estimates.
Simulation results
The non-parametric SimSeq47 procedure was applied to realistically simulate RNA-seq expression data. The simulation technique involves sub-sampling of replicates from a real RNA-seq dataset with a sufficiently large number of replicates. In this way, the underlying characteristics of the source dataset are preserved, including the count distributions and variability. Three series of simulations were performed, each starting from a different RNA-seq source dataset: Zhang, NGP Nutlin, and GTEx data. The degree of homogeneity among the replicates in these datasets varies, reflecting different levels of intra-group biological variability (Table 2 and Figure A1). The Zhang and NGP Nutlin datasets include annotated lncRNAs along with mRNAs, whereas the GTEx RNA-seq dataset contains only annotated mRNA genes. Therefore, simulated counts for mRNA and lncRNA are sampled from mRNA and lncRNA counts of the source dataset, respectively.
Gene expressions were simulated under a wide range of scenarios that may affect the performance of DE tools: different numbers of replicates ranging from 2 to 40, different proportions of true DE genes (0 to 30%), two gene biotypes (mRNA and lncRNA), and different levels of intra-group biological variability (as present in the three source datasets). From the simulation results, the actual false discovery rate (FDR), true positive rate (TPR), and false positive rate (FPR) were computed for 13 DE tools. The comparison between the two gene biotypes was done in two ways: simulating lncRNA data only or simulating lncRNA and mRNA jointly, but analyzing separately.
False discovery rate
FDR refers to the average proportion of incorrect discoveries among SDE genes. It refers to DE methods applied at a user-defined nominal level of the FDR, which is here set to 5%. A good DE tool has actual FDR close to the nominal level, and has high TPR. Our results for simulations starting from the Zhang dataset (Figure 4) indicate that the FDR is not controlled by all DE tools. Only PoissonSeq, limmaVoom, limmaQN, limmaVst, SAMSeq, and DESeq gave acceptable actual FDR levels. The other methods showed severe FDR inflation, particularly when only a small fraction of genes is DE (Figure 4 and Figures A54 and A55 of Additional File 1). FDR levels may even exceed 50%, which means that more than half of the SDE called genes may be false discoveries. For most DE tools, the FDR control was better with increasing number of replicates. The gene biotype was also influential, with considerably poorer performance for lncRNAs. When starting from the NGP Nutlin source dataset, the results were better (Figure 5 and Figure A63), with good FDR control for all DE tools, even with small sample sizes. Only for simulations with 5% of true DE genes, the FDR control was lost (Figure A63). The difference in FDR control between the Zhang and NGP Nutlin datasets is likely explained by their intra-group variability (see Table 2, Figure A1). In line with this, GTEx has intermediate homogeneity and the FDR control in simulations is somewhere in between the extreme Zhang and NGP Nutlin datasets (Figure A65). In Section 4.5.1 of Additional File 1, we separately demonstrated the actual FDR and TPR of DE tools for low and high abundant mRNAs based on a simulation starting from the GTEx data. In general, the tools showed relatively worse performance for low abundant mRNAs (high FDR and low TPR) as seen for lncRNAs.
True positive rate
The TPR, which is also known as sensitivity, is the average proportion of SDE genes among the true DE genes. The TPR should be sufficiently large, otherwise one cannot expect to find many of the true DE genes. Our simulation results indicated that the TPR increases with increasing number of replicates (Figure 4, Figure A58, and Figure A63). PoissonSeq and DESeq showed substantial smaller TPR than the other tools, even in simulations with more than 20 replicates per group. Because of the trade-off between FDR and TPR, a large TPR is expected for DE tools with a large actual FDR. This was observed for edgeR tools, QuasiSeq, and DESeq2. The tools that showed better capability of controlling FDR (limma tools and SAMSeq), also had TPR nearly equal to the TPR from liberal tools. The results also showed that a slightly higher TPR was obtained when more DE genes are present (Figures A56 and A57), and that the sensitivity for detecting DE mRNA was larger than for lncRNA. Similar conclusions can be drawn from the simulations using the more homogeneous datasets; NGP Nutlin and GTEx simulations result in higher TPR (Figure 5 and Figures A63 and A65 in Additional File 1).
Finally, analyzing simulated counts using the default filtration setting of DE tools (particularly DESeq2 and PoissonSeq) resulted in slightly improved performance at a cost of losing a considerable proportion of lncRNAs (Figure A64 of Additional File 1).
False positive rate
The FPR refers to the probability of calling a gene SDE in a scenario with no DE genes at all. Results are shown in Figure A53 of Additional File 1. They demonstrate that all DE tools resulted in a FPR of less than 1%. The results were similar for both gene biotypes (mRNAs and lncRNAs), except for a slightly higher FPR for lncRNAs than for mRNAs. The FPR was generally larger for methods relying on the negative binomial distribution. This finding is in line with conclusions from a previous comparative study17, in which it was concluded that the number of false predictions of differential expression from DE tools (majority of these DE tools are also the part of our study) is sufficiently low even for genes with low counts (gene expressed in the lower 25%).
Simulation of lncRNA expression data only
Results presented up to this point came from simulating, normalizing and analyzing lncRNAs and mRNAs together. Of note, joint analysis of the two gene biotypes may affect results. For example, estimates of gene specific dispersion parameters for negative binomial models are often done by sharing information across all genes using empirical Bays strategy26,32,37, and hence the results for lncRNAs depend on mRNA read counts and vice versa. In addition, adjusted P-values aimed at controlling FDR are calculated taking into account the total number of genes included in the analysis48. Therefore, we also evaluated the performance of the DE tools with only lncRNA data, using the same simulation procedures. Our conclusions remain the same. The results are shown in Figure A62 of Additional File 1. The FDR control is generally worse when analyzing lncRNA separately, particularly for small replicate sizes. Only a small reduction in TPR is observed.
Web application
All simulation results can be consulted and visualized with a web application: http://statapps.ugent.be/tools/AppDGE/.
Conclusions
Our study evaluated the performance of widely used statistical tools for testing differential gene expression (DGE) in RNA-seq data, with separate analysis of mRNA and lncRNA. Several gene expression studies indicated that the expression of the majority of lncRNAs is characterized by low abundance2,10,12,13, high noise11, and tissue-specific expression10. These characteristics are very challenging for DE tools, and may potentially negatively affect tool performance14,15.
Concordance analysis of the DE tools across 6 diverse RNA-seq datasets revealed that the DE tools lack strong consensus on identifying a set of significantly differentially expressed (SDE) genes, gene ranking, fold-change estimates, and handling low count genes. For datasets with a small number of replicates and/or heterogeneous replicates, the disagreement is even worse. Lower concordance was observed for lncRNAs than for mRNAs. In particular, limmaQN, baySeq, and DESeq (also PoissonSeq and SAMSeq to a smaller extent) showed in general lower concordance with other DE tools and also exhibited data dependent performance. In contrast, edgeR, DESeq2, limmaVoom(+QW), and limmaVst showed better agreement with other DE tools and consistent characteristics across datasets.
Results of our large non-parametric simulation study revealed that there are substantial differences among methods with respect to FDR control and sensitivity. The DE tool performance is strongly affected by the sample size, biological variability, and biotype of the transcripts. FDR control at the nominal level is good for all methods for datasets with small biological variability, even with only 5 biological replicates per condition. On the other hand, for datasets that are more variable, the FDR control is only guaranteed for larger sample sizes, and only with the following methods: PoissonSeq, limma tools, SAMSeq and DESeq. All other DE methods result in actual FDR levels far above the nominal level, up to an FDR exceeding 50% for lncRNAs even when studying 40 replicates per condition. Differences among tested tools in terms of sensitivity are not very large, except for PoissonSeq and DESeq, and limmaQN to a smaller extent, which showed the lowest sensitivities among all methods. For highly variable data, a maximum sensitivity of 50% for mRNAs is obtained with 40 replicates per condition. For homogeneous samples, this level of sensitivity is already reached with 4 and 5 replicates per condition.
In the light of promoting reproducible science, it is essential to select a DE tool that succeeds in controlling the FDR level under a large range of conditions. Among these DE tools, one can select one with a high sensitivity. If a DE tool has an actual FDR far larger than the nominal level, many of the discoveries claimed will be false discoveries. If one is prepared to allow for a large proportion of false discoveries, it is generally better to still use a DE tool with good FDR control, but to apply the method at a larger nominal FDR level. In this way, the researcher still controls the error rate. This reasoning implies that the selection of a DE tool may never rely purely on the sensitivity. High sensitivities may be expected from methods with a large actual FDR (and hence poor FDR control). These high sensitivities are illusive in the light of the large proportion of false discoveries.
Combining all results, we conclude that limma with variance stabilizing transformation, limma voom (with and without quality weighting), and SAMSeq control the FDR reasonably well, while not sacrificing sensitivity. These methods end up in the same cluster based on the concordance analysis.
Methods
RNA-seq datasets
A total of 6 datasets are used in this study. All of them are publicly accessible. The Zhang dataset, containing 498 neuroblastoma tumors, was retrieved from Zhang et al.41 (GEO accession number GSE49711). In short, unstranded polyA+ RNA sequencing was performed on the HiSeq 2000 instrument (Illumina). Paired-end reads with a length of 100 nucleotides were obtained. To quantify the full transcriptome, raw fastq files were processed with Kallisto49 v0.42.4 (index build with GRCh38-Ensembl v85). The pseudo-allignment tool Kallisto was chosen above other quantification methods as it is performing equally good but faster. For this study, a subset of 172 patients with high-risk disease were selected, forming two groups: the MYCN amplified (n=91) and MYCN non-amplified (n=81) tumors. Randomly selected 20 samples from each group were used to study concordance analysis of the DE tools, whereas all samples were considered for the simulation study.
Second, the NGP Nutlin-3 dataset (GEO accession number GSE104756) contains 10 biological replicates of NGP neuroblastoma cells grown in 6-well culture flasks treated with either nutlin-3 (8μM racemic mixture) or vehicle (ethanol). RNA was extracted using the RNeasy mini kit (Qiagen), quantified using Nanodrop-100, and quality controlled using FragmentAnalyzer (High Sensitivity RNA Analysis Kit, DNF-472-0500, Advanced Analytical). 100 ng of total RNA was used for TruSeq stranded mRNA library prep (Illumina), followed by paired-end sequencing (2x75 nucleotides) on a NextSeq 500 instrument (Illumina). On average, 20 million read pairs per sample were generated. Raw fastq files were processed with Kallisto2 v0.42.4 (index build with GRCh38-Ensembl v85).
From the Gene Expression Omnibus (GEO) repository of analysis-ready RNA-seq dataset50, we accessed two datasets: Hammer (GEO accession number GSE20895) and Bottomly mRNA-seq datasets (GEO accession number GSE26024). The Hammer mRNA-seq dataset contains the results of a study on the L4 dorsal root ganglion (DRG) of rats with chronic neuropathic pain induced by spinal nerve ligation (SNL) of the neighboring (L5) spinal nerve (two controls and two with L5-SNL induced chronic neuropathic pain)38. The study includes RNA-seq samples of each treatment at two different time points, and in our study, we only used samples from the first time point. The Bottomly mRNA-seq data is an expression set obtained from the brain striatum tissues of two mice strains (it included 10 biological replicates of the C57BL/6J (B6) strain and 11 biological replicates of the DBA/2J (D2) strain)39. Details about library preparation and features of these datasets can be obtained from the original papers cited above.
We also used a subset of RNA-seq dataset from the GTEx (Genotype-Tissue Expression) project40, which provides a large database of human gene expression data with more than 3000 RNA-seq samples from 54 different tissues. For our study, we selected all biological replicates of hippocampus (n1=28) and hypothalamus (n2=30) mRNA-seq counts. All information about this dataset can be obtained from the GTEx project database. Randomly selected 20 samples from each tissue were used for concordance analysis of the DE tools and all samples were considered for the third simulation study.
The colorectal cancer (CRC AZA) RNA-seq dataset contains 3 biological replicates of HCT-116 cells grown in 6-well culture flasks treated with either Azacitidine (1μM) or vehicle (DMSO). RNA was extracted using the RNeasy mini kit (Qiagen), quantified using Picogreen, and quality controlled using Bioanalyser FragmentAnalyzer. 500 ng of total RNA was used for TruSeq stranded mRNA library prep (Illumina), followed by paired-end sequencing (2x100 nucleotides) on a HiSeq instrument (Illumina). On average, 40 million read pairs per sample were generated. Raw fastq files were processed with Bowtie2/Cufflinks (index build with GRCh37-Ensembl v75). The count data can be accessed in Additional File 2.
DE tools and normalization methods
Five normalization methods are evaluated in this study: quantile normalization (QN)31 implemented in limma (limmaQN), Trimmed Mean of M-values (TMM)25 implemented in edgeR, limma (limmaVoom and limmaVoom+QW), baySeq, and QuasiSeq, Medians of Ratios (MR)30 implemented in DESeq, DESeq2, and limma (limmaVst), the goodness-of-fit statistics36 approach in PoissonSeq, and the re-sampling technique23 implemented in SAMSeq. To examine the effect of normalization on DGE analysis, we applied moderated t-test of the limma package in combination to each of the five normalization methods. Afterwards, the extent of dissimilarity of results is used as an indicator of normalization effect on DGE analysis.
The fourteen DE tools that are considered in this paper are summarized in Table 1. All methods are available as R software packages16. Most methods were applied using their default settings, except for DESeq2 and PoissonSeq, which required certain settings to be changed to avoid filtering out too many low abundant genes (such as lncRNA). In particular, in DESeq2 the independent filtering was disabled in the results function, and in PoissonSeq the filter cutoffs for the sum and average counts of genes across samples were set to 1 and 0.01, respectively. All DE tools were applied at the nominal 5% FDR level, unless mentioned otherwise. All computations were performed in R version 3.3.216. baySeq was not included in the simulation study due to its slow computation time. The reader can also find the R scripts in the Additional Files 2.
Concordance analysis
The number of SDE genes detected for each dataset are compared among 14 DE tools. For the Zhang and NGP Nutlin datasets, we explored the proportion of mRNAs and lncRNAs within the set of SDE genes. For the other datasets, we explored the proportion of low and high expressed mRNAs among the set of SDE mRNAs. In particular, we divided the genes into 4 equally large groups based on the quartiles of their average normalized read counts across samples (DESeq normalization). To study the concordance in SDE calling between any two DE tools, we calculated the extent of overlap as the proportion of SDE genes commonly detected by the two tools. To rank genes in the order of significance of differential expression, taking into account the biological importance of the significance we calculated the n-score46, where ϕi is the absolute value of the estimated LFC for the ith gene and pi is the raw P-value, except for SAMSeq, for which pi is replaced by , where wi is the Wilcoxon statistic23. For baySeq, pi is the estimated posterior probability of differential expression (estimated Bayesian False Discovery Rate, BFDR)?. Afterwards, we used Spearman’s rank correlation to evaluate the degree of agreement among DE tools with respect to gene ranking.
Non-parametric simulation methods
SimSeq47 simulates a matrix of RNA-seq read counts by sub-sampling samples from a source RNA-seq dataset. It creates a set of DE genes based on prior information from analyzing the source dataset (which has larger number of replicates than the simulated expression data). Details about the method are given in Benidt et al.47 and in Section 4 of Additional File 1. Results from DGE analysis of the source RNA-seq dataset is available in Table A4 of Additional File 1. Three sets of simulations were performed, each starting from a different source dataset: the Zhang, NGP Nutlin, and GTEx datasets. Gene expressions are simulated containing a given proportion of true DE genes (ranging from 0 to 30%) from a total of 10,000 genes. Each gene in every simulated count matrix has a total of at least one expression in each group. Also, counts are simulated for varying numbers of replicates per group (2 to 40, 2 to 5, and 2 to 14 for simulations that start from the Zhang, NGP Nutlin, and GTEx datasets, respectively). For joint simulation of mRNAs and lncRNAs (for Zhang and NGP Nutilin), each simulated count matrix includes a number of lncRNAs and mRNAs in the set of true DE genes and non-DE genes, proportional to their numbers in the source dataset (for the Zhang data 70% mRNA and for NGP Nutlin data 65% mRNA). We ran 100 independent simulations for each scenario. On each simulated dataset, we applied the various DE tools and calculated P-values, adjusted P-values, LFC, and other relevant statistics. These statistics were subsequently used for computing the comparative performance metrics.
For a given fraction of true DE genes, the SimSeq method generates data for which it is known which genes are truly DE. To each simulated dataset all DE tools are applied and genes are called SDE if their FDR adjusted P-values are smaller than the nominal FDR level (5%, unless mentioned otherwise). For a single simulated dataset the observed false discovery proportion (FDP) is calculated as FDP = FP/(FP + TP), where FP and TP are the numbers of incorrect (false) and correct (true) rejection of the null hypothesis, respectively. The actual False Discovery Rate (FDR) is then computed as the average of the FDPs over all simulations. When the FDR is computed for a scenario with no DE genes at all, the FDR is generally known as the false positive rate (FPR). For this scenario, the SimSeq procedure simply samples replicates from only one group of the source dataset and randomly assigns the samples to one of two arbitrary mock groups. The actual True Positive Rate (TPR), which is also called sensitivity, is defined as the average of the true positive proportions (TPP) across independent simulations, TPP = TP/(TP + FN), where FN is the number of incorrect decisions to fail rejecting the null hypothesis.
The quality of simulated data was examined using the quality metrics proposed by Soneson and Robinson51 and implemented by their countsimQC R package (version 0.5.2). The metrics evaluate the average expressions of genes, variability, mean-variance relationship, correlations among replicates, correlations among genes, and fractions of zero counts. The quality assessments were positive in all aspects, and the reports generated by the countsimQC R package can be found in Additional File 2.
Web application
A web application is developed, using the R Shiny package (version 1.0.5)52. This user interface allows the user to consult and visualize all detailed results from this paper. It can be reached at http://statapps.ugent.be/tools/AppDGE/.
Additional Files
Additional results (tables, figures, and discussion) are compiled in a separate file with name Additional File 1. Additional File 2 includes R source codes used in this study, reports for quality assessment of simulated data, and CRC AZA count dataset. It is accessible through a github repository with name “Additional-File-2” (https://github.com/CenterForStatistics-UGent/Additional-File-2.git). We created a web tool that assists researches to choose the best DE tools for a given RNA-seq study based on our simulation based comparison of 13 DE tools. It can be accessed with a link http://statapps.ugent.be/tools/AppDGE/.
Author’s contributions
O.T., J.V., and P.M. conceived and supervised the study, A.A. analyzed and reported the results, K.P. prepared the initial report, A. A. made the final draft and the web took, C.E. prepared the data. All authors reviewed and approved the final version of the manuscript.
Funding
This study was sponsored by a UGent Special Research Fund Concerted Research Actions (GOA grant number BOF16-GOA-023).
Acknowledgments
We are grateful to Janssen Pharmaceutica (Belgium) for providing the cell line CRC AZA RNA-seq data. We are also grateful to Matthias Fischer for providing the raw Zhang data, Steve Lefever for creating Figure 1 (the general workflow of this study).