RNA-seq 2G: online analysis of differential gene expression with comprehensive options of statistical methods

RNA-seq has become the most prevalent technology for measuring genome-wide gene expression, but the best practices for processing and analysing RNA-seq data are still an open question. Many statistical methods have been developed to identify genes differentially expressed between sample groups from RNA-seq data. These methods differ by their data distribution assumptions, choice of statistical test, and computational resource requirements. Over 25 methods of differential expression detection were validated and made available through a user-friendly web portal, RNA-seq 2G. All methods are suitable for analysing differential gene expression between two groups of samples. They commonly use a read count matrix derived from RNA-seq data as input and statistically compare groups for each gene. The web portal uses a Shiny app front-end and is hosted by a cloud-based server provided by Amazon Web Service. The comparison of methods showed that the data distribution assumption is the major determinant of differences between methods. Most methods are more likely to find that longer genes are differentially expressed, which substantially impacts downstream gene set-level analysis. Combining results from multiple methods can potentially diminish this bias. RNA-seq 2G makes the analysis of RNA-seq data more accessible and efficient, and is freely available at http://rnaseq2g.awsomics.org.


INTRODUCTION
RNA-seq is by far the most advanced technology for investigating gene expression (1). It uses massively parallel sequencing to obtain millions of short reads from each cDNA library (2). After the sequencing reads are aligned to a reference genome and transcriptome, the number of reads mapped to each gene is used as a digital measurement of gene expression. Applications of RNA-seq data include, but are not limited to, alternative splicing, gene fusion, and allele-specific expression (3).
However, its most common application is the detection of genes that show differential expression (DE) between sample groups.
Many statistical methods have been applied to DE analysis of RNA-seq data. Some were developed specifically for RNA-seq data, such as edgeR (4) and DESeq (5), and others were adapted from methods previously used for microarray data, such as limma (6) and SAMSeq (7). Methods using the original matrix of gene-level read counts as input often assume that counts have negative binomial distribution across samples while methods using log-transformed data are most commonly based on a normal distribution. Other methods assume Poisson-like, nonparametric, or multivariate distribution. Finally, metaseqR (8) uses a meta-analysis-like process to combine results of multiple methods. Numerous studies have compared different DE methods in terms of their technical similarities and precision of detecting DE genes (9). However, no detailed guidelines are available for data analysts to objectively select an optimal method for a specific data set analysed for a specific purpose.
Two follow-up strategies are commonly utilized once DE genes are identified from an RNA-seq data set. The first is to pick a small number of top DE genes and experimentally validate them using methods like quantitative PCR. The validated DE genes can be further investigated as biomarkers, disease classifiers, or regulators of downstream events. The second strategy is to select a larger number of DE genes and run an analysis to relate these genes to known biological processes, which is often called gene set enrichment or over-representation analysis. If the method selected for DE analysis has a systemic bias towards certain types of genes, the bias can be amplified by the gene set-level analysis. An example of such bias is that on average, more sequencing reads will be mapped to longer genes, which gives them a higher statistical power to be detected as DE genes.
We developed a user-friendly web portal, RNA-seq 2G, for users to choose any of over 25 DE methods and simultaneously run them online. The results of multiple methods can be visualized, compared, and combined online, or downloaded for other follow-up analyses. All DE methods were re-implemented as R functions to support the offline analysis of large or confidential data sets.
Expression data from human tissues from the GTEx project (10) was used in this report to demonstrate our preliminary findings from comparing these methods.
These DE methods were also made available through the RNA-seq 2G web portal. It takes seconds to hours for the web portal to finish running the DE methods on a typical RNA-seq data set (10-20 samples).
Correspondingly, we grouped the methods into four speed groups. About half of them are in the "Fast" group, where each takes less than 30 seconds to finish. The "Medium" methods usually take 0.5 to 5 minutes, and the "Slow" methods take up to 30 minutes. The "Slower" methods require even more time, so users will need to submit an analysis and retrieve its results later through an analysis ID. Other attributes of the DE methods are listed in Table 1. For example, some methods use the original read counts as input while others use log-transformed data. Additionally, about one-third of the methods cannot perform paired tests and some methods do not have their own normalization procedure, making them reliant on the normalization options available via RNA-seq 2G. Another main difference between the methods is their data distribution assumption, which is most commonly a normal or negative binomial distribution. whether the method uses read count and log-transformed data as input; Norm: whether the method has its own normalization procedure; Distribution: the data distribution assumed by the method; Test: the statistical test used by the method; and R Function: the core R function {package} of the method). Also available at the web portal are a detailed user manual and step-by-step instructions for submitting an analysis. Sample data and results that can be loaded for demonstration, or downloaded in various file formats. No user registration is required and an email address is optional and will only be used for notifying users of the completion of their analysis.

Comparison of DE methods
DE methods were compared using a subset of the GTEx data (10), including samples of hippocampus and hypothalamus, two brain subdomains. The rand index of the two groups equals to 0.86, suggesting that they have somewhat similar transcriptomes. To reflect the typical sample size of RNA-seq data sets, six samples were randomly selected from each group for DE analysis. This was repeated 10 times to ensure that the results are consistent. Principal component analysis using p value rankings completely split DE methods based on their assumption about normal and negative binomial data distributions (Figure 1), although methods based on the same distribution still have plenty of disparity. For example, while both edgeR and

Gene-level analysis
DESeq are based on the negative binomial distribution, they had less than 85% overlap between their top-500 DE genes. Nonparametric methods have even less similarity to each other since they are not based on any data distribution and use different statistical tests, from rank sum (Wilcoxon) to rank product (RankProd).
We took a closer look of two genes with high-level expression in both hippocampus and hypothalamus ( Figure 2). SLC6A11 encodes a neurotransmitter transporter. Its expression levels were perfectly split between the two groups and about 5-fold higher in hypothalamus on average. As

Gene set-level analysis
We postulated that gene length is a source of bias during DE analysis. To perform a comparison of short vs. long genes, we selected two groups of protein coding genes from those having at least medium expression levels in both groups. The "short" group includes the 25% of genes having the shortest length (mean = 1.3kb), while the "long" group includes top 25% of genes having the longest length (mean = 5.7kb). On average, a long gene had a 50% higher read count than a short gene, but the short genes had higher expression than the long genes according to length-adjusted read counts.
All DE methods except RankProd favoured longer genes. For example, both DESeq2 and edgeR were 20% more likely to give significant p values to a long gene than a short one. At the gene set-level, the systemic bias of DE analysis is amplified. For example, genes of voltage-gated sodium channel activity (GO:0005248) have an average length over 5kb. It is indeed a favorite of gene set-level analysis according to our previous experiences. To run a simulation, we randomly selected genes from the "long" and "short" groups to form a collection of "pseudo" gene sets. Gene set enrichment analysis was applied to those gene sets and top DE genes ranked by each method. Gene sets of long genes were much more likely to obtain significant results than those of short genes according to most DE methods, such as tweeDEseq (162-folds), DESeq2 (17-

DISCUSSION
Using simulated or experimental data sets, previous studies have compared DE methods of RNAseq data by focusing on their sensitivity and overlapping of DE genes (26)(27)(28)(29)(30). Their results have not conclusively pinpointed a method that out-performed the others under all circumstances. We believe that the best practice of analyzing RNA-seq will only originate from the collective experience of data analysts and scientists after they apply various methods to various data sets, and closely examine the results. RNA-seq 2G will substantially accelerate this process by providing an objective comparison platform.
To develop RNA-seq 2G, we evaluated ~40 methods for two-group DE analysis. Over 25 of them were validated and made available through a user-friendly web portal. It provides at least twice as many choices of DE methods as any previous studies. The cloud-based system in combination with the Shiny server minimized the effort involved in system administration and user interface design. We believe it is a practical paradigm for developing more web-based applications.
The number of DE genes found from the same data set can differ by 2-3 orders when the same p value cutoff was used, reflecting the broad disparity of sensitivity between methods. Therefore, it is more meaningful to compare the methods in terms of their relative ranking of genes. Given that so many different DE methods are available, it is also suggestive that hypothesis testing of individual genes should be applied with caution, in consideration of the objectivity of method selection. The RNA-seq 2G web portal will help users make informed decision about which method to pick via visualizing and comparing global patterns of results from multiple methods.
Methods making the same assumption about data distribution tend to obtain more similar results ( Figure 1). Increasing sample size will improve the similarity of methods, but will not make distinctive methods converge. Scrutinizing individual genes explained the intrinsic disparity between methods as they evaluate different combinations of effect size and consistence of group difference differently ( Figure 2). Data analysts should then select DE methods accordingly, based on which type of DE genes is more biologically meaningful in the systems they are investigating.
Unlike previous studies, we evaluated the impact of DE methods on downstream analysis at the gene set-level. Gene length was confirmed as a source of systemic bias during DE analysis. The bias was further amplified during gene set analysis to a degree that many DE methods favored gene sets of long genes by over 10 times. Such a bias should be carefully dealt with during integrative analysis of multiple data sets. Interestingly, combining results from multiple methods by metaseqR seemingly reduced the bias. The validation of such a strategy requires further investigation. Baseline expression level, GC content, and sequence mapability of genes are potentially other sources of systemic bias.
These factors are related and distinct, and should be evaluated both individually and collectively in future studies.
In summary, the RNA-seq 2G web portal has by far the most comprehensive collection of DE methods, and will make the analysis of RNA-seq data more accessible and efficient. Our future studies will further compare DE methods in terms of their performance on different cell types and sample sizes, as well as their robustness to background noise and outlier samples. The impact of systemic bias on gene set analysis and strategies to remove such bias will also be further evaluated.