RNA-seq preprocessing and sample size considerations for gene network inference

Background Gene network inference (GNI) methods have the potential to reveal functional relationships between different genes and their products. Most GNI algorithms have been developed for microarray gene expression datasets and their application to RNA-seq data is relatively recent. As the characteristics of RNA-seq data are different from microarray data, it is an unanswered question what preprocessing methods for RNA-seq data should be applied prior to GNI to attain optimal performance, or what the required sample size for RNA-seq data is to obtain reliable GNI estimates. Results We ran 9144 analysis of 7 different RNA-seq datasets to evaluate 300 different preprocessing combinations that include data transformations, normalizations and association estimators. We found that there was no single best performing preprocessing combination but that there were several good ones. The performance varied widely over various datasets, which emphasized the importance of choosing an appropriate preprocessing configuration before GNI. Two preprocessing combinations appeared promising in general: First, Log-2 TPM (transcript per million) with Variance-stabilizing transformation (VST) and Pearson Correlation Coefficient (PCC) association estimator. Second, raw RNA-seq count data with PCC. Along with these two, we also identified 18 other good preprocessing combinations. Any of these algorithms might perform best in different datasets. Therefore, the GNI performances of these approaches should be measured on any new dataset to select the best performing one for it. In terms of the required biological sample size of RNA-seq data, we found that between 30 to 85 samples were required to generate reliable GNI estimates. Conclusions This study provides practical recommendations on default choices for data preprocessing prior to GNI analysis of RNA-seq data to obtain optimal performance results.

triplets in our analysis. In this case ARACNE resulted small number of predicted interactions similar to C3NET. If more interactions wanted to predict with ARACNE, the DPI parameter may be increased knowing the fact that performance decreases.
C3NET: uses RLENET as the first step the in the second step it inly infers the maximum association scored gene pair for each gene. C3NET aims to infer the conservative causal core of gene networks for the sake of highest accuracy instead of inferring the whole network.

Association estimators:
Pearson correlation coefficient (PCC) calculates linear relationship between two random variables, which therefore can be classified as a parametric or linear and correlation estimator. It does not capture nonlinear relationships.
Pearson Based Gaussian (PBG), also called Parametric Gaussian Estimator, estimates mutual information values based on Pearson's correlation with the assumption that the joint distribution is normal. In theory, it can capture linear and nonlinear relationships as a mutual information value.

Spearman Correlation Coefficient (SCC)
is similar to PCC but the data is ranked before calculating the coefficient.
PCC, PBG and SCC showed very similar GNI performances, which is not surprising considering their similarity.
B-spline is based on spline functions and a nonparametric estimator as there are not any assumptions about the data. The spline order k determines the number of bins each of the data points is assigned to. For example, a spline order k = 3 means each data point is assigned to three bins in probability calculations of the mutual information value.

Chao-Shen estimator combines two different approaches, Horvitz-Thompson estimator and
Good-Turing correction of ML estimator.

Normalizations:
The trimmed mean of M-values (TMM) normalization is used as default in the edgR R package that is popular for DE analysis of RNA-seq datasets. It computes a normalization factor for each gene that is applied to library size [1].
Relative Log Expression (RLE) implemented using edgeR package and it is the equivalent of the normalization in the DESeq R package that is also popular in RNA-seq DE analysis [1].

Variance Stabilizing Transformation (VST)
implemented with the popular DESeq2 R package. It transforms data on the log2 scale, which has been normalized with library size. VST generates approximately homoscedastic data that means having constant variance independent of varying mean values. Since VST need integer inputs, before inputting the values to VST function, we multiply the values by 10000 and round them when we take the Log2 of raw scale values.

Quantile normalization (QN) is mostly used in microarray gene expression datasets.
QN aims to make the distribution of gene expressions for each array in a set of arrays the same.

Statistical metrics to evaluate the performance of GNI methods
The most widely employed statistical measures for the performance assessment of GNI algorithms are precision, recall (sensitivity), specificity, accuracy and F-score [3]. They are all According to the hypothesis of using the literature, we accept the predictions in the literature as TP and if not in the literature as FP. We do not make any other assumptions and do not measure anything about the non-predicted edges. Because we know that the literature is incomplete and contains the interactions of all the cell conditions. Therefore, any metric that contains negative ratio is not suitable to asses based on the literature. We are also aware that TP and FP results are not the absolute correct measurements but they relatively give an estimate on the performance.
The only metric that does not include negative ratio is precision and used as the main metric for performance evaluation of the analysis. Precision tells us how accurate is the result based on only the predicted set of interactions. It does not give an absolute accuracy measurement but provide a relative one, which appears to be the most suitable metric to use with the literature information.
Along with the precision, we also measured the p-value based on hyper-geometric test that measures statistically how significant is the number of overlap with the literature. However, as demonstrated from the results, using this measure alone may be misleading. Because it favors the GNI algorithm that predicts largest number of interactions, which is RELNET (RN) that is already well known to provide worst GNI performance, which was also confirmed in this study. In order to detect the overlapping of the predicted network with the literature the you can run the following: overlaps <-ganet.ComLinks(predictedNetwork, alldatabases). Make sure that the first predicted network matrix has at least two columns and its first and second columns have gene pairs of each interaction. For further usage of ganet for statistical overlapping analysis using hyper-geometric or Fisher's exact test, the package help files have detailed information. Below, we present a very simple usage of the ganet package for performance evaluation.
Tutorial to run ganet R package: # Following code can be tried with a simple copy paste action in an R session.
# install ganet directly from web within R session install.packages("https://github.com/altayg/ganet/raw/master/ganet_2.2.tar.gz", type="source", repos=NULL) #to load to the current R session to use library(ganet) # you can use your own predicted network as to column matrix or for trial run the following # for the example of the predicted network. data(ganet.ex.net) # load the unique genes list as a vector for the example predicted network. In the case of a # real application, this gene list must correspond to the genes when initial analysis started # from the row dataset. data(ganet.ex.genes) #this returns a vector that contains all the performance metrics such as precision and p-val.