Flaver: mining transcription factors in genome-wide transcriptome profiling data using weighted rank correlation statistics

Background Mining key transcription factors (TFs) in genome-wide transcriptome profiling data has been an active research area for many years and it has been partially solved by mathematically modelling the ranking orders of genes in the target gene-set for the TF of interest in the gene-list ranked by expression values, called gene-set enrichment analysis (GSEA). However, in some application scenarios the gene-set itself also has a rank attribute, such as the putative target gene-set predicted by the Grit software and other alternatives like FIMO and Pscan. New algorithms must be developed to analyze these data properly. Methodology/Principal Findings By implementing the weighted Kendall’s tau statistic, we proposed a method for genome-wide transcriptome profiling data mining that can identify the key TFs orchestrating a profile. Theoretical properties of the proposed method were established, and its advantages over the GSEA approach were demonstrated when analyzing the RNA-Atlas datasets. The results showed that the top-rated TFs by our method always have experimentally supported evidences in the literatures. Benchmarking using gene ontology (GO) annotations in the AmiGO database indicated that the geometry performance (SQR_P) of our method is higher than GSEA in more than 14% of the cases. Significance The developed method is suitable for analyzing the significance of overrepresentation of ranked gene-sets in a ranked gene-list. A software implementing the method, called “Flaver”, was developed and is publicly available at http://www.thua45.cn/flaver under an academic free license. Author Summary Identification of the regulation roles of TFs in the transcriptome is fundamental in understanding various biological processes. Improve the performance of the prediction tools is important because accurate TF-mining in transcriptome data can finely improve the efficiency of wet-lab experiments. Also, genome wide TF-mining can provide new target genes for transcriptome regulation analysis in system biology perspective. This study developed a new TF-mining tool based on weighted rank correlation statistical method. The tool has better performance in analyzing ranked gene-set and ranked gene-list than its competitor, the GSEA tool. It can help the researchers in identification of the most important TFs in transcriptional data.


29
A typical computational issue is deciding, giving a biological process, see if a TF performed a valid regulation role by 30 coordinating the transcription factor binding site (TFBS) instances of the genes in the genome [1]. The requirements for completing 31 this task are: (i) perform transcriptome profiling for a particular biological process of interest; (ii) identify the target genes of the 32 TFs; (iii) discover the key TFs statically using data collected in requirements (i) and (ii). High-throughput RNA-Seq techniques 33 are increasingly affordable and produce massive amounts of gene differential expression data and solved requirement (i) readily 34 [2,3]. Numerous state of art TFBS prediction tools such as Grit [1], FIMO [4] and Pscan [5] were available, as well as the success

37
Smirnov test variant, the GSEA tool [11], have partially solved the requirement (iii). All of these software require gene-sets and a gene-list as input sources. The input for the 1 st generation enrichment analysis tools are as simple as gene vectors for both gene-39 sets and gene-list [7-10]. The input for the 2 ed generation enrichment analysis tools are the gene-sets which are vectors of genes 40 similar to the 1 st generation tool, and a gene-list ranked by expression values, which is different [11]. However, most of the gene-41 set, obtained from either the TFBS prediction tools like Grit [1], FIMO [4], and Pscan [5], or the Chip-Seq experiment [6], have a 42 rank attribute. Each gene in these gene-sets has assigned and ranked by a p-value or a score by sophistical algorithms which 43 indicates the true or false prediction possibilities. These gene-sets were always very long, containing thousands of entities, and the

46
To overcome the shortcomings of currently available software as described above, a novel TF-mining software "Flaver" was 47 developed. This tool addressed the question of finding the significant associations between ranked gene-sets and ranked gene-list 48 by implementing the weighted Kendall's tau statistic and adopted new ways of constructing the weighting functions. Its application 49 to the human genome has yielded fruitful results, demonstrating its desirability as a TF-mining tool.

66
The weighted Kendall's tau statistics

67
The method developed in this study emphasizes items having high rankings and deemphasizes those having low rankings.

68
The relationships between the gene-set and the gene-list, in terms of agreement in the top ranks, in gene-set enrichment analysis 69 can be measured by a weighted rank correlation. Let S i and L i , i = 1, …, n be the ranks of the gene-set and gene-list, respectively.

70
Further, let (i, R i ), i = 1, …, n, be paired rankings, where R i is a rank entity of L whose corresponding S has rank i among S j, j =

73
Where sgn(x) = -1, 0 or 1, if x <, = or > 0. The v i represented the weighting function which is bounded to [1, n] and range in

77
The weighting function using in the testing is v i = ((n -i + 1)/n) PP , where i ranges from 1 to n and PP ranges from 0.03 to 78 50, which consisted a total of 13 functions ( Fig. 1A shown eight representative weighting functions).

79
Development of the Flaver software

80
Utilizing the weighted Kendall's tau statistics, we developed a software, called "Flaver", for identification of key TFs for a 81 giving transcriptome profile. The tool takes ranked gene-set (specified by the -s option) and ranked gene-list (-i option) as its input.

86
The source code has been deposited into GitHub and is available under academic free license.

87
Performance assessment methods for the Flaver software

88
The TFs annotated to GO terms associated to a specific tissue in the AmiGO databases [16,17], referred to as TFs-gold, were

103
Flaver run took 3 days and the assessment results were presented in Table 1. There was a total of 23 transcriptome profiles in RNA-

104
54T of which either Flaver or GSEA, or both identified TP. The result showed that the Sen of Flaver for 9 out of 23 profiles is 105 higher than GSEA (39%), the Sen of GSEA for 12 out of 23 profiles is higher than Flaver (52%). The Spe of Flaver for 12 out of 106 23 profiles is higher than GSEA (52%), the Spe of GSEA for 10 out of 23 profiles is higher than Flaver (43%). The Pre of Flaver 107 for 11 out of 23 profiles is higher than GSEA (48%), the Pre of GSEA for 6 out of 23 profiles is higher than Flaver (26%). The

108
Acc of Flaver for 11 out of 23 profiles is higher than GSEA (48%), the Acc of GSEA for 12 out of 23 profiles is higher than Flaver 109 (52%). The Err of Flaver for 12 out of 23 profiles is higher than GSEA (52%), the Err of GSEA for 11 out of 23 profiles is higher than Flaver (48%). The F1 score of Flaver for 12 out of 23 profiles is higher than GSEA (52%), the F1 score of GSEA for 9 out of 111 23 profiles is higher than Flaver (39%). The SQR_P of Flaver for 13 out of 23 profiles is higher than GSEA (57%), the SQR_P of 112 GSEA for 10 out of 23 profiles is higher than Flaver (43%). As assessed by the SQR_P parameter, the performance of Flaver 113 outperformed GSEA in more than 14% of the tissues.

115
A comparison of the numbers of TFs identified by Flaver and GSEA for the RNA-54T datasets (FDR < 0.01) showed that 116 each tool produced dramatically different prediction results (Fig 1C). There was only an average of 13.2% TFs in common for both

145
Flaver and GSEA were evaluated based on the parameters of sensitivity (Sen), specificity (Spe), precision (Pre), accuracy (Acc), error rate (Err), F1 score, and geometry performance (SQR_P). Bold 146 number indicated a higher value. A letter "Y" indicated Flaver outperformed GSEA while letter "N" indicated underperformed when evaluated by the SQR_P parameter. String "NA" means not available.

162
The TBX3 is mutated in human ulnar-mammary syndrome and specifically expressed in hepatoblasts, isolated from the developing

174
Comparison of the numbers of GO terms associated with the Flaver and GSEA TFs predictions showed that GSEA identified 175 more GO terms in 14 out of 26 tissues (Fig. 2F). Further investigation of these GO terms showed that the numbers of GO terms 176 directly associated with the function of these tissues for the TFs predicted by Flaver were higher than GSEA, 9 (60%) out the 15 177 tissues were higher, 3 (20%) were equal, and the rest 3 (20%) were lower (Fig. 2E). Comparing of the numbers of GO terms 178 associated with the function of skeletal muscle tissue, which was the top performed tissue by Flaver, with GSEA results showed 179 that a total of 10 GO terms were in the Flaver results, and GSEA only identified 5 GO terms in the top 10 GO terms, respectively 180 ( Fig. 2A and 2B). Comparing of the numbers of GO terms associated with the function of hypothalamus tissue, which was the top 181 performed tissue by GSEA, with Flaver results showed that a total of 5 GO terms were in GSEA predictions, and Flaver only 182 identified 2 GO terms in the top 10 GO terms, respectively ( Fig. 2D and 2C). Although, there were differences between the 183 performance for Flaver and GSEA among different tissues, the overall performance at tissue level as assessed by the numbers of 184 GO terms directly associated with the function of the tissues indicated that Flaver out performed GSEA in 40% of the cases. The

185
GO terms associated with the TFs identified by Flaver were fewer than GSEA, however, the overall numbers of the identified GO 186 terms that directly associated with the function of the tissues were higher than GSEA, indicated that Flaver is more accurate in TFs 187 mining for transcriptome profiling data.

189
Although diverse transcriptome profiling technologies produce raw data with different structure characteristics, one 190 commonality for these techniques is that all the data produced can be transformed into a platform-independent ranking format. A

224
In this study, we reimplemented the weighted Kendall's tau statistic in C ++ programing language and a serial of weight 225 functions have been developed. After adapted the algorithm to transcriptome data analysis, we developed software allowing for 226 identification of key TFs that their target genes are sharing similar ranking orders between the gene-sets and gene-list analyzed.

227
The statistical inference on the key transcription factors is based on comparing the ranked gene-sets and ranked gene-list by an 228 informative top-down algorithm based on weighted Kendall's rank correlation coefficient. The algorithm make sense naturally 229 since the higher-ranking genes in the gene-set tend to be truly TF targets and these genes should be emphasized, on the other hand, 230 the lower-ranking genes in the gene-set tend to be false positives and these genes should be deemphasized.