GeneFriends 2021: Updated co-expression databases and tools for human and mouse genes and transcripts

Gene co-expression analysis has emerged as a powerful method to provide insights into gene function and regulation. The rapid growth of publicly available RNA-sequencing (RNA-seq) data has created opportunities for researchers to employ this abundant data to help decipher the complexity and biology of genomes. Co-expression networks have proven effective for inferring relationship between the genes, for gene prioritization and for assigning function to poorly annotated genes based on their co-expressed partners. To facilitate such analyses we created previously an online co-expression tool for humans and mice entitled GeneFriends. To continue providing a valuable tool to the scientific community, we have updated the GeneFriends database. Here, we present the latest version of GeneFriends, which includes updated gene and transcript co-expression networks based on RNA-seq data from 46,475 human and 34,322 mouse samples. GeneFriends is freely available at http://www.genefriends.org/


INTRODUCTION
The advent of RNA sequencing (RNA-seq) technology has revolutionized biological research (Emrich et al. 2007;Lister et al. 2008). With RNA-seq we are able to understand the complexity of transcriptome, which has enabled us to connect the information on our genome with its functional protein expression (Ozsolak and Milos 2011). Moreover, gene coexpression networks provide the potential to identify the gene modules (highly connected sub-networks) that could serve as points for therapeutic interventions (Chen et al. 2008;Cheng et al. 2020). There are many methods available to cluster the genes in a gene coexpression network (see the review; Sipko et al. 2018). One of the widely used networkbased approach to predict gene functions is the Guilt by association (GBA) method, GBA works on the principle that genes which tend to co-express with each other are functionally related (Oliver 2000;Molet et al. 2013).
With an increase of more than 2 million RNA-seq samples in SRA/GEO between 2015 and 2020, the number and power of co-expression databases has also consequently increased. (Franz et al. 2018;Wong et al. 2018;Obayashi et al. 2019). To facilitate and promote the usage of co-expression networks, we previously created an online microarray and RNA-seq based co-expression database, entitled GeneFriends (van Dam et al. 2012;vanDam et al. 2015) for human and mouse genes and for human transcripts. GeneFriends has proven successful for gene prioritization and associating function to poorly annotated genes.
Studies employing GeneFriends have focused on diverse topics such as estimating tumorigenic index for cancer initiation and progression , genetic analysis for neurological conditions in humans and mice (Ashbrook et al. 2015), genomics of human metabolic disease (Timmons et al. 2018), development of neuronal subtypes (Memic et al. 2016), genome evolution (Keane et al. 2015), genetics of ageing and complex diseases (Fernandes et al. 2016;Marttila et al. 2020) and cell senescence (Avelar et al. 2020). Therefore, to keep our tool at the forefront of publicly available co-expression databases we have updated the RNA-seq based GeneFriends co-expression database for both human and mouse data. We believe our latest updated version of GeneFriends will be useful for a diverse and large number of researchers to understand the complexity, functions and regulation of the human and mouse genomes. GeneFriends is freely available at http://www.genefriends.org.

GENEFRIENDS UPDATED CO-EXPRESSION DATABASE
In addition to updating the previous GeneFriends co-expression database for human genes and transcripts , we have now added an RNA-seq based co-expression database for mouse genes and transcripts ( Figure 1).

Human and Mouse co-expression gene and transcript data
The new human and mouse co-expression databases were constructed from 46,475 and 34,322 RNA-seq samples, respectively. The updated GeneFriends database contains coexpression data for 44,896 human genes and 31,236 mouse genes. The transcript coexpression data comprises of 145,455 human transcripts and 66,327 mouse transcripts. The biotype of genes and transcripts for both human and mouse data is given in Table 1. One of the unique features of GeneFriends co-expression database are its co-expression maps for non-coding genes like Long non-coding RNA (lncRNA) and microRNA (miRNA) which can be useful in providing the insights for regulatory mechanism of gene expression at both transcriptional and post-transcriptional level. The updated GeneFriends database have coexpression data for nearly 16,450 human and 6,436 mouse non-coding genes.
We have also compared the top 5% of ten randomly selected human genes and their co-expression partners, which are present in both previous version (van  and new updated version of GeneFriends (Table 2). The percentage of the average overlap between the ten genes was 30.5% with standard deviation of 4.97%. This difference between the two versions could be due to the difference in number of samples. The previous version was constructed from only 4133 RNA-seq samples as compared to the updated version which is based on 46475 samples. However, when we compared the functional enrichment of the top 5% co-expressed partners for some of these genes, the overlap was stronger suggesting that although the overlap between the co-expressed partners was low but overall they were associated with similar functional categories (Supplementary Data).

GENEFRIENDS GENE AND TRANSCRIPT DATA COMPARISON
To explore the differences between the gene and transcript co-expression maps in human and mouse co-expression database, we compared the median of Pearson correlation values for each gene/transcript with respect to its co-expression partners across the GeneFriends database. For transcripts, median of different transcripts of the same gene was calculated for doing comparison. A total of 34920 human and 25459 mouse genes and its transcripts were analysed. 78% of human and 70% of mouse genes had more than one transcript. While comparing the co-expression maps of human genes and transcripts, the overall co-expression values of genes were significantly higher than the co-expression values of transcripts ( observed for the mouse co-expression database, where mouse genes had higher correlation coefficients than transcripts ( Figure 2B), although the range of correlation coefficients were not as widely distributed as in humans ( Figure 2B). These results indicated that different transcripts arising from the same gene are often expressed under different conditions and are most likely to play different roles in different processes or sometimes these transcripts may even be non-functional (Li et al. 2014).

PATHWAY ANALYSIS IN GENEFRIENDS
Since the primary purpose of the co-expression database is to determine the function of the co-expressed genes, we investigated the KEGG pathway genes to assess the consistency of the co-expression data with pathway annotations. We compared the number of enriched KEGG pathway genes between top and bottom 5% of co-expressed genes in human annotations present in bottom 20 were associated with immune system and infection ( Figure   3A). After this, we selected the top 20 genes from the GeneFriends database with maximum number of KEGG pathway annotations, and checked which pathways are most enriched in these top 20 genes ( Figure 3B). Here also we observed that the pathways related to metabolism and cell signalling were among the top enriched KEGG pathways annotations.
All these observations from KEGG pathway analysis indicated that genes that are enriched in KEGG pathway often tend to co-express together, underscoring that genes that are coexpressed tend to work cooperatively in the same biological pathways.

VALIDATION OF GENEFRIENDS DATA
To assess the quality of the GeneFriends co-expression database we compared the top and bottom 5% of the genes that are present in some widely used databases. Genes from databases such as GenAge (Tacutu et al. 2018), CellAge (Avelar et al. 2020), T2D-AMP Knowledge Portal and TRRUST (Han et al. 2018) and their co-expressed partners were analysed to ascertain whether or not the genes that are linked to some diseases or processes tend to co-express together. GenAge is a curated database of genes related to ageing (Tacutu et al. 2018). We analysed co-expression data of 298 GenAge genes. The top 5% of GenAge genes present in GeneFriends database had significantly higher number of GenAge genes as their co-expressed partners as compared to the bottom 5% [Median(IQR): Top = 29(21-32); Bottom = 11(9-14)]. Similar trend was observed for 272 CellAge database (a curated database of cell senescence genes) and their co-expressed partners, where top 5% had significantly higher number of CellAge genes co-expressed in comparison to bottom 5% [Median(IQR): Top = 29(23-33); Bottom = 9(6-15)]. We were also interested to see how often genes that are related to some diseases may co-express with each other. To investigate this we analysed 132 type 2 diabetes (T2D) effecter genes from T2D-AMP database (https://t2d.hugeamp.org/gene/effectorGeneTable). We observed that T2D effector genes coexpress with each other as top 5% had significantly higher number of T2D genes with respect to the bottom 5% [Median(IQR): Top = 9(5-15); Bottom = 4(3-8)].
To further validate our observations we also tested transcription factor and their targets from TRRUST database version 2 (Han et al. 2018). TRRUST database is a manually curated database of human and mouse transcriptional regulatory networks. As genes that coexpress with each other may also help in co-regulating each other, hence we postulated that transcription targets should co-express with their respective transcription factors. We GeneFriends co-expression database is successfully able to identify the genes that are coexpressed and co-regulated together.

COMPARISON OF HUMAN AND MOUSE CO-EXPRESSION NETWORKS
We analysed human and mouse co-expression networks from an updated GeneFriends coexpression database to decipher the evolutionary differences and similarities between human and mouse co-expression maps. We compared 24,434 genes that have a homolog in both human and mouse gene co-expression database. In our co-expression database, 14,911 genes were one-to-one orthologs, while the remaining mouse and human homologs had a one-tomany or many-to-many relationship. To understand the impact of duplication events on the divergence of humans and mice, we compared the dN/dS ratios of homologous genes with different types of homology (Fig. 5A). The one-to-one orthologs had the lowest dN/dS ratio as compared to the many to many, which had the highest dN/dS ratio. Next, we compared 14911 one to one orthologs among the top 5% of co-expressed genes. The dN/dS ratio values were divided into four groups to check how the increase/decrease in these values may relate to overlapping between two co-expression networks ( Figure 5B). We observed that the group with the lowest dN/dS values had the highest number of overlapped co-expressed genes. This supported the hypothesis that non-synonymous substitutions influence the conservation of coexpression connectivity (Monaco et al. 2015). Therefore, the more the number of nonsynonymous substitutions, the more conserved is a co-expression network.

CONCLUSIONS
Large-scale gene co-expression networks have proven effective for analysing and discovering new gene functions and associations (Liesecke et al. 2019). With the large amounts of publicly available RNA-seq expression data, many co-expression databases are now using RNA-seq based data to generate co-expression networks. GeneFriends is unique in many ways, however, as compared to other available co-expression databases. The features that makes GeneFriends exceptional are its transcript based co-expression maps and inclusion of co-expression networks for non-coding genes. The GeneFriends database encompasses coexpression networks for about 16,000 and 6,000 non-coding genes for humans and mice, respectively. These transcripts and non-coding gene data based co-expression networks are crucial to understand the regulation of gene expression pattern, alternative splicing and dynamic regulation of transcripts in genome. Furthermore, we validated GeneFriends, and especially the results using curated transcription factor-transcriptional target database showed that genes that are co-expressed with each other also tend to co-regulate each other. Overall, in our latest version of human and mouse co-expression networks we hope to make GeneFriends more unique and valuable to the scientific community.

Generation of co-expression database
Human RNA-seq read counts for 46475 samples were downloaded from the recount2 database (Collado-Torres et al. 2017). Human gene expression data was downloaded with recount Bioconductor package (Collado-Torres et al. 2017) and transcript data was downloaded from recountNNLS R package (Fu et al. 2018). Mouse RNA-seq based read counts were obtained for 34322 samples from ARCHS 4 database with rhdf5 Bioconductor R package (Lachmann et al. 2018). The human samples were aligned against the GRCh38 human reference genome, and mouse samples against the GRCm38 mouse reference genome.
The reads were then normalized by dividing the expression per gene/transcript to the combined expression of all genes/transcripts per sample. Cancer-based studies were excluded to avoid any bias in the co-expression database moreover; cancer-related samples do not generalize well with overall co-expression networks.
To create co-expression maps we used weighted Pearson correlation method . This was followed by constructing mutual rank maps by employing the same approach used in COXPRESdb (Obayashi et al. 2019). We used guilt by association method to create co-expression networks. The new GeneFriends database contains coexpression data for 44,896 human genes and 31,236 mouse genes. The transcript coexpression data comprises of 145,455 human transcripts and 66,327 mouse transcripts. The genes that were not expressed in at least 20% of the samples were excluded from the database. The biotype of genes and transcripts for both human and mouse data was identified using biomaRt.

Functional and Pathway Analysis
We used WebGestalt (Liao et al. 2019) to do the Overrepresentation Enrichment Analysis for each of the gene ontology categories (Biological Process. Cellular Component and Molecular Function). The significance level was determined at FDR<0.05 and the multiple test adjustment was done using the Benjamini-Hochberg method. We verified our enrichment results by repeating the analysis using DAVID's annotation clustering (Huang et al. 2009).
p-value and FDR < 0.05 were considered significant. We also used ClusterProfiler Version 3.14.3 (Yu et al. 2012) to visualize the GO terms (FDR<0.05) obtained from DAVID. For KEGG annotation analysis, genes lists with their enriched KEGG pathway annotations were obtained from the Molecular Signature Database Version 6.2 (Subramanian et al. 2005;Liberzon et al. 2015). The box plot and heat map for KEGG pathway analysis were created using R.

Evolution based Analysis
To identify any differences in the evolutionary conservation of genes present in human and mouse co-expression networks we performed dN/dS analysis. The dN/dS values were obtained from biomaRt release 96.

Statistical Analysis
Mann-Whitney U tests was used to test the significance between the correlation coefficients among top 5% and bottom 5% co-expressed partners of genes and to compare the distribution of dN/dS scores between the human and mouse co-expression database. The median and Inter quartile ranges (IQR) were calculated by R package.     (One to one, one to many and many to many). The Mann-Whitney test showed significant difference between all three comparisons (One to one vs one to many, one to many vs many to many and one to one vs many to many). (B) Comparison of the dN/dS values between the top 5% of human and mouse co-expression gene networks.