Understanding functional consequences of type 2 diabetes risk loci using the universal data integration and visualization R package CONQUER

Background Numerous large genome-wide association studies (GWASs) have been performed to understand the genetic factors of numerous traits, including type 2 diabetes. Many identified risk loci are located in non-coding and intergenic regions, which complicates the understanding how genes and their downstream pathways are influenced. An integrative data approach is required to understand the mechanism and consequences of identified risk loci. Results Here, we developed the R-package CONQUER. Data for SNPs of interest (build GRCh38/hg38) were acquired from static- and dynamic repositories, such as, GTExPortal, Epigenomics Project, 4D genome database and genome browsers such as ENSEMBL. CONQUER modularizes SNPs based on the underlying co-expression data and associates them with biological pathways in specific tissues. CONQUER was used to analyze 403 previously identified type 2 diabetes risk loci. In all tissues, the majority of SNPs (mean = 13.50, SD = 11.70) were linked to metabolism. A tissue-shared effect was found for four type 2 diabetes-associated SNPs (rs601945, rs1061810, rs13737, rs4932265) that were associated with differential expression of HLA-DQA2, HSD17B12, MAN2C1 and AP3S2 respectively. Seven SNPs were identified that influenced the expression of seven ribosomal proteins in multiple tissues. Finally, one SNP (rs601945) was found to influence multiple HLA genes in all twelve tissues investigated. Conclusion We present an universal R-package that aggregates and visualizes data in order to better understand functional consequences of GWAS loci. Using CONQUER, we showed that type 2 diabetes risk loci have many tissue-shared effects on multiple pathways including metabolism, the ribosome and HLA pathway.


BACKGROUND
In the past decades, numerous genome-wide association studies (GWAS) have been performed to understand the genetic contribution of traits. While GWASs have provided valuable insight into putative mechanistic pathways, the way the identified risk loci exert their effect on traits remain largely unclear. For type 2 diabetes (T2D), several large meta-GWASs have been performed to understand the genetic drivers of T2D [1][2][3]. In general, GWAS associated loci are not limited to coding regions but are frequently found in intergenic regions [4]. As such, inferring how risk loci influence genes and their downstream pathways remains often unclear, especially for loci in non-coding regions.
To increase the understanding of those variants, an integrative approach is required where the effects of variants are investigated at a multitude of molecular levels.
In recent years, the number of rich open source biological data sets and repositories has tremendously increased, including GTExPortal [5], Epigenomics Project [6], 4D genome database [7] and genome browsers such as ENSEMBL [8]. Extracting, combining and analyzing relevant biological information from these public datasets is complicated and time-consuming. Platforms that integrate such data exist [9,10], but are often online, miss intuitive user experience or contain outdated data or genome builds. To provide researchers with a easy to use interface with the latest data to comprehend the effects of variants, we developed an R-package named CONQUER ('COmprehend fuNctional conseQUencEs R'). Given a single SNP or multiple SNPs, CONQUER allows the user to efficiently extract relevant biological information from various repositories/databases and represents the information through insightful and interactive visualizations. Additionally, CONQUER links SNPs with biological pathways trough enrichment of the associated genes. Here, we use CONQUER to investigate the 403 risk loci associated with T2D in more detail. With CONQUER we identified T2D risk loci that influence the expression of genes in diabetes-relevant tissues.

CONQUER: an universal R-package for GWAS loci
CONQUER is a universal tool that retrieves and visualizes a multitude of public data associated with any SNP of interest. The package can be used both for single and multiple SNPs. In both cases, CONQUER collects data about a SNP from various public databases and stores the data locally per SNP in a file. Data is collected on multiple levels, including expression-, methylation, metabolomics-and protein QTLs, chromosomal interactions, histone modifications and GWAS catalogue (see methods).
All tissues included in GTEx can be investigated with CONQUER. When multiple SNPs are investigated, CONQUER will find shared pathways across the SNPs investigated (see methods). Results are integrated in an interactive offline web interface for the analysis of multiple SNPs (https://github.com/roderickslieker/CONQUER). Altogether, CONQUER has two separate views 1) where in-depth analyses of single SNPs can performed and 2) where multiple SNPs and their aggregated consequences can be investigated and linked to biological pathways. Here, CONQUER was used to analyze 403 T2D-associated SNPs.

Type 2 diabetes-associated eQTLs are tissue-shared
From the most recent meta-GWAS, 403 T2D-associated SNPs were obtained [1]. Out of those, 17 SNPs were associated with in total 23 unique pQTLs (22 trans, 1 cis). Of those, nine were associated with the immune system (REACTOME, P=0.01), including IL17RC, ICAM1, SAA1, ULBP1, C3, DSG1, CFI, IL18RAP, MBL2. Two trans pQTLs were involved in cholesterol metabolism, LPA and ANGPTL3. Of note, the LPA protein was the single cis signal and associated with rs474513 (P=8.27·10 -37 ). Although this variant is an eQTL in 17 tissues for SLC22A2, it was an eQTLs for LPA in the liver (P=1.29·10 -5 ). SLC22A2 encodes the organic cation transporter 2 gene (OCT2) which is involved in the uptake of the glucoselowering drug metformin in the kidneys [11] and LPA encodes the lipoprotein A protein which is thought to be atherogenic [12].
Next, SNPs were investigated in gene expression data of tissues relevant in the etiology of diabetes (subcutaneous and visceral fat, sigmoid-and transverse colon, liver, skeletal muscle, pancreas, pituitary, terminal ileum of the small intestine, stomach, thyroid and whole blood) from healthy individuals. Of the included tissues, sample sizes range from N = 187 (terminal ileum) to N = 803 (skeletal muscle). Characteristics are shown in Table 1. The percentage males was relatively equal across tissues (63.1% -72.1%, Table 1) with the majority middle-aged (50-69 years, Table 1). For the eQTL -eGene analysis, CONQUER retrieved 348 SNPs. Fifty-five SNPs were excluded because they were not a significant eQTL (33 SNPs) or the variant ID was not present (23 SNPs). Sample sizes are correlated with the number of significant eQTLs (R 2 =0.91). We take this into account when evaluating the results by applying a liberal threshold (P≤0.001) and by assessing the normalized effect sizes (NES) across tissues. Using the 348 SNPs, cis-and trans genes were calculated with the GTEx API. After applying a threshold (P ≤ 0.05), sets of co-expressed genes were determined after which all included genes (eGenes and co-expressed genes) were clustered. Out of the 348 SNPs, 214 SNPs were significant. This resulted in 6664 calculated eQTL -eGene pairs across tissues (Fig.1a).
The second most enriched process was genetic information processing. The ribosome pathway was enriched in eleven tissues (Fig. 3a). The modules in the various tissues that were enriched for the ribosome pathway all had varying numbers of associated eQTLs (n = 1-5) and eGenes ( Fig. 3b, Fig. 3c). However, all modules shared a common eQTL -eGene pair, namely, rs12719778 and RPL8. The highest NES of rs12719778 on RPL8 was observed in whole blood (NES = -0.10, P = 8.46·10 -13 , Fig. 3d). In nine of the eleven tissues in which the ribosome pathway was enriched, the modules also contained rs12920022 and RPL13, which had the strongest NES in skeletal muscle (NES = -0.26 , P = 3.36·10 -26 , Fig. 3e).

In-depth analysis of rs601945 shows an association with primarily HLA genes
Among the enriched pathways, multiple pathways were immune-related (i.e. Th 17 cell differentiation, Th 1 and Th 2 cell differentiation). All immune-related pathways that were enriched were linked with rs601945 and HLA-DQA2. As such, we explored the observed effect of rs601945 on the HLA gene HLA-DQA2 in more detail (Fig. 1b). Rs601945 is located in an intergenic region and is in LD (R 2 ≥ 0.  18·10 -16 , Fig. 4e). As rs601945 influences primarily HLA genes that are involved in many biological pathways, rs601945 was linked to seven pathways in multiple tissues. That is cell adhesion molecules pathway (Fig. 4f) in all twelve tissues, to five pathways, including phagosome and various immune pathways in eleven tissues (Fig. 4g, Fig. 4h) and linked to one pathway, intestinal immune network for IgA production in ten tissues (Fig. 4h).

DISCUSSION
In this study, we developed an R-package that aided us in understanding the functional consequences of T2D-associated SNPs. The R-package, called CONQUER collects up-to-date data, directed by SNPs of interest from a multitude of databases and repositories and analyses and visualizes the data. In contrast to previous studies that had similar approaches [12,13], we have developed open-source software that is available as an R-package where only the SNPs and tissues of interest have to be specified and that can be used with minimal programming experience.
With CONQUER we developed a tool that is universal and versatile as it can be used for various diseases and phenotypes where SNPs are of interest. Because we included data from multiple sources of various molecular levels, it provides researchers with a broad range of information that aids them in understanding their phenotype of interest. Additionally, CONQUER expands the search space of consequential effects by including co-expressed genes of eGenes which might reveal up-and downstream consequences. With increasing amounts of data, the complexity also increases. To maintain clear overview of the data, we implemented two separate views 1) where in-depth analyses of single SNPs can performed and 2) where multiple SNPs and their aggregated consequences can be investigated and linked to biological pathways. CONQUER is dependent on the availability of the Application Programming Interfaces (API) to access databases (GTEx, Ensembl and LDlink). This is a strength, as the latest versions of these databases will always be accessed without changing the programming structure of CONQUER. However, if API access itself is changed or the databases are discontinued, then, updates to CONQUER are required. In contrast, access to the static data sources (e.g. meQTLs, miQTLs, pQTLs, chromatin states and chromatin interactions) is more secure as we maintain the source package (conquer.db). In addition, we will regularly update conquer.db as new studies will become available. All eQTLs are calculated with the GTEx API. Within GTEx sample sizes vary substantially per tissues, as a consequence, the number of significant eQTLs is correlated with the number of samples (R 2 =0.91). In the current study we take this into account by applying a liberal threshold and by assessing the NES across tissues. However, specific signatures of tissues with low sample counts might go unnoticed.
CONQUER was used to investigate 403 diabetes-associated SNPs in more detail. T2D is a metabolic disorder, in accordance, most SNPs linked to metabolic pathways. The metabolic pathways as curated by KEGG [11] consists of 1489 genes and is an encompassing term for all pathways that are involved in metabolism. Our results show that SNPs that are directly linked to metabolism do not influence a single metabolic process but are scattered among various metabolic pathways (e.g. oxidative phosphorylation, fatty acid degradation, fructose and mannose metabolism and glycine serine and threonine metabolism). Due to this dispersion of SNPs between numerous pathways it remains difficult to assign groups of SNPs to specific processes in specific tissues. This together with the variety of pathways to which SNPs are mapped shows that T2D has a lot of different points of engagement through which it can originate and progress, which is accordance with heterogeneous nature of T2D [14]. We also linked SNPs to different pathways classified as genetic information processing. As such, proteasome, RNA transport, spliceosome and protein processing in endoplasmic reticulum were pathways to which various SNPs were mapped. Additionally, seven SNPs were mapped to the ribosome pathway. The link between T2D SNPs and the ribosome pathway was observed in eleven tissues. Seven ribosomal genes with predominately negative effect sizes were associated with seven T2D GWAS hits. Although the association between ribosomal content and T2D has extensively been studied [15][16][17], genetic susceptibility to T2D has previously not been linked to a decreased expression of ribosomal genes. Moreover, the hormone insulin and ribosomal content are tightly connected. Insulin stimulates the synthesis of ribosomal proteins in various tissues [18,19] and a loss of ribosomal proteins is associated with an inhibition of AKT phosphorylation activity and the insulin pathway [20]. Rs601945 was highlighted as it influences many HLA genes that are involved in multiple pathways. Rs601945 was associated with the HLA region. The HLA region has previously been associated with T2D [3,21], however, our results reveal that the effects are wide-spread as its association with altered expression of various HLA genes was observed in all investigated tissues.
Interestingly, while the HLA region represents the highest risk for T1D [22], our results are pointing to a connection between HLA-DQA2 and T2D. In addition, our pQTL analyses also highlighted immune response pathways. Our data support that T2D has an immunometabolic component involving, like T1D, members of both innate and adaptive immune response. Altogether, CONQUER revealed three biological main processes that could explain, in part, the association between SNPs and T2D susceptibility. In addition our results show that T2D SNPs influence metabolism through various pathways, that the ribosome pathway is influenced in multiple tissues through different combinations of SNPs and that rs601945 has wide-spread effects as it influences many genes that are involved in multiple immune related pathways.
CONQUER was also used to analyze single SNP effects. Both AP3S2 and HSD17B12 have previously been found in relation to T2D, but in limited number of tissues.
AP3S2 in human pancreatic islets [31] and HSD17B12 in adipose, liver and muscle tissue and, whole blood [3], which are relevant for the treatment of T2D [32]. However, the genetic consequences are not limited to these tissues as our results show. As such, rs1061810 was found to be associated with altered expression of HSD17B12. The effect of rs1061810 on HSD17B12 has previously been described in adipose, liver and muscle tissue and, whole blood [3]. However, our results showed that the influence of rs1061810 on HSD17B12 is not only present in these tissues but in all twelve tissues that we investigated. rs11037579 had lower expression of HSD17B12 in all twelve tissues that we investigated, including adipose tissue. This result corroborates the finding that HSD17B12 expression is downregulated in the adipose tissue of insulin-resistant subjects [23] and plays a role in adipogenesis [24]. The HSD17B12 gene codes a bifunctional enzyme involved in the biosynthesis of estradiol and the elongation of very long chain fatty acids. One of the strongest observed effects was between rs4932265 and AP3S2. AP3S2 is a subunit of the AP-3 complex which is involved in budding of vesicles from the Golgi membrane [25]. AP3S2 has been linked to T2D in six different GWASs [21,[26][27][28][29][30] [31,32]. MAN2C1 binds with PTEN and thereby inhibits its lipid phosphatase activity [31]. PTEN inhibits activation of PI3K-AKT signaling pathway [31,33], a pathway known to be involved in T2D development [33]. As we observe a negative effect size for MAN2C1, it is suggested that in the twelve investigated tissues, for individuals carrying the risk allele of rs13737 the PI3K-AKT signaling pathway could be inhibited in part, by a reduced expression of MAN2C1 through PTEN. This could explain the association of rs13737 with T2D susceptibility. In six tissues we linked rs576123, located in intronic region of the ABO gene to metabolic pathways. While, ABO is at the basis of the ABO blood group system as it indirectly encodes for blood group antigens [34], a recent study has observed an impaired insulin secretion within O blood type subjects. In this study, SNPs located within the first intron have been connected to a reduced activity of the glycosyltransferases encoded by the ABO gene and specific targeting of the ABO gene by shRNA has led to a reduced glucose stimulated insulin secretion [35]. The effect of reduced ABO expression in the other tissues needs to be established.

CONCLUSION
The R-package CONQUER allows efficient integration of multiple datasets. With data on various levels, visualized in a tidy manner, we were able to uncover potential consequences of T2D associated risk loci. As such, SNPs could be linked through various biological mechanisms to insulin resistance and insulin secretion and comprehend the increased T2D risk. Our findings highlight the importance of an integrative approach where risk loci for T2D are not only seen as individual risk factors but also as a network of risk factors. With CONQUER we developed software that does this, uses the latest available data and is easy to use.

CONQUER
CONQUER was developed in R version 3.6.1 and is available from Git D3.js code was integrated in R making use of the htmlwidgets R-package [36] and all tools were integrated into the R package CONQUER.d3. Interactive heatmaps were made using plotly [37]. The interactive circos plot was made with the R-package BioCircos [38]. Interactive tables were generated with the DT package [39].

Data acquisition
The data acquired for CONQUER are based on the human genome reference build GRCh38/hg38. The data is both collected from static sources and Application Programming Interfaces (APIs). The static sources are available in a separate R data package called conquer.db. CONQUER loads this data package when needed. As conquer.db is a separate package it is easily updated with the latest datasets without altering the programming structure of CONQUER. Static data include chromatin interactions, chromatin state segmentations, expression data, transcription factor binding sites, pQTLs, miQTLs.
The chromatin interactions were obtained from the 4D genome database [7]. To have data from multiple tissues (N=31), only IM-PET data was included in CONQUER. Originally this data was based on the human genome reference build GRCh19/hg19. UCSC LiftOver tool [40] was used to lift over the data to GRCh38/hg38. Chromatin state segmentations were obtained from the Roadmap Epigenomics Project for all cell types available (N=127, 15-state model) [6]. Normalized (TPM, Transcript per Million) expression data of all available tissues (N=54) was obtained from GTEx v8 [5]. Missing expression values were imputed with k-nearest neighbor and default parameters of the impute.knn function from the R-package impute [41]. Data of pQTLs [44][45][46][47] , meQTLs [48], miQTLs [49][50][51][52] were acquired from their corresponding references.
The remaining data (linkage disequilibrium, gene information, eQTLs) are obtained from APIs and are collected once the user has given the command. Elementary information about the SNP of interest is acquired from the Ensembl API [8]. The linkage disequilibrium (LD) structure originates from the LDlink API [51]. For both the Ensembl API and LDlink API the population can be specified, by default the population is set on Utah Residents with Northern and Western European Ancestry (CEU) from the 1000 Genomes Project phase 3 [52]. The eQTLs and eGenes corresponding to the SNP of interest are computed making use of GTEx API. By default, GTEx has an eQTL mapping window of one Mb upstream and downstream of the transcription start site of a gene. In CONQUER, we expanded the search space by including genes that have chromosomal interaction with the LD region (R 2 ≥ 0.80) of the leading SNP. CONQUER automatically sends a request to the GTEx API to calculate eQTLs for every available tissue utilizing GTEx v8 [5]. Lastly, phenotype associations are acquired from the GWAScatalog [53]. For every queried SNP, CONQUER generates an RData object containing all previously described data and stores it in a directory the user has provided.

Statistical analyses
First, CONQUER was used to retrieve all the available data of the 403 risk loci associated with T2D [1].
Next, the modularization and pathway enrichment were performed by CONQUER on twelve T2D relevant tissues ( Table 1). It should be noted that GTEx reports normalized effect sizes (NES) as effect of the alternative allele relative to the reference allele. However, a variant associated with T2D can be either the reference allele or alternative allele. Therefore, we investigated the effect size of the risk allele of T2D relative to the other allele based on an additive model. Also, due to the variability of sample sizes of GTEx across tissues and their association with P-values of eQTLs, we use both normalized effect size and P-values for the interpretation of the results. Figures were directly from CONQUER or additionally made using ggplot2.

Modularization and pathway enrichment
With multiple SNPs, CONQUER can modularize SNPs and associate them with biological pathways in tissues of interest (Fig. 5). All tissues in GTEX can be included. Based on the GTEx data, eQTLs and their associating eGenes are selected (P-value ≤ 0.05). For these eGenes, co-expressed genes are identified by performing correlation analyses with imputed GTEx expression data in the corresponding tissues.
Co-expression between genes is assumed when rho ≥ 0.90. Next, the eGenes and their co-expressed genes are hierarchical clustered [54,55] based the distance between genes (1 -rho). The number of modules within the clustered data is optimized by maximizing the globalSEmax of the gap statistic [56] using the cluster R package [57]. Modules of co-expressed genes and eGenes are then tested for pathway enrichment based on KEGG pathways. For each pathway odds ratios and accompanying Pvalues are calculated with Fisher's exact test [58]. If a module does not contain an eQTL or is not enriched for a pathway, it is omitted from the analysis.