Integrative cross tissue analysis of gene expression identifies novel type 2 diabetes genes

To understand the mechanistic underpinnings of type 2 diabetes (T2D) loci mapped through GWAS, we performed a tissue-specific gene association study in a cohort of over 100K individuals (ncases ≈ 26K,ncontrols ≈ 84K) across 44 human tissues using MetaXcan, a summary statistics extension of PrediXcan. We found that 90 genes significantly (FDR < 0.05) associated with T2D, of which 24 are previously reported T2D genes, 29 are novel in established T2D loci, and 37 are novel genes in novel loci. Of these, 13 reported genes, 15 novel genes in known loci, and 6 genes in novel loci replicated (FDRrep < 0.05) in an independent study (ncases ≈ 10K, ncontrols ≈ 62K). We also found enrichment of significant associations in expected tissues such as liver, pancreas, adipose, and muscle but also in tibial nerve, fibroblasts, and breast. Finally, we found that monogenic diabetes genes are enriched in T2D genes from our analysis suggesting that moderate alterations in monogenic (severe) diabetes genes may promote milder and later onset type 2 diabetes.


Introduction 26
Type 2 diabetes (T2D) is a complex disease characterized by impaired glucose homeostasis resulting from 27 dysfunction in insulin-secreting pancreatic islets and decreased insulin sensitivity in peripheral tissues 28 [1]. In addition to environmental factors such as a sedentary lifestyle and poor diet, genetic susceptibility 29 is an important contributor to the development of T2D [2]. Genome-wide association studies (GWAS) 30 have uncovered more than 100 loci that significantly associate with either T2D or glucose-related traits 31 [3, 4, 2]. However, the majority of single nucleotide polymorphisms (SNPs) significantly associated with 32 T2D reside in intronic and intergenic regions rather than protein-encoding regions [5,6]. The results from 33 GWAS suggest an important role for genetic variation that regulates gene expression rather than altering 34 codon sequence [7] and have motivated e↵orts to map the regulatory landscape of the genome [8,9,10]. 35 Indeed, sets of trait-associated SNPs are enriched for variants that associate with gene expression (i.e. 36 expression quantitative trait loci or eQTLs) [11] and that occupy DNAse hypersensitivity sites (DHS) [ Table 1. Significant association between predicted expression and T2D. Bonferroni corrected for all gene tissue pairs tested. When multiple tissues were significant, the top tissue result is shown.

112
To glean insight into relevant biological pathways, we performed a gene set enrichment analysis on the set 113 of FDR  0.05 significant genes and found top Gene Ontology Biological Process (GO:BP) pathways to 114 involve the insulin-secretory pancreatic -cell (e.g. negative regulation of type B pancreatic cell apoptotic 115 process, Supplemental Figure S4). This was also the case when we restricted this analysis to the set of 116 reported T2D genes (Supplemental Table S5). However, we found fatty acid homeostasis to be a top 117 pathway enriched among the set of novel T2D genes implicated in our MetaXcan analysis, underscoring 118 a genetic contribution from variants regulating gene expression in insulin-responsive peripheral tissues 119 (Supplemental Table S6).

120
Enrichment of genes reported for related traits 121 We also explored shared etiology with other complex diseases by comparing our set of MetaXcan-122 implicated T2D genes with sets of genes implicated by GWAS listed in the NHGRI-EBI online catalogue.

131
HKDC1 is a novel T2D gene implicated by our MetaXcan analysis that has been previously implicated 132 in pregnancy-related glycemic traits and is the driver of the observed enrichment for this phenotype 133 (p = 0.044) (Supplemental Table S8). were previously reported genes, 15 were novel genes in known loci, and 6 were novel genes in novel loci.

144
Moreover, the direction of e↵ect was consistent across replicated associations ( Figure 2).

157
Enrichment in diabetes relevant tissues 158 We sought to investigate the role of di↵erent tissues in the pathogenesis of T2D by looking at the 159 enrichment of significant associations in each tissue. We used the average significance (represented by 160 the squared Z-score averaged across all genes) as a measure of enrichment but recognized the need to 161 account for di↵erential power to detect associations given the di↵erent sample sizes used in the training of 162 di↵erent tissue models. The enrichment increases as sample size increases (Spearman's rank correlation 163 ⇢ = 0.887, p = 4.69 ⇥ 10 16 ) ( Figure 3). As expected, we found liver, pancreas, subcutaneous adipose 164 tissue, and skeletal muscle ranked higher than other tissues with similar sample size.

165
However, when we examined individual genes, many of established T2D genes (e.g. TCF7L2, WFS1,  B. Significance of top T2D-associated genes across tissues. Z-scores of the association between predicted expression levels and T2D case control status across 44 human tissues is shown. Except for RCCD1, CWF19L1, and AP3S2, genes are associated in only a few tissues indicating context specificity. The size of the circles represent the magnitude of the Z-score. Blue color represents positive association, i.e. increase in expression level associated with increase in T2D risk. Red color represents negative association. The intensity of the color represents the performance of the prediction models (correlation squared between predicted and observed expression levels cross-validated in the training samples). Therefore larger circles indicate more significant associations whereas darker colors indicate higher prediction confidence. Missing circles mean that the association was not performed because of missing model (no good prediction model) or because the prediction SNPs were absent in the GWAS.  . Green circles correspond to a subset of the monogenic gene list, considered to cause diabetes as the main phenotype. We see that monogenic genes are enriched in significant genes (away from the gray line) and that the subset where the phenotype is diabetes is even more enriched (further away from the gray identity line). OMIM (gray +) and ClinVar (red +) list yield similar enrichments. p-values below 10 10 have been capped to 10 10 for better visualization.
MetaXcan provides a principled way to prioritize e↵ector genes in known trait-associated loci. To imple-198 ment this, we defined 68 non-overlapping windows comprising known T2D-associated SNPs (see details 199 in Methods) which we refer to as T2D-loci. We profiled these loci according to the strength, number, 200 and proximity of predicted gene expression associations. We used two thresholds for the multiple test 201 correction: one very stringent that accounts for the total number of tissue/gene pairs (genome-wide 202 p = 0.05/204, 981 = 2.4 ⇥ 10 7 ) and another one more appropriate for a locus-specific analysis that 203 accounts for the total number of tests within the locus (locus-wide p threshold varies by locus). 204 We found that 33 loci had at least one locus-wide significant association (Supplementary Table S9).  At the window comprising T2D-associated SNPs at the JAZF1 locus, we observed multiple tissue-level 212 associations for JAZF1, the reported T2D gene in this region. Decreased expression of JAZF1 in multiple 213 tissues (inlcuding skeletal muscle, adipose, and pancreas) was associated with T2D (Table S1 and Figure   214 5A). In addition to JAZF1, we find that increased expression of HOXA11, an upstream transcription 215 factor-encoding gene, is associated with T2D at the locus-wide level ( Figure 5A).

216
To gain further insight into these associations, we examined the e↵ect of the SNPs that make up the 217 prediction models on the phenotype and on the expression of the corresponding gene. We find that many 218 of the SNPs in the prediction models for JAZF1 fall within eQTL association peaks in these tissues and 219 are themselves significantly associated with T2D ( Figure 5B-C). Moreover, the disease-promoting alleles 220 for these SNPs are associated with decreased expression of JAZF1 ( Figure 5B-C). However, the lead SNP 221 in the JAZF1 prediction models (rs1635852) is also present in the model for HOXA11 expression where 222 the disease-promoting allele associates with increased gene expression ( Figure 5B-D).
The reported gene, CDKAL1, showed no significant association whereas predicted gene expression of 225 nearby genes ID4 and SOX4 associated with T2D ( Figure 5E). Although there were multiple eQTL 226 association peaks evident among the set of SNPs in the prediction models for these genes, there was only 227 one GWAS peak in this region. Moreover, the disease-promoting alleles of the model SNPs within the 228 shared peak associated with a decrease in gene expression 5F-G.

229
AP3S2, PRC1, and VPS33B locus 230 We observed the most tissue-level gene associations at a region spanning three reported T2D genes:

231
AP3S2, PRC1, and VPS33B ( Figure 6A). Although each of these putative T2D genes were supported 232 by our MetaXcan analysis, the most significant associations at this region corresponded to a novel T2D 233 gene, RCCD1, with increased expression of this gene associated with T2D( Figure 6A, Table S1, and

235
The only other gene in this interval supported by at least one genome-wide significant association was 236 the reported T2D gene AP3S2 where increased expression was associated with disease status. Moreover, 237 increased expression of AP3S2 in 29 tissue models associated with T2D at the window-level threshold 238 ( Figure 6A and Supplementary Table S9).

239
The variants underlying the top associations for RCCD1 and AP3S2 are independent from each 240 other as the SNPs constituting the respective predictive models are not in linkage disequilibrium with 241 each other ( Figure 6F). The genetically predicted gene expression values for RCCD1 in brain cortex 242 and AP3S2 in small intestine are also uncorrelated ( Figure 6B). However, the genetically predicted gene  Figure 6A-B). The predictive models underlying these associations share three 246 SNPs in common (rs2290202, rs2285937, and rs3743445) that are associated with increased expression of 247 RCCD1 in brain cortex and decreased expression of PRC1 in pancreas (Figure 6D-E,G). Therefore, the    We performed a large-scale, in silico study of predicted gene expression across a comprehensive set of 267 human tissues to prioritize genes that alter the risk of T2D through regulation of gene expression levels. 268 We corroborated many of the known T2D genes, which supports the role of regulation of gene expression 269 levels as a key mediating mechanism, but also found novel genes both in known loci as well as in completely

305
In our study, we applied MetaXcan to explicitly integrate regulatory genetic information to improve 306 disease gene mapping and overcome key limitations of GWAS and di↵erential gene expression studies [21]. 3) reference panel [37] with ShapeIt2 [38].
where w l,g represents the prediction model weight for SNP l on gene g, l is the standard deviation for 392 SNP l,ˆ g is the standard deviation of predicted expression for gene g,ˆ l is the regression coe cient for 393 the regression of expression on the allelic dosage of SNP l.

394
In our MetaXcan analyses of T2D, we use regression coe cients (ˆ l ) from results from the trans-ethnic 395 meta-analysis of GWAS and the GWAS on T2D from the GERA study. Values for w l,g were generated 396 as described above and available from the PredictDB website (http://predictdb.org).ˆ 2 g is estimated 397 as: (2) X g . We use SNP information from a 1000 Genome Project reference panel (European ancestry) to the 400 compute the variances and covariances of the SNPs used to predict gene expression.

401
Locus analysis of T2D-associated regions 402 We first identified a set of 111 reported T2D genes based on their being the most proximal to T2D- analyses of the trans-ethnic and GERA studies where the Z-score (Z) was given by: where Z i = 1 (P i /2) ⇤ sign( i ), P i is the p-value for study i, w i = p (N i ), N i refers to the sample size 420 for study i, i is the direction of e↵ect in study i, and the overall P-value is given by: