Interpreting coronary artery disease GWAS results: A functional genomics approach assessing biological significance

Genome-wide association studies (GWAS) have implicated 58 loci in coronary artery disease (CAD). However, the biological basis for these associations, the relevant genes, and causative variants often remain uncertain. Since the vast majority of GWAS loci reside outside coding regions, most exert regulatory functions. Here we explore the complexity of each of these loci, using tissue specific RNA sequencing data from GTEx to identify genes that exhibit altered expression patterns in the context of GWAS-significant loci, expanding the list of candidate genes from the 75 currently annotated by GWAS to 245, with almost half of these transcripts being non-coding. Tissue specific allelic expression imbalance data, also from GTEx, allows us to uncover GWAS variants that mark functional variation in a locus, e.g., rs7528419 residing in the SORT1 locus, in liver specifically, and rs72689147 in the GUYC1A1 locus, across a variety of tissues. We consider the GWAS variant rs1412444 in the LIPA locus in more detail as an example, probing tissue and transcript specific effects of genetic variation in the region. By evaluating linkage disequilibrium (LD) between tissue specific eQTLs, we reveal evidence for multiple functional variants within loci. We identify 3 variants (rs1412444, rs1051338, rs2250781) that when considered together, each improve the ability to account for LIPA gene expression, suggesting multiple interacting factors. These results refine the assignment of 58 GWAS loci to likely causative variants in a handful of cases and for the remainder help to re-prioritize associated genes and RNA isoforms, suggesting that ncRNAs maybe a relevant transcript in almost half of CAD GWAS results. Our findings support a multi-factorial system where a single variant can influence multiple genes and each genes is regulated by multiple variants.


Association Testing 150
Additive logistic models to account for LIPA expression and MI using different 151 combinations of variants were compared using ANOVA with a likelihood ratio test (LRT) 152 implemented in R. Gender and age were included as covariates in models explaining LIPA gene 153 expression, while sex, age, hypercholesterolemia, smoking, and number of diseased vessels 154 were included as covariates in models explaining MI. Differences in LIPA expression between 155 those with or without a history of MI were calculated using the wilcoxin.test function in R.
156 Bonferroni multiple hypothesis correction was implemented.

Allelic Expression Imbalance (AEI)
158 Allelic RNA expression imbalance (AEI) was assessed using data from GTEx 159 (phe000039.v1.GTEx_v8_ASE.expression-matrixfmt-ase.c1). Candidate variants were 160 subsetted from each individual file, and the deviation of the "REF_RATIO" from the 161 "NULL_RATIO" was plotted for each variant in a given tissue type. Tissue types with 5 or more 162 samples were considered. 169 To assign GWAS hits to target genes, we determine for each of the GWAS SNPs whether it 170 appears as an eQTL or sQTL reported by GTEx, searching all available tissues. Recognizing 171 that often multiple SNPs exist over a genomic region as significant GWAS hits, we consider 172 each one individually in assigning candidate genes and separately assess concordance. We opt 173 not to use COLOC and other existing tools that search for overlapping signal between GWAS 174 variants and QTLs because they make assumptions about the genetic model that are not in line 175 with the multi-factorial system we test here (23); namely, these methods assume a single 176 causative variant or that each variant acts independently. Instead, although we recognize it 177 limits the overlap we are able to detect and biases our sample to variants that are ideal markers 178 (i.e. frequent), we search for exact matches between GWAS and QTL marker variants. In

227
This discordance between variants in the same locus may represent a weakness of our 228 approach that looks for overlap with an individual SNP rather than considering 'colocalization' 229 more broadly. However, it occurs almost exclusively among SNPs that are in relatively poor LD 230 (frequently R 2 < 0.2) and would not be expected to serve as good markers for one another nor 231 be part of the same haplotype. Alternatively, this discordance may represent multiple functional 232 variants in the locus with different target genes. Regardless, our results emphasize the 233 complexity in elucidating functional variants from GWAS results. 240 locus 57 -rs11830157 (KSR2), and locus 8 -rs6544713 (ABCG8; however ABCG5 also 241 annotated by GWAS is not supported by QTL or position) (Table 1)    288 effects, these are typically smaller in size; allelic expression imbalance can serve to distinguish 289 between cis and trans effects, but robust RNA allelic ratios are often not available in GTEx.

290
Notably, ncRNAs are candidate genes for 33 of the 58 loci expanded from 6 loci prior to 291 re-prioritization. For no loci are all candidate genes non-coding. The greatest percent of non-292 coding candidate genes (80%) is observed for locus 20 -rs12190287 and rs12202017 (TCF21).
293 The protein-coding gene TCF21 annotated by GWAS is supported by eQTLs in 24 different 294 tissues while 4 additional non-coding genes are introduced based on eQTL and sQTL analysis.
295 The candidate ncRNAs are located within a 500KB region, with TARID being antisense to 296 TCF21, and like many non-coding genes generally have lower expression levels.  The genomic loci for each these 245 candidate genes often harbor multiple protein-347 coding and non-coding transcripts arranged on both the sense and antisense strands (S2 File).
348 They express an average of 9 transcripts per gene and a maximum of 189 (TEX41-locus 10 -349 rs2252641, rs17678683), with 47% of all transcripts being non-coding (Fig 1B). More than half 350 of the gene loci (161) also contain one or more antisense genes (i.e., located on the opposite 351 strand and overlapping).

352
Using GTEx data, we find RNA expression is associated with one or more genetic 353 variant in the locus with a median of 4 eQTLs (max 32230) among 50% of the 245 candidate 354 genes in 48 different tissues (Fig 1B).

355
With a median of 26 publications and a maximum of 27,497 (APOE), only a handful of 356 these 245 candidate genes have been well studied to date (Fig 1B). Twenty percent (51)

363
Each implicated locus displays an astoundingly complex architecture with multiple
394 Overall low expression of CELSR2 in liver tissue, however means that these ratios are for the 395 most part based on low coverage (median total count 13). Despite the relative consistency from 396 sample to sample, large allelic ratios derived from relatively low counts, as observed here, 397 raises suspicion for systemic sources of bias, e.g. preferential amplification of one allele. To 398 evaluate this further, we considered allelic ratios at nearby SNPs in strong LD (R 2 > 0.9) and 399 weak LD (R 2 < 0.1). As these SNPs are co-located, systemic sources of bias should affect all 400 SNPs in the locus while 'true' biological AEI would be expected only for those variants in strong 401 LD with a functional SNP. We observe AEI for those SNPs in strong LD with the GWAS marker, 402 but not for those in the same region in weak LD, a pattern that is suggestive of 'true' biological 403 AEI and a functional cis-acting variant. 433 Assuming one functional variant in the locus, the beta for each eQTL should correlate with its R 2 434 relative to the highest scoring SNP (33).

435
This approach reveals that the observed eQTLs for a gene often represent more than 436 one regulatory variant, with the exception of FLT1 in Tibial Nerve -represented by only one 437 cluster of variants marked by the GWAS SNP (Fig 3). This result is critical to the correct 438 interpretation of GWAS that would otherwise rely on a single variant rather than considering the 439 combined effect of more than one causative variant.  Testing these additional variants with MI instead of LIPA expression did not yield 473 significant associations (Table 2). However, LIPA expression itself is not associated with MI 474 except when rs1412444 is homozygous minor, which may explain the discrepancy. In looking 475 separately at the associations between the GWAS hit and LIPA expression and the GWAS hit 476 and MI, we find that rs1412444 is associated with increased risk of MI and increased expression 477 of LIPA, but counterintuitively those with two minor alleles and MI exhibit lower rather than 478 higher expression, a pattern that also holds in GTEx although it is only statistically significant in 479 CATHGEN (Fig 4, S4 Figure).

497
To consider how eQTLs for a given gene compare across different tissues, we cluster 498 genome-wide significant eQTLs reported by GTEx for LIPA in a heatmap organized by their 499 pairwise LD (R 2 ), using a colored bar at the top of the heatmap to denote tissue type (Fig. 5A).
500 eQTL SNPs generally cluster by tissue, suggesting distinct regulatory variants in different 501 tissues. However, there are two LD blocks that contain eQTLs in more than half of tissues 502 indicative of genetic variation that acts across different tissue types. Variants detected by 503 GWAS for LIPA appear as a significant eQTLs in a subset of tissues (Table 1), some of which fit 504 with our understanding of CAD pathology (heart, aorta, adipose), others suggest as yet 505 unexplained biological consequences (spleen, pancreas).

517
To consider transcript-specific eQTLs more specifically, for 1mb upstream and 518 downstream of LIPA, we calculated the association between each SNP and expression of both 519 overall mRNA and individual isofoms. The resulting local Manhattan plot displays the 520 association p-values for LIPA transcripts in blood ( Fig 4B)