Shared effect modeling reveals that a fraction of autoimmune disease associations are consistent with eQTLs in three immune cell types

The majority of autoimmune disease risk effects identified by genome-wide association studies (GWAS) localize to open chromatin with gene regulatory activity. GWAS loci are also enriched for expression quantitative trait loci (eQTLs), suggesting that most disease risk variants exert their pathological effects by altering gene expression. However, because causal variants are difficult to identify and cis-eQTLs occur frequently, it remains challenging to translate this bulk observation into specific instances of a disease risk variant driving changes to gene regulation. Here, we use a novel joint likelihood framework with higher resolution than previous methods to identify loci where disease risk and an eQTL are driven by a single, shared genetic effect as opposed to distinct effects in close proximity. We find that 22 approximately 25% of autoimmune disease loci harbor an eQTL driven by the same genetic effect, but the majority of loci do not. Thus, we uncover a fraction of gene regulatory changes as strong mechanistic hypotheses for disease risk.

independent effects in the same region 17,18 . Several methods have been developed to identify 1 pathogenic genes within GWAS loci relying on eQTL co-localization [19][20][21][22] . However, these are limited in 2 resolution to detect situations where causal variants for the disease trait and the eQTL are distinct but 3 in linkage disequilibrium.

4
Here, we present an approach to test if a GWAS risk association and an eQTL are driven by the same 5 underlying genetic effect, accounting for the LD between causal variants. Using data from ImmunoChip 6 studies of seven AID comprising >180,000 samples in total (Table S2) 10 When associations to two traits in a locus -here, a disease trait and an eQTL -are driven by the same 11 underlying causal variant, the joint evidence of association should be maximized at the markers in 12 tightest LD with the (potentially unobserved) causal variant 18,25 . Here, we directly evaluate the joint 13 likelihood that both trait associations are due to the same underlying causal variant ( Figure S1), unlike 14 previous approaches that look for similarities in the shape of the association curve over multiple 15 markers 19,20,26,27 . We expect that when the underlying causal effect is shared, joint likelihood is 16 maximized when we model the same causal variant in both traits; conversely, when the underlying 17 causal variants are different, we expect maximum joint likelihood when we model their closest proxies.

18
We empirically derive the null distribution of the joint likelihood ratio statistic by comparing disease 19 associations to permuted eQTL data. The resulting P value is asymptotically conservative against the 20 null of distinct causal variants, as the likelihoods of two competing models will be further contrasted with 21 increasing sample and effect sizes (Supplementary Notes). We are thus able to directly evaluate 22 whether associations in the same genomic locus for two traits, observed in different cohorts of 23 individuals, are due to the same underlying causal variant.

24
To assess the performance of our method, we benchmarked it against coloc 19 , a well-calibrated 25 Bayesian framework that considers spatial similarities in association data across windows of markers in 26 a locus. We simulated pairs of case-control cohorts with either the same or distinct causal variants 27 driving association in each. Though both methods had excellent performance in simulations where the 28 two distinct causal variants are in complete linkage equilibrium (AUC=0.99 for both when r 2 < 0.5), 29 compared to coloc we found that our method maintained higher specificity even as the linkage 30 disequilibrium between distinct causal variants became high (AUC = 0.92 compared to 0.70 for coloc 31 when 0.7 < r 2 < 0.8; Figure S3, Tables S3 and S4). In practice, our resolution becomes limited at very 32 high LD levels (r 2 >0.8), and we are unable to reliably distinguish between two causal variants in very 33 high LD and a single causal variant associated to both traits. Thus, within these limits, we can 34 accurately detect cases of shared genetic effects between two traits.

35
We first identified densely genotyped ImmunoChip loci showing strong association to each disease 36 from publicly available summary data (immunobase.org; Table 1). These include associations to both 37 Crohn disease and ulcerative colitis, collectively designated inflammatory bowel disease, and to each 38 disease alone 28 . Due to the extensive LD and complex natural selection present in the Major

39
Histocompatibility Locus, we excluded this region from consideration. We next identified genes in a 40 1Mb window centered on the most associated variant in each locus. Consistent with previous 41 observations that eQTLs are frequently found in GWAS loci, we found that all loci but 11 had at least 3 diseases, with the proportion varying from 2/40 (5%) for type 1 diabetes loci to 4/10 (40%) for ulcerative 1 colitis loci (false discovery rate < 5%; Tables 1 and 2). Of these 57 shared effects, 43 pass even the 2 more stringent family-wise multiple testing correction (Bonferroni corrected P < 0.05). Thus, our 3 analysis reveals that in the majority of AID loci, variants causally involved in disease phenotypes do not 4 overlap variants responsible for eQTL signals. Overall, we find that only a small minority of tested 5 disease-eQTL pairs in the same locus show any evidence of a shared association, whereas >75% 6 show evidence of being driven by distinct genetic variants in the same locus ( Figure 1).

7
We sought to explain this lack of overlap between disease associations and eQTLs, despite their 8 frequent co-occurrence in the same loci. In particular, although our method showed good performance 9 in simulated data ( Figure S4), we remained concerned that this lack of overlap may be due to low 10 statistical power in the eQTL data, which come from cohorts of limited sample size. However, we find 11 that even amongst the most strongly supported eQTLs (p < 10 -5 ), <25% show evidence of shared 12 effects with disease associations. Conversely, we find strong evidence for distinct effects for the 13 majority of disease-eQTL pairs, with only a subset of comparisons being ambiguous, suggesting that 14 our method is adequately powered to detect shared effects where they exist (Figures 1a and S11-13).

15
To assess whether power affects the total number of loci, rather than eQTL, that can be resolved, we 16 looked more deeply at our significance threshold settings. We find that more liberal thresholds do not 17 increase the number of true positive results after adjusting for false positive rate, indicating that most 18 loci do not contain any gene with an eQTL consistent with the disease association (Figures 1b and 19 S14). Cumulatively, our results demonstrate that only a minority of AID risk effects drive eQTLs in the 20 three cell populations we tested, which are drawn from diverse lineages of the immune system.

21
We next focused on the subset of 57 disease/eQTL pairs in 41 loci where we could detect strong 22 evidence of a shared effect (Table 2). We find that 51/57 (89%) of effects are restricted to one cell 23 population, indicating that tissue-specific eQTLs are important components of the molecular 24 underpinnings of disease ( Figures S5 and S6). The remaining six effects are detected in multiple cell 25 populations; for example, the multiple sclerosis association at rs10783847 on chromosome 12 is 26 consistent with eQTLs for the transcript of methyltransferase-like 21B (METTL21B) in both CD4 + T cells 27 and CD14 + monocytes, but not with eQTLs for the remaining 38 genes in the immediate locus ( Figure   28 2). Although METTL21B is expressed in LCLs, there is no evidence of an eQTL in this tissue within 29 1Mb from rs10783847. Similarly, for the multiple sclerosis association at rs1966115 on chromosome 8 30 and eQTLs for ZC2HC1A, and for the inflammatory bowel disease association at rs55770741 on 31 chromosome 5 and eQTLs for ERAP2, we detect a shared effect in all three cell populations. In several 32 cases we find tissue-specific shared effects despite strong eQTLs for the same gene in other tissues:

33
for TUFM and inflammatory bowel disease risk at rs12448902 on chromosome 16, we find shared 34 effects in CD4 + and CD14 + but not LCLs, where we see a TUFM eQTL at p = 0.01 (joint likelihood P = 35 0.97). For ZFP90 and ulcerative colitis risk at rs889561 on chromosome 16, we also find shared effects 36 in CD4 + and CD14 + but not LCLs, where we observe a ZFP90 eQTL at p = 0.005 that has a low 37 likelihood of shared effect with GWAS (joint likelihood P = 0.95). Instead, we find evidence of sharing 38 between disease risk and an eQTL for NFAT5 in LCLs. Thus, despite the presence of eQTLs for a gene 39 in multiple tissues, not all these effects are consistent with disease associations suggesting that 40 disease-relevant eQTLs are tissue specific.

42
Among our findings are cases where an eQTL is consistent with associations to multiple diseases. For 43 example, the ankyrin repeat domain 55 (ANKRD55) transcript encoded on chromosome 5 has an eQTL 44 in CD4 + T cells that is shared with proximal associations to multiple sclerosis, Crohn disease and 45 rheumatoid arthritis (Figure 3, all observations are significant after Bonferroni correction). We also find weaker evidence for shared effects between all three diseases and an eQTL for interleukin 6 signal transducer (IL6ST) in CD4 + T cells, which passes the false discovery rate threshold but not the more consistent with associations to both celiac disease and multiple sclerosis ( Figure S8), a CD14 + eQTL 1 for RGS1 on chromosome 1 is consistent with associations to both celiac disease and multiple sclerosis 2 ( Figure S9), and a CD14 + eQTL for UBE2L3 on chromosome 22 is consistent with associations to both 3 celiac disease and inflammatory bowel disease ( Figure S10). In all cases, these are the only genome-4 wide significant disease associations reported in these loci. As we consider each disease association 5 independently, these results indicate that the same underlying risk variants drive risk to multiple 6 diseases in these loci by altering gene expression, consistent with observations of shared effects 7 across diseases 7 .

9
Overall, our results suggest that some autoimmune and inflammatory disease loci are consistent with 10 eQTLs acting in specific immune cell subpopulations, which form strong mechanistic hypotheses for the 11 molecular mechanisms driving disease risk. However, these only account for a small fraction of eQTLs 12 present in disease risk loci; this suggests that abundant caution must be exercised before inferring 13 pathological relevance for an observed eQTL simply due to proximity to a disease association. Strong

33
Simulated dataset. We selected two loci with previously known associations to base our positive and 34 negative simulations: CD58 on chromosome 1, rs667309 associated to MS risk 32 , and ATG16L1 on 35 chromosome 2, rs2241880 associated to IBD 33

44
The pairs of simulated cohorts were used either as positive controls if they simulate the same causal 5 treated as the primary trait (i.e. a disease GWAS) and the other as a secondary trait (i.e. eQTL). Due to 1 the nature of the coalescent forward simulation model, 7.7% of simulated cohorts showed peak 2 association to a SNP in only moderate LD with the specified causal variant (r 2 < 0.8). We kept these 3 cohorts as secondary traits only, to better capture the vagaries of resolution limits inherent in the small 4 sample size of eQTL studies. In total, we generated 1,629 cohort pairs for positive controls simulating 5 the same causal variant and 6,106, 530, and 160 cohort pairs for negative controls for which distinct 6 causal variants are separated by r 2 of 0 -0.5, 0.5 -0.7, and 0.7 -0.8, respectively.

19
As our method works best on dense genotype data, we restricted our analyses to the 188 loci 20 genotyped at high density on ImmunoChip. We excluded the Major Histocompatibility Complex (MHC) 21 locus, due to the complex landscape of selection and resulting complex LD patterns. For each disease,

22
we sought the largest published genetic mapping study and identified genome-wide significant 23 associations reported in the 188 ImmunoChip loci. We note that these reports may contain additional 24 samples, so the associations may not be genome-wide significant in the ImmunoChip studies alone.

25
We also excluded any secondary associations after conditioning on initial results, as these are 26 inconsistently reported across diseases. If multiple independent associations are reported within the 27 same ImmunoChip region for any disease, we divide the region at the mid-point between the reported 28 markers and select lead SNPs in each sub-interval separately.
29 30 eQTL dataset. We examined eQTLs in lymphoblastoid cell lines 23 (LCLs) and primary CD4 + T cells and 31 CD14 + monocytes 24 obtained from healthy donors (Table S2) where * is the most associated SNP for primary trait, ! ( ) and ! ( ) are the likelihood of SNP i being

12
We calculated the likelihood of causal association by approximating the local LD structure with pairwise Thus, given association statistics for primary and secondary traits, = ! , ! , … , ! ! 27 and = ! , ! , … , ! ! , the test statistic simplifies to: The p-value of joint likelihood is estimated by permuting phenotypes of secondary traits as under the 31 trivial null hypothesis that that there is no casual variant for secondary trait in the locus (" ! "). With 32 respect to the more likely null that distinct causal variants underlie association signals of two traits 33 (" ! "), we can show that asymptotically as the non-centrality of causal variant increases, p-values Thus, with large enough sample or effect sizes, joint likelihood test against ! will also reject ! in favor 2 of alternative hypothesis of shared causal variant (" ! "). Further, to evaluate whether this property 3 holds for practical non-centrality values, we examined the negative controls simulating ! in ATG16L1 4 and CD58 loci, specifically, if !"#$ was highly shifted toward 1.0 ( Figure S2) and larger than empirically 5 estimated false positive rates as expected (Table S4).

7
For both simulated and real GWAS data, we applied JLIM to SNPs with data for both primary and 8 secondary traits, present in the reference LD panels, and within 100kb of the most associated marker to 9 disease ("lead SNP"). In ImmunoChip data, the analysis windows were further confined by the 10 boundaries of the fine-mapping intervals. We compared each lead SNP to eQTL data for all genes with 11 a transcription start sites (TSS) up to 1Mb from the lead SNP, and an eQTL association p < 0.05 for at 12 least one SNP in the analysis window. To minimize computational burden, we did not consider SNPs

17
We corrected for multiple tests using false discovery rate (FDR) levels and Bonferroni correction. The

27
Bayesian coloc. We ran Bayesian coloc 19 using default parameter settings on both simulated and real 28 data as described for JLIM. For simulated data, we used colocalization prior p 12 values of 10 -5 or 10 -6 , 29 which are default values for higher sensitivity and higher specificity, respectively. The beta and variance 30 of beta were used for all SNPs in the analysis window in case/control mode. We calculated accuracy as 31 the area under the receiver operator curve (ROC; Figure S3).

33
As ImmunoChip data is only available as summary statistics, we used the minor allele frequencies from 34 non-Finnish Europeans from the 1000 Genomes Project, and quantitative beta and variance of beta 35 calculated on eQTL association data, and a colocalization prior p 12 = 10 -6 . We did not consider the type 36 1 diabetes data, where case/control sample size is limited after excluding affected sib pair data.

38
Estimating the number of disease GWAS loci with consistent eQTL effects. We expect JLIM p-39 values to follow a bimodal distribution with modes close to zero and one when the data support a model 40 of shared or distinct causal effects, respectively. Conversely, under the null model of no cis-eQTL 41 association, we expect a uniform p-value distribution. We can thus estimate the proportion disease-42 eQTL pairs belonging to the null ! , same ! and distinct ! causal variant models from the observed 43 p-value distribution 40 (Figures S11-13). To assess if the strength of the eQTL association influences the likelihood of identifying a shared causal variant, we calculate these proportions for subsets of trait pairs To estimate the number of disease GWAS loci that can be explained by consistent effect of same  (Figures 1 and S14). Specifically, at each JLIM p-8 value cutoff ! , we successively calculated ! : 9 10

19
As the true null is mixture of two nulls, ! and ! , the false positive rate of JLIM against true null 20 ≥ ! ∪ ! can be bounded by using the following decomposition:                     Table 1. Only a minority of disease associations share causal variants with eQTLs across three immune cell subpopulations. We identified 261 disease associations in ImmunoChip regions with at least one eQTL within 100kb of the most associated SNP. Only 41/261 (16%) of these associations show evidence of a shared effect with an eQTL in that region. Thus, while eQTLs are abundant in disease-associated loci, they do not appear to be driven by the same causal variant as the disease association. a We only consider associations reported at genome-wide significant levels and overlapping genomic regions densely genotyped on ImmunoChip, excluding conditional peaks and MHC loci (see Methods). b eQTLs are selected if there is nominal association (eQTL p < 0.05) to at least one SNP within 100kb of the most associated SNP to disease, and a transcription start site of the gene within 1Mb of that SNP. c Number of loci where disease association is consistent with a shared effect for at least one eQTL (FDR < 5%).  10  10  9  10  10  2  1  3  4  T1D  47  39  40  36  40  2  0  0  2  RA  34  34  34  34  34  2  0  1  3  CEL  34  34  34  34  34  3  2  0  5  Overall  272  258  259  255  261  25  16  11  41   Table 2. Forty one loci harbor eQTLs driven by the same variants as an association to at least one of seven diseases. We find 57 instances of shared disease-eQTL effects in 41 loci (joint likelihood of shared association FDR < 5%). a Variant with the minimum association p value to disease in the ImmunoChip summary statistics. b Minimum eQTL p value for any SNP within 100kb of the lead SNP. Dashes (-) indicate genes that are either not detected or with minimum eQTL P > 0.05 in that cell type. c Highlighted in bold are disease-eQTL pairs with false discovery rate < 5%. Asterisk (*) marks eQTL genes passing Bonferroni correction.  We find strong evidence that approximately 75% of eQTLs are driven by distinct causal variants (orange) to 261 disease risk associations across 155 ImmunoChip regions. The strength of eQTL association does not influence the proportion of shared effects (green) we are able to detect, suggesting this lack of overlap is not due to lack of power. We find no compelling evidence for either shared or distinct associations for a small proportion of disease-eQTL pairs (gray). (b) The median number of loci with at least one shared effect eQTL in any tissue (blue line) at more liberal significance thresholds remains constant after false positive adjustment, further supporting this conclusion. The shaded area represents the lower and upper expectation bounds for disease-eQTL pairs driven by the same causal variant. Only 31-57% of multiple sclerosis associations and 37-57% of inflammatory bowel disease associations are consistent with eQTL effects. Equivalent data for the other diseases are presented in Figure S14. . This association is consistent with eQTLs for METTL21B in CD4 + T cells (middle panel) and CD14 + monocytes (lower panel, both shaded by LD to rs10783847), but not to eQTL data for any other genes in the region (upper gene track: black boxes denote 38 genes with eQTL data available in addition to METTL21B (red); gray denotes genes which are not reliably detected in our data or do not have eQTL p < 0.05 in the region). (b) Joint likelihood p-values for 39 candidate genes analyzed for this MS association peak in three cell types. Those with FDR < 5% are shown in red. (c) Association p-values for MS risk (x-axis) and eQTLs (y-axis) are strongly correlated for both CD4 + T cells (middle panel) and CD14 + monocytes (lower panel). (d) Similarly, eQTL association Z statistics scale linearly with LD (r, x axis) to rs10783847, consistent with a model of a single causal variant driving both disease association and eQTL.