Abstract
Genetic risk for common autoimmune diseases is influenced by hundreds of small effect, mostly non-coding variants, enriched in regulatory regions active in adaptive-immune cell types. DNaseI hypersensitivity sites (DHSs) are a genomic mark for regulatory DNA. Here, we generated a single DHSs annotation from fifteen deeply sequenced DNase-seq experiments in adaptive-immune as well as non-immune cell types. Using this annotation we quantified accessibility across cell types in a matrix format amenable to statistical analysis, deduced the subset of DHSs unique to adaptive-immune cell types, and grouped DHSs by cell-type accessibility profiles. Measuring enrichment with cell-type-specific TF binding sites as well as proximal gene expression and function, we show that accessibility profiles grouped DHSs into coherent regulatory functions. Using the adaptive-immune-specific DHSs as input (0.37% of genome), we associated DHSs to six autoimmune diseases with GWAS data. Associated loci showed higher replication rates when compared to loci identified by GWAS or by considering all DHSs, allowing the additional discovery of 327 loci (FDR<0.005) below typical GWAS significance threshold, 52 of which are novel and replicating discoveries. Finally, we integrated DHS associations from six autoimmune diseases, using a network model (bird’-eye view) and a regulatory Manhattan plot schema (per locus). Taken together, we described and validated a strategy to leverage finely resolved regulatory priors, enhancing the discovery, interpretability, and resolution of genetic associations, and providing actionable insights for follow up work.
Introduction
Most common autoimmune diseases affecting over 4% of the world’ population1,2 have a substantial polygenic heritable component3. Genome-wide association studies (GWAS) have been successful at linking hundreds of genomic loci to autoimmune diseases4, but understanding the molecular mechanisms influencing disease-risk remains challenging5. DNaseI hypersensitive site sequencing (DNase-Seq) is a high-throughput technology for genome-wide detection of DNaseI hypersensitive sites (DHSs) in a given cell type6,7, and DHSs are an excellent mark for regulatory DNA where transcription factors bind8,9. Recently, it has been shown that the majority of autoimmune disease risk alleles reside in DHSs active in adaptive immune cell types, such as T and B cells10-13. As recently suggested11,14, this presents the opportunity to focus association studies on regulatory DNA of such trait-relevant cell types. However, a large proportion of the regulatory DNA of a given cell type is shared with non-related cell types8. Since genes involved in adaptive-immune-specific functions (e.g. T cell receptor signaling15) are likely regulated by adaptive-immune-specific regulatory DNA, we suggest further focusing genotype-phenotype studies of autoimmune diseases, on regulatory DNA specific to adaptive immune cell types.
Regulatory marks other than DHSs such as certain histone marks are also available for many cell types16,17. However, when compared with DHSs, these other marks typically lack in two key factors we relied on, resolution, about 200-300 base pairs [bps] for a typical DHS compared to thousands of bps for a typical histone mark peak, and sequencing depth, over 200 million reads for each DNase-Seq data we used, compared to 10-30 million reads for a typical histone-mark ChlP-Seq experiment. For these two main reasons, here we used DHSs and not other regulatory marks.
This work focused on developing a framework for analyzing genetic data in light of the growing functional data. Perhaps the largest difference between our approach and many other reports is that we have used the functional data, instead of the genetic data, to dictate which regions of the genome would be evaluated (Fig. 1a,b). This resulted in, 1) a reduced requirement for multiple-testing correction and thereby increased statistical power; 2) tested units became context-specific, short regulatory regions thought to influence the studied trait, instead of all SNPs—the large majority of which have no functional impact; and 3) improved discovery and interpretation of genetic associations from current autoimmune-disease GWAS, afforded by trait-relevant, finely-resolved regulatory priors.
Finally, since more results were generated than are possible to report in detail, we integrated results from six autoimmune diseases in a human-accessible format, in the hope of encouraging further investigations by others. This was achieved through a network model (integrating over all results) and a regulatory Manhattan plot schema (integrating results per locus). See Fig. 1c for an overview of our approach.
Results
Generating a unified DHS annotation from multiple samples and quantifying DHS accessibility
Currently, there is no unified genome-wide annotation for DHSs as there is for example for genes. Instead, each sample comes with its own unique DHS annotation. Moreover, there are currently no standards for quantifying accessibility as there are for quantifying gene expression, e.g. RPKM18. This makes working with DHSs from multiple DNase-seq experiments challenging. Here, we generated a single DHS annotation from fifteen available DNase-seq samples16,17,19: seven adaptive immune cell types (B, CD8+ T, Naïve CD4+ T, Th1, Th2, Th17, Treg), an innate immune cell (monocytes), and seven non-immune cell types (human embryonic stem cell, hereafter hESC, fetal brain, astrocyte, myoblast, fibroblast, epithelial, and hematopoietic stem cells, hereafter hematoSC). In total, we annotated 348,527 DHSs covering 3.4% of the human genome from DHSs that were significantly accessible in at least one of the above cell types (Methods). We then quantified accessibility per DHS and cell-type in number of cleavages per kilobase per million (CPKMs). We visualized the resulting accessibility matrix as a heatmap (Fig. S1). We saw that over half of all accessible DNA in adaptive-immune cell types is ubiquitously accessible across many other cell types, and that such ubiquitous DHSs tended to be more accessible when compared to adaptive-immune-specific DHSs. If one considered trait-relevant cell types alone, or considered cell types independently from one another, the ubiquitous DHSs might have overshadowed the less accessible but more interesting adaptive-immune-specific DHSs. This demonstrates the insights that can be gained by creating one DHS annotation from trait-relevant and trait-irrelevant cell types, and continuously quantifying accessibility in a single matrix.
Grouping DHSs by accessibility profiles
Next, we hypothesized that grouping of DHSs by the subset of cell-types they were accessible in, would also group DHSs by distinct regulatory functions. Using the CPKM matrix, we scored DHSs for how well they matched cell-type-specific profiles, or profiles specific to predefined subsets of cell types (Figure 2a). Out of nineteen accessibility profiles we assayed, nine were unique to adaptive-immune cell types (purple), and were designed to group DHSs in broad strokes by salient immune functions: B cells specific DHSs (antibody production20), CD8+ T cell specific DHSs (cell-mediated killing21), Naïve CD4+ T cell specific DHSs (maintenance of non-activated T cells22), CD4+ Th1 specific DHSs (cytokine-induced cell-mediated immunity22), CD4+ Th2 specific DHSs (cytokine-induced antibody production22), CD4+ Th17 specific DHSs (cytokine-induced mucosal immunity22), CD4+ Treg specific DHSs (cytokine-induced anti-inflammatory response and tolerance to self antigens22); and for subsets of cell types: DHSs specific to all six T lineages (T cell maintenance), and DHSs specific to all four activated CD4+ T cells, namely: Th1, Th2, Th17 and Treg (CD4+ T cell activation). As examples, we present genome-browser views of two uniformly selected DHSs per accessibility profile (Fig. 2b, Fig. S2). The accessibility annotation, CPKM matrix, and match-scores per accessibility profile are available in Table S1.
We also compared the grouping of DHSs into accessibility profiles based on the CPKM matrix, with the alternative of grouping DHSs into the same profiles using a binary matrix in which DHSs were annotated as accessible (1) or not accessible (0)–the common practice for DHS data8,11,16,19,23,24. With respect to resolving accessibility profiles among the relate T lineages, the CPKM matrix increased the number of identified DHSs per profile (sensitivity; Fig. S3), proportion of correctly classified DHSs per profile (specificity; Fig. S3), and overall accessibility at identified DHSs per profile (signal-to-noise ratio; Fig. S4; Methods).
Underlying DHSs of different accessibility profiles are distinct combinations of DNA sequence motifs
The expected molecular basis for DNA accessibility differences between cell types is a corresponding difference in TF occupancy9,25. Therefore, to assess whether grouping of DHSs into accessibility profiles was cell type dependent and not strongly confounded by other factors (e.g. genetic or environmental differences between sample donors), we de novo identified enriched DNA sequence-motifs in the top 600 DHSs matching each accessibility profile (Methods). Consistent with unique regulatory functions for DHSs belonging to different accessibility profiles, but not with confounders, distinct combinations of sequence motifs were enriched in each profile, often matching to binding sites of TFs known to be important in the accessible cell types (Fig. 3a). For example, binding sites for the Th2 and Th17 master regulators, GATA322 and RORC26, were solely discovered in DHSs specific to Th2 and Th17 cells, respectively. However, we did not recover the binding motifs for the Th1 and Treg master regulators, FOXP3 and TBX21, respectively, suggesting that a lower-resolution was achieved for DHSs specific to these two cell types. Additionally, we found that ubiquitous DHSs were solely enriched for the CCCTC-binding factor (CTCF) – a constitutively expressed DNA-binding protein involved in organizing chromatin into topological domains27 – suggesting that these DHSs were marking generic chromatin-to-chromatin contact points found across all cell types.
Accessibility profiles grouped DHSs by coherent regulatory functions
We next determined if DHSs of different accessibility profiles marked regulatory DNA for genes with distinct cellular functions. To this end, we associated DHSs with their nearest transcription-start-site (TSS) gene, as a proxy for regulated genes, and evaluated each profile for gene ontology (GO) term enrichment (Methods)28,29. For each accessibility profile, the enriched GO terms were mostly in line with known functions in accessible cell types (Fig. 3b and Fig. S5). Using the BioGPS30 and GTEx31 gene-expression compendia, we also evaluated if genes assigned to DHSs of a given accessibility profile, where differentially expressed in tissues relevant to accessible cell types (Fig. 3c, Fig. S6, S7). This was indeed the case. For example, in bronchial epithelial cells, genes assigned to epithelial-specific DHSs were most differentially expressed. Additionally, we noted that up regulation in gene expression was much more common than down regulation, indicating that most DHSs marked enhancers.
Taken together, the profile-specific enrichment in TF binding sites, GO terms, and differential gene expression, show that grouping DHSs by accessibility profiles, resulted in grouping of regulatory DNA by coherent, cell-type-dependent regulatory functions.
GWAS signal stratified by DHS cell-type accessibility profiles
In previous steps, we grouped DHSs into accessibility profiles. We first verified that as expected adaptive-immune-specific DHSs were selectively enriched with risk alleles for autoimmune diseases (Fig. S8). We further resolved which subsets of adaptive-immune-specific DHSs were most relevant to each disease, using a rank-based approach and a permutation-based statistic (Methods; Fig. S9). No single accessibility profile dominated across all autoimmune diseases (Fig. S10a). However, within each disease more strongly enriched cell-type profiles emerged, e.g. DHSs specific to CD8+ T cells, B cells, and Treg cells, were most enriched for systemic lupus erythematosus (SLE). Perhaps of more interest was that Alzheimer’ disease and schizophrenia, two non-autoimmune controls, clustered together with the autoimmune diseases (Fig. S10b). Specifically, we found that schizophrenia was enriched with DHSs specific to CD4+ T cells, and Alzheimer’ was enriched with DHSs specific to monocytes, B cells, and CD8+ T cells, revealing involvement of different immune processes, and providing further resolution to recent results suggesting immune involvement in susceptibility to Alzheimer’s32,33 and schizophrenia34,35.
A context-specific, regulatory-wide association study (csRWAS)
Next, we performed an association study between the adaptive-immune-specific DHSs and autoimmune diseases. Specifically, from each GWAS we assigned a single proxy SNP for each DHS, and used that as the DHS association P-value (Methods, table S2). Choosing the largest GWAS (by sample size) for each of six autoimmune diseases, we analyzed rheumatoid arthritis36 (RA, n=80k), ulcerative colitis37 (UC, n=26k), Crohn’s disease38 (CD, n=21k), multiple sclerosis39 (MS, n=15k), systemic lupus erythematosus40 (SLE, n=11k), and type 1 diabetes41 (T1D, n=5k). The genome-wide significance typically employed by GWAS (p=5x10-8) is not appropriate for csRWAS, as all the adaptive-immune-specific DHSs combined constitute ~0.37% of the human genome. Leveraging the groupings of DHSs into accessibility profiles, we estimated a null P-values distribution from proxy SNPs assigned to non-immune DHSs (Methods), as we did not expect these to be specifically associated with autoimmune diseases. However, to the extent that non-immune DHSs were tagged by risk alleles, this null is conservative. This allowed matching a P-value to a desired FDR threshold. For example, at an FDR of 0.001, P-values ranged from 2x10-10 for RA to 3.1x10-4 for MS, and at an FDR of 0.005, from 7x10-5 for RA to 2.8x10-3 for MS. Employing an FDR<0.005 as a significance threshold, we identified between 243 and 839 DHSs (94 to 165 independent loci) per disease. To aid in analysis of these many associated DHSs, we first filtered them based on the GWAS genome-wide significance threshold, as associations above this threshold would have likely already been reported by the corresponding GWAS. Specifically, each associated DHS at FDR<0.005 was assigned to one of four groups, in the following order of precedence:
Genome-wide significant (GW), if the associated DHS itself was identified above genome-wide significance (p<5x10-8).
Locus GW significance (IGW), if any GWAS SNP within 0.1cM or 100kb of the associated DHS was identified above genome-wide significance.
True, if any GWAS SNP within 0.1cM or 100kb of the associated DHS, in the GWAS catalog4 for the same disease, was identified above genome-wide significance. Or,
Novel otherwise.
That means that DHSs in the Novel and True bins represent associations found below genome-wide significance (with the latter reaching genome-wide significance in other studies). Similarly, after grouping DHSs into independent loci, loci were assigned into these four groups, as determined by the highest-precedence DHS found in each locus.
We initially expected all GW and lGW loci to be reported in the GWAS catalog; after all they had a GWAS SNP with p<5x10-8. In practice however we found that this was not the case. For example, out of 644 DHSs assigned to the GW bin (FDR<0.005), 89 were not reported as part of any locus in the catalog. We examined these un-cataloged DHSs manually (Methods) and found that beside a single genome-wide-significance SNP supporting them, typically 0-4 other SNPs in the region were below a nominal significance of p<1x10-3. Therefore, these loci were likely flagged as poor in the original GWAS and therefore not reported. Here, we flagged such poorly associated DHSs if around the associated DHS, fewer than four SNPs with p<1x10-3 were found (Methods). Finally, we determined which DHSs and loci replicated across diseases, as this can provide further proof that an association is genuine, and detect key loci and regulatory DNAs associated to multiple autoimmune diseases (Methods).
We summarized the results for all the non-poor loci, stratified by the above groupings of: GW, lGW, True, and Novel in Table 1 (see table S3 and S4 for details per DHS and per locus, respectively). Out of 529 loci associated to one or more of the autoimmune diseases (FDR<0.005), 322 (60.9%) had further support by either replicating here across diseases (27.2%) or reaching genome-wide significance in the GWAS catalog in the same diseases (55%). Importantly, out of the 529 loci, 327 were discovered below genome-wide significance (True or Novel) in at least one of the diseases a locus was associated with, and 153 (46.8%) of these had replication or catalog support. This shows that using informed regulatory priors allows the identification of many genuine associations below GWAS genome-wide significance; albeit higher significance does lead to improved validation rates (Table 1). It still remains to be evaluated how GWAS compares to csRWAS with respect to replication rates at equal significance thresholds and we will return to this question to conclude the results section. Next, we describe how polymorphisms that disrupt potential TF binding sites inside DHSs were identified, and how we prioritize among them to suggest follow up SNPs for each associated locus.
Identifying polymorphisms in DHSs that disrupt predicted TF binding sites
Genetic variants that modify TF binding sites in DHSs are a major source of human gene expression variation42. Here, we scanned DHSs for sequences matching one of the enriched TF binding sites found in adaptive-immune-specific accessibility profiles (Fig. 3a), and identified polymorphisms disrupting such binding sites as possible sources of disease risk alleles (Methods). Since many such polymorphic TF binding sites were found, we scored binding sites using a three-letter grade to allow prioritizing among them (Methods). The first letter grade measured how well the predicted binding-site matched to the cognate TF binding site and cell-type context (+A being the best). The second and third letters measured the SNP MAF in Europeans and Asians (A being common, B intermediate, and C rare), respectively (table S5). We will see examples of these grades in action below.
We next describe how we integrated the regulatory and genetic information in a network model.
A hierarchical integrative network model
Through disease-associated DHSs our approach connected diseases, SNPs, loci, DHSs, polymorphic TF-binding-sites, DHS accessibility profiles, and genes by most proximal TSS. Using the accessibility profiles as a functional guide, we constructed a hierarchical network model with five levels. The network for associated DHSs at an FDR<0.005 is visualized in Figure 4 and is available for navigation in Cytoscape43 (supplementary file 1). Although gene assignment to DHSs by nearest TSS is prone to false negatives and false positives, particularly in gene-dense regions, this simple procedure clearly matched many DHSs to correct regulated genes, as supported by the GO term enrichments and differential gene expression analyses, and as shown next by KEGG44 pathway enrichment. We first examined the 528 genes in the top level of the network. We asked if the associated genes were enriched with functional pathways, above what was expected for genes most proximal to adaptive-immune-specific DHSs (Methods). The top three enriched KEGG pathways revealed canonical T-cell signaling pathways (hypergeometric test): JAK/STAT signaling pathway45 (fold-change[observed/expected]=5.56, p=2.86x10-12), Cytokine-cytokine receptor interaction45 (fold-change =4.08, p=3.43x10-11), and T-cell receptor signaling46 (fold-change =4.17, p=1.65x10-06). No unexpected pathway enrichments were found. We further examined whether genes found in two or more autoimmune diseases (96 such genes) were more enriched for one of those three pathways. Cytokine-cytokine receptor interaction45 and T-cell receptor signaling showed similar fold enrichments (fold-changes of 4.7 and 4.48, respectively), however, JAK/STAT signaling pathway had almost doubled its fold enrichment (fold-change of 10.04, p=9.56x10-07) (Fig. 4b). The JAK/STAT genes along with their DHS-derived cellular-contexts were: 1. Janus-kinase 2 (JAK2, Activated CD4+ T set47), 2. Suppressors of cytokine signaling 1 (SOCS1, Activated CD4+ T set22), 3. Signal transducer and activator of transcription 1 (STAT1, T set48), 4. Protein tyrosine phosphatase non-receptor type 2 (PTPN2,T set49), 5. Interleukin 23 receptor (IL23R, T set or Th1750), 6. Interleukin 12 Receptor beta 2 (IL12RB2, Th151), 7. Leukemia inhibitory factor (LIF, Th152), 8. Sprouty-related EVH1 domain containing 2 (SPRED2, Treg), 9. Signal transducer and 10. Activator of transcription 3 (STAT3, Treg22,53), and interleukin 21 (IL21, Th1754), most having reported functions in the predicted cell-type context (references provided). Five of these genes form protein-protein complexes through interactions with JAK2, namely: IL12RB2, STAT3, STAT1 and SOCS122,55’57. This suggests that in addition to the highly autoimmune-relevant HLA region (which we excluded from all of our analyses), the dysregulation of genes involved in JAK/STAT signaling is the second most prevalent pathway in autoimmunity.
Note that there is a bias in which cellular contexts we can find. For example, IL21 was associated with a Th17 context, but is also important for T follicular helper cells58, for which we had no accessibility data. Moreover, for T cells we have six lineages, allowing fine resolution into cellular contexts, whereas for B cells we have only one sample, a composite of B cells at different developmental stages, in which case cellular contexts can only be identified as an aggregate.
Next, we provide two examples for how the network can be used to extract actionable insights for two genes of interest, IRF5 and BLK (Fig. 4c,d). For both genes, we show that stretches of B-cell-specific DHSs underlie the association, with at least one of these DHSs harboring a ‘+AAA’ polymorphic TF binding site. Thereby suggesting candidate causal regulatory SNPs, molecular mechanisms of action, and cellular contexts of associations. In support, both genes were previously described as important for B cell development and function59,60 and in the same regions super enhancers in B-cells were previously reported10.
Regulatory Manhattan plots (RMPs)
A regulatory Manhattan plot (RMP) highlights DHS tagging SNPs above a set significance threshold, across related traits, and visualizes their matched accessibility profile. The profiles in turn help determine the likely cellular context of the locus. For example, we show the RMPs for the IRF5 and BLK example loci discussed above (Fig. 5). Note that the signal over the IRF5 TSS region is not in LD with the signal above the TNPO3 gene body (Fig. S11). As additional examples, we present thirteen associations in three loci that replicated across diseases, IL2RA, PTPN22, and IKZF1, including associations below GWAS genome-wide significance (True and Novel) (Fig. 6). For IL2RA and IKZF1, csRWAS resolved associations to a single dominant accessibility profile: PTPN22 with T set, and IL2RA with Treg. Clear evidence that a locus is not associated is also important, so we provide RMPs for all loci across all diseases, including non-associated ones (supplementary files 2 and 3). Such RMPs reveal that csRWAS often could not associate loci with some diseases, not because an association signal was not present, but because no DHS proximal SNPs were available. This highlights the importance of high-quality dense imputations for csRWAS, and means that we could only provide a lower bound for the true cross-replication rates across these six autoimmune diseases.
In support of the two novel IKZF1 associations reported above with MS and RA, csRWAS linked three additional members of the ikaros hematopoietic transcription factor family61, namely, IKZF2, IKZF3, and IKZF4, to autoimmune diseases (Fig. S12), further resolving the cellular context of the RA-association with IKZF4 to Treg, and the RA-UC-CD-SLE-T1D association with IKZF3 to a T-set or CD8+ T cell accessibility profiles. Note that IKZF3, was associated with five of the six autoimmune diseases, suggesting that dysregulation of IKZF3 is a prevalent cause for autoimmunity, highlighting these TF for possible drug intervention. Additional RMPs for eight loci including the gene-regions for IL12RB2, TCF7, CTLA4, CD28, CCR6, and ETS1 (Fig. S13-S14), reveal more cellular contexts and identify four novel cross-replicating associations, two for IL12RB2 with RA and SLE, one for ETS1 (upstream) with SLE, and one for TCF7 with RA, as well as one non-replicating association for ETS1 (gene body) with UC. We also note that several genome-wide significant loci identified here were not within 0.1cM of a catalog lead-SNP for the respective diseases, and were not flagged by us as poor. Specifically, two cross-replicating loci were found for RA near ZFP36L1 and FAM213B (Fig. S15), two loci for CD, the first cross-replicating with UC near BRD7 and the second not-replicating near TTC33 (Fig. S16), and one locus for MS near SOCS1 that crossreplicated with T1D and CD (Fig. S17). To summarize, we highlight the more promising novel and replicating loci in Table 2 and Figures S18-S27, providing an array of new insights for six common autoimmune diseases. Furthermore, in order to assist in prioritizing polymorphic TF binding sites for follow up work, each RMP is accompanied by a table (in the same.pdf file) providing additional information similar to that found in the network, but in a tabular format and on the locus scale (supplementary files 2 and 3, see example in Fig. S28). We next contrast the loci replication rates of GWAS and csRWAS to conclude the results section.
High rate of validation for genetic associations identified by csRWAS
Often, an initial GWAS suffers from what is known as the “winner’ curse”, where associations found are likely stronger in that GWAS sample than in the general population that is assayed62. This problem is less significant as the GWAS sample size increases. Therefore, the gold standard for validation of any genetic study is replication in an independent sample, which is recommended to be larger to correctly identify false positives63. Following these GWAS guidelines, we examined the rate at which loci discovered in a smaller GWAS, replicated in a larger and independent GWAS for the same autoimmune disease. We compared the replication rates attained by GWAS (considering all SNPs), RWAS (considering all DHS tagging SNPs), and csRWAS (considering all adaptive-immune-specific DHS tagging SNPs) (Methods). We show that the replication rates were overall higher for csRWAS when compared to RWAS or GWAS (Fig. 7a,b). The difference in replication rate was more pronounced for the smaller CD discovery data (n=4,664), when compared to the larger RA discovery data (n=22,515), consistent with prior information having larger impact on underpowered GWAS. Since emerging whole-genome sequencing-based GWAS commonly have such lower sample sizes64–70, we propose that csRWAS is well suited for their analysis as well, as it can leverage the improved SNP resolution offered by sequencing data, while avoiding the pitfalls of greatly increasing the number of variants considered.
Considering the reciprocal of the results above, we show that to achieve a desired replication rate, csRWAS required a reduced significance threshold when compared to GWAS (Fig. 7c,d). For example, if the desired replication rate was 0.5 for RA, csRWAS required loci to reach p<7.4x10-05, while GWAS required loci to reach p<5.6x10-06. This means that at the intermediate P-values (in which significance threshold was reached for csRWAS but not for GWAS) additional loci could be discovered from csRWAS, without loss of accuracy. We quantified the number of such discoveries unique to csRWAS for a range of replication rates (Fig. 7e,f). Keeping with the same example as before, for RA at a 0.5 replication rate, csRWAS discovered 7 additional loci to GWAS, accounting for ~58% of all twelve csRWAS discoveries at that replication rate. The results for CD were more striking, likely because of the higher relevance of priors for underpowered GWAS. These results remained consistent when varying parameters (Methods, Fig. S29-S30).
Discussion
We have demonstrated that genome-wide association testing can be made considerably more powerful and precise by making use of chromatin accessibility and other functional data, and by focusing on genomic regions relevant to the trait or disease of interest. We illustrate the method by reanalyzing existing genetic data from six autoimmune diseases, where the approach led to additional discoveries and increased interpretability. Note that even if associated DHSs did not harbor any causal allele, the regulatory information they provide can still be used to highlight likely cellular contexts, and therefore help determine candidate genes (e.g. genes with function or high expression in suggested cell types), and guide further experiments to prove causal relationships. We propose that when accessibility data is available for trait-relevant cell types, csRWAS would be a valuable complementary analysis to GWAS.
An important technical advancement made here is in defining a single annotation from multiple DNase-seq samples, and generating a continuous accessibility matrix from multiple DNase-Seq data. This simplifies analysis of large collections of DNase-seq experiments, and can be generalized to other peak-like functional data. This also suggests that a unified DHS annotation from all DNase-seq samples can be defined, facilitating comparative studies of accessibility across a larger accessibility matrix (analogous to GTEx31 for gene expression).
As used here, csRWAS had limitations. Some were data related, e.g. not having DNase-seq data for every trait-relevant cell type, or not having biological replicates to estimate the contribution of donor-to-donor variation. As a result of the latter, for example, some of the differences we attributed to cell type specificity, may instead be due to donor-specific genetic or environmental differences. However, the multitude of tests we performed to validate that DHSs grouped by accessibility profiles, also grouped regulatory DNA by coherent regulatory functions, suggests that DHS accessibility was driven primarily by cell-type differences. Other limitations were analysis related, e.g. the TF binding sites we tested here, likely comprise only a fraction of the known and unknown binding motifs important for adaptive-immune cell types. With that respect, the modular design of the analysis makes it possible to incorporate additional DHS related data in the future. Also, although a useful first approximation, the assignment of the most proximal gene’ TSS to DHSs as a likely target was error prone. As such, when examining a locus of interest we recommend carefully reviewing the RMP and information table for that locus, and applying domain knowledge before making a decision on candidate genes for follow up.
Extensions to this work are also possible. For example, the measured trait, instead of being disease status, could be gene expression in relevant cell types71. The large number of studied “traits” generally further diminishes power of such expression quantitative trait loci (eQTL) mapping. Hence, focusing on regulatory DNA specific to trait-relevant cell types is promising for greatly improving both statistical power and functional insights of eQTL studies.
To conclude, we identified 1975 DHSs (FDR<0.005) associated to one or more of six autoimmune diseases, grouped into 529 independent loci across the genome. The original GWAS of these six diseases, employing standard genome-wide significance thresholds and not integrating non-genetic data, would have missed 327 of these loci, although 153 of those (46.8%) readily replicated here or in the GWAS catalog. Also, as a result of the functional prior, this study identified and replicated 52 novel loci. But perhaps more importantly, csRWAS provides actionable insights for many associations, e.g. suggests polymorphic TF binding sites only in accessible DNA unique to trait-relevant cell types, and proposes cellular contexts for follow up experiments.
Taken together we presented compelling evidence that trait-tailored approaches to functional and genetic data, can provide a structured path to better understanding how genotypes are translated to phenotypes.
Figures
All figures appear below but are also available as two downloadable .pdf files from the following private links.
Main figures, private links:
https://drive.google.com/open?id=OB_nf7cPOLTBSSVpuajRxQlZySVk
Supplemental figures, private link:
https://drive.google.com/open?id=OB_nf7cPOLTBSQUtfZOpUV3hQLW8
Acknowledgements
We thank Leonardo Arbiza, Kaixiong Ye, and Cris Van Hout for helpful comments and discussions about this work. This work was supported in part by NIH grant R01- HG006849 (A.K.) and by an award from the Ellison Medical Foundation (A.K.). This work would not have been possible without the sharing of data through public resources. We thank the Encyclopedia of DNA Elements (ENCODE) consortium and Roadmap Epigenomics consortium and particularly the work carried out by the Stamatoyannopoulos lab for collecting the DNaseI-seq data used in this manuscript. We thank the following consortia for sharing GWAS summary statistics: Rheumatoid Arthritis Consortium International for Immunochip (RACI), International Inflammatory Bowel Disease Genetics Consortium (IIBDGC), The Coronary Artery Disease Genetics Consortium (C4D), International Consortium for Blood Pressure (ICBP), DIAbetes Genetics Replication And Meta-analysis (DIAGRAM), The Genetic Investigation of ANthropometric Traits (GIANT), International Genomics of Alzheimer’ Project (IGAP), Psychiatric Genomics Consortium (PGC), and the Euro-Canadian systemic lupus erythematosus consortium (EC-SLE). This study makes use of data generated by the Wellcome Trust Case Control Consortium (WTCCC). A full list of the investigators who contributed to the generation of the data is available from www.wtccc.org.uk. Funding for the WTCCC project was provided by the Wellcome Trust under award 076113. Finally, we would like to thank all the participants and staff involved in the collection of the genotype-phenotype data used in this manuscript.