Abstract
Gene regulation is known to play a fundamental role in human disease, but mechanisms of regulation vary greatly across genes. Here, we explore the contributions to disease of two types of genes: genes whose regulation is driven by enhancer regions as opposed to promoter regions (Enhancer-driven) and genes that regulate many other genes in trans (Master-regulator). We link these genes to SNPs using a comprehensive set of SNP-to-gene (S2G) strategies and apply stratified LD score regression to the resulting SNP annotations to draw three main conclusions about 11 autoimmune diseases and blood cell traits (average Ncase=13K across 6 autoimmune diseases, average N =443K across 5 blood cell traits). First, several characterizations of Enhancer-driven genes defined in blood using functional genomics data (e.g. ATAC-seq, RNA-seq, PC-HiC) are conditionally informative for autoimmune disease heritability, after conditioning on a broad set of regulatory annotations from the baseline-LD model. Second, Master-regulator genes defined using trans-eQTL in blood are also conditionally informative for autoimmune disease heritability. Third, integrating Enhancer-driven and Master-regulator gene sets with protein-protein interaction (PPI) network information magnified their disease signal. The resulting PPI-enhancer gene score produced >2x stronger conditional signal (maximum standardized SNP annotation effect size (τ*) = 2.0 (s.e. 0.3) vs. 0.91 (s.e. 0.21)), and >2x stronger gene-level enrichment for approved autoimmune disease drug targets (5.3x vs. 2.1x), as compared to the recently proposed Enhancer Domain Score (EDS). In each case, using functionally informed S2G strategies to link genes to SNPs that may regulate them produced much stronger disease signals (4.1x-13x larger τ* values) than conventional window-based S2G strategies. We conclude that our characterizations of Enhancer-driven and Master-regulator genes identify gene sets that are important for autoimmune disease, and that combining those gene sets with functionally informed S2G strategies enables us to identify SNP annotations in which disease heritability is concentrated.
1 Introduction
Disease risk variants associated with complex traits and diseases predominantly lie in non-coding regulatory regions of the genes, motivating the need to assess the relative importance of genes for disease through the lens of gene regulation1–6. Several recent studies have performed disease-specific gene-level prioritization by integrating GWAS summary statistics data with functional genomics data, including gene expression and gene networks7–14. Here, we investigate the contribution to autoimmune disease of gene sets reflecting two specific aspects of gene regulation in blood—genes with strong evidence of enhancer-driven regulation (Enhancer-driven) and genes that regulate many other genes (Master-regulator); previous studies suggested that both of these characterizations are important for understanding human disease9, 15–24. For example, several common non-coding variants associated with Hirschsprung disease have been identified in intronic enhancer elements of RET gene and have been shown to synergistically regulate its expression25, 26 and NLRC5 acts as a master regulator of MHC class genes in immune response27. Our two main goals are to characterize which types of genes are important for autoimmune disease, and to construct SNP annotations derived from those genes that are conditionally informative for disease heritability, conditional on all other annotations.
A major challenge in gene-level analyses of disease is to link genes to SNPs that may regulate them, a prerequisite to integrative analyses of GWAS summary statistics. Previous studies have often employed window-based strategies such as 100kb8, 9, 11, linking each gene to all SNPs within 100kb; however, this approach lacks specificity. Here, we incorporated functionally informed SNP-to-gene (S2G) linking strategies that capture both distal and proximal components of gene regulation. We evaluated the resulting SNP annotations by applying stratified LD score regression28 (S-LDSC) conditional on a broad set of coding, conserved, regulatory and LD-related annotations from the baseline-LD model29, 30, meta-analyzing the results across 11 autoimmune diseases and blood cell traits; we focused on autoimmune diseases and blood cell traits because the functional data underlying the gene scores and S2G strategies that we analyze is primarily measured in blood. We also assessed gene-level enrichment for disease-related gene sets, including approved drug targets for autoimmune disease10.
Results
Overview of methods
We define an annotation as an assignment of a numeric value to each SNP with minor allele count ≥5 in a 1000 Genomes Project European reference panel31, as in our previous work28; we primarily focus on annotations with values between 0 and 1. We define a gene score as an assignment of a numeric value between 0 and 1 to each gene; gene scores predict the relevance of each gene to disease. We primarily focus on binary gene sets defined by the top 10% of genes; we made this choice to be consistent with ref.9, and to ensure that all resulting SNP annotations (gene scores x S2G strategies; see below) were of reasonable size (0.2% of SNPs or larger). We consider 11 gene scores prioritizing enhancer-driven genes, master-regulator genes, and genes with high network connectivity to enhancer-driven or master-regulator genes (Table 1, Supplementary Figure S1); these gene scores were only mildly correlated (average r= 0.08, Supplementary Figure S2). We considered enhancer-driven and master-regulator genes because previous studies suggested that both of these characterizations are important for understanding human disease9, 15–24.
We define a SNP-to-gene (S2G) linking strategy as an assignment of 0, 1 or more linked genes to each SNP. We consider 10 S2G strategies capturing both distal and proximal gene regulation (see Methods, Figure 1A and Table 2); these S2G strategies aim to link SNPs to genes that they regulate. For each gene score X and S2G strategy Y, we define a corresponding combined annotation X × Y by assigning to each SNP the maximum gene score among genes linked to that SNP (or 0 for SNPs with no linked genes); this generalizes the standard approach of constructing annotations from gene scores using window-based strategies8, 9. For example, EDS-binary × ABC annotates SNPs linked by Activity-By-Contact enhancer-gene links32, 33 to any gene from the EDS-binary gene set, whereas EDS-binary × 100kb annotates all SNPs within 100kb of any gene from the EDS-binary gene set. We considered combined annotations based on S2G strategies related to gene regulation because SNPs that regulate functionally important genes may be important for disease. For each S2G strategy, we also define a corresponding binary S2G annotation defined by SNPs linked to the set of all genes. We have publicly released all gene scores, S2G links, and annotations analyzed in this study (see URLs). We have also included annotations for 93 million Haplotype Reference Consortium (HRC) SNPs34 (MAF > 0.1% in imputed UK Biobank data35) and 170 million TOPMed SNPs36 (Freeze 3A).
We assessed the informativeness of the resulting annotations for disease heritability by applying stratified LD score regression (S-LDSC)28 to 11 independent blood-related traits (6 autoimmune diseases and 5 blood cell traits; average Ncase=13K for autoimmune diseases and N =443K for blood cell traits, Supplementary Table S1) and meta-analyzing S-LDSC results across traits; we also assessed results meta-analyzed across autoimmune diseases or blood cell traits only, as well as results for individual diseases/traits. We conditioned on 86 coding, conserved, regulatory and LD-related annotations from the baseline-LD model (v2.1)29, 30 (see URLs). S-LDSC uses two metrics to evaluate informativeness for disease heritability: enrichment and standardized effect size (τ*). Enrichment is defined as the proportion of heritability explained by SNPs in an annotation divided by the proportion of SNPs in the annotation28, and generalizes to annotations with values between 0 and 137. Standardized effect size (τ*) is defined as the proportionate change in per-SNP heritability associated with a 1 standard deviation increase in the value of the annotation, conditional on other annotations included in the model29; unlike enrichment, τ* quantifies effects that are conditionally informative, i.e. unique to the focal annotation conditional on other annotations included in the model. In our “marginal” analyses, we estimated τ* for each focal annotation conditional on the baseline-LD annotations. In our “joint” analyses, we merged baseline-LD annotations with focal annotations that were marginally significant after Bonferroni correction and performed forward stepwise elimination to iteratively remove focal annotations that had conditionally non-significant τ* values after Bonferroni correction, as in ref.11, 29, 37–42. We did not consider other feature selection methods, as previous research determined that a LASSO-based feature selection method is computationally expensive and did not perform better in predicting off-chromosome χ2 association statistics (R. Cui and H. Finucane, personal correspondence). The difference between marginal τ* and joint τ* is that marginal τ* assesses informativeness for disease conditional only on baseline-LD model annotations, whereas joint τ* assesses informativeness for disease conditional on baseline-LD model annotations as well as other annotations in the joint model.
As a preliminary assessment of the potential of the 10 S2G strategies, we considered the 10 S2G annotations defined by SNPs linked to the set of all genes. The S2G annotations were only weakly positively correlated (average r = 0.09; Supplementary Figure S3). We analyzed the 10 S2G annotations via a marginal analysis, running S-LDSC28 conditional on the baseline-LD model and meta-analyzing the results across the 11 blood-related traits. In the marginal analysis, all 10 S2G annotations were significantly enriched for disease heritability, with larger enrichments for smaller annotations (Figure 2A and Supplementary Table S2); values of standardized enrichment (defined as enrichment scaled by the standard deviation of the annotation11) were more similar across annotations (Supplementary Figure S4 and Supplementary Table S3). 7 S2G annotations attained conditionally significant τ* values after Bonferroni correction (p < 0.05/10) (Figure 2B and Supplementary Table S2). In the joint analysis, 3 of these 7 S2G annotations were jointly significant: TSS (joint τ* = 0.97), Roadmap (joint τ* = 0.84) and Activity-by-Contact (ABC) (joint τ* = 0.44) (Figure 2B and Supplementary Table S4). This suggests that these 3 S2G annotations are highly informative for disease. Subsequent analyses were conditioned on the baseline-LD+ model defined by 86 baseline-LD model annotations plus all S2G annotations (except Coding, TSS and Promoter, which were already part of the baseline-LD model), to ensure that conditionally significant τ* values for (gene scores x S2G strategies) annotations are specific to the gene scores and cannot be explained by (all genes x S2G strategies) annotations. Accordingly, we confirmed that (random genes x S2G strategies) annotations did not produce conditionally significant τ* values for any S2G strategy (Supplementary Table S5).
We validated the gene scores implicated in our study by investigating whether they were enriched in 5 “gold-standard” disease-related gene sets: 195 approved drug target genes for autoimmune diseases10, 43; 550 Mendelian genes related to immune dysregulation44, 390 Mendelian genes related to blood disorders45, 146 “Bone Marrow/Immune” genes defined by the Developmental Disorders Database/Genotype-Phenotype Database (DDD/G2P)46, and 2200 (top 10%) high-pLI genes47 (Figure 3C and Supplementary Table S6). (We note that the high-pLI genes should not be viewed as a strict gold standard as not all of these genes are disease-related, but 30% of these genes have an established human disease phenotype47.)
Subsequent subsections are organized in the following order: description of gene scores for that subsection; marginal analyses using S-LDSC; joint analyses using S-LDSC; and validation of the genes scores implicated in our study using “gold-standard” disease-related gene sets.
Enhancer-driven genes are conditionally informative for autoimmune disease heritability
We assessed the disease informativeness of 7 gene scores prioritizing Enhancer-driven genes in blood. We defined these gene scores based on distal enhancer-gene connections, tissue-specific expression, or tissue-specific eQTL, all of which can characterize enhancer-driven regulation (Figure 1B, Table 1 and Methods). Some of these gene scores were derived from the same functional data that we used to define S2G strategies (e.g. ABC32, 33 and ATAC-seq48; see URLs). We included two published gene scores, (binarized) blood-specific enhancer domain score (EDS)24 and specifically expressed genes in GTEx whole blood9 (SEG-GTEx). We use the term “enhancer-driven” to broadly describe gene scores with high predicted functionality under a diverse set of metrics, notwithstanding the fact that all genes require the activation of enhancers and their promoters. 4 of our enhancer-driven gene scores (ABC-G, ATAC-distal, EDS-binary, PC-HiC) were explicitly defined based on distal enhancer-gene connections. Using the established EDS-binary (derived from the published Enhancer Domain Score (EDS)24) as a point of reference, we determined that the other 3 gene scores had an average excess overlap of 1.7x with the EDS-binary score (P-values per gene score: 2e-08 to 6e-06; Supplementary Table S7), confirming that they prioritize enhancer-driven genes. 3 of our enhancer-driven scores (eQTL-CTS, Expecto-MVP, SEG-GTEx) were not explicitly defined based on distal enhancer-gene connections. We determined that these 3 gene scores had an average excess overlap of 1.5x with the EDS-binary score (P-values per gene score: 4e-07 to 1e-04; Supplementary Table S7)) confirming that they prioritize enhancer-driven genes; notably, the excess overlap of 1.5x was almost as large as the excess of overlap of 1.7x for gene scores defined based on distal enhancer-gene connections.
We combined the 7 Enhancer-driven gene scores with the 10 S2G strategies (Table 2) to define 70 annotations. In our marginal analysis using S-LDSC conditional on the baseline-LD+ model (meta-analyzing S-LDSC results across 11 autoimmune diseases and blood cell traits), all 70 enhancer-driven annotations were significantly enriched for disease heritability, with larger enrichments for smaller annotations (Supplementary Figure S5 and Supplementary Table S8); values of standardized enrichment were more similar across annotations (Supplementary Figure S6 and Supplementary Table S9). 37 of the 70 enhancer-driven annotations attained conditionally significant τ* values after Bonferroni correction (p < 0.05/110) (Figure 3A and Supplementary Table S8). We observed the strongest conditional signal for ATAC-distal ABC (τ* = 1.0 0.2). ATAC-distal is defined by the proportion of mouse gene expression variability across blood cell types that is explained by distal ATAC-seq peaks in mouse48; the mouse genes are mapped to orthologous human genes. 4 of the 7 gene scores (ABC-G, ATAC-distal, EDS-binary and SEG-GTEx) produced strong conditional signals across many S2G strategies; however none of them attained Bonferroni-significant τ* for all 10 S2G strategies (Figure 3A). Among the S2G strategies, average conditional signals were strongest for the ABC strategy (average τ* = 0.59) and TSS strategy (average τ* = 0.52), which greatly outperformed the window-based S2G strategies (average τ* = 0.04-0.07), emphasizing the high added value of S2G strategies incorporating functional data (especially the ABC and TSS strategies).
We compared meta-analyses of S-LDSC results across 6 autoimmune diseases vs. 5 blood cell traits(Figure 3B, Supplementary Figure S7, Supplementary Table S1, Supplementary Table S10, and Supplementary Table S11). Results were broadly concordant (r = 0.57 between τ* estimates), with slightly stronger signals for autoimmune diseases (slope=1.3). We also compared meta-analyses of results across 2 granulocyte-related blood cell traits vs. 3 red blood cell or platelet-related blood cell traits (Supplementary Figure S8, Supplementary Table S12, and Supplementary Table S13). Results were broadly concordant (r = 0.65, slope = 1.1). We also examined S-LDSC results for individual disease/traits and applied a test for heterogeneity49 (Supplementary Figure S9, Supplementary Figure S10, Supplementary Table S14, Supplementary Table S15). Results were generally underpowered (FDR<5% for 16 of 770 annotation-trait pairs), with limited evidence of heterogeneity across diseases/traits (FDR<5% for 11 of 70 annotations).
We jointly analyzed the 37 enhancer-driven annotations that were Bonferroni-significant in our marginal analysis (Figure 3A and Supplementary Table S8) by performing forward stepwise elimination to iteratively remove annotations that had conditionally non-significant τ* values after Bonferroni correction. Of these, 6 annotations were jointly significant in the resulting Enhancer-driven joint model (Supplementary Figure S11 and Supplementary Table S16), corresponding to 4 Enhancer-driven gene scores: ABC-G, ATAC-distal, EDS-binary and SEG-GTEx.
We assessed the enrichment of the 7 Enhancer-driven gene scores (Table 1) in 5 “gold-standard” disease-related gene sets (drug target genes10, 43, Mendelian genes (Freund)44, Mendelian genes (Vuckovic)45, immune genes46, and high-pLI genes47) (Figure 3C and Supplementary Table S6). 6 of the 7 gene scores were significantly enriched (after Bonferroni correction; p < 0.05/55) in the drug target genes, all 7 were significantly enriched in both Mendelian gene sets, 3 of 7 were significantly enriched in the immune genes, and 5 of 7 were significantly enriched in the high-pLI genes. The largest enrichment was observed for SEG-GTEx genes in the drug target genes (2.4x, s.e. 0.1) and Mendelian genes (Freund) (2.4x, s.e. 0.1). These findings validate the high importance to disease of enhancer-driven genes..
We performed 5 secondary analyses. First, for each of the 6 annotations from the Enhancer-driven joint model (Supplementary Figure S11), we assessed their functional enrichment for fine-mapped SNPs for blood-related traits from two previous studies50, 51. We observed large and significant enrichments for all 6 annotations (Supplementary Table S17), consistent with the S-LDSC results. Second, for each of the 7 Enhancer-driven gene scores, we performed pathway enrichment analyses to assess their enrichment in pathways from the ConsensusPathDB database52; all 7 gene scores were significantly enriched in immune-related and signaling pathways (Supplementary Table S18). Third, we explored other approaches to combining information across genes that are linked to a SNP using S2G strategies, by using either the mean across genes or the sum across genes of the gene scores linked to a SNP, instead of the maximum across genes. We determined that results for either the mean or the sum were very similar to the results for the maximum, with no significant difference in standardized effect sizes of the resulting SNP annotations (Supplementary Table S8, Supplementary Table S19, Supplementary Table S20). Fourth, we repeated our analyses of the 5 enhancer-driven gene scores for which the top 10% (of genes) threshold was applied, using top 5% or top 20% thresholds instead (Supplementary Table S21 and Supplementary Table S22). We observed very similar results, with largely non-significant differences in standardized effect sizes. Fifth, we confirmed that our forward stepwise elimination procedure produced identical results when applied to all 70 enhancer-driven annotations, instead of just the 37 enhancer-driven annotations that were Bonferroni-significant in our marginal analysis.
We conclude that 4 of the 7 characterizations of Enhancer-driven genes are conditionally informative for autoimmune diseases and blood-related traits when using functionally informed S2G strategies.
Genes with high network connectivity to Enhancer-driven genes are even more informative
We assessed the disease informativeness of a gene score prioritizing genes with high connectivity to Enhancer-driven genes in a protein-protein interaction (PPI) network (PPI-enhancer).We hypothesized that (i) genes that are connected to Enhancer-driven genes in biological networks are likely to be important, and that (ii) combining potentially noisy metrics defining enhancer-driven genes would increase statistical signal. We used the STRING PPI network53 to quantify the network connectivity of each gene with respect to each of the 4 jointly informative Enhancer-driven gene scores from Supplementary Figure S11 (ABC-G, ATAC-distal, EDS-binary and SEG-GTEx) (Figure 1D). Network connectivity scores were computed using a random walk with restart algorithm10, 54 (see Methods). We defined the PPI-enhancer gene score based on genes in the top 10% of average network connectivity across the 4 Enhancer-driven gene scores (Table 1). The PPI-enhancer gene score was only moderately positively correlated with the 4 underlying Enhancer-driven gene scores (average r=0.28; Supplementary Figure S2).
We combined the PPI-enhancer gene score with the 10 S2G strategies (Table 2) to define 10 annotations. In our marginal analysis using S-LDSC (meta-analyzing S-LDSC results across 11 autoimmune diseases and blood cell traits), all 10 PPI-enhancer annotations were significantly enriched for disease heritability, with larger enrichments for smaller annotations (Supplementary Figure S5 and Supplementary Table S23); values of standardized enrichment were more similar across annotations (Supplementary Figure S6 and Supplementary Table S24). All 10 PPI-enhancer annotations attained conditionally significant τ* values after Bonferroni correction (p < 0.05/110) (Figure 3A and Supplementary Table S23). Notably, the maximum τ* (2.0 (s.e. 0.3) for PPI-enhancer x ABC) was >2x larger than the maximum τ* for the recently proposed Enhancer Domain Score24 (EDS) (0.91 (s.e. 0.21) for EDS-binary x ABC). All 10 PPI-enhancer annotations remained significant when conditioned on the Enhancer-driven joint model from Supplementary Figure S11 (Supplementary Table S25). In a comparison of meta-analyses of S-LDSC results across 5 blood cell traits vs. 6 autoimmune diseases, results were broadly concordant (r = 0.93 between τ* estimates), but with much stronger signals for autoimmune diseases (slope=2.2) (Figure 3B, Supplementary Figure S7, Supplementary Table S10, and Supplementary Table S11). In a comparison of meta-analyses across 2 granulocyte-related blood cell traits vs. 3 red blood cell or platelet-related blood cell traits, results were broadly concordant (r = 0.83), but with much stronger signals for granulocyte-related blood cell traits (slope = 2.1) (Supplementary Figure S8 and Supplementary Tables S12, S13). In analyses of individual traits, 62 of 110 PPI-enhancer annotation-trait pairs were significant (FDR<5%) (Supplementary Figure S9, Supplementary Figure S10, Supplementary Table S14), 8 of with evidence of heterogeneity across diseases/traits (FDR<5% for 8 of 10 PPI-enhancer annotations) (Supplementary Table S15).
We jointly analyzed the 6 Enhancer-driven annotations from the Enhancer-driven joint model (Supplementary Figure S11) and the 10 marginally significant PPI-enhancer annotations conditional on the enhancer-driven joint model in Supplementary Table S25. Of these, 3 Enhancer-driven and 4 PPI-enhancer annotations were jointly significant in the resulting PPI-enhancer-driven joint model (Figure 3D and Supplementary Table S26). The joint signal was strongest for PPI-enhancer × ABC (τ* = 1.2±0.21), highlighting the informativeness of the ABC S2G strategy. 3 of the 7 annotations attained τ* > 0.5; annotations with τ* > 0.5 are unusual, and considered to be important39.
We assessed the enrichment of the PPI-enhancer gene score in the 5 “gold-standard” disease-related gene sets (drug target genes10, 43, Mendelian genes (Freund)44, Mendelian genes (Vuckovic)45, immune genes46, and high-pLI genes47) (Figure 3C and Supplementary Table S6). The PPI-enhancer gene score showed significant enrichment in all 5 gene sets, with higher magnitude of enrichment compared to any of the 7 Enhancer-driven gene scores. In particular, the PPI-enhancer gene score was 5.3x (s.e. 0.1) enriched in drug target genes and 4.6x (s.e. 0.1) enriched in Mendelian genes (Freund), a ≥2x stronger enrichment in each case than the EDS-binary gene score24 (2.1x (s.e. 0.1) and 2.3x (s.e. 0.1)). These findings validate the high importance to disease of the PPI-enhancer gene score..
We sought to assess whether the PPI-enhancer disease signal derives from (i) the information in the PPI network or (ii) the improved signal-to-noise of combining different Enhancer-driven gene scores (see above). To assess this, we constructed an optimally weighted linear combination of the 4 enhancer-driven scores from Figure 3D, without using PPI network information (Weighted-enhancer; see Methods). We repeated the above analyses using Weighted-enhancer instead of PPI-enhancer. We determined that marginal τ* values were considerably lower for Weighted-enhancer vs. PPI-enhancer annotations (0.65x, Bootstrap p: 3.4e-09 for 1,000 re-samples per annotation, Supplementary Table S27 vs. Supplementary Table S23). (In addition, none of the Weighted-enhancer annotations were significant conditional on the PPI-enhancer-driven joint model from Figure 3D; see Supplementary Table S28). This confirms that the additional PPI-enhancer signal derives from the information in the PPI network. To verify that the PPI-enhancer disease signal is driven not just by the PPI network but also by the input gene scores, we defined a new gene score analogous to PPI-enhancer but using 4 randomly generated binary gene sets of size 10% as input (PPI-control). We determined that marginal τ* values were much lower for PPI-control vs. PPI-enhancer annotations (0.52x, Bootstrap p-value: 4.4e-16 over 1,000 re-samples per annotation, Supplementary Table S29 vs. Supplementary Table S23). (In addition, none of the PPI-control annotations were significant conditional on the PPI-enhancer-driven joint model from Figure 3D; see Supplementary Table S30). This confirms that the PPI-enhancer disease signal is driven not just by the PPI network but also by the input gene scores.
We performed 4 secondary analyses. First, we defined a new gene score (RegNet-Enhancer) using the regulatory network from ref.55 instead of the STRING PPI network, and repeated the above analyses. We determined that the STRING PPI network and the RegNet regulatory network are similarly informative (Supplementary Table S31 and Supplementary Table S32); we elected to use the STRING PPI network in our main analyses because the RegNet regulatory network uses GTEx expression data, which is also used by the SEG-GTEx gene score, complicating interpretation of the results. Second, for each of the 4 jointly significant PPI-enhancer annotations from Figure 3D, we assessed their functional enrichment for fine-mapped SNPs for blood-related traits from two previous studies50, 51. We observed large and significant enrichments for all 4 annotations (Supplementary Table S17), consistent with the S-LDSC results (and with the similar analysis of Enhancer-driven annotations described above). Third, we performed a pathway enrichment analysis to assess the enrichment of the PPI-enhancer gene score in pathways from the ConsensusPathDB database52; this gene score was enriched in immune-related pathways (Supplementary Table S18). Fourth, we confirmed that our forward stepwise elimination procedure produced identical results when applied to all 80 enhancer-driven and PPI-enhancer annotations, instead of just the 6 Enhancer-driven annotations from the Enhancer-driven joint model (Supplementary Figure S11) and the 10 PPI-enhancer annotations.
We conclude that genes with high network connectivity to Enhancer-driven genes are conditionally informative for autoimmune diseases and blood-related traits when using functionally informed S2G strategies.
Master-regulator genes are conditionally informative for autoimmune disease heritability
We assessed the disease informativeness of two gene scores prioritizing Master-regulator genes in blood. We defined these gene scores using whole blood eQTL data from the eQTLGen consortium56 (Trans-master) and a published list of known transcription factors in humans57 (TF) (Figure 1C, Table 1 and Methods). We note that TF genes do not necessarily act as master regulators, but can be viewed as candidate master regulators. Using 97 known master-regulator genes from 18 master-regulator families58–62 as a point of reference, we determined that Trans-master and TF genes had 3.5x and 5.4x excess overlaps with the 97 master-regulator genes (P-values: 5.6e-72 and 2.2e-160; Supplementary Table S33 and Supplementary Table S34), confirming that they prioritize master-regulator genes.
In detail, Trans-master is a binary gene score defined by genes that significantly regulate 3 or more other genes in trans via SNPs that are significant cis-eQTLs of the focal gene (10% of genes); the median value of the number of genes trans-regulated by a Trans-master gene is 14. Notably, trans-eQTL data from the eQTLGen consortium56 was only available for 10,317 previously disease-associated SNPs. It is possible that genes with significant cis-eQTL that are disease-associated SNPs may be enriched for disease heritability irrespective of trans signals. To account for this gene-level bias, we conditioned all analyses of Trans-master annotations on both (i) 10 annotations based on a gene score defined by genes with at least 1 disease-associated cis-eQTL, combined with each of the 10 S2G strategies, and (ii) 10 annotations based on a gene score defined by genes with at least 3 unlinked disease-associated cis-eQTL, combined with each of the 10 S2G strategies; we chose the number 3 to maximize the correlation between this gene score and the Trans-master gene score (r = 0.32). Thus, our primary analyses were conditioned on 93 baseline-LD+ and 20 additional annotations (113 baseline-LD+cis model annotations); additional secondary analyses are described below. We did not consider a SNP annotation defined by trans-eQTLs, because the trans-eQTLs in eQTLGen data were restricted to disease-associated SNPs, which would bias our results.
We combined the Trans-master gene score with the 10 S2G strategies (Table 2) to define 10 annotations. In our marginal analysis using S-LDSC conditional on the baseline-LD+cis model, all 10 Trans-master annotations were strongly and significantly enriched for disease heritability, with larger enrichments for smaller annotations (Supplementary Figure S5 and Supplementary Table S35); values of standardized enrichment were more similar across annotations (Supplementary Figure S6 and Supplementary Table S36). All 10 Trans-master annotations attained conditionally significant τ* values after Bonferroni correction (p < 0.05/110) (Figure 4A and Supplementary Table S35). We observed the strongest conditional signals for Trans-master × TSS (τ* = 1.6, vs. τ* = 0.37-0.39 for Master-regulator × window-based S2G strategies). We observed similar (slightly more significant) results when conditioning on baseline-LD+ annotations only (Supplementary Table S37).
As noted above, trans-eQTL data from the eQTLGen consortium56 was only available for 10,317 previously disease-associated SNPs, and we thus defined and conditioned on baseline-LD+cis model annotations to account for gene-level bias. We verified that conditioning on annotations derived from gene scores defined by other minimum numbers of cis-eQTL and/or unlinked cis-eQTL produced similar results (Supplementary Table S38, Supplementary Table S39, Supplementary Table S40, Supplementary Table S41, Supplementary Table S42). To verify that our results were not impacted by SNP-level bias, we adjusted each of the 10 Trans-master annotations by removing all disease-associated trans-eQTL SNPs in the eQTLGen data from the annotation, as well as any linked SNPs (Methods). We verified that these adjusted annotations produced similar results (Supplementary Table S43).
TF is a binary gene score defined by a published list of 1,639 known transcription factors in humans57. We combined TF with the 10 S2G strategies (Table 2) to define 10 annotations. In our marginal analysis conditional on the baseline-LD+cis model, all 10 TF annotations were significantly enriched for heritability, but with smaller enrichments than the Trans-master annotations (Supplementary Table S35); see Supplementary Table S36 for standardized enrichments. 9 TF annotations attained significant τ* values after Bonferroni correction (Figure 4A and Supplementary Table S35) (the same 9 annotations were also significant conditional on the baseline-LD+ model; Supplementary Table S37). Across all S2G strategies, τ* values of Trans-master annotations were larger than those of TF annotations (Supplementary Table S35).
We compared meta-analyses of S-LDSC results across 6 autoimmune diseases vs. 5 blood cell traits(Figure 4B, Supplementary Figure S7, Supplementary Table S1, Supplementary Table S44, Supplementary Table S45). Results were broadly concordant (r = 0.56 between τ* estimates), with slightly stronger signals for blood cell traits (slope=0.57). We also compared meta-analyses of results across 2 granulocyte-related blood cell traits vs. 3 red blood cell or platelet-related blood cell traits (Supplementary Figure S8, Supplementary Table S46, and Supplementary Table S47). Results were broadly concordant (r = 0.94, slope = 1.12). We also examined S-LDSC results for individual disease/traits and applied a test for heterogeneity49 (Supplementary Figure S12, Supplementary Figure S13, Supplementary Table S14, Supplementary Table S15). We observed several annotation-trait pairs with disease signal (FDR<5% for 96 of 220 annotation-trait pairs), with evidence of heterogeneity across diseases/traits (FDR<5% for 10 of 20 annotations).
We jointly analyzed the 10 Trans-master and 9 TF annotations that were Bonferroni-significant in our marginal analysis (Figure 4A and Supplementary Table S35) by performing forward stepwise elimination to iteratively remove annotations that had conditionally non-significant τ* values after Bonferroni correction. Of these, 3 Trans-master annotations and 2 TF annotations were jointly significant in the resulting Master-regulator joint model (Supplementary Figure S14 and Supplementary Table S48). The joint signal was strongest for Trans-master Roadmap (τ* = 0.81, s.e. = 0.13), emphasizing the high added value of the Roadmap S2G strategy.
We assessed the enrichment of the Trans-master and TF gene scores in the 5 “gold standard” disease-related gene sets (drug target genes10, 43, Mendelian genes (Freund)44, Mendelian genes (Vuckovic)45, immune genes46, and high-pLI genes47) (Figure 4C and Supplementary Table S6). The Trans-master gene score showed higher enrichment in all 5 gene sets compared to the TF gene score. The enrichments for master-regulator genes were lower (1.4x, s.e. 0.07) for drug target genes in comparison to some Enhancer-driven genes and the PPI-enhancer gene score (Figure 3C); this can be attributed to the fact that master-regulator genes may tend to disrupt genes across several pathways, rendering them unsuitable as drug targets.
We performed 7 secondary analyses. First, for comparison purposes, we defined a binary gene score (Trans-regulated) based on genes with at least one significant trans-eQTL. We combined Trans-regulated genes with the 10 S2G strategies to define 10 annotations. In our marginal analysis using S-LDSC conditional on the baseline-LD+cis model, none of the Trans-regulated annotations attained conditionally significant τ* values after Bonferroni correction (p < 0.05/110) (Supplementary Table S49). (In contrast, 3 of the annotations were significant when conditioning only on the baseline-LD+ model (Supplementary Table S50).) Second, a potential complexity is that trans-eQTL in whole blood may be inherently enriched for blood cell trait-associated SNPs (since SNPs that regulate the abundance of a specific blood cell type would result in trans-eQTL effects on genes that are specifically expressed in that cell type56), potentially limiting the generalizability of our results to non-blood cell traits. To ensure that our results were robust to this complexity, we verified that analyses restricted to the 5 autoimmune diseases (Supplementary Table S1) produced similar results (Supplementary Table S51). Third, for each of the 5 annotations from the Master-regulator joint model (Supplementary Figure S14), we assessed their functional enrichment for fine-mapped SNPs for blood-related traits from two previous studies50, 51. We observed large and significant enrichments for all 5 annotations (Supplementary Table S17), consistent with the S-LDSC results (and with similar analyses described above). Fourth, we performed pathway enrichment analyses to assess the enrichment of the Trans-master and TF gene scores in pathways from the ConsensusPathDB database52. The Trans-master gene score was significantly enriched in immune-related pathways (Supplementary Table S18). Fifth, we explored other approaches to combining information across genes that are linked to a SNP using S2G strategies, by using either the mean across genes or the sum across genes of the gene scores linked to a SNP, instead of the maximum across genes. We determined that results for either the mean or the sum were very similar to the results for the maximum, with no significant difference in standardized effect sizes of the resulting SNP annotations (Supplementary Table S35, Supplementary Table S19 and Supplementary Table S20). Sixth, we repeated our analyses of the Trans-master gene score, defined in our primary analyses based on 2,215 genes that trans-regulate ≥ 3 genes, using either 3,717 genes that trans-regulate ≥ 1 gene (most of which trans-regulate multiple genes) or 1,170 genes that trans-regulate ≥ 10 genes (Supplementary Table S52). We observed very similar results, with largely non-significant differences in standardized effect sizes. Seventh, we confirmed that our forward stepwise elimination procedure produced identical results when applied to all 20 master-regulator annotations, instead of just the 19 master-regulator annotations that were Bonferroni-significant in our marginal analysis.
We conclude that master-regulator genes are conditionally informative for autoimmune diseases and blood-related traits when using functionally informed S2G strategies.
Genes with high network connectivity to Master-regulator genes are even more informative
We assessed the disease informativeness of a gene score prioritizing genes with high connectivity to Master-regulator genes in the STRING PPI network53 (PPI-master, analogous to PPI-enhancer; see Methods and Table 1). The PPI-master gene score was positively correlated with the 2 underlying master-regulator gene scores (average r = 0.43) and modestly correlated with PPI-enhancer (r=0.22) (Supplementary Figure S2). In addition, it had an excess overlap of 7.2x with the 97 known master-regulator genes58–62 (P = 2e-214; Supplementary Table S33 and Supplementary Table S34).
We combined the PPI-master gene score with the 10 S2G strategies (Table 2) to define 10 annotations. In our marginal analysis using S-LDSC conditional on the baseline-LD+cis model, all 10 PPI-master annotations were significantly enriched for disease heritability, with larger enrichments for smaller annotations (Figure 4A and Supplementary Table S53); values of standardized enrichment were more similar across annotations (Supplementary Figure S6 and Supplementary Table S54). All 10 PPI-master annotations attained conditionally significant τ* values after Bonferroni correction (p < 0.05/110) (Figure 4B and Supplementary Table S53) (as expected, results were similar when conditioning only on the baseline-LD+ model; Supplementary Table S55). We observed the strongest conditional signals for PPI-master combined with TSS (τ*=1.7, s.e. 0.16), Coding (τ*=1.7, s.e. 0.14) and ABC (τ*=1.6, s.e. 0.17) S2G strategies, again emphasizing the high added value of S2G strategies incorporating functional data (Supplementary Table S53). 9 of the 10 PPI-master annotations remained significant when conditioning on the Master-regulator joint model from Supplementary Figure S14 (Supplementary Table S56). In a comparison of meta-analyses of S-LDSC results across 5 blood cell traits vs. 6 autoimmune diseases, results were broadly concordant (r = 0.81 between τ* estimates, slope = 0.93) (Figure 4B, Supplementary Figure S7, Supplementary Table S44, and Supplementary Table S45). In a comparison of meta-analyses across 2 granulocyte-related blood cell traits vs. 3 red blood cell or platelet-related blood cell traits, results were broadly concordant, but with slightly signals for granulocyte-related traits (r = 0.92, slope = 1.3) (Supplementary Figure S8 and Supplementary Tables S46, S47). In the analyses of individual traits, 101 of 110 PPI-enhancer annotation-trait pairs weresignificant (FDR<5%) (Supplementary Figure S12, Supplementary Figure S13, Supplementary Table S14), with evidence of heterogeneity across diseases/traits (FDR<5% for 6 of 10 PPI-master annotations)(Supplementary Table S15).
We jointly analyzed the 5 Master-regulator annotations from the master-regulator joint model (Supplementary Figure S14 and Supplementary Table S48) and the 9 PPI-master annotations significant conditional on the master-regulator joint model in Supplementary Table S56. Of these, 2 Trans-master-regulator and 3 PPI-master annotations were jointly significant in the resulting PPI-master-regulator joint model (Figure 4D and Supplementary Table S57). The joint signal was strongest for PPI-master × Roadmap (τ* = 0.94±0.14),and 4 of the 5 annotations attained τ* > 0.5.
We assessed the enrichment of the PPI-master gene score in the 5 “gold standard” disease-related gene sets (drug target genes10, 43, Mendelian genes (Freund)44, Mendelian genes (Vuckovic)45, immune genes46, and high-pLI genes47) (Figure 4C and Supplementary Table S6). The PPI-master gene score showed significant enrichment in all 5 gene sets, with higher magnitude of enrichment compared to either of the master-regulator gene scores. In particular, the PPI-master gene score was 2.7x (s.e. 0.1) enriched in drug target genes and 3.4x (s.e. 0.1) enriched in Mendelian genes (Freund)..
We performed 3 secondary analyses. First, for each of the 3 jointly significant PPI-master annotations from Figure 4D, we assessed their functional enrichment for fine-mapped SNPs for blood-related traits from two previous studies50, 51. We observed large and significant enrichments for all 3 annotations (Supplementary Table S17), consistent with the S-LDSC results (and with similar analyses described above). Second, we performed a pathway enrichment analysis to assess the enrichment of the PPI-master gene score in pathways from the ConsensusPathDB database52 and report the top enriched pathways (Supplementary Table S18). Third, we confirmed that our forward stepwise elimination procedure produced identical results when applied to all 30 master-regulatorand PPI-master annotations, instead of just the 5 master-regulator annotations from the Master-regulator joint model (Supplementary Figure S14) and the 9 PPI-master annotations that were Bonferroni-significant in our marginal analysis.
We conclude that genes with high network connectivity to Master-regulator genes are conditionally informative for autoimmune diseases and blood-related traits when using functionally informed S2G strategies.
Combined joint model
We constructed a combined joint model containing annotations from the above analyses that were jointly significant, contributing information conditional on all other annotations. We merged the baseline-LD+cis model with annotations from the PPI-enhancer (Figure 3D) and PPI-master (Figure 4D) joint models, and performed forward stepwise elimination to iteratively remove annotations that had conditionally non-significant τ* values after Bonferroni correction (p < 0.05/110). The combined joint model contained 8 new annotations, including 2 Enhancer-driven, 2 PPI-enhancer, 2 Trans-master and 2 PPI-master annotations (Figure 5 and Supplementary Table S58). The joint signals were strongest for PPI-enhancer ABC (τ* = 0.99, s.e. 0.23) and PPI-master Roadmap (τ* = 0.91, s.e. 0.12) highlighting the importance of two distal S2G strategies, ABC and Roadmap; 5 of the 8 new annotations attained τ* > 0.5. We defined a new metric quantifying the conditional informativeness of a heritability model (combined τ*, generalizing the combined τ* metric of ref.63 to more than two annotations; see Methods). As expected, the combined joint model attained a larger combined τ* (2.5, s.e. 0.24) than the PPI-enhancer (1.5, s.e. 0.15) or PPI-master (1.9, s.e. 0.14) joint models (Supplementary Figure S15, Supplementary Table S59, Supplementary Table S60, Supplementary Table S61).
We evaluated the combined joint model of Figure 5 (and other models) by computing loglSS64 (an approximate likelihood metric) relative to a model with no functional annotations (ΔloglSS), averaged across a subset of 6 blood-related traits (1 autoimmune disease and 5 blood cell traits) from the UK Biobank (Supplementary Table S1). The combined joint model attained a +12.3% larger ΔloglSS than the baseline-LD model (Supplementary Table S62); most of the improvement derived from the 7 S2G annotations (Figure 2) and the 8 Enhancer-driven and Master-regulator annotations (Figure 5). The combined joint model also attained a 27.2% larger ΔloglSS than the baseline-LD model in a separate analysis of 24 non-blood-related traits from the UK Biobank (Supplementary Table S62; traits listed in Supplementary Table S63), implying that the value of the annotations introduced in this paper is not restricted to autoimmune diseases and blood-related traits. However, the non-blood-related traits had considerably lower absolute ΔloglSS compared to the blood-related traits. Accordingly, in a broader analysis of 36 non-blood-related traits from UK Biobank and non-UK Biobank sources (Supplementary Table S64), meta-analyzed τ* values were considerably lower for non-blood-related traits than for blood-related traits (Supplementary Figure S16, Supplementary Table S65 and Supplementary Table S66).
We investigated the biology of individual loci by examining 1,198 SNPs that were previously confidently fine-mapped (posterior inclusion probability (PIP) > 0.90) for 1 autoimmune disease and 5 blood cell traits from the UK Biobank. As noted above, fine-mapped SNPs from ref.51 were highly enriched for all 8 annotations from the combined joint model (Supplementary Table S17); accordingly, focusing on the 4 highly enriched regulatory annotations from Figure 5 (enrichment 18; 1.5% of SNPs in total), 194 of the 1,198 SNPs belonged to one or more of these 4 annotations. A list of these 194 SNPs is provided in Supplementary Table S67. We highlight 3 notable examples. First, rs231779, a fine-mapped SNP (PIP=0.91) for “All Auto Immune Traits” (Supplementary Table S1), was linked by the ABC S2G strategy to CTLA4, a high-scoring gene for the PPI-enhancer gene score (ranked 109) (Supplementary Figure S17A). CTLA4 acts as an immune checkpoint for activation of T cells and is a key target gene for cancer immunotherapy65–67. Second, rs6908626, a fine-mapped SNP (PIP=0.99) for ”All Auto Immune Traits” (Supplementary Table S1), was linked by the Roadmap S2G strategy to BACH2, a high-scoring gene for the Trans-master gene score (ranked 311) (Supplementary Figure S17B). BACH2 is a known master-regulator TF that functions in innate and adaptive lineages to control immune responses68, 69, has been shown to control autoimmunity in mice knockout studies70, and has been implicated in several autoimmune and allergic diseases including lupus, type 1 diabetes and asthma71–73. Third, rs113473633, a fine-mapped SNP (PIP = 0.99 and PIP = 0.99) for white blood cell (WBC) count and eosinophil count (Supplementary Table S1), was linked by the Roadmap S2G strategy to NFKB1, a high-scoring gene for the Trans-master and PPI-master gene scores (ranked 409 and 111) (Supplementary Figure S17C for WBC count, Supplementary Figure S17D for eosinophil count). NFKB1 is a major transcription factor involved in immune response74 and is critical for development and proliferation of lymphocytes75, 76, and has previously been implicated in blood cell traits45. In each of these examples, we we nominate both the causal gene and the SNP-gene link.
We performed 4 secondary analyses. First, we investigated whether the 8 annotations of the combined joint model still contributed unique information after including the pLI gene score47, which has previously been shown to be conditionally informative for disease heritability37, 41, 77. We confirmed that all 8 annotations from Figure 5 remained jointly significant (Supplementary Figure S18 and Supplementary Table S68). Second, we considered integrating PPI network information via a single gene score (PPI-all) instead of two separate gene scores (PPI-enhancer and PPI-master). We determined that the combined joint model derived from PPI-all attained a similar combined τ* (2.5, s.e. 0.22; Supplementary Table S59; see Supplementary Table S69 for individual τ* values) as the combined joint model derived from PPI-enhancer and PPI-master (2.5, 0.24; Supplementary Table S59), and we believe it is less interpretable. Third, we constructed a less restrictive combined joint model by conditioning on the baseline-LD+ model instead of the baseline-LD+cis model. The less restrictive combined joint model included 1 additional annotation, SEG-GTEx Coding (Supplementary Table S70). This implies that the combined joint model is largely invariant to conditioning on the baseline-LD+ or baseline-LD+cis model. Fourth, we analyzed binarized versions of all 11 gene scores (Table 1) using MAGMA78, an alternative gene set analysis method. 9 of the 11 gene scores produced significant signals (Supplementary Table S71), 11 marginally significant gene scores (Figure 3 and Figure 4) and 5 gene scores included in the combined joint model of Figure 5 in the S-LDSC analysis. However, MAGMA does not allow for conditioning on the baseline-LD model, does not allow for joint analysis of multiple gene scores to assess joint significance, and does not allow for incorporation of S2G strategies. Fifth, we confirmed that our forward stepwise elimination procedure produced identical results when applied to all 110 enhancer-driven, master-regulator, PPI-enhancer and PPI-master annotations, instead of just the 12 annotations from the PPI-enhancer (Figure 3D) and PPI-master (Figure 4D) joint models. Sixth, we assessed the model fit of the final joint model by correlating the residuals from stratified LD score regression with the independent variables in the regression (annotation-specific LD scores) for each of the 11 blood-related traits (Supplementary Figure S19). We observed an average squared correlation of 0.02 across annotation-specific LD scores and traits, suggesting good model fit.
We conclude that both Enhancer-driven genes and Master-regulator genes, as well as genes with high network connectivity to those genes, are jointly informative for autoimmune diseases and blood-related traits when using functionally informed S2G strategies.
Discussion
We have assessed the contribution to autoimmune disease of Enhancer-driven genes and Master-regulator genes, incorporating PPI network information and 10 functionally informed S2G strategies. We determined that our characterizations of Enhancer-driven and Master-regulator genes, informed by PPI networks, identify gene sets that are important for autoimmune disease, and that combining those gene sets with functionally informed S2G strategies enables us to identify SNP annotations in which disease heritability is concentrated. The conditional informativeness of SNP annotations derived from these characterizations implies that these gene sets and SNP annotations provide information about disease that is different from other available sources of information. The gene scores analyzed in this study were enriched for 5 gold-standard disease-related gene sets, including autoimmune disease drug targets and Mendelian genes related to immune dysregulation, providing further evidence that they expand and enhance our understanding of which genes impact autoimmune disease. In particular, our PPI-enhancer gene score produced stronger signals than the recently proposed Enhancer Domain Score24 (EDS). Our primary results were meta-analyzed across 11 autoimmune diseases and blood-related traits; we determined that results of meta-analyses across 6 autoimmune diseases and meta-analyses across 5 blood cell traits were quite similar, and that analyses of individual diseases/traits were generally underpowered.
Our work has several downstream implications. First, the PPI-enhancer gene score, which attained a particularly strong enrichment for approved autoimmune disease drug targets, will aid prioritization of drug targets that share similar characteristics as previously discovered drugs, analogous to pLI47 and LOEUF79, 80. Second, our gene scores and S2G strategies help nominate specific gene-linked regions for functional validation using CRISPRi experiments. In particular, our results implicate ABC and Roadmap strategies as highly informative distal S2G strategies and TSS as a highly informative proximal S2G strategy, motivating the use of these specific S2G strategies in functional validation experiments. Third, our framework for disease heritability analysis incorporating S2G strategies (instead of conventional window-based approaches) will be broadly applicable to gene sets derived from other types of data, as in our more recent work81. Fourth, at the level of genes, our findings have immediate potential for improving gene-level probabilistic fine-mapping of transcriptome-wide association studies82 and gene-based association statistics83. Fifth, at the level of SNPs, SNPs in regions linked to genes representing disease informative gene scores have immediate potential for improving functionally informed fine-mapping51, 84–86 (including experimental followup87), polygenic localization51, and polygenic risk prediction88, 89.
Our work has several limitations, representing important directions for future research. First, we restricted our analyses to Enhancer-driven and Master-regulator genes in blood, focusing on autoimmune diseases and blood-related traits; this choice was primarily motivated by the better representation of blood cell types in functional genomics assays and trans-eQTL studies. However, it will be valuable to extend our analyses to other tissues and traits as more functional data becomes available. Second, the trans-eQTL data from eQTLGen consortium56 is restricted to 10,317 previously disease-associated SNPs; we modified our analyses to account for this bias. However, it would be valuable to extend our analyses to genome-wide trans-eQTL data at large sample sizes, if that data becomes available. Third, we investigated the 10 S2G strategies separately, instead of constructing a single optimal combined strategy. A comprehensive evaluation of S2G strategies, and a method to combine them, will be provided elsewhere (S. Gazal, unpublished data). Fourth, the forward stepwise elimination procedure that we use to identify jointly significant annotations29 is a heuristic procedure whose choice of prioritized annotations may be close to arbitrary in the case of highly correlated annotations; however, the correlations between the gene scores, S2G strategies, and annotations that we analyzed were modest. Fifth, the potential of the gene scores implicated in this study to aid prioritization of future drug targets—based on observed gene-level enrichments for approved autoimmune disease drug targets—is subject to the limitation that novel drug targets that do not adhere to existing patterns may be missed; encouragingly, we also identify gene-level enrichments of the gene scores implicated in this study for 4 other “gold-standard” disease-related gene sets. Despite all these limitations, our findings expand and enhance our understanding of which gene-level characterizations of enhancer-driven and master-regulatory architecture and their corresponding gene-linked regions impact autoimmune diseases.
Methods
Genomic annotations and the baseline-LD model
We define an annotation as an assignment of a numeric value to each SNP in a predefined reference panel (e.g., 1000 Genomes Project31; see URLs). Binary annotations can have value 0 or 1 only. Continuous-valued annotations can have any real value; our focus is on continuous-valued annotations with values between 0 and 1. Annotations that correspond to known or predicted function are referred to as functional annotations. The baseline-LD model (v.2.1) contains 86 functional annotations (see URLs). These annotations include binary coding, conserved, and regulatory annotations (e.g., promoter, enhancer, histone marks, TFBS) and continuous-valued linkage disequilibrium (LD)-related annotations.
Gene Scores
We define a gene score as an assignment of a numeric value between 0 and 1 to each gene; we primarily focus on binary gene sets defined by the top 10% of genes. We analyze a total of 11 gene scores (Table 1): 7 Enhancer-driven gene scores, 2 Master-regulator gene scores and 2 PPI-based gene scores (PPI-master, PPI-enhancer) that aggregate information across Enhancer-driven and Master-regulator gene scores. We scored 22,020 genes on chromosomes 1-22 from ref.7 (see URLs). When selecting the top 10% of genes for a given score, we rounded the number of genes to 2,200. We used the top 10% of genes in our primary analyses to be consistent with previous work9, who also defined gene scores using the top 10% of genes for a given metric, and to ensure that all SNP annotations (gene scores x S2G strategies) analyzed were of reasonable size (0.2% of SNPs or larger).
The 7 Enhancer-driven gene scores are as follows:
ABC-G: A binary gene score denoting genes that are in top 10% of the number of ’intergenic’ and ’genic’ Activity-by-Contact (ABC) enhancer to gene links in blood cell types, with average HiC score fraction > 0.01532 (see URLs).
ATAC-distal: A probabilistic gene score denoting the proportion of gene expression variance in 86 immune cell types in mouse, that is explained by the patterns of chromatin covariance of distal enhancer OCRs (open chromatin regions) to the gene, compared to chromatin covariance of OCRs that are near TSS of the gene and unexplained variances (see Figure 2 from48). The genes were mapped to their human orthologs using Ensembl biomaRt90.
EDS-binary: A binary gene score denoting genes that are in top 10% of the blood-specific Activity-based Enhancer Domain Score (EDS)24 that reflects the number of conserved bases in enhancers that are linked to genes in blood related cell types as per the Roadmap Epigenomics Project91, 92 (see URLs).
eQTL-CTS: A probabilistic gene score denoting the proportion of immune cell-type-specific eQTLs (with FDR adjusted p-value < 0.05 in one or two cell-types) across 15 different immune cell-types from the DICEdb project93 (see URLs).
Expecto-MVP: A binary gene score denoting genes that are in top 10% in terms of the magnitude of variation potential (MVP) in GTEx Whole Blood, which is the sum of the absolute values of all directional mutation effects within 1kb of the TSS, as evaluated by the Expecto method7 (see URLs).
PC-HiC-distal: A binary gene score denoting genes that are in top 10% in terms of the total number of Promoter-capture HiC connections across 17 primary blood cell-types.
SEG-GTEx: A binary gene score denoting genes that are in top 10% in terms of the SEG t-statistic9 score in GTEx Whole Blood.
The 2 Master-regulator gene scores are as follows:
Trans-master: A binary gene score denoting genes with significant trait-associated cis-eQTLs in blood that also act as significant trans-eQTLs for at least 3 other genes based on data from eQTLGen Consortium56. We used the threshold of trans-regulating ≥3 genes in our primary analyses because this results in a gene score spanning ≈10% of genes, analogous to other gene scores.
TF: A binary gene score denoting genes that act as human transcription factors57.
The 2 PPI-based gene scores are as follows:
PPI-enhancer: A binary gene score denoting genes in top 10% in terms of closeness centrality measure to the disease informative enhancer-regulated gene scores. To get the closeness centrality metric, we first perform a Random Walk with Restart (RWR) algorithm54 on the STRING protein-protein interaction (PPI) network53, 94(see URLs) with seed nodes defined by genes in top 10% of the 4 enhancer-regulated gene scores with jointly significant disease informativeness (ABC-G, ATAC-distal, EDS-binary and SEG-GTEx). The closeness centrality score was defined as the average network connectivity of the protein products from each gene based on the RWR method.
PPI-master: A binary gene score denoting genes in top 10% in terms of closeness centrality measure to the 2 disease informative master-regulator gene scores (Trans-master and TF). The algorithm was same as that of PPI-enhancer.
S2G strategies
We define a SNP-to-gene (S2G) linking strategy as an assignment of 0, 1 or more linked genes to each SNP with minor allele count ≥ 5 in a 1000 Genomes Project European reference panel31. We explored 10 SNP-to-gene linking strategies, including both distal and proximal strategies (Table 2). The proximal strategies included gene body ± 5kb; gene body ± 100kb; predicted TSS (by Segway95, 96) ; coding SNPs; and promoter SNPs (as defined by UCSC97, 98). The distal strategies included regions predicted to be distally linked to the gene by Activity-by-Contact (ABC) score32, 33 > 0.015 as suggested in ref.33 (see below); regions predicted to be enhancer-gene links based on Roadmap Epigenomics data (Roadmap)91, 92, 99; regions in ATAC-seq peaks that are highly correlated (> 50% as recommended in ref.48) to expression of a gene in mouse immune cell-types (ATAC)48; regions distally connected through promoter-capture Hi-C links (PC-HiC)100; and SNPs with fine-mapped causal posterior probability (CPP)37 > 0.001 (we chose this threshold to ensure that the SNP annotations generated after combining the gene scores with the eQTL S2G strategy were of reasonable size (0.2% of SNPs or larger) for all gene scores analyzed) in GTEx whole blood (we use this thresholding on CPP to ensure adequate annotation size for annotations resulting from combining this S2G strategy with the gene scores studied in this paper).
Activity-by-Contact model predictions
We used the Activity-by-Contact (ABC) model (https://github.com/broadinstitute/ABC-Enhancer-Gene-Prediction) to predict enhancer-gene connections in each cell type, based on measurements of chromatin accessibility (ATAC-seq or DNase-seq) and histone modifications (H3K27ac ChIP-seq), as previously described32, 33. In a given cell type, the ABC model reports an “ABC score” for each element-gene pair, where the element is within 5 Mb of the TSS of the gene.
For each cell type, we:
Called peaks on the chromatin accessibility data using MACS2 with a lenient p-value cutoff of 0.1.
Counted chromatin accessibility reads in each peak and retained the top 150,000 peaks with the most read counts. We then resized each of these peaks to be 500bp centered on the peak summit. To this list we added 500bp regions centered on all gene TSS’s and removed any peaks overlapping blacklisted regions101, 102 (https:// sites.google.com/site/anshulkundaje/projects/blacklists). Any resulting overlapping peaks were merged. We call the resulting peak set candidate elements.
Calculated element Activity as the geometric mean of quantile normalized chromatin accessibility and H3K27ac ChIP-seq counts in each candidate element region.
Calculated element-promoter Contact using the average Hi-C signal across 10 human Hi-C datasets as described below.
Computed the ABC Score for each element-gene pair as the product of Activity and Contact, normalized by the product of Activity and Contact for all other elements within 5 Mb of that gene.
To generate a genome-wide averaged Hi-C dataset, we downloaded KR normalized Hi-C matrices for 10 human cell types (GM12878, NHEK, HMEC, RPE1, THP1, IMR90, HU-VEC, HCT116, K562, KBM7). This Hi-C matrix (5 Kb) resolution is available here: ftp://ftp.broadinstitute.org/outgoing/lincRNA/average_hic/average_hic.v2.191020.tar.gz32, 103. For each cell type we performed the following steps.
Transformed the Hi-C matrix for each chromosome to be doubly stochastic.
We then replaced the entries on the diagonal of the Hi-C matrix with the maximum of its four neighboring bins.
We then replaced all entries of the Hi-C matrix with a value of NaN or corresponding to Knight–Ruiz matrix balancing (KR) normalization factors ¡ 0.25 with the expected contact under the power-law distribution in the cell type.
We then scaled the Hi-C signal for each cell type using the power-law distribution in that cell type as previously described.
We then computed the “average” Hi-C matrix as the arithmetic mean of the 10 cell-type specific Hi-C matrices.
In each cell type, we assign enhancers only to genes whose promoters are “active” (i.e., where the gene is expressed and that promoter drives its expression). We defined active promoters as those in the top 60% of Activity (geometric mean of chromatin accessibility and H3K27ac ChIP-seq counts). We used the following set of TSSs (one per gene symbol) for ABC predictions: https://github.com/broadinstitute/ABC-Enhancer-Gene-Prediction/blob/v0.2.1/reference/RefSeqCurated.170308.bed. CollapsedGeneBounds.bed. We note that this approach does not account for cases where genes have multiple TSSs either in the same cell type or in different cell types.
For intersecting ABC predictions with variants, we took the predictions from the ABC Model and applied the following additional processing steps: (i) We considered all distal element-gene connections with an ABC score 0.015, and all distal or proximal promoter-gene connections with an ABC score 0.1. (ii) We shrunk the 500-bp regions by 150-bp on either side, resulting in a 200-bp region centered on the summit of the accessibility peak. This is because, while the larger region is important for counting reads in H3K27ac ChIP-seq, which occur on flanking nucleosomes, most of the DNA sequences important for enhancer function are likely located in the central nucleosome-free region. (iii) We included enhancer-gene connections spanning up to 2 Mb.
Stratified LD score regression
Stratified LD score regression (S-LDSC) is a method that assesses the contribution of a genomic annotation to disease and complex trait heritability28, 29. S-LDSC assumes that the per-SNP heritability or variance of effect size (of standardized genotype on trait) of each SNP is equal to the linear contribution of each annotation where acj is the value of annotation c for SNP j, where acj may be binary (0/1), continuous or probabilistic, and τc is the contribution of annotation c to per-SNP heritability conditioned on other annotations. S-LDSC estimates the τc for each annotation using the following equation where is the stratified LD score of SNP j with respect to annotation c and rjk is the genotypic correlation between SNPs j and k computed using data from 1000 Genomes Project31 (see URLs); N is the GWAS sample size.
We assess the informativeness of an annotation c using two metrics. The first metric is enrichment (E), defined as follows (for binary and probabilistic annotations only): where is the heritability explained by the SNPs in annotation c, weighted by the annotation values.
The second metric is standardized effect size (τ*) defined as follows (for binary, probabilistic, and continuous-valued annotations):
where sdc is the standard error of annotation c, the total SNP heritability and M is the total number of SNPs on which this heritability is computed (equal to 5, 961, 159 in our analyses). represents the proportionate change in per-SNP heritability associated to a 1 standard deviation increase in the value of the annotation.
Combined τ*
We defined a new metric quantifying the conditional informativeness of a heritability model (combined τ*, generalizing the combined τ* metric of ref.63 to more than two annotations. In detail, given a joint model defined by M annotations (conditional on a published set of annotations such as the baseline-LD model), we define Here rml is the pairwise correlation of the annotations m and l, and is expected to be positive since two positively correlated annotations typically have the same direction of effect (resp. two negatively correlated annotations typically have opposite directions of effect). We calculate standard errors for using a genomic block-jackknife with 200 blocks.
Evaluating heritability model fit using SumHer loglSS
Given a heritability model (e.g. the baseline-LD model or the combined joint model of Figure 5), we define the ΔloglSS of that heritability model as the loglSS of that heritability model minus the loglSS of a model with no functional annotations (baseline-LD-nofunct; 17 LD and MAF annotations from the baseline-LD model29), where loglSS64 is an approximate likelihood metric that has been shown to be consistent with the exact likelihood from restricted maximum likelihood (REML). We compute p-values for ΔloglSS using the asymptotic distribution of the Likelihood Ratio Test (LRT) statistic: −2 loglSS follows a χ2 distribution with degrees of freedom equal to the number of annotations in the focal model, so that −2ΔloglSS follows a χ2 distribution with degrees of freedom equal to the difference in number of annotations between the focal model and the baseline-LD-nofunct model. We used UK10K as the LD reference panel and analyzed 4,631,901 HRC (haplotype reference panel34) well-imputed SNPs with MAF ≥ and INFO ≥ 0.99 in the reference panel; We removed SNPs in the MHC region, SNPs explaining > 1% of phenotypic variance and SNPs in LD with these SNPs.
We computed ΔloglSS for 8 heritability models:
baseline-LD model: annotations from the baseline-LD model29 (86 annotations).
baseline-LD+ model: baseline-LD model plus 7 new S2G annotations not included in the baseline-LD model (93 annotations).
baseline-LD+Enhancer model: baseline-LD model+ plus 6 jointly significant S2G annotations c corresponding to Enhancer-driven gene scores from Supplementary Figure S11 (99 annotations).
baseline-LD+PPI-enhancer model: baseline-LD model+ plus 7 jointly significant S2G annotations c corresponding to Enhancer-driven and PPI-enhancer gene scores from Figure 3D (100 annotations).
baseline-LD+cis model: baseline-LD+ plus 20 S2G annotations used to correct for confounding in evaluation of Trans-master gene score (see Results) (113 annotations).
baseline-LD+Master model: baseline-LD+cis plus 4 jointly significant Master-regulator S2G annotations from Supplementary Figure S14 (117 annotations).
baseline-LD+PPI-master model: baseline-LD+cis plus 4 jointly significant Master-regulator and PPI-master S2G annotations from Figure 4D (117 annotations).
baseline-LD+PPI-master model: baseline-LD+cis plus 8 jointly significant Enhancer-driven, Master-regulator, PPI-enhancer and PPI-master S2G annotations from the final joint model in Figure 5 (121 annotations).
Data Availability
All summary statistics used in this paper are publicly available (see URLs). This work used summary statistics from the UK Biobank study (http://www.ukbiobank.ac.uk/). The summary statistics for UK Biobank is available online (see URLs). All gene scores, S2G links and SNP annotations analyzed in this study are publicly available here:https://data.broadinstitute.org/alkesgroup/LDSCORE/Dey_Enhancer_MasterReg. Supplementary Tables S14 and S67 are provided as excel files in the above link. We have also included annotations for 93 million Haplotype Reference Consortium (HRC) SNPs and 170 million TOPMed SNPs (Freeze 3A).
Code Availability
The codes used to generate SNP annotations from gene sets, and for performing PPI-informed integration of gene sets are available on Github: https://github.com/kkdey/GSSG.
URLs
Gene scores, S2G links, annotations https://data.broadinstitute.org/alkesgroup/LDSCORE/Dey_Enhancer_MasterReg
Github code repository and data https://github.com/kkdey/GSSG
Activity-by-Contact (ABC) S2G links: https://www.engreitzlab.org/resources
1000 Genomes Project Phase 3 data: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502
UK Biobank summary statistics: https://data.broadinstitute.org/alkesgroup/UKBB/
baseline-LD model annotations: https://data.broadinstitute.org/alkesgroup/LDSCORE/
BOLT-LMM software: https://data.broadinstitute.org/alkesgroup/BOLT-LMM
S-LDSC software: https://github.com/bulik/ldsc
Supplementary Tables
2 Supplementary Figures
Acknowledgments
We thank Ran Cui, Hilary Finucane, Sebastian Pott, John Platig, Xinchen Wang and Soumya Raychaudhuri for helpful discussions. This research was funded by NIH grants U01 HG009379, R01 MH101244, K99HG010160, R37 MH107649, R01 MH115676 and R01 MH109978. S.S.Kim was supported by NIH award F31HG010818. This research was conducted using the UK Biobank Resource under application 16549.
Footnotes
We have updated the manuscript to include individual trait-level analysis, meta-analyses for autoimmmune disease and blood cell traits separately and highlighting how this approach can nominate causal genes and SNP-gene links for specific fine-mapped SNPs. Figures 2, 3 revised.
https://alkesgroup.broadinstitute.org/LDSCORE/Dey_Enhancer_MasterReg
References
- 1.↵
- 2.
- 3.
- 4.
- 5.
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.
- 13.
- 14.↵
- 15.↵
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
- 23.
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.
- 60.
- 61.
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.↵
- 96.↵
- 97.↵
- 98.↵
- 99.↵
- 100.↵
- 101.↵
- 102.↵
- 103.↵
- 104.
- 105.
- 106.
- 107.
- 108.
- 109.↵
- 110.
- 111.
- 112.
- 113.
- 114.
- 115.
- 116.
- 117.
- 118.
- 119.
- 120.
- 121.