Integrative approaches to improve the informativeness of deep learning models for human complex diseases

Kushal K. Dey; Samuel S. Kim; Steven Gazal; Joseph Nasser; Jesse M. Engreitz; Alkes L. Price

doi:10.1101/2020.09.08.288563

Abstract

Deep learning models have achieved great success in predicting genome-wide regulatory effects from DNA sequence, but recent work has reported that SNP annotations derived from these predictions contribute limited unique information for human complex disease. Here, we explore three integrative approaches to improve the disease informativeness of allelic-effect annotations (predicted difference between reference and variant alleles) constructed using several previously trained deep learning models: DeepSEA, Basenji and DeepBind (and a related machine learning model, deltaSVM). First, we employ gradient boosting to learn optimal combinations of deep learning annotations, using fine-mapped SNPs and matched control SNPs (on held-out chromosomes) for training. Second, we improve the specificity of these annotations by restricting them to SNPs implicated by (proximal and distal) SNP-to-gene (S2G) linking strategies, e.g. prioritizing SNPs involved in gene regulation. Third, we predict gene expression (and derive allelic-effect annotations) from deep learning annotations at SNPs implicated by S2G linking strategies — generalizing the previously proposed ExPecto approach, which incorporates deep learning annotations based on distance to TSS. We evaluated these approaches using stratified LD score regression, using functional data in blood and focusing on 11 autoimmune diseases and blood-related traits (average N =306K). We determined that the three approaches produced SNP annotations that were uniquely informative for these diseases/traits, despite the fact that linear combinations of the underlying DeepSEA, Basenji, DeepBind and deltaSVM blood annotations were not uniquely informative for these diseases/traits. Our results highlight the benefits of integrating SNP annotations produced by deep learning models with other types of data, including data linking SNPs to genes.

Introduction

Deep learning models^1–8 (and related machine learning models^9–11) have shown considerable promise in predicting regulatory marks from DNA sequence, motivated by the well-documented role of non-coding variation in complex disease^12–18. However, we recently showed that existing deep learning models provide limited unique information about complex disease when conditioned on a broad set of coding, conserved, regulatory and LD-related annotations¹⁹. Thus, further ideas are required in order for deep learning models to achieve their full potential in contributing to our understanding of complex disease.

Here, we explore three approaches for integrating different types of functional data to improve the disease informativeness of allelic-effect SNP annotations (predicted difference between reference and variant alleles) constructed using several previously trained deep learning models: DeepSEA⁴, Basenji⁵ and DeepBind¹; for comparison purposes, we also consider a related machine learning model, deltaSVM⁹. First, we employ gradient boosting²⁰ to learn optimal combinations of deep learning annotations, integrating these annotations with fine-mapped SNPs on held-out chromosomes from previous studies^21–23. Second, we improve the specificity of deep learning/machine learning annotations by restricting them to SNPs linked to genes; we consider a broad set of proximal and distal SNP-to-gene (S2G) linking strategies, e.g. prioritizing SNPs involved in gene regulation^{19, 24–32}. Third, we predict gene expression (and derive allelic-effect annotations) from deep learning annotations at SNPs implicated by S2G linking strategies, generalizing the previously proposed ExPecto approach⁴, which incorporates deep learning annotations based on distance to TSS. We consider either SNPs linked to all genes, or SNPs linked to genes in biologically important gene sets^{19, 33}. We assessed the informativeness of the resulting annotations for disease heritability by applying stratified LD score regression (S-LDSC)¹⁶ to 11 autoimmune diseases and blood-related traits (average N =306K), conditional on a broad set of coding, conserved, regulatory and LD-related annotations from the baseline-LD model^{34, 35}.

Results

Overview of Methods

We define an annotation as an assignment of a numeric value to each SNP with minor allele count 5 in a 1000 Genomes Project European reference panel³⁶, as in our previous work¹⁶; we primarily focus on annotations with values between 0 and 1. Our annotations are derived from allelic-effect deep learning (or machine learning) annotations (predicted difference between reference and variant alleles of sequence-based predictions of functional annotations) from several recently developed models: DeepSEA⁴, Basenji⁵, DeepBind¹ and deltaSVM⁹. DeepSEA employs a multi-class classification model to predict transcription factor and chromatin features by analyzing sequence data in a 1kb of human reference sequence around a SNP. Basenji employs a Poisson likelihood model to predict chromatin and CAGE profiles by analyzing 130kb of human reference sequence around each SNP using dilated convolutional layers. DeepBind fits a deep convolutional neural net model to sequences of varying length (14-101bp) to predict binding motifs of transcription factors and RNA-binding proteins. deltaSVM applies a gapped k-mer support vector machine (gkm-SVM^{10, 11}) based classification method to sequences of length 10bp to predict profiles for transcription factors and chromatin features.

Our previous work¹⁹ focused on unsigned (absolute) allelic-effect annotations for DNase and three histone marks, H3K27ac, H3K4me1 and H3K4me3 (associated with active enhancers and promoters). Here, we integrate signed allelic-effect annotations for all features with other types of data - fine-mapped SNPs, SNPs linked to genes, and gene expression - to generate more disease-informative unsigned annotations. We have publicly released all new annotations analyzed in this study, along with open-source software for constructing the new annotations (see URLs).

First, we employ gradient boosting to integrate deep learning (or machine learning) annotations with fine-mapped SNPs (on held-out chromosomes) for blood-related traits from previous studies^21–23 to generate boosted annotations representing an optimal combination of annotations. We use the XGBoost gradient boosting model²⁰, and we train the gradient boosting model on even (respectively odd) chromosomes in order to construct annotations on odd (respectively even) chromosomes that are not used for training (to avoid overfitting); all parameter settings follow our previous work on AnnotBoost³⁷, which has different goals.As input features, we use all of the deep learning (or machine learning) annotations from the pre-trained DeepSEA, Basenji, DeepBind and deltaSVM models, respectively. For comparison purposes, we also consider a simpler logistic regression model. Second, we improve the specificity of these annotations by restricting them to SNPs linked to genes using 10 (proximal and distal) SNP-to-gene (S2G) strategies^{24–32, 38} (Table 1). Third, we predict gene expression (and derive allelic-effect annotations) from deep learning annotations at SNPs implicated by S2G linking strategies, generalizing the previously proposed ExPecto approach⁴, which incorporates deep learning annotations based on distance to TSS.

View this table:

Table 1. List of 10 S2G strategies:

For each S2G strategy, we provide a brief description, indicate whether the S2G strategy prioritizes distal or proximal SNPs relative to the gene, and report its size (% of SNPs linked to genes). S2G strategies are listed in order of increasing size. Further details are provided in the Methods section.

We assessed the informativeness of the resulting annotations for disease heritability by applying stratified LD score regression (S-LDSC)¹⁶ to 11 independent blood-related diseases and traits (5 autoimmune diseases and 6 blood cell traits; average N =306K, Table S1) and meta-analyzing S-LDSC results across traits; we restricted our analyses to blood-related traits due to our focus on functional data in blood. We conservatively conditioned all analyses on a “baseline-LD-deep model” defined by 86 coding, conserved, regulatory and LD-related annotations from the baseline-LD model (v2.1)^{34, 35} and 14 additional jointly significant annotations from ref.¹⁹: 1 non-tissue-specific allelic-effect Basenji annotation, 3 Roadmap annotations, 5 ChromHMM annotations, and 5 other annotations (100 annotations total) (Table S2 and Table S3).

We used two metrics to evaluate the informativeness of individual annotations for disease heritability: enrichment and standardized effect size (τ*). Enrichment is defined as the proportion of heritability explained by SNPs in an annotation divided by the proportion of SNPs in the annotation¹⁶, and generalizes to annotations with values between 0 and 1²⁷. Standardized effect size (τ*) is defined as the proportionate change in per-SNP heritability associated with a 1 standard deviation increase in the value of the annotation, conditional on other annotations included in the model³⁴. Unlike enrichment, τ* quantifies effects that are unique to the focal annotation, thus, we use τ* as our primary metric. In our “marginal” analyses, we estimated τ* for each focal annotation conditional on the baseline-LD-deep annotations. In our “joint” analyses, we merged baseline-LD-deep annotations with focal annotations that were marginally significant after Bonferroni correction and performed forward stepwise elimination to iteratively remove focal annotations that had conditionally non-significant τ^*values after Bonferroni correction, as in ref.³⁴. Finally, in addition to the S-LDSC metrics enrichment and τ* (which evaluate individual annotations), we independently evaluated the combined joint model arising from our analyses using logl_SS³⁹, an approximate likelihood metric that evaluates a heritability model defined by a set of functional annotations, without running S-LDSC.

DeepBoost annotations restricted to SNPs implicated by functionally informed S2G linking strategies are uniquely informative for autoimmune disease heritability

We developed a gradient boosting approach, DeepBoost, to learn optimal combinations of deep learning annotations (sequence-based predictions of functional annotations), using fine-mapped SNPs on held-out chromosomes for blood-related diseases/traits^21–23 and matched control SNPs for training (Figure S1 and Methods). The input deep learning/machine learning annotations consisted of either 2,002 DeepSEA allelic-effect annotations⁴, or 4,229 Basenji allelic-effect annotations⁵, or 927 DeepBind allelic-effect annotations¹ (based on TF and RBP motifs), or 1,329 deltaSVM allelic-effect annotations⁹ (trained on DHS and TF, which were reported to be the most informative features in recent work⁴⁰). The DeepSEA, Basenji and deltaSVM models were based on tissue/cell-type-specific features spanning 127 tissues and cell types from Roadmap⁴¹; the DeepBind model was trained on non-tissue-specific features. We defined published allelic-effect annotations for each of these models as the maximum of the absolute allelic effects across relevant blood cell type features (or across all features for DeepBind, which is non-tissue-specific) (Methods).

The fine-mapped SNPs consisted of 8,741 fine-mapped autoimmune disease SNPs²¹ with causal probability > 0.0275, the threshold used by ref.²¹ (in secondary analyses, we also considered other sets of fine-mapped SNPs^{22, 23}). DeepBoost uses decision trees to distinguish fine-mapped SNPs from matched control SNPs (with similar MAF and LD structure and local GC content) using an optimal combination of deep learning annotations; the DeepBoost model is trained using the XGBoost gradient boosting software²⁰ (see URLs). DeepBoost attained an AUROC of up to 0.67 in distinguishing fine-mapped SNPs from control SNPs (highest = 0.67 for Basenji, second highest = 0.62 for DeepSEA), an encouraging result given the fundamental difficulty of this task (Table S4). The boosted allelic-effect annotations derived from DeepBoost (DeepSEAΔ- boosted, BasenjiΔ-boosted, DeepBindΔ-boosted, deltaSVMΔ-published; we use Δ to denote allelic-effect annotations) were only mildly correlated with published allelic-effect annotations as defined above (average r=0.16; Methods) (Figure S2). We also observed mild correlations between boosted annotations produced by the 4 models (average r=0.12; maximum of 0.32 between DeepSEAΔ-boosted and BasenjiΔ-boosted) (Figure S2). We determined that using logistic regression instead of XGBoost attained only slightly lower AUROC for each of the 4 models (average AUROC=0.612 for gradient boosting vs. 0.595 for logistic regression, difference=0.017; similar difference in secondary analyses of other fine-mapped SNP sets) (Table S4, Table S5). Given that a random classifier obtains an AUROC of 0.5, this can be viewed as a +17.9% improvement for gradient boosting (0.112/0.095 = 1.179); we believe this improvement is sufficient to justify the choice of gradient boosting in preference to logistic regression in our primary analyses; we also consider annotations constructed using logistic regression in our secondary analyses.

We broadly investigated which features of the DeepSEA, Basenji, DeepBind and deltaSVM models contributed the most to corresponding boosted annotations by applying Shapley Additive Explanation (SHAP)⁴², a widely used tool for pinpointing biological features underlying machine learning models^{37, 43, 44}. For each model analyzed, we aggregated SHAP values across SNPs and primarily focused on the top 20 features, following ref.³⁷ (see Data Availability for visualization of top 100 features); we caution that aggregating values across SNPs does not account for the wide variation in SHAP values in our analyses, and that the large number of features makes it difficult to delineate which features the observed improvements derive from. For DeepSEAΔ- boosted, top features included TF features in GM12878 and K562, two immune-related cell lines (Figure S3); for BasenjiΔ-boosted, top features included activating histone marks (H3K27ac and H3K4me3) in immune cell types (Figure S4); for DeepBindΔ- boosted, top features included the TF features TBP, HOXA13 and SP1 (Figure S5); for deltaSVMΔ-boosted, top features largely consisted of TF features in the immune cell type K562 (Figure S6). We also investigated which features of the DeepSEA, Basenji, DeepBind and deltaSVM models contributed the most to corresponding annotations constructed using logistic regression (instead of gradient boosting). We observed partial overlap with the top features from gradient boosting, including immune cell type features for DeepSEA and Basenji and the HOXA13 TF feature for DeepBind (Table S6); HOXA13 regulates genes associated with immune response, gap junction/cell adhesion, and pregnancy⁴⁵. Finally, we investigated the impact of using only features from 27 blood cell types as input to our gradient boosting method (524 DeepSEA features or 479 Basenji features or 91 deltaSVM features; not applicable to DeepBind, which is non-tissue-specific). We determined that this attained only slightly lower AUROC than using all features (Table S7).

We assessed the informativeness for disease heritability of allelic-effect annotations constructed using DeepSEA, Basenji, DeepBind and deltaSVM. In our marginal analysis of disease heritability (across 11 autoimmune diseases and blood-related traits) using S-LDSC conditional on the baseline-LD-deep model, 2 of 4 published annotations (DeepSEAΔ-published, BasenjiΔ-published)and 1 of 4 boosted annotations (BasenjiΔ- boosted) were significantly enriched for heritability (after Bonferroni correction for 174 annotations tested; see Methods), with larger enrichments for the boosted annotations (Figure 1A, left panel, Figure 1C, left panel and Table S8); values of standardized enrichment (defined as enrichment scaled by the standard deviation of the annotation) are reported in Figure S7 and Table S9. However, none of these annotations attained Bonferroni-significant τ* values (although the BasenjiΔ-boosted annotation was FDR-significant) (Figure 1B, left panel, Figure 1D, left panel and Table S8). We constructed analogs of the DeepSEAΔ-boosted, BasenjiΔ-boosted, DeepBindΔ-boosted and deltaSVMΔ-boosted annotations using three other sets of fine-mapped SNPs: 4,312 fine-mapped inflammatory bowel disease SNPs²², 1,429 functionally fine-mapped SNPs for 14 blood-related UK Biobank traits^{23, 46}, or the union of all 14,482 fine-mapped SNPs. The resulting annotations produced less disease signal than those constructed using the 8,741 fine-mapped autoimmune disease SNPs²¹ (Table S8).

Figure 1. Disease informativeness of published and boosted allelic-effect deep learning annotations restricted to SNPs implicated by functionally informed S2G strategies:

(A, Left panel) Heritability enrichment of published and boosted annotations based on the DeepSEA and Basenji models, conditional on the baseline-LD-deep model. Dashed horizontal line denotes no enrichment. (B, Left panel) Standardized effect size (τ*) of published and boosted DeepSEA and Basenji annotations, conditional on the baseline-LD-deep model. (A, Right panel) Heritability enrichment of published-restricted and boosted-restricted DeepSEA and Basenji annotations, conditional on the baseline-LD-deep-S2G model. Dashed horizontal line denotes no enrichment, solid horizontal lines denote enrichments of underlying S2G annotations. (B, Right panel) Standardized effect size (τ*) of published-restricted and boosted-restricted DeepSEA and Basenji annotations, conditional on the baseline-LD-deep-S2G model. (C, Left panel) Heritability enrichment of published and boosted annotations based on the DeepBind and deltaSVM models, conditional on the baseline-LD-deep model. Dashed horizontal line denotes no enrichment. (D, Left panel) Standardized effect size (τ*) of published and boosted DeepBind and deltaSVM annotations, conditional on the baseline-LD-deep model. (C, Right panel) Heritability enrichment of published-restricted and boosted-restricted DeepBind and deltaSVM annotations, conditional on the baseline-LD-deep-S2G model. Dashed horizontal line denotes no enrichment, solid horizontal lines denote enrichments of underlying S2G annotations. (D, Right panel) Standardized effect size (τ*) of published-restricted and boosted-restricted DeepBind and deltaSVM annotations, conditional on the baseline-LD-deep-S2G model. Results are meta-analyzed across 11 blood-related traits. The percentage under each bar denotes the size of the annotation (defined as average annotation value; equal to proportion of SNPs for binary annotations). ** denotes P < 0.05/174. Error bars denote 95% confidence intervals. Numerical results, including results for all 10 S2G strategies analyzed, are reported in Table S8 and Table S11.

We sought to improve the specificity of these annotations by restricting them to SNPs implicated by SNP-to-gene (S2G) linking strategies, e.g. prioritizing SNPs that may play a role in gene regulation; we define an S2G strategy as an assignment of 0, 1 or more linked genes to each SNP. We considered 10 S2G strategies capturing both proximal and distal gene regulation in blood, as in our previous work³⁸ (see Methods and Table 1), and constructed 10 corresponding binary S2G annotations defined by SNPs linked to the set of all genes; the S2G annotations were only mildly positively correlated with each other (average r = 0.09; Figure S8). We defined restricted allelic-effect annotations as a simple product of allelic-effect annotations and S2G annotations. Due to correlations between allelic-effect annotations and S2G annotations (average r = 0.16; Figure S9), the size of a restricted allelic-effect annotation (defined as average annotation value; equal to proportion of SNPs for binary annotations) was generally larger than the product of the respective sizes of the underlying allelic-effect and S2G annotations (as would be expected if the two constituent annotations were independent); for example, the BasenjiΔ-boosted allelic effect annotation has size 15% and the ABC S2G annotation has size 1.4%, but the BasenjiΔ-boosted × ABC boosted-restricted allelic effect annotation has size 0.63%, which is lager than 15% × 1.4% = 0.21%. We evaluated 80 restricted allelic-effect annotations (8 allelic-effect annotations (4 published + 4 boosted) x 10 S2G annotations). We analyzed the restricted allelic-effect annotations conditional on a “baseline-LD-deep-S2G model” defined by 100 baseline-LD-deep annotations and 7 new S2G annotations from Table 1 that were not already included in the baseline-LD model (107 annotations total) (Table S2 and Table S10), to ensure that heritability enrichments that are entirely due to S2G annotations would not produce conditional signals.

In our marginal analysis of disease heritability using S-LDSC conditional on the baseline-LD-deep-S2G model, 48 of 80 annotations were significantly enriched for heritability (after Bonferroni correction for 174 annotations tested; see Methods), with larger enrichments for smaller annotations (Figure 1A right panel, Figure 1C right panel and Table S11); values of standardized enrichment were more similar across annotations (Table S12). Although published and boosted allelic-effect annotations were of similar size, the enrichments for boosted-restricted annotations across S2G strategies were higher on average (1.3x for DeepSEA, 1.5x for Basenji, 1.1x for DeepBind, 1.4x for deltaSVM) than the enrichments for published-restricted annotations. 3 of the boosted-restricted annotations (DeepSEAΔ-boosted × eQTL), BasenjiΔ-boosted × ABC and BasenjiΔ- boosted TSS; no DeepBind or deltaSVM annotations) attained Bonferroni-significant τ* values (Figure 1B right panel, Figure 1D right panel and Table S11). (In comparison, when we conditioned only on the baseline-LD-deep model, 20 of the 80 annotations attained Bonferroni-significant τ* values (Table S13).

We jointly analyzed the 3 marginally significant annotations from the marginal analyses from Figure 1B, right panel by performing forward stepwise elimination to iteratively remove annotations that had conditionally non-significant τ^∗values after Bonferroni correction. All 3 annotations were jointly significant in the resulting joint model, with joint effect sizes very similar to the conditional effect sizes from Figure 1B, right panel (Figure S10 and Table S14). All three annotations had joint τ* > 0.5; annotations with τ^∗> 0.5 are unusual, and considered to be important⁴⁷.

We investigated whether the boosted-restricted annotations would detect gene set-specific signals by further restricting them to SNPs linked to two biologically important gene sets: genes intolerant to loss-of-function (LoF) mutations³³ (pLI) and genes with high PPI network connectivity to Enhancer-driven genes in blood³⁸ (PPI-enhancer). We defined gene set-specific boosted-restricted annotations by replacing the S2G annotations (containing SNPs linked to all genes) with annotations containing SNPs linked to genes in the input gene set (Methods); we primarily focused on boosted-restricted annotations (instead of published-restricted annotations) because these were the restricted annotations that produced significant conditional signals in Figure 1B, right panel. We evaluated 80 gene set-specific boosted-restricted annotations (2 gene sets (pLI, PPI-enhancer) x 4 boosted allelic-effect annotations (BasenjiΔ-boosted, DeepSEAΔ-boosted, DeepBindΔ- boosted, deltaSVMΔ-boosted) x 10 S2G strategies). We analyzed the gene set-specific boosted-restricted annotations conditional on a “baseline-LD-deep-S2G-geneset” model defined by 107 baseline-LD-deep-S2G annotations and 8 jointly significant gene set-specific S2G annotations (Table S15 and Table S2), to ensure that heritability enrichments that are entirely due to the gene set-specific S2G annotations would not produce conditional signals.

In our marginal analysis of disease heritability using S-LDSC conditional on the baseline-LD-deep-S2G-geneset model, 41 of the 80 gene set-specific boosted-restricted annotations were significantly enriched for heritability (after Bonferroni correction for 174 annotations tested; see Methods), with larger enrichment for smaller annotations (Figure 2A and Table S16); values of standardized enrichment were more similar across annotations (Figure S11 and Table S17). 13 of the 80 annotations (3 DeepSEAΔ- boosted (PPI-enhancer), 3 BasenjiΔ-boosted (PPI-enhancer), 2 DeepBindΔ-boosted (PPI-enhancer), 3 deltaSVMΔ-boosted (PPI-enhancer), 1 DeepBindΔ-boosted (pLI) and 1 deltaSVMΔ-boosted (pLI)) attained conditionally Bonferroni-significant τ^*values (Figure 2B and Table S16). We jointly analyzed these 13 annotations by performing forward stepwise elimination. The resulting joint model contained 2 jointly significant annotations, BasenjiΔ-boosted (PPI-enhancer) × ABC and BasenjiΔ-boosted (PPI- enhancer) Roadmap (Figure 2C and Table S18); both annotations had joint τ* > 0.5. Both annotations remained jointly significant, with very similar τ^∗values, when further conditioned on the 3 jointly significant boosted-restricted annotations from Figure 1B, right panel (including the underlying BasenjiΔ-boosted × ABC annotation and the underlying BasenjiΔ-boosted × Roadmap annotation) (Table S18).

Figure 2. Disease informativeness of gene set-specific boosted-restricted annotations:

(A) Heritability enrichment of gene set-specific boosted-restricted annotations based on the DeepSEA, Basenji, DeepBind and deltaSVM models, conditional on the baseline-LD-deep- S2G-geneset model. (B) Standardized effect size (τ*) of gene set-specific boosted-restricted DeepSEA, Basenji, DeepBind and deltaSVM annotations, conditional on the baseline-LD-deep- S2G-geneset model. (C) Standardized effect size (τ*) of the two jointly significant annotations, conditional on the baseline-LD-deep-S2G-geneset model plus both annotations. Results are meta-analyzed across 11 blood-related traits. τ* values less than 0 are displayed as 0 for visualization purposes. In panel C, the percentage under each bar denotes the size of the annotation (defined as average annotation value; equal to proportion of SNPs for binary annotations). ** denotes P < 0.05/174. Error bars denote 95% confidence intervals. In panel B, the black box in each row denotes the S2G strategy with highest τ*. Numerical results, including results for all 10 S2G strategies analyzed, are reported in Table S16 and Table S18.

We performed 3 secondary analyses. First, we repeated the analysis of restricted annotations using local GC-content (proportion of G and C nucleotides in a 1000bp window around each SNP) in addition to the S2G strategies, conditioning on the baseline-LD-deep-S2G model and the unweighted local GC-content annotation. The τ^∗values for all 3 jointly significant restricted annotations from Figure 1B, right panel were nearly unchanged and remained Bonferroni-significant (Table S19); this implies that the unique disease signal in our restricted annotations cannot be explained by local GC-content. Second, we assessed the informativeness for disease heritability of allelic-effect annotations constructed using logistic regression (instead of gradient boosting). We determined that these annotations were less informative for disease heritability; in particular, only 1 of 3 annotations from Figure 1B, right panel (and no other annotations) attained conditionally Bonferroni-significant τ^∗values (Table S20 and Table S21). Third, we repeated the analysis of gene set-specific restricted annotations using published-restricted annotations instead of boosted-restricted annotations. Marginal results were comparable to Figure 2B (14 annotations with Bonferroni-significant τ^∗values; Table S22), but none of the gene set-specific published-restricted annotations annotations were significant conditional on the 2 jointly significant gene set-specific boosted-restricted from Figure 2C (Table S23).

We conclude that boosted deep learning allelic-effect annotations restricted to SNPs implicated by functionally informed S2G linking strategies are uniquely informative for autoimmune diseases and blood-related traits. All annotations that were uniquely informative for disease in our joint analyses were based on the DeepSEA and Basenji models, and we thus restrict our remaining analyses to these two deep learning models.

Sequence-based deep learning predictions of gene expression informed by S2G linking strategies are uniquely informative for autoimmune disease heritability

We developed a new approach, Imperio, to predict gene expression from DNA sequence by using S2G strategies to prioritize deep learning annotations (sequence-based deep learning predictions of functional annotations) as features (Figure S12 and Methods). Imperio generalizes the ExPecto approach⁴, which prioritizes deep learning annotations as features based on distance to TSS. Specifically, Imperio uses regularized linear regression to fit optimal combinations of features predicting gene expression across 22,020 genes based on 2,002 DeepSEA or 4,229 Basenji deep learning annotations restricted to relatively common SNPs (MAF > 1%) linked to the target gene by 5 S2G strategies that are suitably large in size and generalizable to tissues beyond blood (5kb, 100kb, TSS, ABC and Roadmap Enhancer; Table 1) (2,002 × 5 or 4,229 × 5 features); the feature weights are independent of the target gene but dependent on the deep learning annotation and the S2G strategy (see Methods). In contrast, ExPecto fits optimal combinations of features based on 2,002 DeepSEA annotations restricted to 10 different functions of distance to TSS (using exponential decay), for a total of 2,002 × 10 features. We restricted our Imperio analyses to the DeepSEA and Basenji models, as all annotations that were uniquely informative for disease in our above joint analyses were based on these models. We focused on predicting gene expression in blood, due to the larger amount of data currently available for ABC and Roadmap Enhancer in blood cell types (however, our approach is generalizable to other tissues). We evaluated the accuracy of Imperio in predicting gene expression across genes on chromosome 8, which was withheld from Imperio training data (analogous to ref.⁴). We determined that Imperio attained similar predictive accuracy as ExPecto (Spearman correlation ρ = 0.76 (Basenji) and ρ = 0.72 (DeepSEA) with log RPKM expression, vs. ρ = 0.79 for ExPecto; Figure S13). The expression predictions were highly correlated between the Imperio and ExPecto models (average ρ = 0.82) (Figure S14), but the resulting allelic-effect annotations were less correlated, such that Imperio may contribute unique information (see below). The top significant features driving the Imperio model fit included Transcription Factor (TF) features for DeepSEA and CAGE features for Basenji (Table S24). When we compared the 5 Imperio models utilizing a single S2G strategy, TSS outperformed the other S2G strategies, but the resulting model fit (Spearman correlation ρ = 0.69 (Basenji) and ρ = 0.71 (DeepSEA) with log RPKM expression) was substantially worse than the model fit of the Imperio model utilizing all 5 S2G strategies (Table S25).

We used the Imperio allelic effects (signed predicted difference in expression between reference and variant alleles) to predict GTEx blood gene expression across individuals for each gene (see Methods). For each gene, we compared the Imperio prediction r² to the total cis-SNP heritability of that gene, which represents an upper bound on the prediction r² that can be attained using DNA sequence (because Imperio uses a (constrained) linear model to compute predictions; see Methods). Averaging across all 22,020 genes, Imperio predictions captured up to 82% of the total cis-SNP heritability on average (82% for Imperio-Basenji and 79% for Imperio-DeepSEA, vs. 75% for ExPecto; this analysis was not considered in ref.⁴) (Table S26). The Imperio prediction r² closely tracked cis-SNP heritability (ρ = 0.83 for Imperio-DeepSEA, ρ = 0.84 for Imperio-Basenji across genes, vs. ρ = 0.81 for ExPecto) (Figure S15). Because disease heritability pertains to variation across individuals, the higher accuracy of Imperio in predicting gene expression variation across individuals may be expected to lead to annotations that are more informative for disease heritability.

We used the gene expression predictions from Imperio (DeepSEA and Basenji models) and ExPecto⁴ (DeepSEA model) to construct expression allelic-effect annotations (absolute value of the predicted difference in expression between reference and variant alleles) by summing allelic effects across genes linked by S2G strategies to the annotated SNP (see Methods). The Imperio training data excluded chromosome 8 (analogous to ref.⁴; see above), but did not exclude the target chromosomes on which allelic-effect annotations were constructed. However, this does not constitute overfitting, because the Imperio model was trained using reference sequence only. The Imperio-DeepSEA and Imperio-Basenji annotations were moderately correlated with each other (r = 0.54) and with ExPecto-DeepSEA (average r = 0.48) (Figure S16), such that each may contribute unique information. Furthermore, Imperio-DeepSEA and Imperio-Basenji annotations showed only mild correlation (average r=0.11) with boosted-restricted allelic effect annotations from previous section (Table S27). We analyzed the Imperio-DeepSEA, Imperio-Basenji and ExPecto-DeepSEA allelic-effect annotations conditional on the baseline-LD-deep-S2G-geneset model (see above; Table S2 and Table S15), for consistency with analyses of gene set-specific allelic-effect annotations (see below).

In our marginal analysis of disease heritability using S-LDSC conditional on the baseline-LD-deep-S2G-geneset model, all 3 annotations were significantly enriched for disease heritability (after Bonferroni correction for 174 annotations tested; see Methods), with larger enrichments for smaller annotations annotations (Figure 3A and Table S28); values of standardized enrichment were more similar across annotations (Table S29). One annotation, Imperio-Basenji, attained a Bonferroni-significant τ^*value (Figure 3B and Table S28); the τ* value was very close to 0.5. This implies that Imperio-Basenji provides unique information about autoimmune diseases and blood-related traits. We note that the improvement of Imperio-Basenji vs. Expecto-DeepSEA derives both from the use of S2G strategies in Imperio (Imperio-DeepSEA vs. Expecto-DeepSEA) and the use of the Basenji model (Imperio-Basenji vs. Imperio-DeepSEA); however, statistical uncertainty precludes a precise quantification of the relative importance of these two factors.

Figure 3. Disease informativeness of allelic-effect annotations based on predictions of gene expression from DNA sequence using S2G linking strategies to prioritize deep learning annotations as features:

(A) Heritability enrichment of Imperio allelic-effect annotations, conditional on the baseline-LD-deep-S2G-geneset model. Dashed horizontal line denotes no enrichment. (B) Standardized effect size (τ*) of Imperio allelic-effect annotations, conditional on the baseline-LD-deep-S2G-geneset model. (C) Heritability enrichment of gene set-specific Imperio allelic-effect annotations, conditional on the baseline-LD-deep-S2G-geneset model. Dashed horizontal line denotes no enrichment. (D) Standardized effect size (τ*) of gene set-specific Imperio allelic-effect annotations, conditional on the baseline-LD-deep-S2G-geneset model. Results are meta-analyzed across 11 blood-related traits. The percentage under each bar denotes the size of the annotation (defined as average annotation value; equal to proportion of SNPs for binary annotations). ** denotes P < 0.05/174. Error bars denote 95% confidence intervals. Numerical results, including results for both pLI and PPI-enhancer gene sets, are reported in Table S28, Table S30 and Table S32.

We investigated whether the Imperio approach would detect gene-set specific signals by restricting Imperio to two biologically important gene sets: pLI³³ and PPI-enhancer³⁸ (see above). We defined gene set-specific allelic-effect annotations by restricting both the fitting of feature weights and the gene expression predictions to genes in the input gene set. Pairwise correlations between the 4 gene set-specific allelic-effect annotations ([Imperio-DeepSEA or Imperio-Basenji] x [pLI or PPI-enhancer]) (and the 3 non-gene set-specific allelic-effect annotations) are reported in Figure S16. We analyzed the gene set-specific allelic-effect annotations conditional on the baseline-LD-deep-S2G-geneset model (see above; Table S2 and Table S15).

In our marginal analysis of disease heritability using S-LDSC conditional on the baseline-LD-deep-S2G-geneset model, all 4 annotations were significantly enriched for disease heritability (after Bonferroni correction for 174 annotations tested; see Methods), with larger enrichments for smaller annotations annotations (Figure 3C and Table S30); values of standardized enrichment were more similar across annotations (Table S31). Two annotations, Imperio-DeepSEA (PPI-enhancer) and Imperio-Basenji (PPI-enhancer), attained Bonferroni-significant τ* values (Figure 3D and Table S30). In a joint analysis of both annotations, only Imperio-DeepSEA (PPI-enhancer) remained significant (Figure 3D and Table S32); the τ* value was larger than 0.5. Imperio-DeepSEA (PPI-enhancer) remained significant (with τ* > 0.5) when further conditioned on the Imperio-Basenji annotation from Figure 3B (Table S33).

We performed 5 secondary analyses. First, we fit an Imperio+ExPecto model using both Imperio (DeepSEA or Basenji) and ExPecto features. The Imperio+ExPecto allelic-effect annotations did not produce a significant disease signal conditional on the baseline-LD-deep-S2G-geneset model plus the Imperio-Basenji annotation from Figure 3B (Table S34). Second, we investigated a partially restricted gene set-specific Imperio approach by restricting either (a) the fitting of feature weights or (b) the gene expression predictions (but not both) to genes in the input gene set. None of the partially restricted gene set-specific annotations produced a significant disease signal conditional on the baseline-LD-deep-S2G-geneset model plus the two significant annotations from Figure 3B,D (Table S35 and Table S36). Third, we assessed whether the disease informativeness of Imperio could be explained by annotations defined by the number of genes linked to each SNP by each S2G strategy (see Methods). However, none of these annotations produced a significant disease signal conditional on the baseline-LD-deep-S2G-geneset model, either for all genes (Table S37) or when restricted to PPI-enhancer genes (Table S38). Fourth, we modified Imperio by constructing allelic-effect annotations using the maximum across genes proximal to the annotated SNPs, instead of the sum (see Methods). None of the modified annotations produced a significant disease signal conditional on the baseline-LD-deep-S2G-geneset model plus the two significant annotations from Figure 3B,D (Table S39). Fifth, we compared the Imperio annotations to MaxCPP-blood (Maximum across genes of fine-mapped eQTL Causal Posterior Probability) annotation²⁷ constructed using GTEx whole blood gene expression data⁴⁸. The MaxCPP-blood annotation was only weakly correlated with Imperio annotations (average r = 0.09) and did not produce a significant disease signal conditioned on the baseline-LD-deep-S2G- geneset model (Table S40), consistent with the fact that a related MaxCPP annotation based on a meta-analysis across tissues²⁷ is already included in the baseline-LD model.

We conclude that allelic-effect annotations based on predictions of gene expression from DNA sequence using S2G linking strategies to prioritize deep learning annotations as features are uniquely informative for autoimmune diseases and blood-related traits.

Combined joint model

We constructed a combined joint model containing annotations from the above analyses that were jointly significant, contributing unique information conditional on all other annotations. In detail, we merged the baseline-LD-deep-S2G-geneset model with 3 DeepBoost boosted-restricted allelic-effect annotations from Figure 1, 2 gene-set specific DeepBoost annotations from Figure 2, 1 Imperio gene expression prediction allelic-effect annotation from Figure 3B, and 1 gene-set specific Imperio annotation from Figure 3D, and performed forward stepwise elimination to iteratively remove annotations that had conditionally non-significant τ* values after Bonferroni correction. The resulting combined joint model contained 3 new annotations, including 1 DeepBoost annotation (BasenjiΔ-boosted × TSS) and the 2 Imperio annotations (Imperio-Basenji and Imperio-DeepSEA (PPI-enhancer)) (Figure 4 and Table S41). 2 of these annotations attained τ^∗> 0.5: BasenjiΔ-boosted × TSS (1.1 ± 0.29) and Imperio-DeepSEA (PPI-enhancer) (0.67 ± 0.15); as noted above, annotations with τ^∗> 0.5 are unusual, and considered to be important⁴⁷. The combined τ^∗^{19, 49} of the 3 annotations was high (1.7 ± 0.3).

Figure 4. Combined joint model:

(A) Heritability enrichment of 3 jointly significant annotations, conditional on the baseline-LD-deep-S2G-geneset model. (B) Standardized effect size (τ*) conditional on the baseline-LD-deep-S2G-geneset model plus the 3 jointly significant annotations. Results are meta-analyzed across 11 blood-related traits. The percentage under each bar denotes the size of the annotation (defined as average annotation value; equal to proportion of SNPs for binary annotations). Error bars denote 95% confidence intervals. Numerical results are reported in Table S41.

We independently evaluated the combined joint model of Figure 4 (and other models) by computing logl_SS³⁹, (an approximate likelihood metric that evaluates a heritability model defined by a set of functional annotations) relative to a model with no functional annotations (Δlogl_SS), averaged across a subset of 6 blood-related traits (1 autoimmune disease and 5 blood cell traits) from the UK Biobank⁴⁶ (Table S1). The combined joint model attained a +20.3% larger Δlogl_SS than the baseline-LD model (Table S42); +2.5% of the improvement derived from the 3 new annotations from Figure 4. The combined joint model also attained a +14.2% larger Δlogl_SS than the baseline-LD model (+2.2% of the improvement derived from the 3 new annotations from Figure 4) in a separate analysis of 24 non-blood-related traits from the UK Biobank (see Table S43 for list of traits) that had lower absolute logl_SS values (Table S42), implying that the value of the annotations introduced in this paper is not restricted to autoimmune diseases and blood-related traits.

We conclude that two types of allelic-effect annotations informed by S2G strategies—DeepBoost boosted-restricted annotations and Imperio gene expression prediction annotations—are jointly informative for autoimmune diseases and blood-related traits.

Discussion

We have evaluated the contribution to autoimmune disease of SNP annotations constructed by integrating 4 sequence-based models - 3 deep learning approaches (DeepSEA, Basenji and DeepBind) and 1 machine learning approach (deltaSVM), with different types of functional data, including fine-mapped SNPs, SNP-to-gene linking strategies, gene expression data, and biologically important gene sets, using our DeepBoost and Imperio frameworks. We determined that boosted deep allelic-effect annotations restricted to SNPs implicated by functionally informed S2G linking strategies are uniquely informative for disease. We also determined that allelic-effect annotations based on prediction of gene expression from DNA sequence that were informed by S2G linking strategies are uniquely informative for disease, outperforming allelic-effect annotations from ExPecto⁴. We further determined that both DeepBoost and Imperio allelic-effect annotations were jointly informative for disease, resulting in an improved heritability model. All annotations that were uniquely informative for disease in our joint analyses using DeepBoost were based on the DeepSEA and Basenji models (and we thus restricted our Imperio analyses to these models). However, the DeepBind and deltaSVM models have performed well under other metrics: deltaSVM performed as well or better than DeepSEA in analyses of MPRA data⁴⁰, and DeepSEA, DeepBind and deltaSVM performed similarly well in analyses of allele-specific transcription factor binding⁵⁰.

Our work has several downstream implications. First, the DeepBoost and Imperio frameworks can be applied to other models beyond DeepSEA, Basenji, DeepBind and deltaSVM, and we anticipate that future deep learning models will benefit from these frameworks. Second, the accuracy of the Imperio framework in capturing cis- SNP heritability in blood suggests that it may be valuable to integrate Imperio gene expression predictions in other settings, such as transcriptome-wide association studies (TWAS)^51–53 or mediated expression score regression (MESC)⁵⁴. Third, our findings have immediate potential for improving functionally informed fine-mapping^{23, 55–57} (including experimental follow-up⁵⁸), polygenic localization²³, and polygenic risk prediction^{59, 60}.

Our work has several limitations, representing important directions for future research. First, we focused our analyses on functional data in blood, and on blood-related diseases/traits; this choice was motivated by (i) the better representation of some S2G strategies, such as ABC and Roadmap Enhancer, in blood cell types than in other tissues, and (ii) the particularly large functional enrichments observed in autoimmune diseases and blood-related traits^{16, 19, 27, 34}. However, it will be of interest to apply the DeepBoost and Imperio frameworks to other tissues and corresponding diseases/traits, once richer functional data becomes available. Second, we investigated the 10 S2G strategies separately, instead of constructing a single optimal combined strategy. A comprehensive evaluation of S2G strategies, and a method to combine them, will be provided elsewhere⁶¹. Third, our S-LDSC analyses are inherently focused on common variants, but deep learning models have also shown promise in prioritizing rare pathogenic variants^{4, 8, 62}. The value of deep learning models for prioritizing rare pathogenic variants has been questioned in a recent analysis focusing on Human Gene Mutation Database (HGMD) variants⁶³, meriting further investigation. Fourth, we focused here on deep learning models trained using human data, but models trained using data from other species may also be informative for human disease^{26, 64}. Fifth, the forward stepwise elimination procedure that we use to identify jointly significant annotations³⁴ is a heuristic procedure whose choice of prioritized annotations may be close to arbitrary in the case of highly correlated annotations. Nonetheless, our framework does impose rigorous criteria for conditional informativeness. Sixth, the large number of features (up to 4,229 features; Basenji model) makes it difficult to delineate which features the observed improvements derive from; this limitation is not unique to our work, as previous studies using deep learning models included a similarly large number of features^{1, 2, 4, 5}.

Despite all these limitations, our findings improve the informativeness of deep learning models for autoimmune diseases and blood-related traits, and enhance our understanding of the sequence-mediated regulatory processes impacting these diseases/traits.

Methods

Genomic annotations and the baseline-LD model

We define a functional annotation as an assignment of a numeric value to each SNP with minor allele count ≥ 5 in a predefined reference panel (e.g., 1000 Genomes Project³⁶; see URLs). Annotations can be either binary or continuous-valued (Methods). Our focus is on continuous-valued annotations (with values between 0 and 1) that are obtained by integrating deep learning models with functional data, including fine-mapped SNPs, SNP-to-gene linking strategies, gene expression data, and biologically important gene sets. Annotations that correspond to known or predicted function are referred to as functional annotations. The baseline-LD model (v.2.1) contains 86 functional annotations (see URLs). These annotations include binary coding, conserved, and regulatory annotations (e.g., promoter, enhancer, histone marks, TFBS) and continuous-valued linkage disequilibrium (LD)-related annotations.

DeepSEA, Basenji, DeepBind and deltaSVM functional annotations

Deep learning/machine learning annotations were derived using three pre-trained Convolutional Neural Net (CNN) models: Basenji⁵, DeepSEA^{2, 4} (architecture from ref.⁴) and DeepBind¹; and a Support Vector Machine (SVM) based machine learning model: deltaSVM⁹ (see URLs). Basenji is a Poisson likelihood model trained on original count data from 4, 229 cell-type specific histone mark, chromatin accessibility and FANTOM5 CAGE^{65, 66} annotations. Basenji uses dilated convolutional layers that allow scanning much larger contiguous sequence around a variant ( 130kb). DeepSEA is a classification based model trained on binary peak call data from 2, 002 cell-type specific TFBS, histone mark and chromatin accessibility annotations from the ENCODE⁶⁷ and Roadmap Epigenomics⁴¹ projects with a sequence length of 1kb. DeepBind is a convolutional neural net model trained on 927 non-tissue-specific features based on 538 distinct transcription factors and 194 distinct RNA binding proteins. We restricted the deltaSVM model to 1,329 pre-trained sequence-based gapped k-mer support vector machine (gkm-SVM^{10, 11}) features comprising of 699 ENCODE3 TFs, 317 DHS promoters and 313 DHS enhancers from Roadmap⁹ (see URLs), as these features were previously shown to be most informative in ref.⁴⁰. For each SNP with minor allele count ≥ 5 in 1000 Genomes, we applied the pre-trained DeepSEA and Basenji models to the surrounding DNA sequence to compute both the prediction (at reference allele) and the predicted difference in probability between the reference and the alternate alleles. We call these the variant-level annotations and allelic-effect annotations respectively; this naming convention has been used previously¹⁹. The allelic-effect annotations are more interesting from a biological perspective as they are specific to a sequence-based predictive model like these deep learning models. We define a “‘published” allelic-effect annotation for each model by aggregating allelic effects across features. We defined DeepSEAΔ- published and BasenjiΔ-published annotations as the maximum absolute allelic effect across DNase, H3K27ac, H3K4me1 and H3K4me3 epigenomic marks in 27 blood cell types from Roadmap Epigenomics data^{19, 41}. Similarly, we defined deltaSVMΔ-published as the maximum absolute effect across all features in the 27 blood cell types. Since DeepBind is non-tissue-specific, we defined DeepBoostΔ-published as the maximum absolute effect across all features considered..

Boosted deep learning annotations using DeepBoost

DeepSEAΔ-published and BasenjiΔ-published represent a simple maximum of allelic-effect annotations across tissues and chromatin features. Here we introduce a gradient boosting approach to combine allelic-effect annotations across tissues and chromatin features. In detail, we train a classification model using decision trees, where each node in a tree splits SNPs into 2 classes (fine-mapped and control) using deep learning allelic-effect annotations from DeepSEA and Basenji models. The features in this classification model comprise of either allelic-effect annotations for 2,002 DeepSEA features or allelic-effect annotations for 4,229 Basenji features. We choose the control SNPs from non-finemapped SNPs matched for MAF, LD, local GC-content and the number of repeats distribution. MAF is based on the same reference panel (European samples from 100 Genomes Phase 3³⁶), and LD is estimated by applying S-LDSC on all SNPs annotation (‘base’). The number of control SNPs were chosen equal to the number of fine-mapped SNPs. We used fine-mapped SNPs data related to blood traits from three sources^21–23.

We used the Extreme gradient boosting (XGBoost) method implemented in the XGBoost software^{20, 68} with following model parameters: the number of estimators (200, 250, 300), depth of the tree (25, 30, 35), learning rate (0.05), gamma (minimum loss reduction required before additional partitioning on a leaf node; 10), minimum child weight (6, 8 ,10), and subsample (0.6, 0.8, 1); we optimized parameters by tuning hyper-parameters (a randomized search) with five-fold cross-validation. Two important parameters to avoid over-fitting are gamma and learning rate; we chose these values consistent with previous studies⁶⁹, as in our previous work on AnnotBoost framework³⁷.

The gradient boosting predictor is based on T additive estimators (T=200, 250, 300) and it minimizes the loss objective function ℒ^t at iteration t. f_t is an independent tree structure and γ(f_t) is the complexity parameter. The final prediction from the gradient boosting model therefore is In order to avoid winner’s curse and overfitting, we use fine-mapped SNPs on odd (respectively even) chromosomes as training data to make predictions for even (respectively odd) chromosomes, as in our previous work on AnnotBoost³⁷; thus, boosted annotations on a given chromosome are not informed by fine-mapped SNPs on that chromosome. We report the average AUROC of odd and even chromosome classifiers. The boosted annotations produced as output of the classifier are probabilistic in nature because of the logistic loss. We generate 4 boosted annotations, DeepSEAΔ-boosted, BasenjiΔ-boosted, DeepBindΔ-boosted and deltaSVMΔ-boosted, for each of 4 sets of fine-mapped SNPs, comprising of 8,741 fine-mapped autoimmune disease SNPs²¹, 4,312 fine-mapped inflammatory bowel disease SNPs²², 1,429 functionally fine-mapped SNPs for 14 blood-related UK Biobank traits^{23, 46}, or the union of these 14,482 fine-mapped SNPs.

Boosted-restricted deep learning annotations using S2G strategies

We define a SNP-to-gene (S2G) linking strategy as an assignment of 0, 1 or more linked genes to each SNP with minor allele count ≥ 5 in a 1000 Genomes Project European reference panel³⁶. We intersect the 8 allelic-effect annotations from the previous subsections (DeepSEAΔ-published, BasenjiΔ-published, DeepBindΔ-published, deltaSVMΔ-published, DeepSEAΔ-boosted, BasenjiΔ-boosted, DeepBindΔ-boosted and deltaSVMΔ-boosted) with 10 S2G strategies used in ref.³⁸ to generate 80 restricted allelic-effect annotations.

We explored 10 SNP-to-gene linking strategies in blood (Table 1). The proximal strategies included gene body ± 5kb; gene body ± 100kb; predicted TSS (by Segway^{30, 31}); coding SNPs; and promoter SNPs (as defined by UCSC^{70, 71}). The distal strategies included regions predicted to be distally linked to the gene by Activity-by-Contact (ABC) score^{24, 25} > 0.015 as suggested in ref.²⁴ (see below); regions predicted to be enhancer-gene links based on Roadmap Epigenomics data (Roadmap)^{28, 32, 41}; regions in ATAC-seq peaks that are highly correlated (> 50% as recommended in ref.²⁶) to expression of a gene in mouse immune cell-types (ATAC)²⁶; regions distally connected through promoter-capture Hi-C links (PC-HiC)²⁹; and SNPs with fine-mapped causal posterior probability (CPP)²⁷ > 0.001 (we chose this threshold to ensure that the SNP annotations generated after combining the gene scores with the eQTL S2G strategy were of reasonable size (0.2% of SNPs or larger) for all gene scores analyzed) in GTEx whole blood. (This is the threshold used throughout the analyses in our parallel study providing a comprehensive evaluation of S2G strategies⁶¹, which was initiated prior to the current study and has different goals.)

The boosted-restricted allelic-effect annotations were further restricted to SNPs linked to genes in two biologically important gene sets - pLI³³ and PPI-enhancer³⁸.

PPI-enhancer: A binary gene score denoting genes in top 10% in terms of closeness centrality measure to the disease informative enhancer-regulated gene scores as defined in ref.³⁸. To get the closeness centrality metric, we first perform a Random Walk with Restart (RWR) algorithm⁷² on the STRING protein-protein interaction (PPI network^{73, 74}(see URLs) with seed nodes defined by genes in top 10% of the 4 enhancer-regulated gene scores defined in ref.³⁸ with jointly significant disease informativeness (ABC-G^{24, 25}, ATAC-distal²⁶, EDS-binary⁷⁵ and SEG-GTEx⁷⁶). The closeness centrality score was defined as the average network connectivity of the protein products from each gene based on the RWR method.
pLI : A probabilistic gene score with each gene graded by the probability of intolerance to loss of function mutations³³.

We generate an additional 80 annotations by combining the 2 gene scores (pLI, PPI-enhancer) with 40 restricted boosted allelic-effect annotations for DeepSEA, Basenji, DeepBind and deltaSVM models and 10 S2G strategies.

Imperio deep learning annotations using gene expression predictions informed by S2G strategies

We propose a new framework, Imperio, that for predicting gene expression from DNA sequence by using S2G strategies to prioritize deep learning annotations (sequence-based deep learning predictions of functional annotations) as features. This approach is analogous to the recent ExPecto framework⁴, but focuses on sequences around common variants linked to genes—either proximally or distally via enhancers, as in the Roadmap and ABC distal S2G strategies. We selected these two distal S2G strategies because they outperformed other distal strategies in blood in our previous work³⁸. We integrate both DeepSEA and Basenji models with S2G strategies to predict gene expression. We consider a reduced set of 5 classes of S2G strategies: 5kb, 100kb, TSS, ABC and Roadmap Enhancer. We fit a elastic net regularized linear regression model to log RPKM expression data for gene g, Y_g . where f represents the chromatin mark features for the deep learning model (2,002 for the DeepSEA model and 4,229 for the Basenji model), s represents SNPs that are at least 1Kb apart ensuring relatively weaker correlation in their variant-effect or allelic-effect annotations, d represents a SNP-to-gene linking strategy, and represents the set of all SNPs linked to gene g by the S2G strategy d. The 5 types of are:

: SNPs in a window of 5kb around gene g
: SNPs in a window of 100kb around gene g
: SNPs in a window of of ±5kb around the TSS of gene g.
: SNPs in regions linked to gene g by aggregation of Hi-C and enhancer marks data in 56 blood cell-types with a Acitivity-by-Contact (ABC) score of > 0.03.
: SNPs in Roadmap Enhancers linked to gene g in 27 blood cell-types.

β_fdrepresents the model coefficient capturing the effect of each chromatin feature f and each S2G strategy d on the gene expression. p_sf represents the variant-level prediction for chromatin feature f around SNP i. ϵ_g represents white noise in the regression model. The model in Equation 3 is fitted by using Extreme gradient boosting (XGBoost) method. Following the training procedure in ExPecto, all genes except the ones in chromosome chr8 were used for training. The predictive performance of this approch is assessed on the holdout chromosome chr8.

We define the signed Imperio effect of each SNP as the sequence mediated effect on expression of a variant s and S2G strategy d. 𝒥_sd is the per-allele estimated change in expression caused by SNP s for any gene it is linked to through S2G strategy d. 𝒥_sd is treated as the atom for any Imperio based annotations we investigate.

The total absolute change in expression of gene g caused by SNP s and strategy d is given as follows. The total sequence mediated absolute predicted change by SNP s and S2G strategy d across all genes g is given by where N_sd is the number of genes linked to SNP s by S2G strategy d. We adjust for the minor allele frequency (MAF) p_s for each SNP s to adjust for per-allele effect sizes, as per ref⁷⁷. These Δ_s scores were normalized to convert them to a probabilistic scale.

For a supplementary analysis, we also consider annotations that do not include the information of the number of genes linked to a SNP (ξ_s). We analyze 3 annotations, 2 Imperio annotations, Imperio-DeepSEA and Imperio-Basenji, and ExPecto-DeepSEA.

Predicting gene expression across individuals using Imperio

We use the Imperio effect of each SNP s in S2G strategy d, 𝒥_sd from Equation 4 (for either DeepSEA or the Basenji model) to define a gene specific Imperio score for each individual n and S2G strategy d as follows.

where G_ns represents the number of risk alleles for individual n and the commonly varying SNP s.

Next we perform a regression on the normalized gene expression log RPKM data for individual n and gene g, Y_ng with predictors given by and . B denotes the covariates that are adjusted for in the model. We consider a total of 68 covariates including 5 principal components across samples, platform, gender and PCR amplification. In cases where there is only one SNP s in and only one of these predictors is used. This model provides an insight on relative contribution of different S2G strategies in explaining the inter-individual gene expression variation. The inter-individual Imperio model in Equation 12 is a linear model in risk alleles (G_ns), similar to a gene expression cis-heritability model but with constrained parameters; thus, the cis-heritability represents an upper bound on the prediction r² from the Imperio inter-individual prediction model.

We compute , the proportion of variance explained by the predictor variables and for all S2G strategies d and for gene g.

Gene set-specific Imperio deep learning annotations

The Imperio model coefficients β_fdin the previous section are fitted across all genes. However, different genes may have distinct sequence-mediated regulatory characteristics. Additionally, not all genes in blood are equally important. Therefore, we propose a gene-set specific Imperio model, where we perform the training of the model in Equation 3 over all genes g in a particular gene set 𝒢. We consider two gene sets, pLI³³ and PPI-enhancer³⁸ (see above).

The sequence-mediated expression effect of a variant s corresponding to gene set 𝒢 is given by where are the estimated model coefficients of β_fdin Equation 3 fitted for genes in gene set 𝒢.

We analyze 4 annotations, combining Imperio models for DeepSEA and Basenji models with the PPI-enhancer and pLI gene sets.

We further define intermediate Imperio annotations by restricting either (a) the fitting of feature weights or (b) the gene expression predictions (but not both) to genes in the input gene set.

We define Imperio-sub-1 annotations generated by using all genes for fitting the model and gene sets for computing the expression allelic effects. We define Imperio-sub-2 annotations generated by using genes in a geneset for fitting the model and all genes for computing the expression allelic effects

Activity-by-Contact S2G strategy

The Activity-by-Contact (ABC)^{24, 25} (https://github.com/broadinstitute/ABC-Enhancer-Gene-Prediction) S2G strategy is determined by a predictive model for enhancer-gene connections in each cell type, based on measurements of chromatin accessibility (ATAC-seq or DNase-seq) and histone modifications (H3K27ac ChIP-seq), as previously described^{24, 25}. We provide a brief summary of this approach, following ref.^{24, 61} (which contains further details). In a given cell type, the ABC model reports an “ABC score” for each element-gene pair, where the element is within 5 Mb of the TSS of the gene.

For each cell type, we:

Called peaks on the chromatin accessibility data using MACS2 with a lenient p-value cutoff of 0.1.
Counted chromatin accessibility reads in each peak and retained the top 150,000 peaks with the most read counts. We then resized each of these peaks to be 500bp centered on the peak summit. To this list we added 500bp regions centered on all gene TSS’s and removed any peaks overlapping blacklisted regions^{78, 79} (https://sites.google.com/site/anshulkundaje/projects/blacklists). Any resulting overlapping peaks were merged. We call the resulting peak set candidate elements.
Calculated element Activity as the geometric mean of quantile normalized chromatin accessibility and H3K27ac ChIP-seq counts in each candidate element region.
Calculated element-promoter Contact using the average Hi-C signal across 10 human Hi-C datasets as described below.
Computed the ABC Score for each element-gene pair as the product of Activity and Contact, normalized by the product of Activity and Contact for all other elements within 5 Mb of that gene.

To generate a genome-wide averaged Hi-C dataset, we downloaded KR normalized Hi-C matrices for 10 human cell types (GM12878, NHEK, HMEC, RPE1, THP1, IMR90, HU-VEC, HCT116, K562, KBM7). This Hi-C matrix (5 Kb) resolution is available here: ftp://ftp.broadinstitute.org/outgoing/lincRNA/average_hic/average_hic.v2.191020.tar.gz^{25, 80}. For each cell type we performed the following steps.

Transformed the Hi-C matrix for each chromosome to be doubly stochastic.
We then replaced the entries on the diagonal of the Hi-C matrix with the maximum of its four neighboring bins.
We then replaced all entries of the Hi-C matrix with a value of NaN or corresponding to Knight–Ruiz matrix balancing (KR) normalization factors ¡ 0.25 with the expected contact under the power-law distribution in the cell type.
We then scaled the Hi-C signal for each cell type using the power-law distribution in that cell type as previously described.
We then computed the “average” Hi-C matrix as the arithmetic mean of the 10 cell-type specific Hi-C matrices.

In each cell type, we assign enhancers only to genes whose promoters are “active” (i.e., where the gene is expressed and that promoter drives its expression). We defined active promoters as those in the top 60% of Activity (geometric mean of chromatin accessibility and H3K27ac ChIP-seq counts). We used the following set of TSSs (one per gene symbol) for ABC predictions: https://github.com/broadinstitute/ABC-Enhancer-Gene-Prediction/blob/v0.2.1/reference/RefSeqCurated.170308.bed. CollapsedGeneBounds.bed. We note that this approach does not account for cases where genes have multiple TSSs either in the same cell type or in different cell types.

For intersecting ABC predictions with variants, we took the predictions from the ABC Model and applied the following additional processing steps: (i) We considered all distal element-gene connections with an ABC score ≥ 0.015, and all distal or proximal promoter-gene connections with an ABC score ≥ 0.1. (ii) We shrunk the 500-bp regions by 150-bp on either side, resulting in a 200-bp region centered on the summit of the accessibility peak. This is because, while the larger region is important for counting reads in H3K27ac ChIP-seq, which occur on flanking nucleosomes, most of the DNA sequences important for enhancer function are likely located in the central nucleosome-free region. (iii) We included enhancer-gene connections spanning up to 2 Mb.

Number of new annotations analyzed

For the Bonferroni correction, we corrected for 174 new annotations analyzed in our primary analyses (8 + 80 + 80 + 2 + 4 = 174). This choice is appropriate given the large number of potential secondary analyses, and is consistent with previous work¹⁹:

8 genome-wide allelic-effect annotations: 4 published (DeepSEAΔ-published, BasenjiΔ-published, DeepBindΔ-published and deltaSVMΔ-published) and 4 boosted (DeepSEAΔ-boosted, BasenjiΔ-boosted, DeepBindΔ-boosted and deltaSVMΔ-boosted) annotations constructed using fine-mapped SNPs from ref.²¹ (vs. matched control SNPs). [Figure 1]
80 restricted deep learning allelic-effect annotations corresponding to 4 published annotations (DeepSEAΔ-published, BasenjiΔ-published, DeepBindΔ-published and deltaSVMΔ-published) and 4 boosted annotations (DeepSEAΔ-boosted, BasenjiΔ- boosted, DeepBindΔ-boosted and deltaSVMΔ-boosted), restricted using 10 S2G strategies from Table 1. [Figure 1]
80 gene set-specific restricted deep learning allelic-effect annotations, integrating DeepSEAΔ-boosted, BasenjiΔ-boosted, DeepBindΔ-boosted and deltaSVMΔ- boosted annotations, restricted using 10 S2G strategies from Table 1 with SNPs linked to genes specific to 2 gene scores (pLI and PPI-enhancer). [Figure 2]
2 Imperio annotations (Imperio-DeepSEA, Imperio-Basenji) (we also analyzed 1 ExPecto-DeepSEA annotation from ref.⁴, but this is not a new annotation). [Figure 3]
4 gene-set specific Imperio annotations combining Imperio-DeepSEA and Imperio-Basenji models with genes from 2 gene sets (pLI and PPI-enhancer). [Figure 3]

Stratified LD score regression

Stratified LD score regression (S-LDSC) is a method that assesses the contribution of a genomic annotation to disease and complex trait heritability^{16, 34}. S-LDSC assumes that the per-SNP heritability or variance of effect size (of standardized genotype on trait) of each SNP is equal to the linear contribution of each annotation where a_cj is the value of annotation c for SNP j, where a_cj may be binary (0/1), continuous or probabilistic, and τ_c is the contribution of annotation c to per-SNP heritability conditioned on other annotations. S-LDSC estimates the τ_c for each annotation using the following equation where is the stratified LD score of SNP j with respect to annotation c and r_jk is the genotypic correlation between SNPs j and k computed using data from 1000 Genomes Project³⁶ (see URLs); N is the GWAS sample size.

We assess the informativeness of an annotation c using two metrics. The first metric is enrichment (E), defined as follows (for binary and probabilistic annotations only): where is the heritability explained by the SNPs in annotation c, weighted by the annotation values.

The second metric is standardized effect size (τ*) defined as follows (for binary, probabilistic, and continuous-valued annotations): where sd_c is the standard error of annotation c, the total SNP heritability and M is the total number of SNPs on which this heritability is computed (equal to 5, 961, 159 in our analyses). represents the proportionate change in per-SNP heritability associated to a 1 standard deviation increase in the value of the annotation.

Combined τ*

We use the combined tau* metric of ref.¹⁹, quantifying the conditional informativeness of a heritability model (combined τ^∗, generalizing the combined τ* metric of ref.⁴⁹ to more than two annotations. In detail, given a joint model defined by M annotations (conditional on a published set of annotations such as the baseline-LD model), we define Here r_ml is the pairwise correlation of the annotations m and l, and is expected to be positive since two positively correlated annotations typically have the same direction of effect (resp. two negatively correlated annotations typically have opposite directions of effect). We calculate standard errors for using a genomic block-jackknife with 200 blocks.

Evaluating heritability model fit using SumHer logl_SS

Given a heritability model (e.g. the baseline-LD model or the combined joint model of Figure 4), we define the Δlogl_SS of that heritability model as the logl_SS of that heritability model minus the logl_SS of a model with no functional annotations (baseline- LD-nofunct; 17 LD and MAF annotations from the baseline-LD model³⁴), where logl_SS³⁹ is an approximate likelihood metric that has been shown to be consistent with the exact likelihood from restricted maximum likelihood (REML). We compute p-values for Δlogl_SS using the asymptotic distribution of the Likelihood Ratio Test (LRT) statistic: −2 logl_SS follows a χ² distribution with degrees of freedom equal to the number of annotations in the focal model, so that −2Δlogl_SS follows a χ² distribution with degrees of freedom equal to the difference in number of annotations between the focal model and the baseline-LD-nofunct model. We used UK10K as the LD reference panel and analyzed 4,631,901 HRC (haplotype reference panel⁸¹) well-imputed SNPs with MAF ≥ 0.01 and INFO≥ 0.99 in the reference panel; We removed SNPs in the MHC region, SNPs explaining > 1% of phenotypic variance and SNPs in LD with these SNPs.

Data Availability

All DeepBooost and Imperio annotations are available at https://alkesgroup.broadinstitute.org/LDSCORE/DeepLearning/Dey_DeepBoost_Imperio/. The deep learning allelic effect SNP level annotations for DeepSEA, Basenji, DeepBind and deltaSVM models are available at https://alkesgroup.broadinstitute.org/LDSCORE/DeepLearning/. This work used summary statistics from the UK Biobank study (http://www.ukbiobank.ac.uk/). The summary statistics for UK Biobank is available online (https://data.broadinstitute.org/alkesgroup/UKBB/). The 1000 Genomes Project Phase 3 data are available at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502. The baseline-LD annotations are available at https://data.broadinstitute.org/alkesgroup/LDSCORE/. The SHAP visualization of top 100 features for each model are at https://alkesgroup.broadinstitute.org/LDSCORE/DeepLearning/Dey_DeepBoost_Imperio/ExtDataFigures.

Code Availability

The codes for generating DeepBoost and Imperio annotations are available in the Github repository https://github.com/kkdey/Imperio. This work primarily uses the S-LDSC software (https://github.com/bulik/ldsc). We used publicly available software for DeepSEA (https://github.com/FunctionLab/ExPecto), Basenji (https://github.com/calico/basenji), DeepBind (http://tools.genes.toronto.edu/deepbind/) and deltaSVM (https://www.beerlab.org/deltasvm/) to generate annotations for these respective models.

Supplementary Tables

View this table:

Table S1. List of all blood-related traits:

List of 11 blood-related traits (6 autoimmune diseases and 5 blood cell traits) analyzed in this paper.

View this table:

Table S2. List of baseline models used in this paper:

We report the 6 baseline models or joint models discussed in this paper, along with number of annotations and a brief description.

View this table:

Table S3. Additional annotations of baseline-LD-deep model:

List of 14 jointly significant annotations from ref.¹⁹ added to the baseline-LD model to create the baseline- LD-deep model. They include 1 non-tissue-specific Basenji allelic-effect annotation, 3 Roadmap annotations, 5 ChromHMM annotations and 5 other annotations.

View this table:

Table S4. AUROC attained by DeepBoost.

We report the AUROC for a gradient boosting model distinguishing fine-mapped SNPs from matched control SNPs using allelic-effect annotations from the DeepSEA, Basenji, DeepBind and deltaSVM models as features. We consider four sets of fine-mapped SNPs - 8,741 fine-mapped autoimmune disease SNPs²¹ (Farh), 4,312 fine-mapped inflammatory bowel disease SNPs²² (Huang), 1,429 functionally fine-mapped SNPs for 14 blood-related UK Biobank traits^{23, 46} (Weissbrod), or union of all 14,482 fine-mapped SNPs (Union).

View this table:

Table S5. AUROC attained by logistic classification instead of XGBoost.

We report the AUROC for a logistic regression model distinguishing fine-mapped SNPs from matched control SNPs using allelic-effect annotations from the DeepSEA, Basenji, DeepBind and deltaSVM models as features. We consider four sets of fine-mapped SNPs - 8,741 fine-mapped autoimmune disease SNPs²¹ (Farh), 4,312 fine-mapped inflammatory bowel disease SNPs²² (Huang), 1,429 functionally fine-mapped SNPs for 14 blood-related UK Biobank traits^{23, 46} (Weissbrod), or union of all 14,482 fine-mapped SNPs (Union).

View this table:

Table S6. Top features underlying the logistic regression model.

We report the top 5 features for the logistic regression model (instead of the gradient boosting model) fitted on 8,741 fine-mapped autoimmune disease SNPs²¹ (Farh) corresponding to allelic-effect annotations from four different models: DeepSEA, Basenji, DeepBind and deltaSVM.

View this table:

Table S7. AUROC attained by DeepBoost using 27 blood cell types only.

We report the AUROC for a gradient boosting model distinguishing fine-mapped SNPs from matched control SNPs using using blood-specific allelic-effect from the DeepSEA, Basenji and deltaSVM models as features. (DeepBind model was not considered, as its features are non-tissue-specific.) We consider four sets of fine-mapped SNPs - 8,741 fine-mapped autoimmune disease SNPs²¹ (Farh), 4,312 fine-mapped inflammatory bowel disease SNPs²² (Huang), 1,429 functionally fine-mapped SNPs for 14 blood-related UK Biobank traits^{23, 46} (Weissbrod), or union of all 14,482 fine-mapped SNPs (Union).

View this table:

Table S8. S-LDSC results for published and boosted allelic-effect annotations:

Standardized Effect sizes (τ*) and Enrichment (E) of 4 published allelic-effect annotations for 4 sequence-based models, DeepSEA, Basenji, DeepBind and deltaSVM (DeepSEAΔ- published, BasenjiΔ-published, DeepBindΔ-published, deltaSVMΔ-published) and 16 boosted allelic-effect annotations for the same 4 deep learning models and 4 sets of finemapped SNPs for blood-related traits - 8,741 fine-mapped autoimmune disease SNPs²¹ (DeepSEAΔ-boosted, BasenjiΔ-boosted, DeepBindΔ-boosted, deltaSVMΔ- boosted), 4,312 fine-mapped inflammatory bowel disease SNPs²² (DeepSEAΔ-boosted- Huang, BasenjiΔ-boosted-Huang, DeepBindΔ-boosted-Huang, deltaSVMΔ-boosted- Huang), 1,429 functionally fine-mapped SNPs for 14 blood-related UK Biobank traits²³ (DeepSEAΔ-boosted-Weissbrod, BasenjiΔ-boosted-Weissbrod, DeepBindΔ- boosted-Weissbrod, deltaSVMΔ-boosted-Weissbrod), or union of these fine-mapped SNPs (DeepSEAΔ-boosted-Union, BasenjiΔ-boosted-Union, DeepBindΔ-boosted-Union, deltaSVMΔ-boosted-Union). Results are conditioned on 100 baseline-LD-deep annotations. Reports are meta-analyzed across 11 blood and autoimmune traits.

View this table:

Table S9. Standardized enrichment of SNP annotations for published and boosted allelic-effect annotations:

Standardized enrichment of 4 published allelic-effect annotations for 4 sequence-based models, DeepSEA, Basenji, DeepBind and deltaSVM (DeepSEAΔ-published, BasenjiΔ-published, DeepBindΔ-published, deltaSVMΔ-published) and 16 boosted allelic-effect deep-learning annotations for the same 4 deep learning models and 4 sets of finemapped SNPs for blood-related traits - 8,741 fine-mapped autoimmune disease SNPs²¹ (DeepSEAΔ-boosted, BasenjiΔ-boosted, DeepBindΔ-boosted, deltaSVMΔ-boosted), 4,312 fine-mapped inflammatory bowel disease SNPs²² (DeepSEAΔ-boosted-Huang, BasenjiΔ-boosted-Huang, DeepBindΔ- boosted-Huang, deltaSVMΔ-boosted-Huang), 1,429 functionally fine-mapped SNPs for 14 blood-related UK Biobank traits²³ (DeepSEAΔ-boosted-Weissbrod, BasenjiΔ- boosted-Weissbrod, DeepBindΔ-boosted-Weissbrod, deltaSVMΔ-boosted-Weissbrod), or union of these fine-mapped SNPs (DeepSEAΔ-boosted-Union, BasenjiΔ-boosted-Union, DeepBindΔ-boosted-Union, deltaSVMΔ-boosted-Union). Results are conditioned on 100 baseline-LD-deep annotations. Reports are meta-analyzed across 11 blood and autoimmune traits.

View this table:

Table S10. Additional annotations of baseline-LD-deep-S2G model:

List of 7 annotations corresponding to 7 S2G strategies linked to all genes from ref.³⁸ added to the baseline-LD model to create the baseline-LD-deep-S2G model.

View this table:

Table S11. S-LDSC results for published-restricted and boosted-restricted allelic-effect annotations restricted using S2G strategies, conditional on the baseline-LD-deep-S2G model annotations:

Standardized Effect sizes (τ*) and Enrichment (E) of SNP annotations corresponding to each of DeepSEAΔ-published, BasenjiΔ-published, DeepBindΔ-published, deltaSVMΔ-published, DeepSEAΔ-boosted, BasenjiΔ-boosted, DeepBindΔ-boosted and deltaSVMΔ-boosted annotations restricted using 10 S2G strategies conditional on 107 baseline-LD-deep-S2G annotations (100 baseline-LD-deep and 7 additional annotations from Table S10). Reports are meta-analyzed across 11 blood-related traits.

View this table:

Table S12. Standardized enrichment of published-restricted and boosted-restricted allelic-effect annotations restricted using S2G strategies, conditional on the baseline-LD-deep-S2G model annotations:

Standardized enrichment of restricted SNP annotations corresponding to each of DeepSEAΔ-published, BasenjiΔ-published, DeepBindΔ-published, deltaSVMΔ-published, DeepSEAΔ-boosted, BasenjiΔ-boosted, DeepBindΔ-boosted and deltaSVMΔ-boosted annotations restricted using 10 S2G strategies conditional on 107 baseline-LD-deep-S2G annotations (100 baseline-LD-deep and 7 additional annotations from Table S10). Reports are meta-analyzed across 11 blood-related traits.

View this table:

Table S13. S-LDSC results for published-restricted and boosted-restricted allelic-effect annotations restricted using S2G strategies, conditional on the baseline-LD-deep model annotations :

Standardized Effect sizes (τ*) and Enrichment (E) of 80 restricted SNP annotations corresponding to DeepSEAΔ-published, BasenjiΔ-published, DeepBindΔ-published, deltaSVMΔ-published, DeepSEAΔ-boosted, BasenjiΔ-boosted, DeepBindΔ-boosted and deltaSVMΔ-boosted annotations restricted using 10 S2G strategies. Results are conditional on 100 baseline-LD-deep annotations. Reports are meta-analyzed across 11 blood-related traits.

View this table:

Table S14. S-LDSC results for joint model of published-restricted and boosted-restricted deep learning allelic-effect annotations restricted using S2G strategies, conditional on the baseline-LD-deep-S2G model annotations.

Standardized Effect sizes (τ*) and Enrichment (E) of the significant SNP annotations in a joint model comprising of the marginally significant published-restricted and boosted-restricted SNP annotations corresponding to published and boosted deep learning allelic-effect annotations combined with S2G strategies. Results are conditional on 107 baseline-LD-deep-S2G model annotations (100 baseline-LD-deep and 7 additional annotations from Table S10). Results are meta-analyzed across 11 blood-related traits.

View this table:

Table S15. Additional annotations of baseline-LD-deep-S2G-geneset model:

List of 8 jointly significant gene set-specific S2G annotations from ref.³⁸ added to the baseline-LD-deep-S2G model to create the baseline-LD-deep-S2G-geneset model. They include 7 annotations from the Enhancer-driven+PPI-enhancer joint model in ref³⁸ and 1 jointly significant pLI S2G annotation.

View this table:

Table S16. S-LDSC results for gene set-specific boosted-restricted annotations, conditional on baseline-LD-deep-S2G-geneset model annotations:

Standardized Effect sizes (τ*) and Enrichment (E) of 80 restricted SNP annotations corresponding to 4 allelic-effect annotations (DeepSEAΔ-boosted, BasenjiΔ-boosted, DeepBindΔ-boosted and deltaSVMΔ-boosted), 2 gene scores (PPI-enhancer and pLI) and 10 S2G strategies, conditional on 115 baseline-LD-deep-S2G-geneset annotations. Reports are meta-analyzed across 11 blood-related traits.

View this table:

Table S17. Standardized enrichment of gene set-specific boosted-restricted annotations, conditional on baseline-LD-deep-S2G-geneset model annotations:

Standardized enrichment of 80 restricted SNP annotations corresponding to 4 allelic-effect annotations (DeepSEAΔ-boosted, BasenjiΔ-boosted, DeepBindΔ-boosted and deltaSVMΔ-boosted), 2 gene scores (PPI-enhancer and pLI) and 10 S2G strategies, conditional on 115 baseline-LD-deep-S2G-geneset annotations. Reports are meta-analyzed across 11 blood-related traits.

View this table:

Table S18. S-LDSC results for joint model of gene set-specific boosted-restricted annotations, conditional on baseline-LD-deep-S2G-geneset model annotations:

Standardized Effect sizes (τ*) and Enrichment (E) of the Bonferroni significant boosted-restricted S2G annotations linked to PPI-enhancer genes that were marginally significant in Figure 2. The results were conditioned either on 115 baseline- LD-deep-S2G-geneset annotations, or baseline-LD-deep-S2G-geneset plus 3 annotations from Figure 1, or baseline-LD-deep-S2G-geneset plus 3 annotations from Figure 1 plus BasenjiΔ-boosted×Roadmap. Reports are meta-analyzed across 11 blood-related traits.

View this table:

Table S19. S-LDSC results for boosted-restricted deep learning allelic-effect annotations restricted using S2G strategies, conditional on the baseline-LD- deep-S2G model annotations plus the local GC-content annotation and annotations restricted using the local GC-content annotation:

Standardized Effect sizes (τ*) and Enrichment (E) of restricted SNP annotations corresponding to each of DeepSEAΔ-boosted, BasenjiΔ-boosted, DeepbindΔ-boosted and deltaSVMΔ-boosted annotations restricted using the local GC-content and 10 S2G strategies conditional on 100 baseline-LD-deep annotations and unrestricted S2G annotations and S2G annotations restricted using local GC-content annotation. Reports are meta-analyzed across 11 blood-related traits.

View this table:

Table S20. S-LDSC results for annotations generated using the logistic regression model:

Standardized Effect sizes (τ*) and Enrichment (E) of SNP annotations generated by training fine-mapped SNPs on features from DeepSEA, Basenji, DeepBind and deltaSVM approaches using the logistic regression model (instead of the gradient boosting model). Results are conditional on 100 baseline-LD-deep annotations. Reports are meta-analyzed across 11 blood-related traits.

View this table:

Table S21. S-LDSC results for boosted-restricted deep learning allelic-effect annotations generated using the logistic regression model:

Standardized Effect sizes (τ*) and Enrichment (E) of restricted SNP annotations (across 10 S2G strategies) corresponding to annotations generated by training fine-mapped SNPs on features from DeepSEA, Basenji, DeepBind and deltaSVM approaches using the logistic regression model (instead of the gradient boosting model). Results are conditional on 107 baseline- LD-deep-S2G annotations (100 baseline-LD-deep and 7 additional annotations from Table S10). Results are meta-analyzed across 11 blood-related traits.

View this table:

Table S22. S-LDSC results for gene set-specific published-restricted annotations, conditional on baseline-LD-deep-S2G-geneset model annotations:

Standardized Effect sizes (τ*) and Enrichment (E) of 80 published-restricted SNP annotations corresponding to the 4 models (DeepSEAΔ-published, BasenjiΔ-published, DeepBindΔ- published, deltaSVMΔ-published) for which we observed significant enrichment signal for the published allelic-effect annotations, 2 gene scores (PPI-enhancer and pLI) and 10 S2G strategies, conditional on 115 baseline-LD-deep-S2G-geneset annotations. Reports are meta-analyzed across 11 blood-related traits.

View this table:

Table S23. S-LDSC results for gene set-specific published-restricted annotations, conditional on baseline-LD-deep-S2G-geneset model annotations plus the 2 jointly significant gene set-specific boosted-restricted annotations from Figure 2 Panel C:

Standardized Effect sizes (τ*) and Enrichment (E) of 80 restricted SNP annotations corresponding to 4 published allelic effect annotations (DeepSEAΔ- published, BasenjiΔ-published, DeepBindΔ-published and deltaSVMΔ-published) for which we observed significant enrichment signal for the published allelic-effect annotations, 2 gene scores (PPI-enhancer and pLI) and 10 S2G strategies, conditional on 115 baseline-LD-deep-S2G-geneset annotations and 2 jointly significant annotations from Figure 2 Panel C. Reports are meta-analyzed across 11 blood-related traits.

View this table:

Table S24. Top significant features of Imperio-DeepSEA and Imperio-Basenji models:

We report the top 10 chromatin marks (if significant) based on their magnitude of effect size. We consider 4 different models - the Imperio model fitted and evaluated on all genes for DeepSEA (Imperio-DeepSEA) and Basenji (Imperio-Basenji) and the Imperio model fitted and evaluated on PPI-enhancer genes for DeepSEA (Imperio-DeepSEA (PPI-enhancer)) and Basenji (Imperio-Basenji (PPI-enhancer)).

View this table:

Table S25. Comparison of 5 Imperio models utilizing a single S2G strategy with respect to using all 5 S2G strategies:

We perform the Imperio prediction model for DeepSEA and Basenji features corresponding to only one of the 5 S2G strategies, and compare the resulting model fit with that of the full Imperio model corresponding to all 5 S2G strategies. We use two measures of model fit - the r² metric and the correlation of predicted expression (corr.pred) with original expression on the genes of a holdout chromosome (chr8).

View this table:

Table S26. Proportion of cis-heritability captured by Imperio and ExPecto predictions of gene expression across individuals:

Results are averaged across all 22,020 genes for the 2 Imperio models (Imperio-DeepSEA and Imperio-Basenji) and the ExPecto-DeepSEA model.

View this table:

Table S27. Correlation of boosted-restricted deep learning allelic-effect annotations restricted using S2G strategies with Imperio annotations:

We report the correlation of boosted-restricted deep learning allelic-effect annotations restricted using S2G strategies with Imperio annotations for DeepSEA and Basenji deep learning models.

View this table:

Table S28. S-LDSC results for Imperio and ExPecto annotations, conditional on the baseline-LD-deep-S2G-geneset model annotations:

Standardized Effect sizes (τ*) and Enrichment (E) of Imperio annotations for DeepSEA (Imperio-DeepSEA) and Basenji (Imperio-Basenji) along with a similarly defined ExPecto (ExPecto-DeepSEA) annotation. Results were conditional on 115 baseline-LD-deep-S2G-geneset annotations. Reports are meta-analyzed across 11 blood-related traits.

View this table:

Table S29. Standardized enrichment results for Imperio and ExPecto annotations conditional on the baseline-LD-deep model annotations:

Standardized Enrichment of Imperio annotations for DeepSEA (Imperio-DeepSEA) and Basenji (Imperio-Basenji) along with a similarly defined ExPecto (ExPecto-DeepSEA) annotations. Results were conditional on 115 baseline-LD-deep-S2G-geneset annotations. Reports are meta-analyzed across 11 blood-related traits.

View this table:

Table S30. S-LDSC results for gene-set specific Imperio annotations, conditional on the baseline-LD-deep-S2G-geneset model annotations:

Standardized Effect sizes (τ*) and Enrichment (E) of gene set-specific Imperio annotations corresponding to 2 deep learning models (DeepSEA and Basenji) and 2 gene sets (pLI and PP-enhancer). Results were conditional on 115 baseline-LD-deep-S2G-geneset annotations. Reports are meta-analyzed across 11 blood-related traits.

View this table:

Table S31. Standardized enrichment results for gene-set specific Imperio annotations conditional on the baseline-LD-deep-S2G-geneset model:

Standardized Enrichment of gene set-specific Imperio annotations corresponding to 2 deep learning models (DeepSEA and Basenji) and 2 gene sets (pLI and PP-enhancer) and. Results were conditional on 115 baseline-LD-deep-S2G-geneset annotations. Reports are meta-analyzed across 11 blood-related traits.

View this table:

Table S32. S-LDSC results for joint model of gene set-specific Imperio annotations conditional on the baseline-LD-deep-S2G-geneset model annotations:

Joint Standardized Effect sizes (τ*) and Enrichment (E) of the 2 marginally significant gene-set specific Imperio annotations, Imperio-DeepSEA (PPI-enhancer) and Imperio-Basenji (PPI-enhancer). Results were conditional on 115 baseline-LD-deep-S2G-geneset annotations. Reports are meta-analyzed across 11 blood-related traits.

View this table:

Table S33. S-LDSC results for gene-set specific Imperio annotations, conditional on baseline-LD-deep-S2G-geneset model annotations plus 1 Imperio-Basenji annotation from Figure 3B:

Standardized Effect sizes (τ*) and Enrichment (E) of gene-set specific Imperio annotations corresponding to 2 deep learning models (DeepSEA and Basenji) and 2 gene-sets (pLI and PP-enhancer). Results were conditional on 115 baseline-LD-deep-S2G-geneset model annotations and 1 Imperio-Basenji annotation from Figure 3B. Reports are meta-analyzed across 11 blood-related traits.

View this table:

Table S34. S-LDSC results for Imperio+ExPecto annotations, conditional on the baseline-LD-deep-S2G-geneset model annotations:

Standardized Effect sizes (τ*) and Enrichment (E) of the Δ_s SNP level annotations computed using a combination both ExPecto⁴ and Imperio features for DeepSEA and Basenji models. Results were conditional either on 115 baseline-LD-deep-S2G-geneset annotations or baseline-LD-deep-S2G-geneset plus 1 Imperio-Basenji annotation from Figure 3B. Reports are meta-analyzed across 11 blood-related traits.

View this table:

Table S35. S-LDSC results for partially restricted gene set-specific Imperio annotations defined by restricting only the fitting of feature weights, conditional on the baseline-LD-deep-S2G-geneset model annotations:

Standardized Effect sizes (τ*) and Enrichment (E) of the intermediate Imperio (Imperio-int1) annotations computed by using all genes for fitting the model and gene sets for computing the expression allelic effects for 2 deep learning models (DeepSEA and Basenji) and 2 gene sets (pLI and PPI-enhancer). Results were conditional either on 115 baseline-LD-deep- S2G-geneset annotations or baseline-LD-deep-S2G-geneset plus 2 significant Imperio annotations from Figure 3. Reports are meta-analyzed across 11 blood-related traits.

View this table:

Table S36. S-LDSC results for partially restricted gene set-specific Imperio annotations defined by restricting only the gene expression predictions, conditional on the baseline-LD-deep-S2G-geneset model annotations:

Standardized Effect sizes (τ*) and Enrichment (E) of the intermediate Imperio (Imperio-int-2) annotations computed by using genes in a geneset for fitting the model and all genes for computing the expression allelic effects for 2 deep learning models (DeepSEA and Basenji) and 2 gene sets (pLI and PPI-enhancer). Results were conditional either on 115 baseline-LD-deep-S2G-geneset annotations or baseline-LD-deep-S2G-geneset plus 2 significant Imperio annotations from Figure 3. Reports are meta-analyzed across 11 blood-related traits.

View this table:

Table S37. S-LDSC results for annotations defined by the total number of genes linked to each SNP by each S2G strategy, conditional on the baseline- LD-deep-S2G-geneset model annotations:

Standardized Effect sizes (τ*) and Enrichment (E) of the number of genes (N_sd) linked to each SNP s by the S2G strategy d. The number of genes was thresholded at 5 and annotations were standardized to probabilistic scale. Results were conditional either on 115 baseline-LD-deep-S2G-geneset annotations. Reports are meta-analyzed across 11 blood-related traits.

View this table:

Table S38. S-LDSC results for annotations defined by the number of PPI- enhancer genes linked to each SNP by each S2G strategy, conditional on the baseline-LD-deep-S2G-geneset model annotations:

Standardized Effect sizes (τ*) and Enrichment (E) of the number of PPI-enhancer genes linked to each SNP s by the S2G strategy d. The number of genes was thresholded at 5 and annotations were standardized to probabilistic scale. Results were conditional either on 115 baseline- LD-deep-S2G-geneset annotations. Reports are meta-analyzed across 11 blood-related traits.

View this table:

Table S39. S-LDSC results for Imperio annotations defined using the maximum across genes proximal to the annotated SNPs (instead of the sum), conditional on the baseline-LD-deep-S2G-geneset model annotations plus the two significant annotations from Figure 3B,D:

Standardized Effect sizes (τ*) and Enrichment (E) of the SNP level annotations computed using the maximum across genes proximal to the annotated SNPs (instead of the sum), conditional on the baseline- LD-deep-S2G-geneset model annotations plus the two significant annotations from Figure 3B,D. Reports are meta-analyzed across 11 blood-related traits.

View this table:

Table S40. S-LDSC results for Whole blood MaxCPP annotations conditional on different baseline models:

Standardized Effect sizes (τ*) and Enrichment (E) of Whole blood MaxCPP (MaxCPP) annotations. Results were conditional on either 115 baseline-LD-deep-geneset, or 107 baseline-LD-deep-S2G, or 100 baseline-LD-deep annotations. Reports are meta-analyzed across 11 blood-related traits.

View this table:

Table S41. S-LDSC results for combined joint model:

Standardized Effect sizes (τ*) and Enrichment (E) in a joint model comprising of significant SNP annotations from Figure 1, Figure 2 and Figure 3. Only results for the 3 jointly Bonferroni significant annotations are reported. The results were conditioned on 115 baseline-LD-deep-S2G- geneset annotations. Reports are meta-analyzed across 11 blood-related traits.

View this table:

Table S42. Δlogl_SS results for the the combined joint model and other heritability models.

We report Δlogl_SS derived from the logl_SS metric, proposed in ref.³⁹, for the different heritability models studied in this paper: baseline-LD, baseline-LD-deep, baseline-LD-deep-S2G, baseline-LD-S2G-geneset and combined joint model in Figure 4 (Table S2). We compute Δlogl_SS as the difference in logl_SS for each model with respect to s baselineLD-no-funct model with 17 annotations that include no functional annotations^{37, 39}. We also report the percentage increase in Δlogl_SS for each model over the baseline-LD model. We do not report AIC as the number of annotations are not too different to alter conclusions based on just the logl_SS. We report three summary Δlogl_SS results - one averaged across 30 UK Biobank traits³⁷ (All), one averaged across 6 blood-related traits from UK Biobank (Blood) and one averaged across the other 24 non blood related traits from UK Biobank (Non-blood) (Table S43).

View this table:

Table S43. List of UKBiobank traits used for logl_SS calculations.

The list consists of 6 blood-related traits and 24 non blood-related traits.

Supplementary Figures

Figure S1. Illustration of the DeepBoost model:

(A) An overview of a sequence based genomic deep learning model like DeepSEA and Basenji, that trains on sequence images for a region and the chromatin features around that region using a deep Convolutional Neural Net (CNN) model. (B) Illustration of how the alelic effect annotation for a particular feature f is computed at a SNP s. The number of features f is 2,002 for the DeepSEA model, 4,229 for the Basenji model, 927 for Deepbind and 1329 for the deltaSVM model used. The length of the vertical arrow at the SNP site denotes the magnitude of the allelic effect and its direction represents the sign (up and down for positive and negative allelic effect respectively). (C) Illustration of the DeepBoost classification model where we classify positive set of fine-mapped SNPs from the negative set of matched controls using the allelic effect features.

Figure S2. Correlations between published and boosted allelic-effect annotations.

Correlation matrix of boosted and published allelic-effect annotations for DeepSEA, Basenji, DeepBind and deltaSVM models. We observed mildly positive correlations between published and boosted annotations for the same model (average r=0.16), and we observed weakly positive correlations across all annotations (r=0.10).

Figure S3. Feature importance of boosted annotations for DeepBoost using the DeepSEA model.

We applied SHAP⁴² to assess which deep learning features were most important for the prediction of boosted annotations using the DeepSEA (Methods). We report the top 20 features with signed SHAP scored ordered from top to bottom based on importance as in ref³⁷.

Figure S4. Feature importance of boosted annotations for DeepBoost using the Basenji model.

We applied SHAP⁴² to assess which deep learning features were most important for the prediction of boosted annotations using the Basenji (Methods). We report the top 20 features with signed SHAP scored ordered from top to bottom based on importance as in ref³⁷.

Figure S5. Feature importance of boosted annotations for DeepBoost using the DeepBind model.

We applied SHAP⁴² to assess which deep learning features were most important for the prediction of boosted annotations using the DeepBind method (Methods). We report the top 20 features with signed SHAP scored ordered from top to bottom based on importance as in ref³⁷.

Figure S6. Feature importance of boosted annotations for DeepBoost using the deltaSVM model.

We applied SHAP⁴² to assess which deep learning features were most important for the prediction of boosted annotations using the deltaSVM method (Methods). We report the top 20 features with signed SHAP scored ordered from top to bottom based on importance as in ref³⁷. TF E3 559: eGFP-ZNF507 ChIP-seq on human K562 genetically modified using stable transfection; TF E3 290: lung embryo (67 days), TF E3 540: ARID1B ChIP-seq on human K562, TF E3 110: NT2/D1, TF E3 765: EGR1 ChIP-seq on human K562, TF E3 372: HAIB ChIP TAF1 in MCF-7, TF E3 781: LEF1 ChIP-seq on human K562, TF E3 954: MNT ChIP-seq on human MCF-7, TF E3 802: CBFA2T3 ChIP-seq on human K562, TF E3 704: L3MBTL2 ChIP-seq on human K562, TF E3 15: NFE2L2 ChIP-seq on human A549, TF E3 181: eGFP-ZNF394 ChIP-seq on human HEK293 genetically modified using site-specific recombination originated from HEK293eGFP-ZNF394 ChIP-seq on human HEK293 genetically modified using site-specific recombination originated from HEK293, TF E3 389: HNRNPLL ChIP-seq on human HepG2, TF E3 829: eGFP-HINFP ChIP-seq on human K562 genetically modified using stable transfection, TF E3 330: CEBPB ChIP-seq protocol v042211.1 on human K562, TF E3 226: eGFP-ZSCAN4 ChIP-seq on human HEK293 genetically modified using site-specific recombination originated from HEK293, TF E3 752: HNRNPL ChIP-seq on human K562, TF E3 872: CUX1 ChIP-seq on human MCF-7.

Figure S7. Standardized enrichment of SNP annotations for published and boosted deep learning allelic-effect annotations.

Barplot representing standardized enrichment metric, as proposed in ref.⁸⁶, for (A) 4 published DeepSEA, Basenji, DeepBind and deltaSVM allelic-effect annotations and (B) 16 boosted annotations for DeepSEA, Basenji, DeepBind and deltaSVM models, using 3 sets of fine-mapped SNPs and their union. All results are conditional on the baseline-LD-deep model annotations.

Figure S8. Correlation between S2G annotations.

Correlation matrix of S2G annotations derived from all 10 SNP-to-gene (S2G) linking strategies (Table 1), as defined by the sets of SNPs linked to all genes.

Figure S9. Correlation between boosted allelic-effect annotations and S2G annotations.

Correlation matrix of 4 boosted allelic effect annotations, DeepSEAΔ-boosted, BasenjiΔ-boosted, DeepBindΔ-boosted and deltaSVMΔ-boosted, and 10 S2G annotations. The correlations range from weakly positive to moderately positive.

Figure S10. S-LDSC results for joint model of published-restricted and boosted-restricted deep learning allelic-effect annotations restricted using S2G strategies, conditional on the baseline-LD-deep-S2G model annotations.

Standardized effect size (τ*) conditional on baseline-LD-deep-S2G and other significant restricted S2G annotations (right column, shading) compared to the effect size from Figure 1 Panel B right panel (left column, white). Results are meta-analyzed across 11 blood-related traits. ** denotes P < 0.05/174. Error bars denote 95% confidence intervals. Numerical results are reported in Table S14.

Figure S11. Standardized enrichment of gene set-specific boosted-restricted annotations, conditional on baseline-LD-deep-S2G-geneset model annotations.

Standardized enrichment metric, as proposed in ref.⁸⁶, for 80 SNP annotations corresponding to 2 gene scores (PPI-enhancer³⁸, pLI³³) with 10 S2G annotations prioritized by 4 boosted allelic-effect annotations (DeepSEAΔ-boosted, BasenjiΔ-boosted, DeepBindΔ-boosted and deltaSVMΔ- boosted). Results only shown for those allelic-effect models and S2G strategies that show Bonferroni significance. ** denotes P < 0.05/174. Error bars denote 95% confidence intervals. Numerical results are reported in Table S17.

Figure S12. Illustration of the Imperio model:

A schematic representation of the different S2G straategie used in the Imperio model : (A) 100kb, (B) 5kb, (C) TSS and (D) ABC or Roadmap. (E) Illustration of how the deep learning variant level or allelic effect annotations are combined with these S2G strategies to generate the featues which are used as predictors in a regression model with GTEx Whole blood expression (log CPM) used as response.

Figure S13. Accuracy of Imperio in predicting gene expression across genes on chromosome 8.

For both Imperio-DeepSEA and Imperio-Basenji, we plot predicted expression vs. observed log RPKM expression, for 990 genes on chromosome 8.

Figure S14. Correlation in predicted expression between Imperio and ExPecto models.

Correlation in predicted expression for 990 chr8 genes used as held-out test set for the ExPecto method⁴ and the two Imperio models corresponding to DeepSEA and Basenji deep learning models.

Figure S15. Comparison of Imperio prediction r² for predictions of gene expression across individuals vs. cis-heritability.

We plot the prediction r² for the inter-individual model comprising of the Imperio predicted expression effects (see Methods) and the per-gene cis-heritability in Whole blood as estimated from trancriptiome wide association studies⁵² for two deep learning models - DeepSEA (Panel A) and Basenji (Panel B).

Figure S16. Correlation between all-genes ExPecto, all-genes Imperio, and gene set-specific Imperio annotations.

Correlation matrix of Whole blood MaxCPP²⁷, Whole blood ExPecto²⁷, and 6 Imperio annotations corresponding to 2 deep learning models (DeepSEA and Basenji) and three sets of genes (all genes, pLI genes and PPI-enhancer genes). The correlations range from slightly positive to medium high positive values.

Acknowledgments

We thank David Kelley, Steven Gazal, Alexander Gusev and Armin Schoech for helpful discussions. This research was funded by NIH grants U01 HG009379, R01 MH101244, R37 MH107649, R01 MH115676 and R01 MH109978. S.S.Kim was supported by NIH award F31HG010818. This research was conducted using the UK Biobank Resource under application 16549.

Footnotes

Following reviewer response, we have expanded the set of models from 2 deep learning models (DeepSEA and Basenji) to 4 deep learning/machine learning-based sequence models (DeepSEA, Basenji, DeepBind, deltaSVM). We have also updated the text to clarify the comparisons across methods and the features underlying the performance of these methods in greater detail.
https://github.com/kkdey/Imperio
https://alkesgroup.broadinstitute.org/LDSCORE/DeepLearning/Dey_DeepBoost_Imperio/

References

1.↵
B. Alipanahi, A. Delong, M.T. Weirauch, and B.J. Frey. Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning. Nature biotechnology, 33:831–838, 2015.
OpenUrl CrossRef PubMed
2.↵
J. Zhou and O.G. Troyanskaya. Predicting effects of noncoding variants with deep learning–based sequence model. Nature methods, 12:931–934, 2015.
OpenUrl
3.
D.R. Kelley, J. Snoek, and J.L. Rinn. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome research, 26:990–999, 2016.
OpenUrl Abstract/FREE Full Text
4.↵
J. Zhou, C.L. Theesfeld, K. Yao, K.M. Chen, A.K. Wong, and O.G. Troyanskaya. Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nature genetics, 50:1171–1179, 2018.
OpenUrl CrossRef
5.↵
D.R. Kelley, Y.A. Reshef, M. Bileschi, D. Belanger, C.Y. McLean, and J. Snoek. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome research, 28:739–750, 2018.
OpenUrl Abstract/FREE Full Text
6.
J. Zou et al. A primer on deep learning in genomics. Nature genetics, 51:12–18, 2019.
OpenUrl CrossRef
7.
G. Eraslan et al. Deep learning: new computational modelling techniques for genomics. Nature Reviews Genetics, 20:389–403, 2019.
OpenUrl CrossRef
8.↵
J. Zhou et al. Whole-genome deep-learning analysis identifies contribution of noncoding mutations to autism risk. Nature genetics, 51(6):973–980, 2019.
OpenUrl CrossRef PubMed
9.↵
D. Lee, D.U. Gorkin, M. Baker, B.J. Strober, A.L. Asoni, A.S. McCallion, and M.A. Beer. A method to predict the impact of regulatory variants from DNA sequence. Nature genetics, 47:955–961, 2015.
OpenUrl CrossRef PubMed
10.↵
M. Ghandi, D. Lee, M. Mohammad-Noori, and M.A. Beer. Enhanced regulatory sequence prediction using gapped k-mer features. PLoS computational biology, 10:e1003711, 2014.
OpenUrl
11.↵
M. Ghandi, M. Mohammad-Noori, N. Gharegani, D. Lee, L. Garraway, and M.A. Beer. gkmSVM: an R package for gapped-kmer SVM. Bioinformatics, 32:2205–2207, 2016.
OpenUrl CrossRef PubMed
12.↵
L.A. Hindorff et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proceedings of the National Academy of Sciences, 106(23):9362–9367, 2009.
OpenUrl Abstract/FREE Full Text
13.
M.T. Maurano et al. Systematic localization of common disease-associated variation in regulatory dna. Science, 337(6099):1190–1195, 2012.
OpenUrl Abstract/FREE Full Text
14.
G. Trynka et al. Chromatin marks identify critical cell types for fine mapping complex trait variants. Nature genetics, 45(2):124–130, 2013.
OpenUrl CrossRef PubMed
15.
J.K. Pickrell. Joint analysis of functional genomic data and genome-wide association studies of 18 human traits. The American Journal of Human Genetics, 94(4):559–573, 2014.
OpenUrl CrossRef PubMed
16.↵
H.K. Finucane, B. Bulik-Sullivan, A. Gusev, G. Trynka, Y. Reshef, P.R. Loh, V. Anttila, H. Xu, C. Zang, K. Farh, and S. Ripke. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nature genetics, 47:1228–1235, 2015.
OpenUrl CrossRef PubMed
17.
A.L. Price, C.C. Spencer, and P. Donnelly. Progress and promise in understanding the genetic basis of common diseases. Proceedings of the Royal Society B: Biological Sciences, 282(1821):20151684, 2015.
OpenUrl CrossRef PubMed
18.↵
P.M. Visscher et al. 10 years of GWAS discovery: biology, function, and translation. The American Journal of Human Genetics, 101(1):5–22, 2017.
OpenUrl CrossRef PubMed
19.↵
K.K. Dey et al. Evaluating the informativeness of deep learning annotations for human complex diseases. Nature communications, 11:1–9, 2020.
OpenUrl
20.↵
T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, ACM:785–794, 2016.
21.↵
K.K.H. Farh et al. Genetic and epigenetic fine mapping of causal autoimmune disease variants. Nature, 518:337–343, 2015.
OpenUrl CrossRef PubMed
22.↵
H. Huang et al. Fine-mapping inflammatory bowel disease loci to single-variant resolution. Nature, 547:173–178, 2017.
OpenUrl CrossRef PubMed
23.↵
O. Weissbrod et al. Functionally-informed fine-mapping and polygenic localization of complex trait heritability. Nature Genetics, 52(12):p.1355–1363, 2020.
OpenUrl
24.↵
J. Nasser et al. Genome-wide enhancer maps link risk variants to disease genes. Nature, 593(7858):238–243, 2021.
OpenUrl
25.↵
C.P. Fulco et al. Activity-by-contact model of enhancer–promoter regulation from thousands of CRISPR perturbations. Nature Genetics, 51:1664–1669, 2019.
OpenUrl CrossRef
26.↵
H. Yoshida et al. The cis-Regulatory Atlas of the Mouse Immune System. Cell, 176:897–912, 2019.
OpenUrl CrossRef
27.↵
F. Hormozdiari et al. Leveraging molecular quantitative trait loci to understand the genetic architecture of diseases and complex traits. Nature genetics, 50(7):1041– 1047, 2018.
OpenUrl CrossRef PubMed
28.↵
Y. Liu et al. Evidence of a recombination rate valley in human regulatory domains. Genome Biology, 18:p.193, 2017.
OpenUrl CrossRef
29.↵
B.M. Javierre et al. Lineage-specific genome architecture links enhancers and non-coding disease variants to target gene promoters. Cell, 167:1369–1384, 2016.
OpenUrl CrossRef PubMed
30.↵
M.M. Hoffman, J. Ernst, S.P. Wilder, A. Kundaje, R.S. Harris, and M. Libbrecht. A method to predict the impact of regulatory variants from DNA sequence. Nucleic acids research, 41:827–841, 2012.
OpenUrl PubMed Web of Science
31.↵
M.M. Hoffman, O.J. Buske, J. Wang, Z. Weng, J.A. Bilmes, and W.S. Noble. Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nature methods, 9:473–476, 2012.
OpenUrl
32.↵
J. Ernst et al. Mapping and analysis of chromatin state dynamics in nine human cell types. Nature, 473(7345):43–49, 2011.
OpenUrl CrossRef PubMed Web of Science
33.↵
M. Lek, K.J. Karczewski, E.V. Minikel, K.E. Samocha, E. Banks, et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature, 536:285–291, 2016.
OpenUrl CrossRef PubMed Web of Science
34.↵
S. Gazal et al. Linkage disequilibrium–dependent architecture of human complex traits shows action of negative selection. Nature genetics, 49 (10):1421–1427, 2017.
OpenUrl CrossRef PubMed
35.↵
S. Gazal, C. Marquez-Luna, H.K. Finucane, and A.L. Price. Reconciling s-ldsc and ldak models and functional enrichment estimates. Nature genetics, 51(8):1202– 1204, 2019.
OpenUrl CrossRef
36.↵
1000 Genomes Project Consortium. A global reference for human genetic variation. Nature, 526(7571):68–74, 2015.
OpenUrl CrossRef PubMed
37.↵
S.S. Kim et al. Improving the informativeness of Mendelian disease pathogenicity scores for common disease. bioRxiv, 2020.
38.↵
K.K. Dey et al. Unique contribution of enhancer-driven and master-regulator genes to autoimmune disease revealed using functionally informed SNP-to-gene strategies. bioRxiv, 2020.
39.↵
D. Speed, J. Holmes, and D.J Balding. Evaluating and improving heritability models using summary statistics. Nature Genetics, 52:458–462, 2020.
OpenUrl PubMed
40.↵
D. Shigaki et al. Integration of multiple epigenomic marks improves prediction of variant impact in saturation mutagenesis reporter assay.. Human mutation, 40(9):1280–1291, 2019.
OpenUrl
41.↵
A. Kundaje et al. Integrative analysis of 111 reference human epigenomes. Nature, 518:317–330, 2015.
OpenUrl CrossRef PubMed
42.↵
S.M. Lundberg and S.I. Lee. A unified approach to interpreting model predictions. In Advances in neural information processing systems, pages 4765–4774, 2017.
43.↵
D.F. Read et al. Predicting gene expression in the human malaria parasite Plasmodium falciparum using histone modification, nucleosome positioning, and 3D localization features. PLoS computational biology, 15(9):e1007329, 2019.
OpenUrl
44.↵
M. Yap et al. Verifying explainability of a deep learning tissue classifier trained on RNA-seq data. Scientific reports, 11(1):1–12, 2021.
OpenUrl
45.↵
L. Liu et al. HoxA13 regulates phenotype regionalization of human pregnant myometrium. The Journal of Clinical Endocrinology and Metabolism, 100(12):E1512– E1522, 2015.
OpenUrl
46.↵
C. Bycroft et al. The uk biobank resource with deep phenotyping and genomic data. Nature, 562(7726):203–209, 2018.
OpenUrl CrossRef PubMed
47.↵
F.I. Hormozdiari et al. Functional disease architectures reveal unique biological role of transposable elements. Nature Communications, 10(1):4054, 2019.
OpenUrl
48.↵
GTEx Consortium. Genetic effects on gene expression across human tissues. Nature, 550(7675):204–213, 2017.
OpenUrl CrossRef PubMed Web of Science
49.↵
B. van de Geijn, H. Finucane, S. Gazal, F. Hormozdiari, T. Amariuta, and X Liu. Annotations capturing cell-type-specific TF binding explain a large fraction of disease heritability. Human Molecular Genetics, 29:1057–1067, 2020.
OpenUrl CrossRef
50.↵
O. Wagih et al. Allele-specific transcription factor binding as a benchmark for assessing variant impact predictors. . bioRxiv, 11(1):p.253427, 2018.
OpenUrl
51.↵
E.R. Gamazon et al. A gene-based association method for mapping traits using reference transcriptome data. Nature genetics, 47:1091–1098, 2015.
OpenUrl CrossRef PubMed
52.↵
A. Gusev et al. Integrative approaches for large-scale transcriptome-wide association studies. Nature genetics, 48:245–252, 2016.
OpenUrl CrossRef PubMed
53.↵
M. Wainberg et al. Opportunities and challenges for transcriptome-wide association studies. Nature genetics, 51:592–599, 2019.
OpenUrl CrossRef PubMed
54.↵
D.W. Yao et al. Quantifying genetic effects on disease mediated by assayed gene expression levels. Nature Genetics, 52:626–633, 2020.
OpenUrl
55.↵
G. Kichaev et al. Integrating functional data to prioritize causal variants in statistical fine-mapping studies. PLoS genetics, 10(10):e1004722, 2014.
OpenUrl
56.
W. Chen et al. Incorporating functional annotations for fine-mapping causal variants in a Bayesian framework using summary statistics. Genetics, 204(3):933– 958, 2016.
OpenUrl Abstract/FREE Full Text
57.↵
G. Kichaev et al. Improved methods for multi-trait fine mapping of pleiotropic risk loci. Bioinformatics, 33(2):248–255, 2017.
OpenUrl CrossRef PubMed
58.↵
J.P. Ray et al. Prioritizing disease and trait causal variants at the TNFAIP3 locus using functional and genomic features. Nature communications, 11(1):1–13, 2020.
OpenUrl
59.↵
Y. Hu et al. Leveraging functional annotations in genetic risk prediction for human complex diseases. PLoS computational biology, 13(6):e1005589, 2017.
OpenUrl
60.↵
C. Márquez-Luna et al. Multiethnic polygenic risk scores improve risk prediction in diverse populations. Genetic epidemiology, 41(8):811–823, 2017.
OpenUrl CrossRef PubMed
61.↵
S. Gazal et al. Combining SNP-to-gene linking strategies to pinpoint disease genes and assess disease omnigenicity. medRxiv, page 2021.08.02.21261488, 2021.
62.↵
K. Jaganathan et al. Predicting splicing from primary sequence with deep learning. Cell, 176(3):535–548, 2019.
OpenUrl PubMed
63.↵
L. Liu et al. Biological relevance of computationally predicted pathogenicity of noncoding variants. Nature Communications, 10:330, 2019.
OpenUrl
64.↵
D.R. Kelley. Cross-species regulatory sequence activity prediction. PLOS Computational Biology, 16(7):e1008050, 2020.
OpenUrl
65.↵
M. Lizio et al. Gateways to the fantom5 promoter level mammalian expression atlas. Genome biology, 16(1):22, 2015.
OpenUrl CrossRef PubMed
66.↵
M. Lizio et al. Update of the fantom web resource: high resolution transcriptome of diverse cell types in mammals. Nucleic acids research, 45(D1):D737, 2017.
OpenUrl CrossRef PubMed
67.↵
J. Ernst and M. Kellis. ChromHMM: automating chromatin-state discovery and characterization. Nature methods, 9:215–216, 2012.
OpenUrl
68.↵
J.H. Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232, 2001.
69.↵
B. Caron, Y. Luo, and A. Rausell. NCBoost classifies pathogenic non-coding variants in Mendelian diseases through supervised learning on purifying selection signals in humans. Genome biology, 20:32, 2019.
OpenUrl CrossRef
70.↵
W.J. Kent et al. The human genome browser at ucsc. Genome research, 12(6):996– 1006, 2002.
OpenUrl Abstract/FREE Full Text
71.↵
D. Karolchik, A.S. Hinrichs, T.S. Furey, K.M. Roskin, C.W. Sugnet, et al. The UCSC Table Browser data retrieval tool. Nucleic acids research, 32:D493–D496, 2004.
OpenUrl CrossRef PubMed Web of Science
72.↵
H. Tong, C. Faloutsos, and J.Y. Pan. Random walk with restart: fast solutions and applications. Knowledge and Information Systems, 14:327–346, 2008.
OpenUrl
73.↵
D. Szklarczyk et al. STRING v10: protein–protein interaction networks, integrated over the tree of life. Nucleic acids research, 43:D447–D452, 2014.
OpenUrl PubMed
74.↵
D. Szklarczyk et al. The STRING database in 2017: quality-controlled protein–protein association networks, made broadly accessible. Nucleic acids research, 45(Database issue):D362–D368, 2017.
OpenUrl CrossRef PubMed
75.↵
X. Wang and D.B. Goldstein. Enhancer Domains Predict Gene Pathogenicity and Inform Gene Discovery in Complex Disease. The American Journal of Human Genetics, 106:215–233, 2020.
OpenUrl CrossRef
76.↵
H.K. Finucane, Y.A. Reshef, V. Anttila, K. Slowikowski, A. Gusev, A. Byrnes, et al. Heritability enrichment of specifically expressed genes identifies disease-relevant tissues and cell types. Nature genetics, 50:621–629, 2018.
OpenUrl CrossRef PubMed
77.↵
A.P. Schoech et al. Quantification of frequency-dependent genetic architectures in 25 UK Biobank traits reveals action of negative selection. Nature communications, 10(1):1–10, 2019.
OpenUrl
78.↵
H.M. Amemiya, A. Kundaje, and A.P. Boyle. The ENCODE blacklist: identification of problematic regions of the genome. Scientific reports, 9(1):1–5, 2019.
OpenUrl
79.↵
ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature, 489:57–74, 2012.
OpenUrl CrossRef PubMed Web of Science
80.↵
Jan-Renier AJ Moonen et al. KLF4 Recruits SWI/SNF to Increase Chromatin Accessibility and Reprogram the Endothelial Enhancer Landscape under Laminar Shear Stress. bioRxiv, 2020.
81.↵
S. McCarthy, S. Das, W. Kretzschmar, O. Delaneau, A.R. Wood, A. Teumer, H.M. Kang, C. Fuchsberger, P. Danecek, K. Sharp, and Y. Luo. A reference panel of 64,976 haplotypes for genotype imputation. Nature genetics. Nature genetics, 48:1279–1283, 2016.
OpenUrl CrossRef PubMed
82.
L. Jostins et al. Host-microbe interactions have shaped the genetic architecture of inflammatory bowel disease. Nature, 491:119–124, 2012.
OpenUrl CrossRef PubMed Web of Science
83.
Y. Okada et al. Genetics of rheumatoid arthritis contributes to biology and drug discovery. Nature, 506:376–381, 2014.
OpenUrl CrossRef PubMed Web of Science
84.
J. Bentham et al. Genetic association analyses implicate aberrant regulation of innate and adaptive immunity genes in the pathogenesis of systemic lupus erythematosus. Nature genetics, 47(12):1457–1464, 2015.
OpenUrl CrossRef PubMed
85.
P.C. Dubois et al. Multiple common variants for celiac disease influencing immune gene expression. Nature genetics, 42(4):295–302, 2010.
OpenUrl CrossRef PubMed Web of Science
86.↵
S.S. Kim et al. Genes with high network connectivity are enriched for disease heritability. The American Journal of Human Genetics, 104:896–913, 2019.
OpenUrl CrossRef PubMed

View the discussion thread.

Posted August 13, 2021.

Download PDF

Data/Code

Citation Tools

Subject Area

Genetics

Subject Areas

All Articles

Animal Behavior and Cognition (5214)
Biochemistry (11745)
Bioengineering (8751)
Bioinformatics (29195)
Biophysics (14971)
Cancer Biology (12095)
Cell Biology (17411)
Clinical Trials (138)
Developmental Biology (9421)
Ecology (14178)
Epidemiology (2067)
Evolutionary Biology (18306)
Genetics (12245)
Genomics (16801)
Immunology (11867)
Microbiology (28083)
Molecular Biology (11592)
Neuroscience (60965)
Paleontology (451)
Pathology (1870)
Pharmacology and Toxicology (3238)
Physiology (4959)
Plant Biology (10427)
Scientific Communication and Education (1683)
Synthetic Biology (2885)
Systems Biology (7339)
Zoology (1651)

[1] 1.↵
B. Alipanahi, A. Delong, M.T. Weirauch, and B.J. Frey. Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning. Nature biotechnology, 33:831–838, 2015.
OpenUrl CrossRef PubMed

[2] 2.↵
J. Zhou and O.G. Troyanskaya. Predicting effects of noncoding variants with deep learning–based sequence model. Nature methods, 12:931–934, 2015.
OpenUrl

[3] 3.
D.R. Kelley, J. Snoek, and J.L. Rinn. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome research, 26:990–999, 2016.
OpenUrl Abstract/FREE Full Text

[4] 4.↵
J. Zhou, C.L. Theesfeld, K. Yao, K.M. Chen, A.K. Wong, and O.G. Troyanskaya. Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nature genetics, 50:1171–1179, 2018.
OpenUrl CrossRef

[5] 5.↵
D.R. Kelley, Y.A. Reshef, M. Bileschi, D. Belanger, C.Y. McLean, and J. Snoek. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome research, 28:739–750, 2018.
OpenUrl Abstract/FREE Full Text

[6] 6.
J. Zou et al. A primer on deep learning in genomics. Nature genetics, 51:12–18, 2019.
OpenUrl CrossRef

[7] 7.
G. Eraslan et al. Deep learning: new computational modelling techniques for genomics. Nature Reviews Genetics, 20:389–403, 2019.
OpenUrl CrossRef

[8] 8.↵
J. Zhou et al. Whole-genome deep-learning analysis identifies contribution of noncoding mutations to autism risk. Nature genetics, 51(6):973–980, 2019.
OpenUrl CrossRef PubMed

[9] 9.↵
D. Lee, D.U. Gorkin, M. Baker, B.J. Strober, A.L. Asoni, A.S. McCallion, and M.A. Beer. A method to predict the impact of regulatory variants from DNA sequence. Nature genetics, 47:955–961, 2015.
OpenUrl CrossRef PubMed

[10] 10.↵
M. Ghandi, D. Lee, M. Mohammad-Noori, and M.A. Beer. Enhanced regulatory sequence prediction using gapped k-mer features. PLoS computational biology, 10:e1003711, 2014.
OpenUrl

[11] 11.↵
M. Ghandi, M. Mohammad-Noori, N. Gharegani, D. Lee, L. Garraway, and M.A. Beer. gkmSVM: an R package for gapped-kmer SVM. Bioinformatics, 32:2205–2207, 2016.
OpenUrl CrossRef PubMed

[12] 12.↵
L.A. Hindorff et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proceedings of the National Academy of Sciences, 106(23):9362–9367, 2009.
OpenUrl Abstract/FREE Full Text

[13] 13.
M.T. Maurano et al. Systematic localization of common disease-associated variation in regulatory dna. Science, 337(6099):1190–1195, 2012.
OpenUrl Abstract/FREE Full Text

[14] 14.
G. Trynka et al. Chromatin marks identify critical cell types for fine mapping complex trait variants. Nature genetics, 45(2):124–130, 2013.
OpenUrl CrossRef PubMed

[15] 15.
J.K. Pickrell. Joint analysis of functional genomic data and genome-wide association studies of 18 human traits. The American Journal of Human Genetics, 94(4):559–573, 2014.
OpenUrl CrossRef PubMed

[16] 16.↵
H.K. Finucane, B. Bulik-Sullivan, A. Gusev, G. Trynka, Y. Reshef, P.R. Loh, V. Anttila, H. Xu, C. Zang, K. Farh, and S. Ripke. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nature genetics, 47:1228–1235, 2015.
OpenUrl CrossRef PubMed

[17] 17.
A.L. Price, C.C. Spencer, and P. Donnelly. Progress and promise in understanding the genetic basis of common diseases. Proceedings of the Royal Society B: Biological Sciences, 282(1821):20151684, 2015.
OpenUrl CrossRef PubMed

[18] 18.↵
P.M. Visscher et al. 10 years of GWAS discovery: biology, function, and translation. The American Journal of Human Genetics, 101(1):5–22, 2017.
OpenUrl CrossRef PubMed

[19] 19.↵
K.K. Dey et al. Evaluating the informativeness of deep learning annotations for human complex diseases. Nature communications, 11:1–9, 2020.
OpenUrl

[20] 20.↵
T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, ACM:785–794, 2016.

[21] 21.↵
K.K.H. Farh et al. Genetic and epigenetic fine mapping of causal autoimmune disease variants. Nature, 518:337–343, 2015.
OpenUrl CrossRef PubMed

[22] 22.↵
H. Huang et al. Fine-mapping inflammatory bowel disease loci to single-variant resolution. Nature, 547:173–178, 2017.
OpenUrl CrossRef PubMed

[23] 23.↵
O. Weissbrod et al. Functionally-informed fine-mapping and polygenic localization of complex trait heritability. Nature Genetics, 52(12):p.1355–1363, 2020.
OpenUrl

[24] 24.↵
J. Nasser et al. Genome-wide enhancer maps link risk variants to disease genes. Nature, 593(7858):238–243, 2021.
OpenUrl

[25] 25.↵
C.P. Fulco et al. Activity-by-contact model of enhancer–promoter regulation from thousands of CRISPR perturbations. Nature Genetics, 51:1664–1669, 2019.
OpenUrl CrossRef

[26] 26.↵
H. Yoshida et al. The cis-Regulatory Atlas of the Mouse Immune System. Cell, 176:897–912, 2019.
OpenUrl CrossRef

[27] 27.↵
F. Hormozdiari et al. Leveraging molecular quantitative trait loci to understand the genetic architecture of diseases and complex traits. Nature genetics, 50(7):1041– 1047, 2018.
OpenUrl CrossRef PubMed

[28] 28.↵
Y. Liu et al. Evidence of a recombination rate valley in human regulatory domains. Genome Biology, 18:p.193, 2017.
OpenUrl CrossRef

[29] 29.↵
B.M. Javierre et al. Lineage-specific genome architecture links enhancers and non-coding disease variants to target gene promoters. Cell, 167:1369–1384, 2016.
OpenUrl CrossRef PubMed

[30] 30.↵
M.M. Hoffman, J. Ernst, S.P. Wilder, A. Kundaje, R.S. Harris, and M. Libbrecht. A method to predict the impact of regulatory variants from DNA sequence. Nucleic acids research, 41:827–841, 2012.
OpenUrl PubMed Web of Science

[31] 31.↵
M.M. Hoffman, O.J. Buske, J. Wang, Z. Weng, J.A. Bilmes, and W.S. Noble. Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nature methods, 9:473–476, 2012.
OpenUrl

[32] 32.↵
J. Ernst et al. Mapping and analysis of chromatin state dynamics in nine human cell types. Nature, 473(7345):43–49, 2011.
OpenUrl CrossRef PubMed Web of Science

[33] 33.↵
M. Lek, K.J. Karczewski, E.V. Minikel, K.E. Samocha, E. Banks, et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature, 536:285–291, 2016.
OpenUrl CrossRef PubMed Web of Science

[34] 34.↵
S. Gazal et al. Linkage disequilibrium–dependent architecture of human complex traits shows action of negative selection. Nature genetics, 49 (10):1421–1427, 2017.
OpenUrl CrossRef PubMed

[35] 35.↵
S. Gazal, C. Marquez-Luna, H.K. Finucane, and A.L. Price. Reconciling s-ldsc and ldak models and functional enrichment estimates. Nature genetics, 51(8):1202– 1204, 2019.
OpenUrl CrossRef

[36] 36.↵
1000 Genomes Project Consortium. A global reference for human genetic variation. Nature, 526(7571):68–74, 2015.
OpenUrl CrossRef PubMed

[37] 37.↵
S.S. Kim et al. Improving the informativeness of Mendelian disease pathogenicity scores for common disease. bioRxiv, 2020.

[38] 38.↵
K.K. Dey et al. Unique contribution of enhancer-driven and master-regulator genes to autoimmune disease revealed using functionally informed SNP-to-gene strategies. bioRxiv, 2020.

[39] 39.↵
D. Speed, J. Holmes, and D.J Balding. Evaluating and improving heritability models using summary statistics. Nature Genetics, 52:458–462, 2020.
OpenUrl PubMed

[40] 40.↵
D. Shigaki et al. Integration of multiple epigenomic marks improves prediction of variant impact in saturation mutagenesis reporter assay.. Human mutation, 40(9):1280–1291, 2019.
OpenUrl

[41] 41.↵
A. Kundaje et al. Integrative analysis of 111 reference human epigenomes. Nature, 518:317–330, 2015.
OpenUrl CrossRef PubMed

[42] 42.↵
S.M. Lundberg and S.I. Lee. A unified approach to interpreting model predictions. In Advances in neural information processing systems, pages 4765–4774, 2017.

[43] 43.↵
D.F. Read et al. Predicting gene expression in the human malaria parasite Plasmodium falciparum using histone modification, nucleosome positioning, and 3D localization features. PLoS computational biology, 15(9):e1007329, 2019.
OpenUrl

[44] 44.↵
M. Yap et al. Verifying explainability of a deep learning tissue classifier trained on RNA-seq data. Scientific reports, 11(1):1–12, 2021.
OpenUrl

[45] 45.↵
L. Liu et al. HoxA13 regulates phenotype regionalization of human pregnant myometrium. The Journal of Clinical Endocrinology and Metabolism, 100(12):E1512– E1522, 2015.
OpenUrl

[46] 46.↵
C. Bycroft et al. The uk biobank resource with deep phenotyping and genomic data. Nature, 562(7726):203–209, 2018.
OpenUrl CrossRef PubMed

[47] 47.↵
F.I. Hormozdiari et al. Functional disease architectures reveal unique biological role of transposable elements. Nature Communications, 10(1):4054, 2019.
OpenUrl

[48] 48.↵
GTEx Consortium. Genetic effects on gene expression across human tissues. Nature, 550(7675):204–213, 2017.
OpenUrl CrossRef PubMed Web of Science

[49] 49.↵
B. van de Geijn, H. Finucane, S. Gazal, F. Hormozdiari, T. Amariuta, and X Liu. Annotations capturing cell-type-specific TF binding explain a large fraction of disease heritability. Human Molecular Genetics, 29:1057–1067, 2020.
OpenUrl CrossRef

[50] 50.↵
O. Wagih et al. Allele-specific transcription factor binding as a benchmark for assessing variant impact predictors. . bioRxiv, 11(1):p.253427, 2018.
OpenUrl

[51] 51.↵
E.R. Gamazon et al. A gene-based association method for mapping traits using reference transcriptome data. Nature genetics, 47:1091–1098, 2015.
OpenUrl CrossRef PubMed

[52] 52.↵
A. Gusev et al. Integrative approaches for large-scale transcriptome-wide association studies. Nature genetics, 48:245–252, 2016.
OpenUrl CrossRef PubMed

[53] 53.↵
M. Wainberg et al. Opportunities and challenges for transcriptome-wide association studies. Nature genetics, 51:592–599, 2019.
OpenUrl CrossRef PubMed

[54] 54.↵
D.W. Yao et al. Quantifying genetic effects on disease mediated by assayed gene expression levels. Nature Genetics, 52:626–633, 2020.
OpenUrl

[55] 55.↵
G. Kichaev et al. Integrating functional data to prioritize causal variants in statistical fine-mapping studies. PLoS genetics, 10(10):e1004722, 2014.
OpenUrl

[56] 56.
W. Chen et al. Incorporating functional annotations for fine-mapping causal variants in a Bayesian framework using summary statistics. Genetics, 204(3):933– 958, 2016.
OpenUrl Abstract/FREE Full Text

[57] 57.↵
G. Kichaev et al. Improved methods for multi-trait fine mapping of pleiotropic risk loci. Bioinformatics, 33(2):248–255, 2017.
OpenUrl CrossRef PubMed

[58] 58.↵
J.P. Ray et al. Prioritizing disease and trait causal variants at the TNFAIP3 locus using functional and genomic features. Nature communications, 11(1):1–13, 2020.
OpenUrl

[59] 59.↵
Y. Hu et al. Leveraging functional annotations in genetic risk prediction for human complex diseases. PLoS computational biology, 13(6):e1005589, 2017.
OpenUrl

[60] 60.↵
C. Márquez-Luna et al. Multiethnic polygenic risk scores improve risk prediction in diverse populations. Genetic epidemiology, 41(8):811–823, 2017.
OpenUrl CrossRef PubMed

[61] 61.↵
S. Gazal et al. Combining SNP-to-gene linking strategies to pinpoint disease genes and assess disease omnigenicity. medRxiv, page 2021.08.02.21261488, 2021.

[62] 62.↵
K. Jaganathan et al. Predicting splicing from primary sequence with deep learning. Cell, 176(3):535–548, 2019.
OpenUrl PubMed

[63] 63.↵
L. Liu et al. Biological relevance of computationally predicted pathogenicity of noncoding variants. Nature Communications, 10:330, 2019.
OpenUrl

[64] 64.↵
D.R. Kelley. Cross-species regulatory sequence activity prediction. PLOS Computational Biology, 16(7):e1008050, 2020.
OpenUrl

[65] 65.↵
M. Lizio et al. Gateways to the fantom5 promoter level mammalian expression atlas. Genome biology, 16(1):22, 2015.
OpenUrl CrossRef PubMed

[66] 66.↵
M. Lizio et al. Update of the fantom web resource: high resolution transcriptome of diverse cell types in mammals. Nucleic acids research, 45(D1):D737, 2017.
OpenUrl CrossRef PubMed

[67] 67.↵
J. Ernst and M. Kellis. ChromHMM: automating chromatin-state discovery and characterization. Nature methods, 9:215–216, 2012.
OpenUrl

[68] 68.↵
J.H. Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232, 2001.

[69] 69.↵
B. Caron, Y. Luo, and A. Rausell. NCBoost classifies pathogenic non-coding variants in Mendelian diseases through supervised learning on purifying selection signals in humans. Genome biology, 20:32, 2019.
OpenUrl CrossRef

[70] 70.↵
W.J. Kent et al. The human genome browser at ucsc. Genome research, 12(6):996– 1006, 2002.
OpenUrl Abstract/FREE Full Text

[71] 71.↵
D. Karolchik, A.S. Hinrichs, T.S. Furey, K.M. Roskin, C.W. Sugnet, et al. The UCSC Table Browser data retrieval tool. Nucleic acids research, 32:D493–D496, 2004.
OpenUrl CrossRef PubMed Web of Science

[72] 72.↵
H. Tong, C. Faloutsos, and J.Y. Pan. Random walk with restart: fast solutions and applications. Knowledge and Information Systems, 14:327–346, 2008.
OpenUrl

[73] 73.↵
D. Szklarczyk et al. STRING v10: protein–protein interaction networks, integrated over the tree of life. Nucleic acids research, 43:D447–D452, 2014.
OpenUrl PubMed

[74] 74.↵
D. Szklarczyk et al. The STRING database in 2017: quality-controlled protein–protein association networks, made broadly accessible. Nucleic acids research, 45(Database issue):D362–D368, 2017.
OpenUrl CrossRef PubMed

[75] 75.↵
X. Wang and D.B. Goldstein. Enhancer Domains Predict Gene Pathogenicity and Inform Gene Discovery in Complex Disease. The American Journal of Human Genetics, 106:215–233, 2020.
OpenUrl CrossRef

[76] 76.↵
H.K. Finucane, Y.A. Reshef, V. Anttila, K. Slowikowski, A. Gusev, A. Byrnes, et al. Heritability enrichment of specifically expressed genes identifies disease-relevant tissues and cell types. Nature genetics, 50:621–629, 2018.
OpenUrl CrossRef PubMed

[77] 77.↵
A.P. Schoech et al. Quantification of frequency-dependent genetic architectures in 25 UK Biobank traits reveals action of negative selection. Nature communications, 10(1):1–10, 2019.
OpenUrl

[78] 78.↵
H.M. Amemiya, A. Kundaje, and A.P. Boyle. The ENCODE blacklist: identification of problematic regions of the genome. Scientific reports, 9(1):1–5, 2019.
OpenUrl

[79] 79.↵
ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature, 489:57–74, 2012.
OpenUrl CrossRef PubMed Web of Science

[80] 80.↵
Jan-Renier AJ Moonen et al. KLF4 Recruits SWI/SNF to Increase Chromatin Accessibility and Reprogram the Endothelial Enhancer Landscape under Laminar Shear Stress. bioRxiv, 2020.

[81] 81.↵
S. McCarthy, S. Das, W. Kretzschmar, O. Delaneau, A.R. Wood, A. Teumer, H.M. Kang, C. Fuchsberger, P. Danecek, K. Sharp, and Y. Luo. A reference panel of 64,976 haplotypes for genotype imputation. Nature genetics. Nature genetics, 48:1279–1283, 2016.
OpenUrl CrossRef PubMed

[82] 82.
L. Jostins et al. Host-microbe interactions have shaped the genetic architecture of inflammatory bowel disease. Nature, 491:119–124, 2012.
OpenUrl CrossRef PubMed Web of Science

[83] 83.
Y. Okada et al. Genetics of rheumatoid arthritis contributes to biology and drug discovery. Nature, 506:376–381, 2014.
OpenUrl CrossRef PubMed Web of Science

[84] 84.
J. Bentham et al. Genetic association analyses implicate aberrant regulation of innate and adaptive immunity genes in the pathogenesis of systemic lupus erythematosus. Nature genetics, 47(12):1457–1464, 2015.
OpenUrl CrossRef PubMed

[85] 85.
P.C. Dubois et al. Multiple common variants for celiac disease influencing immune gene expression. Nature genetics, 42(4):295–302, 2010.
OpenUrl CrossRef PubMed Web of Science

[86] 86.↵
S.S. Kim et al. Genes with high network connectivity are enriched for disease heritability. The American Journal of Human Genetics, 104:896–913, 2019.
OpenUrl CrossRef PubMed

Integrative approaches to improve the informativeness of deep learning models for human complex diseases

Abstract

Introduction

Results

Overview of Methods

DeepBoost annotations restricted to SNPs implicated by functionally informed S2G linking strategies are uniquely informative for autoimmune disease heritability

Sequence-based deep learning predictions of gene expression informed by S2G linking strategies are uniquely informative for autoimmune disease heritability

Combined joint model

Discussion

Methods

Genomic annotations and the baseline-LD model

DeepSEA, Basenji, DeepBind and deltaSVM functional annotations

Boosted deep learning annotations using DeepBoost

Boosted-restricted deep learning annotations using S2G strategies

Imperio deep learning annotations using gene expression predictions informed by S2G strategies

Predicting gene expression across individuals using Imperio

Gene set-specific Imperio deep learning annotations

Activity-by-Contact S2G strategy

Number of new annotations analyzed

Stratified LD score regression

Combined τ*

Evaluating heritability model fit using SumHer loglSS

Data Availability

Code Availability

Supplementary Tables

Supplementary Figures

Acknowledgments

Footnotes

References

Citation Manager Formats

Subject Area

Evaluating heritability model fit using SumHer logl_SS