Leveraging supervised learning for functionally-informed fine-mapping of cis-eQTLs identifies an additional 20,913 putative causal eQTLs

We present the expression modifier score (EMS), a predicted probability that a variant has a cis-regulatory effect on gene expression, trained on fine-mapped eQTLs and leveraging 6,121 features including epigenetic marks and sequence-based neural network predictions. We validate EMS and use it as a prior for statistical fine-mapping of eQTLs, identifying an additional 20,913 putatively causal eQTLs. Incorporating EMS into colocalization analysis identifies 310 additional candidate genes for UK Biobank phenotypes.

Understanding the effects of non-coding variants on gene expression is of fundamental importance 1 . Experimental methods to assess whether an individual variant modifies expression of a nearby gene in native chromatin context via direct perturbation are lowthroughput 2 . Existing computational predictors [3][4][5][6] , which are higher-throughput, thus lack large gold standard sets of regulatory variants for training and validation. Here, we leverage a novel set of 14,807 putative expression-modifying variant-gene pairs in humans, and use 6,121 features to directly train a predictor of whether a variant modifies expression.
To define the set of putative expression-modifying variant-gene pairs, we analyzed results of recent fine-mapping of eQTLs from GTEx 7,8 , including the 14,807 variant-gene pairs with posterior inclusion probability (PIP) greater than 0.9 according to two methods 9,10 across 49 tissues (Fig. S1, S2). The size of our data set allowed us to quantify the enrichment of putative causal variants for several functional annotations, including deep learning-derived variant effect scores from Basenji 6 and distance to canonical transcription starting site (TSS), with high precision (Fig. S3, S4, S5).
Next, we built a random forest classifier of whether a given variant is a putative causal eQTL for a given gene using 807 binary functional annotations 11-13 , 5,313 Basenji features corresponding to functional activity predictors 6,14 , and distance to TSS. We then scaled the output score of the random forest classifier to reflect the probability of observing a positively labeled sample in a random draw from all the variant-gene pairs (Fig. S6, Methods), and named this scaled score the expression modifier score (EMS). We performed the above process for 49 tissues in GTEx v8 individually. For whole blood, the Basenji scores together had 55.0% of the feature importance for EMS, and distance to TSS had feature importance of 43.1%. The binary functional annotations together had less than 2% of importance (Fig.  S7). Results were similar for other tissues (Supplementary File 1).
EMS achieved higher prediction accuracy than other genomic scores 4,15-18 for putative causal eQTLs on a held-out chromosome (top bin enrichment for held-out putative causal eQTLs 18.3x vs. 15.1x for distance to TSS, the second best, Fisher's exact test p=3.33 ⋅ 10 !" , Fig.  1a; AUPRC=0.884 vs. 0.856 when using distance to TSS, the second best, Fig. S8; Methods). EMS was among the top-performing methods in prioritizing experimentally suggested regulatory variants from reporter assay experiments 19,20 , despite not varying distance to TSS, the most informative feature (Fig. 1b-c, Fig. S9, Methods). Finally, EMS prioritized putative causal non-coding variants for hematopoietic traits in the UK Biobank (UKBB) dataset 21 with performance comparable to other scores (17.6x for EMS vs 17.1x for DeepSEA, the second best; Fig. 1d), although there are known differences between the genetic architectures of cis-gene expression and complex traits 22 . The results were consistent when we performed the same set of analyses in different datasets: hematopoietic traits in BioBank Japan 23 (BBJ) and lymphoblastoid cell line (LCL) eQTL in Geuvadis 24,25 (Fig. S10).
Since EMS is in units of estimated probability, one natural way to utilize EMS for better prioritization of putative causal eQTLs is to use it as a prior for statistical fine-mapping. We developed a simple algorithm for approximate functionally-informed fine-mapping and applied it with EMS as a prior to obtain a functionally-informed posterior, denoted PIPEMS (Methods). We found that PIPEMS identified more putative causal eQTLs than the original PIP calculated with a uniform prior, denoted PIPunif. Specifically, 95.4% of variants with PIPunif> 0.9 also had PIPEMS> 0.9 (2,152 out of 2,255), while only 33.8% of variants with PIPEMS> 0.9 had PIPunif> 0.9 (1,125 out of 3,277; Fig. 2a). Similarly, credible sets mostly decreased in size (Fig. 2b,  Supplementary File 2).
We evaluated the quality of PIPEMS by comparing it with PIPunif and a publicly available eQTL fine mapping result that uses distance to TSS as a prior 7,25 (denoted PIPDAP-G) in two ways (Other methods for functionally-informed fine-mapping 26,27 would be computationally intensive for a data set this size; the recently introduced PolyFun 28 is designed for complex traits.). First, PIPEMS had the highest enrichment level of reporter assay QTLs 25 (raQTLs) in the PIP>0.9 bin (16.8x vs 12.9x in PIPunif and 11.4x in PIPDAP-G, Fisher's exact test p=1.65 ⋅ 10 !# between PIPEMS and PIPDAP-G; Fig. 2c). Second, complex trait causal non-coding variants were comparably enriched in PIP>0.9 bins (Fig. S11). These results suggest that PIPEMS is a valid measure for identifying putative causal cis-regulatory variants.
We next compared the usage of PIPEMS to PIPunif for complex trait gene prioritization, as in Weeks et al 29 . We calculated PIPEMS for 49 GTEx tissues (Fig. S12, S13), resulting in a total of additional 20,913 eQTLs with PIPEMS>0.9 ( Fig. S14; Supplementary File 3). We then colocalized the eQTL signals with 95 UKBB phenotypes. Using the gold standard gene set described in ref [29], PIPEMS achieved higher precision and higher recall than PIPunif ( Table  1, Methods). Overall, PIPEMS elucidated 310 candidate genes for UKBB phenotypes that were not identified with PIPunif (Supplementary File 4). On the other hand, PIPDAP-G showed lower precision than PIPEMS and PIPunif but higher recall (Table 1) suggesting the value of future studies in investigating different priors in eQTL fine-mapping and the trade-off between precision and recall.
An example of PIPEMS resolving a credible set that is ambiguous with PIPunif is shown in Fig.  2d. Here, four variants upstream of CITED4 are in perfect LD in GTEx, giving PIPunif = 0.25 for all four (Fig. S15). In UKBB, the four variants are also in high LD, with PIP for neutrophil count between 0.133 and 0.181 for all four. Thus, standard colocalization analysis does not identify CITED4 as a neutrophil count-related gene (CLPP less than 4.53 ⋅ 10 !# for all variants; Methods). However, one of the four variants, rs35893233, creates a binding motif of SPI1, a transcription factor known to be involved in myeloid differentiation 30 , and presents epigenetic activity in myeloid-related cell types, such as showing the highest basenji score for cap analysis gene expression (CAGE) activity in acute myeloid leukemia (AML). This variant has >25x greater EMS than the other three variants (1.73 ⋅ 10 !$ vs 6.11 ⋅ 10 !% , 1.00 ⋅ 10 !% and 8.62 ⋅ 10 !& , respectively), enabling PIPEMS to narrow down the credible set to the single variant (PIPEMS = 0.956 for rs35893233). Integrating EMS into the co-localization analysis thus allows identification of CITED4 as a neutrophil count-related gene (CLPP=0.173). Additional examples are described in Fig. S16.
Limitations of our approach include (1) limited power to call putative causal variants in high LD regions or with low minor allele frequency, (2) the simplifications of thresholding and ignoring effect size and direction, and (3) the lack of a comprehensive set of features. EMS for all variants in GTEx v8 are publicly available for 49 tissues. Our study provides a powerful resource for deciphering the mechanisms of non-coding variation.

Methods:
The Expression Modifier Score (EMS) Fine-mapping of GTEx v8 data is described in Ulirsch et. al 8 and is summarized in the Supplementary Methods. We constructed a binary classification task by labeling the variantgene pairs with PIP>0.9 for both of the two fine-mapping methods (SuSiE 9 and FINEMAP 10 ) as positive, and the ones with PIP<0.0001 for both methods as negative. Each variant-gene pair was annotated with 6,121 features (distance to TSS annotated in the GTEx v8 dataset, 12 non-cell type specific binary features from the LDSC baseline model 13 , 795 cell type specific binary features from the Roadmap Epigenomics Consortium 12 , where variants falling in narrow peak are annotated as 1, and others are 0, and 5,313 deep-learning derived cell type-specific features generated by the Basenji model 6,14 ; Supplementary File 5). The 152 most predictive features were selected based on different prediction accuracy metrics such as F1 measure and mean decrease of impurity (MDI) for each feature (Supplementary Methods). A combination of random search followed by grid search was performed to tune the hyperparameter for a random forest classifier that maximizes the AUROC of the binary prediction in the held-out dataset (Supplementary File 6). Finally, for each prediction score bin, we calculated the fraction of positively labeled samples and scaled the output score, to derive the EMS. Further details are described in the Supplementary Methods.

Performance evaluation of EMS
To evaluate the performance of EMS, for each chromosome, we trained EMS using all the other chromosomes to avoid overfitting. CADD v1.4 and GERP scores were annotated using the hail annotation database (https://hail.is). ncER scores were downloaded from https://github.com/TelentiLab/ncER_datasets. In order to annotate the DeepSEA v1.0 and Fathmm v2.3 non-coding scores, we mapped hg38 coordinates to hg19 using the hail liftover function, removed variants that do not satisfy 1 to 1 matching, and followed their web instructions (https://humanbase.readthedocs.io/en/latest/deepsea.html, and http://fathmm.bio compute.org.uk) to score the variants. Insertions and deletions were excluded for Fathmm scores. For DeepSEA, we calculated the e-values from the individual features, following ref [4]. We computed the area under the receiver operating characteristic curve and the precision recall curve (Fig. S8) as well as enrichments of different variant-gene pairs or variants as described in the next sections (Fig. 1).

Computation of enrichment
Enrichment of a specific set of variant-gene pairs (e.g. putative causal eQTLs in the GTEx whole blood dataset) in a score bin is defined as the probability of drawing a variant-gene pair in the set given that the variant-gene is in the score bin, divided by the overall probability of drawing a variant-gene pair in the set. The error bar denotes the standard error of the numerator, divided by the denominator (we assumed the standard error of the denominator is small enough, since the total number of variant-gene pairs is typically large; >100,000,000 for all the variant-gene pairs in GTEx). When testing binary functional features as in Fig. S3, S4, the score is the individual functional feature, and the set is defined by the specific PIP bin. Enrichment analysis of eQTL, complex trait, and reporter assay data Saturation mutagenesis data 19 was downloaded from the MPRA data access portal (http://mpra.gs.washington.edu). An MPRA hit was defined as having a Bonferroni-significant association p-value (lower than 0.05 divided by the total number of variant-cell type pairs) for at least one cell type, regardless of the effect size and direction. The raQTL data 20 was downloaded from https://osf.io/w5bzq/wiki/home/. EMS was re-scaled to have a constant distance to TSS (200 bp, roughly representing the scale of typical distance to TSS in plasmids 5 ), which is expected to significantly decrease the performance of EMS compared to in native genome. Similarly, when comparing EMS with other scores for enrichments of MPRA hits or raQTLs, distance to TSS was not used for the comparison.
Fine-mapping of UKBB traits is described in Ulirsch et al 8 . To focus on non-coding regulatory effects, we annotated the variants in VEP v85 and filtered out coding and splice variants for the UKBB dataset. For each (non-coding) variant, we calculated the maximum PIP over all the hematopoietic traits, as well as the maximum Whole-Blood EMS over all the genes in the cis window of the variant, since a variant can have different regulatory effect on different genes, for different phenotypes. A variant was defined as putative hematopoietic trait-causal if it has SuSiE PIP higher than 0.9 in any of the hematopoietic traits. In UKBB and raQTL dataset, we focused on the variants that exist in the GTEx v8 dataset to reduce the calculation complexity.
For all four datasets, the variants (or variant-gene pairs in GTEx) other than putative causal ones were randomly downsampled to achieve a total number of variants to be exactly 100,000, to reduce the computational burden while keeping enough number of variants to observe statistical significance. GTEx enrichment, MPRA hits enrichment, raQTL enrichment and UKBB enrichment are thus defined as the enrichment of putative causal eQTLs, MPRA hits, raQTLs and putative hematopoietic-trait causal variants in the downsampled dataset respectively.
Approximate functionally-informed fine-mapping using EMS In the Sum of Single Effects (SuSiE) model, for a given gene, the vector of true SNP effects on that gene is modeled as a sum of vectors with only one non-zero element each: where and ' are vectors of length and is the number of variants in the locus. Intuitively, each ' corresponds to the contribution of one causal variant. One output of SuSiE is a set ofvectors * , . . . , ( , with ( ( ) equal to the posterior probability that ' ( ) ≠ 0; i.e., that the -th causal variant is the variant . Credible sets are computed for each from ' , and credible sets that are not "pure" --i.e., that contain a pair of variants with absolute correlation less than 0.5 -are pruned out. The ' are also used to compute PIPs.
Our algorithm for approximate functionally-informed fine-mapping takes the approach of reweighting the posterior probability calculated using the uniform prior, analogous to ref [31], and proceeds as follows. For each gene and each tissue, we start with * , . . . , ( computed by SuSiE using the uniform prior. For each , if ' corresponds to a pure credible set, we re-weight each element of ' by the EMS of the corresponding variant, and we normalize so that the sum is equal to 1, obtaining 8 ' . In other words, letting * ,..., , denote the EMSs for the variants, we define 8 ' ( ) for the variant to be if ' corresponds to a pure credible set; otherwise, we set 8 ' = ' . We then use the updated 8 * , . . ., 8 ( to compute updated PIPs and credible sets as in the original SuSiE method. See Supplementary Methods for further details. Performance evaluation of PIPEMS and application to gene prioritization PIP using distance to TSS as a prior (PIPDAP-G) was downloaded from the GTEx portal (https://gtexportal.org/). The raQTL data was downloaded from https://osf.io/w5bzq/wiki/home/, and the negative variants were randomly downsampled to a total of 100,000 variants. The number of putative causal eQTLs is defined as the number of variant-gene-tissue pairs with PIPEMS>0.9. For complex trait causal non-coding variant prioritization, a threshold of PIP>0.1 was chosen to account for low sample size. We defined a gene prioritization task using 49 tissues in GTEx and 95 complex traits in UKBB using the following steps (further details are described in Weeks et al. 26 ): Across all traits, we identified 1 Mb regions centered at unresolved credible sets (no coding variant with PIP>0.1) that additionally contained at least one "gold standard gene" (protein-coding variant with PIP>0.5) for the same trait. There were 2,897 such regions and 1,161 gold standard genes. Our intuition is that the gene with the fine-mapped protein-coding variant is most likely to be the primary causal signal, and that a nearby non-coding signal is more likely to act through this gene (i.e. via regulation) than through a different gene.
For each gene-region pair, we defined the co-localization posterior probability (CLPP) for the gene to be the maximum of the product of the eQTL PIP and trait PIP, across all tissues and all variants in the unresolved credible set. A gene is prioritized if it has CLPP > 0.1 and it has the maximum CLPP in its region. We compute the precision as the number of correctly prioritized genes (where the prioritized gene is also the gene with the primary, protein-coding signal) divided by the total number of prioritized genes. We compute recall as the number of correctly prioritized genes divided by the total number of gold standard genes. The total number of candidate genes is defined as the number of gene-trait pairs presenting CLPP>0.1 in at least one tissue and variant.