Abstract
Identifying causal variants from genome-wide association studies (GWASs) is challenging due to widespread linkage disequilibrium (LD). Functional annotations of the genome may help prioritize variants that are biologically relevant and thus improve fine-mapping of GWAS results. However, classical fine-mapping methods have a high computational cost, particularly when the underlying genetic architecture and LD patterns are complex. Here, we propose a novel approach, SparsePro, to efficiently conduct functionally informed statistical fine-mapping. Our method enjoys two major innovations: First, by creating a sparse low-dimensional projection of the high-dimensional genotype data, we enable a linear search of causal variants instead of a combinatorial search of causal configurations used in most existing methods; Second, we adopt a probabilistic framework with a highly efficient variational expectation-maximization algorithm to integrate statistical associations and functional priors. We evaluate SparsePro through extensive simulations using resources from the UK Biobank. Compared to state-of-the-art methods, SparsePro achieved more accurate and well-calibrated posterior inference with greatly reduced computation time. We demonstrate the utility of SparsePro by investigating the genetic architecture of five functional biomarkers of vital organs. We show that, compared to other methods, the causal variants identified by SparsePro are highly enriched for expression quantitative trait loci and explain a larger proportion of trait heritability. We also identify potential causal variants contributing to the genetically encoded coordination mechanisms between vital organs, and pinpoint target genes with potential pleiotropic effects. In summary, we have developed an efficient genome-wide fine-mapping method with the ability to integrate functional annotations. Our method may have wide utility in understanding the genetics of complex traits as well as in increasing the yield of functional follow-up studies of GWASs. SparsePro software is available on GitHub at https://github.com/zhwm/SparsePro.
1 Introduction
Establishment of large biobanks and advances in genotyping and sequencing technologies have enabled large-scale genome-wide association studies (GWASs) [1–3]. Although GWASs have revealed extensive associations between genetic variants and traits of interest, understanding the genetic architecture underlying these genetic associations remains challenging [4–6], mainly because GWASs typically rely on univariate regression models, which are not able to distinguish the causal variants from other variants in linkage disequilibrium (LD) [5, 7, 8].
Several statistical fine-mapping approaches have been proposed for identifying causal variants in GWASs while considering the underlying LD patterns. For instance, BIMBAM [9], CAVIAR [10] and CAVIARBF [11] estimate the posterior inclusion probabilities (PIPs) in a pre-defined locus by evaluating multivariate Gaussian likelihood enumerating all possible configurations. FINEMAP [12] accelerates the inference with a shotgun stochastic search focusing on the most likely subset of causal configurations. However, the number of causal configurations required to evaluate can grow combinatorially as the number of causal variants increases, thus tremendously increasing the computational cost if multiple causal variants exist. SuSiE [13] introduces an iterative Bayesian stepwise selection algorithm for variable selection, which can also be applied to statistical fine-mapping with greatly improved computational efficiency.
Furthermore, it has been recognized that functional annotations of the genome may help prioritize variants that are biologically relevant, thus improving fine-mapping of GWAS results [8]. For example, PAINTOR [14] and RiVIERA [15] empirically estimate the impacts of functional annotations from statistical evidence, which improves the accuracy of fine-mapping but has a high computational cost, especially when multiple causal SNPs exist in the same locus. Poly-Fun [16] adopts stratified LD score regression [17] to effectively partition total trait heritability into annotation-dependent heritability estimates, and uses these estimates of annotation-tagged heritability to specify functional priors for fine-mapping methods.
In this work, we present a unified probabilistic framework called Sparse Projections to Causal Effects (SparsePro) for statistical fine-mapping with the capacity to incorporate functional annotations. Accompanied with an efficient variational expectation-maximization inference algorithm [18], SparsePro achieves superior accuracy in identifying causal variants as well as computational efficiency compared to the state-of-the-art approaches in both simulation studies and real data analyses. We further demonstrate the utility of SparsePro in genome-wide fine-mapping of functional biomarkers for five vital organs in human.
2 Materials and Methods
2.1 SparsePro method overview
To fine-map causal SNPs, our method takes two lines of evidence (Figure 1). First, from estimated marginal associations between genetic variants and a complex trait of interest, accompanied by matched LD information, we can group correlated genetic variants together and assess their effects jointly. Then, we infer the contribution of each SNP towards each group of causal effect separately to obtain posterior inclusion probabilities (PIPs). Second, optionally, if we have knowledge about any functional annotations which may be enriched for the causal SNPs, we can estimate the relative enrichment of these annotations, and re-prioritize SNPs according to the enrichments of these annotations. As outputs, our model yields functionally informed PIP for each SNP and the enrichment estimates of candidate functional annotations.
2.2 Our contributions in the context of the existing methods
Our work is related to two existing methods, SuSiE [13] and PolyFun [16]. Inspired by the “sum of single effects” model in SuSiE, we introduce a sparse projection of the genotype in our model specification so that the identification of causal variants and estimation of causal effect sizes are separated. This sparse projection avoids exhaustively evaluating the combinatorial number of causal configurations. For statistical inference, SuSiE adopts an iterative Bayesian step-wise selection algorithm that operates on the Bayes Factors (BFs) [13]. Here, we use a paired mean field variational inference algorithm [18] to jointly update the variational parameters for the causal effects and causal indicators of each SNP. Moreover, we have adapted our algorithm to directly work with GWAS summary statistics and provided appropriate estimates for the hyperparamters including trait variance and heritability estimates. To enable functionally informed fine-mapping, PolyFun uses genome-wide heritability estimates from LD score regression to set the functional priors for fine-mapping methods [16]. In contrast, we aggregate the genome-wide statistical fine-mapping evidence by maximizing the evidence lower bound of SparsePro to prioritize relevant annotations and robustly derive genome-wide functional priors.
2.3 SparsePro model specification
We assume the following data generative process (Figure 1) for a continuous polygenic trait. First, the prior probability for the g-th SNP (g ∈ {1,…, G}) being causal is defined as: where Ag is the 1 × M annotation row vector of M candidate annotations for the g-th SNP; and w is the M × 1 vector of logarithm of relative enrichment. Here, we use the softmax function to ensure the prior probabilities sum up to 1. If no functional information is provided, the prior probability of being causal is considered equal for all SNPs, i.e. .
We assume that there exist K independent causal effects, and that where and sk is a binary indicator vector of length G indicating which SNP is the causal SNP under the k-th (k ∈ {1, …, K}) causal effect.
Then, the causal effect sizes are sampled from a normal distribution, i.e.
Finally, the continuous trait yN×1 over N individuals is generated as follows: or in matrix form: where XN×G is the full genotype matrix, SG×K is the sparse projection matrix, βK×1 is the causal effect vector, and denotes the variance not attributable to the modelled genetic effects.
2.4 A variational inference algorithm for Bayesian fine-mapping
With this model specification (Figure 1), finding the causal variants is equivalent to inferring the sparse projections sk and the effect sizes βk given y and X for k ∈ {1, …, K}:
As the number of possible causal configurations grows combinatorial with G, the exact posterior solution is intractable because of the marginal likelihood in the denominator. Unlike most existing fine-mapping approaches using sampling-based methods to search through a subset of possible causal configurations [12, 19], we adopt a paired mean field factorization of variational family to approximate the posterior [18]:
This variational distribution preserves the dependency between sk and βk. It has been shown that the paired mean field variational family has similar mode and shape as the desired posterior distribution, and that such inference can achieve high accuracy with substantially improved computational efficiency [18].
To find the best approximation, we minimize the Kullback-Leibler (KL) divergence between the posterior distribution and the proposed variational distribution, which is equivalent to maximizing the evidence lower bound (ELBO) [20]:
Based on the mean field assumptions [18], this optimization can be conducted iteratively for the kth causal effect and the gth SNP with the following closed-form updates until convergence (derivation details are available in Supplementary Notes).
We update posterior effect size for the g-th SNPs in the k-th causal effect: with where o represents element-wise multiplication of vectors.
We then update the posterior probability of the g-th SNP being causal in the k-th causal - effect:
For fine-mapping, we take the maximum of these K probabilities as the PIP for SNP .
2.5 Adaptation to GWAS summary statistics
The above variational inference algorithm requires access to large datasets containing both individual-level genotype X and phenotype data y. Since a growing number of GWASs have released publicly available summary statistics (i.e., marginal effect size estimate and its standard error seg for the g-th SNP), we adapt SparsePro to directly operate on these summary statistics with additional information from an LD reference panel (i.e. estimates of pairwise SNP-SNP Pearson correlation).
Specifically, if we have reasonable surrogates for and , we can plug them into Equations (1), (2), and (3). We include two forms of reformulation depending on whether the genotypes are standardized to have zero mean and unit variance in the GWAS.
If the genotypes are standardized, we have where N is the sample size.
If the genotypes are not standardized, we have
Therefore,
Notably, if y has been standardized to have unit variance prior to a GWAS, we naturally supply var(y) = 1. Otherwise, it can be estimated as var(y) = 2Np(1 − p)se2 where N (the study sample size), p (minor allele frequencies), and se (standard errors of effect size estimates) are usually available in GWAS summary statistics.
2.6 Variational expectation-maximization for integrating functional annotations
To estimate the relative enrichment of functional annotations and further prioritize variants, we adopt a variational expectation-maximization scheme to maximize ELBO with respect to the logarithm of relative enrichment (w) first and then use the estimate ŵ to calculate (prior probability of being causal) for each SNP.
Suppose we have M candidate annotations and Agm (m ∈ {1, …, M}) is a 0/1 indicator denoting whether the g-th SNP has the m-th annotation. By setting the derivative of ELBO with respect to wm to 0 and solving for wm, we have the following estimate for the logarithm of relevant enrichment (detailed in Supplementary Notes), where
We note that this metric is equivalent to the logarithm of a relative risk, thus its standard error can be calculated as
We evaluate the significance of annotation enrichment with the log likelihood ratio test (G-test) [21]. Only annotations which demonstrate statistical significance are included in our model to update the prior probability of being causal for each SNP. Specifically,
This functionally informed prior helps prioritize causal SNPs in addition to statistical evidence.
2.7 Hyperparameter settings
We have three hyperparameters: number of causal effect K, inverse of the unexplained variance τy and inverse variance of causal effect sizes in our model. As we show in Supplementary Notes, our model is not sensitive to the setting of K as long as K is larger than the actual number of independent effects, except that increasing K marginally increases the computation time.
We set τy as where h2 is the local SNP heritability that can be estimated by a modified Heritability Estimation from Summary Statistics (HESS) [22] based on GWAS summary statistics (Supplementary Notes)
We set τβ as for each of the independent causal effects. We use k ∈ {1, …, K} to account for different effect sizes and to improve model identifiability.
2.8 Simulation studies
We conducted simulations to showcase the efficiency and utility of our method. We leveraged resources from the UK Biobank [1]. Specifically, we first retained 353,606 White British ancestry participants by excluding one individual from each pair of closely related individuals (who had a 3rd degree or closer relationship). We then retrieved the genotypes of these individuals based on 271,699 SNPs which had a minor allele frequency ≥ 0.001 and an imputation quality score ≥ 0.6 on chromosome 22. Next, we sampled 50 causal SNPs with a two-fold relative enrichment amongst SNPs that were annotated as “conserved sequences” [23], “DNase I hypersensitive sites” (DHS) [24], “non-synonymous” [25], or that overlapped with histone marks H3K27ac [26] or H3K4me3 [24]. We used the GCTA GWAS simulation pipeline [27] to simulate a continuous trait with a per-chromosome heritability of 0.01. We tested the association between each SNP and this simulated trait, and obtained GWAS summary statistics using the fastGWA software [28]. This process was replicated 22 times to imitate a GWAS. We obtained LD information calculated using the UK Biobank participants from https://alkesgroup.broadinstitute.org/UKBB_LD/ [16]. These LD matrices were generated for genome-wide SNPs binned into sliding windows of 3 Mb where two neighboring windows had a 2-Mb over-lap.
We applied SparsePro to the GWAS summary statistics with the above LD information, and iterated over all sliding windows, first without any functional annotation information. We denoted the fine-mapping results as “SparsePro-”. Next, we aggregated the results from all 22 replications to estimate the relative enrichment for ten binary functional annotations. In addition to the five annotations simulated to be enriched of causal SNPs, we also included five annotations without enrichment: “actively transcribed regions” [29], “transcription start sites” [29], “promoter regions” [30], “5’-untranslated regions” [25], and “3’-untranslated regions” [25].
Annotations with a G-test p-value < 1 × 10−6 were selected to conduct functionally informed fine-mapping, and the results were denoted as “SparsePro+”. τβ and τy were set according to aforementioned empirical estimates. PIPs for SNPs in the 1-Mb centre of each 3-Mb sliding window were extracted.
2.9 Method comparisons using simulated data
We also performed fine-mapping with some of the state-of-the-art methods. To perform fine-mapping with conditional and joint (COJO) analyses [31] and FINEMAP [12], we first selected COJO lead SNPs based on GWAS summary statistics by performing stepwise model selection implemented in the GCTA-COJO software [27]. We then applied FINEMAP with shotgun stochastic search to SNPs in a 1-MB window centered at each COJO-identified lead SNP. We wrote an in-house script using the “susie_rss” function to perform genome-wide fine-mapping with SuSiE in the same sliding windows as SparsePro. We aggregated summary statistics from 22 replications and used PolyFun with the “baselineLF2.2.UKB” model [16] to calculate functional priors. The “baselineLF2.2.UKB” model contained all annotations used in SparsePro as well as additional pre-computed LD-related annotations for optimal performance of PolyFun [16]. The estimated priors were provided to SuSiE via “prior weights” and to FINEMAP via the --prior-snps option, respectively. The maximal number of causal SNPs in each locus was set to 5 for all methods.
We compared the performance of these methods in terms of precision (1 - false discovery rate), recall, calibration of PIPs, as well as computation time, all evaluated on a 2.1 GHz CPU node on Compute Canada.
2.10 Fine-mapping genetic determinants of functional biomarkers for vital organs
To investigate the genetic coordination mechanisms of vital organs, we performed GWAS in the UK Biobank [1] for five functional biomarkers: forced expiratory volume in one second to forced vital capacity (FEV1-FVC) ratio for lung function, estimated glomerular filtration rate for kidney function, pulse rate for heart function, total protein for liver function and blood glucose level for pancreatic islet function. For each trait, we first regressed out the effects of age, age2, sex, genotyping array, recruitment centre, and the first 20 genetic principal components before inverse normal transforming the residuals to z-scores that had zero mean and unit variance. We then performed GWAS analysis on the resulting z-scores with the fastGWA software [27, 28] to obtain summary statistics.
Using the summary statistics and the matched LD information [16], we performed genome-wide fine-mapping with SparsePro-, SparsePro+, SuSiE and PolyFun-informed SuSiE as described in the simulation analyses (Section 2.9), except that the number of causal effects was set to 9 for each LD region to account for potentially more causal variants.
To evaluate the biological relevance of SNPs fine-mapped by different methods, we assessed their relative enrichment in tissue-specific expression quantitative loci (eQTL). Tissue-specific eQTL identified in the most recent release of the Genotype-Tissue Expression (GTEx) project [32, 33] were obtained from https://gtexportal.org/home/datasets. The eQTL information was not used by any functionally informed fine-mapping methods.
Additionally, we calculated trait heritability conferred by fine-mapped SNPs with SparsePro- and SparsePro+, respectively, at several commonly used PIP thresholds for determining causal variants: 0.50, 0.80, 0.90, 0.95, and 0.99. The adjusted R2 obtained from multivariate linear regression of the z-scores (i.e. inverse normal transformed trait residuals after regressing out covariate effects) against all fine-mapped SNPs was used as a surrogate of the SNP heritability. We compared these results to heritability captured by the same number of SNPs fine-mapped by SuSiE and PolyFun-informed SuSiE, separately at each PIP threshold. For instance, if SparsePro-identified J SNPs with a PIP > 0.5, we would select J SNPs with the highest PIP determined by SuSiE and compare the adjusted R2. Notably, this analysis evaluates predictive associations instead of actual causality, hence the adjusted R2 is not a direct indicator of the validity of the fine-mapping results.
We selected SNPs with a PIP > 0.8 to explore possible pleiotropic effects using phenogram [34]. Loci with potential pleiotropic effects were visualized using LocusZoom [35].
3 Results
3.1 SparsePro demonstrates superior performance in simulation
We performed simulations based on real genotype data from UK Biobank (Materials and Methods). We observed that SparsePro consistently demonstrated superior accuracy in identifying true causal variants. SparsePro without annotation (SparsePro-) achieved an area under the precision-recall curve (AUPRC) of 0.3699, higher than the AUPRC of 0.2677 by FINEMAP and the AUPRC of 0.3573 by SuSiE (Figure 2A). Notably, SparsePro had a substantially higher precision at the same recall rates (Figure 2A). For example, at the recall rate of 25%, Sparse-Pro achieved greater than 95% precision, which is highly desirable in fine-mapping because only a small number of the prioritized SNPs will be experimented validated in vivo or in vitro in practice (Figure 2A).
Moreover, SparsePro can incorporate functional priors (Supplementary Table S1) with improved fine-mapping power. SparsePro+ achieved an AUPRC of 0.4636, outperforming both functionally informed FINEMAP (AUPRC = 0.3088) and functionally informed SuSiE (AUPRC = 0.4042) with functional priors derived by PolyFun. As expected, we also found that the performance of SparsePro was not sensitive to the pre-specified number of independent causal effects (Supplementary Table S2 and Supplementary Notes).
Compared to FINEMAP and SuSiE, the PIPs yielded by SparsePro appeared to be much more calibrated. It has been shown that for a well-calibrated fine-mapping method, the mean PIP of all SNPs with a PIP above a certain threshold should be equal to the precision if these SNPs were to be considered causal variants [16]. Here, we found that the mean PIP of all SNPs considered to be causal variants by SparsePro was almost identical to the desired precision at any threshold (Figure 2B). In contrast, the PIPs generated by FINEMAP and SuSiE appeared to be inflated (Figure 2B).
For instance, if SNPs with a PIP > 0.8 were to be considered as causal variants, SparsePro- and SparsePro+ would both have a median precision across simulations of 95% and 100% respectively (Figure 2C). The selected SNPs by FINEMAP (median precision = 50% only) and SuSiE (median precision = 71% only) included an excessive proportion of false positives, even with functional priors (median precision = 77% for FINEMAP and 79% for SuSiE; Figure 2C). The high precision by SparsePro was consistent for all frequently used PIP thresholds (Figure 2C) although FINEMAP and SuSiE sometimes have a slightly higher recall.
Furthermore, SparsePro conferred not only higher fine-mapping precision, but also higher computational efficiency. In our simulation, it took only an hour to fine-map chromosome 22, which was 6.5 times faster than FINEMAP and 3 times faster than SuSiE (Figure 2D and Supplementary Table S3).
3.2 Fine-mapped SNPs by SparsePro are more enriched in tissue-specific eQTL and confer higher trait heritability
We performed GWAS in the UK Biobank [1] for five functional biomarkers: FEV1-FVC ratio (lung function), estimated glomerular filtration rate (kidney function), pulse rate (heart function), total protein (liver function) and blood glucose level (pancreatic islet function). Genome-wide fine-mapping of five functional biomarkers based on the UK Biobank population using Sparse-Pro identified multiple potentially causal variants (Supplementary Table S4). To assess biological relevance of the fine-mapping results, we estimated the relative enrichment of causal signals in tissue-specific eQTL for each trait (Materials and Methods). We found that the fine-mapped SNPs were significantly enriched in tissue-specific eQTL for all five biomarkers, while results based on SparsePro−/+ showed the strongest enrichment (Figure 3A). For example, for total protein, the fine-mapped SNPs determined by SparsePro-were 4.00-fold (95% CI: 3.25-4.92) more likely to be liver-specific eQTL than non-fine-mapped SNPs, compared to a 1.54-fold (95% CI: 1.35-1.75) enrichment based on fine-mapped SNPs by SuSiE. While SuSiE was substantially improved by functional priors derived from PolyFun with a 2.20-fold (95% CI: 1.97-2.45) enrichment, the fine-mapped SNPs by SparsePro+ exhibited the highest biological relevance, being 4.06-fold (95% CI: 3.31-4.97) more likely to be liver-specific eQTL.
Moreover, at most PIP thresholds, the SNPs fine-mapped by SparsePro- explained a higher proportion of phenotypic variance based on all UK Biobank subjects (Methods) compared to the same number of the most likely causal SNPs identified by SuSiE (Figure 3B and Supplementary Table S5). With the functional annotations (Supplementary Table S5), the fine-mapped SNPs by SparsePro+ consistently achieved a higher SNP heritability for estimated glomerular filtration rate, FEV1-FVC ratio, as well as total protein compared to the PolyFun-informed SuSiE; although for glucose and pulse rate, PolyFun-informed SuSiE was able to identify SNPs with a slightly higher predictive performance at certain PIP thresholds (Figure 3C and Supplementary Table S6).
3.3 Pleiotropic effects of SNPs rs1260326 and rs5742915 on the functions of multiple vital organs
Overall, we observed considerable polygenicity for the five biomarkers of the vital organs (Figure 4A). Interestingly, at the PIP threshold of 0.80, we found two potentially causal variants for three of the five biomarkers. Specifically, SNP rs1260326 (Figure 4B), a missense variant (Leu446Pro) in gene GCKR, was fine-mapped for glomerular filtration rate (PIP = 1.000), blood glucose level (PIP = 0.998), pulse rate (PIP = 0.823) and total protein level (PIP = 1.000). Notably, this specific variant has been found to be significantly associated with a wide variety of glycemic traits [36] and other quantitative traits for metabolic syndromes and comorbidities [37, 38], and has been implicated in the functions of liver and other vital organs [39–41].
Another SNP, rs5742915 (Figure 4C), a missense variant (Phe645Leu) in gene PML was fine-mapped for FEV1-FVC ratio (PIP = 0.858), pulse rate (PIP = 1.000) and total protein level (PIP = 0.987). This variant has also been associated with other quantitative biomarkers of polygenic traits featuring development and metabolism, including birth weight [42], height [43], appendicular lean mass [44], and age at menarche [45]. These findings, along with other SNPs exhibiting pleiotropic effects at somewhat lower PIP thresholds (Supplementary Table S4) presented promising genetic targets for experimental validations in a larger effort towards understanding the mechanisms of genetic coordination among vital organs.
4 Discussion
Accurately identifying trait-determining and disease-causing variants is fundamental in genetics and particularly important for appropriately interpreting GWAS results [5, 8]. In this work, we developed SparsePro, an efficient fine-mapping method to help prioritize causal variants for complex traits, possibly with prior functional information. Through genome-wide simulations, we showed that SparsePro was highly accurate and computationally efficient compared to existing methods. By fine-mapping genetic associations with five biomarkers for vital organ functions, we demonstrated that SparsePro identified candidate variants that were biologically relevant, including two variants with pleiotropic effects, which might indicate genetically encoded coordination among vital organs.
Compared to the existing methods, SparsePro has three important features. First of all, we use an efficient variational inference algorithm to approximate the posterior distribution of the causal variant indicators instead of exhaustively searching through all possible causal configurations or performing stepwise regression. As a result, SparsePro can be significantly faster than the existing fine-mapping methods, such as FINEMAP [12], and is more than twice as fast as SuSiE [13], which is a similar variable selection framework but implements an iterative Bayesian stepwise selection procedure. The substantially improved computational efficiency enables statistical fine-mapping of large chunks of the genome instead of analyzing genetic associations on a per-locus basis as in most existing follow-up studies of GWASs. In our simulation studies, compared to locus-wise fine-mapping based on COJO-identified lead SNPs, such a genome-wide fine-mapping requires neither a pre-specified p-value threshold (e.g. p < 5 × 10−8) for determining candidate loci nor an arbitrary number of causal effects per locus. If functional annotations are available, the estimation of functional enrichment may also be more robust by including more variants with little additional computational overhead.
Second, we utilize a paired mean field variational family, where the causal effect and the casual indicator are coupled in the variational distribution. This ensures that our approximation matches closely with the true posterior distribution of the causal variant indicators [18]. As a result, SparsePro yielded better-calibrated PIPs compared to existing fine-mapping methods.
Third, given GWAS summary statistics, we provide estimates for hyperparameters including τy and τβ that are reasonable in the context of polygenic trait genetics. Consequently, at several commonly used PIP thresholds for defining causal variants, SparsePro showed improved control of false positives, demonstrated higher precision in identifying causal variants in simulation and obtained stronger enrichment for tissue-specific eQTL in real data application.
Last, we propose and implement a probabilistic model that coherently integrates statistical evidence and functional prior information. The key difference between SparsePro+ and other methods that leverage functional priors, such as PolyFun [16] and PAINTOR [14], is that each annotation is tested for its relevance with the trait of interest before being used to derive the priors in our model. Therefore, functional annotations serve as complementary evidence when statistical evidence is not sufficient to discern causal variants. Based on our results, it seems that this approach distills better prior information from the functional annotations compared to the aforementioned methods.
We note that SparsePro can be further improved with the following future directions. First, SparsePro generally requires that the supplied LD reference panel matches well with that of the GWAS study population to guarantee proper convergence. While we advocate the public availability of the in-sample LD information along with the GWAS summary statistics, a more robust model is needed to account for mismatched LD information. Second, SparsePro currently supports only binary annotations while compatibility with continuous annotations is also desirable. Last, the current variational expectation-maximization scheme might not accurately estimate the joint enrichment of highly correlated annotations. Performing variable selection beforehand or effectively aggregating enrichment estimates may enable the inclusion of multiple correlated informative annotations, such as cell type-specific annotations to further improve the utility of SparsePro.
In summary, SparsePro is an efficient genome-wide fine-mapping method with the ability of integrate functional annotations. We envision its wide utility in understanding the genetic architecture of complex traits, identifying target genes, and increasing the yield of functional follow-up studies of GWASs.
7 Author contributions
W.Z and Y.L have conceived the study and developed the methodology. W.Z created the computational software and ran the analyses. All authors interpreted the results. W.Z. drafted the initial manuscript. H.S.N and Y.L supervised this study and revised the manuscript critically.
8 Disclosures
The authors declare no conflict of interest.
9 Data and Software Availability
SparsePro is an open-access software and publicly available at https://github.com/zhwm/SparsePro. All simulation and plotting scripts to reproduce this study are publicly available at https://github.com/zhwm/SparsePro_Paper. Individual-level phenotype and genotype data from the UK Biobank are available upon successful application to its research committee. GCTA were downloaded from https://cnsgenomics.com/software/gcta/bin/gcta_1.93.2beta.zip. FINEAMP were downloaded from http://www.christianbenner.com/finemap_v1.4_x86_64.tgz. SuSiE (version 0.11.42) were installed from CRAN. PolyFun were installed from https://github.com/omerwe/polyfun. UK Biobank LD information was downloaded from https://alkesgroup.broadinstitute.org/UKBB_LD/. Tissue-specific eQTL were obtained from https://storage.googleapis.com/gtex_analysis_v8/single_tissue_qtl_data/GTEx_Analysis_v8_eQTL_EUR.tar.
SparsePro Supplementary Information
1 Supplementary Notes
1.1 SparsePro is not sensitive to hyperparameter K
The number of causal effects K is an important hyperparameter in statistical fine-mapping. In methods that exhaustively search through causal configurations, the computation time increases combinatorially with K since the number of candidate causal configurations also grows combinatorially. In contrast, in SparsePro, the computation time increases linearly with K. In practice, most of the computation time is spent on loading LD information, thus the computation time varies only slightly with K ∈ {5, 7, 9, 11}. The output of SparsePro is not sensitive to the choice of K as long as K is greater than or equal to the actual number of causal effects. In our simulation studies, we found that with K = 7, 9, or 11, the resulting PIPs were extremely highly correlated with those based on K = 5, and the overal AUPRC metrics were also highly consistent (Supplementary Table S2).
1.2 Modified HESS estimates for hyperparameters τy and τβ
Local heritability estimates are useful in setting hyperparameters for SparsePro. Shi et al. [22] provided an unbiased estimator for local heritability estimation based on summary statistics: where R is the LD matrix, is GWAS summary effect size, N is the sample size in the GWAS and P is the number of SNPs considered in a locus. However, this estimate requires that when generating summary statistics, both genotypes and phenotypes should be standardized to have zero mean and unit variance. Since summary statistics generated by some GWAS pipelines do not specifically standardize the genotypes and phenotypes, we modified the HESS estimator to account for the non-unit variance: where o represents element-wise multiplication and v is a P × 1 vector: for the p-th SNP with genotype vector Xp. This estimate can be adapted to directly operate on summary statistics as explained in Materials and Methods.
1.3 Full derivation of variational EM algorithm
As has been described in Materials and Methods, based on the data generative process, for the k-th causal effect, we have: with . Therefore, we have the joint probability:
The goal of fine-mapping is to infer the posterior probability, and in particular, of the sparse projection S(from here we make the dependency on hyperparameters implicit for the ease of notation):
We use a paired mean field factorized [18] variational family q(S, β) to approximate the posrior:
Note that in this variational family, we do not specify the form of the distribution; rather, we only specify the dependency of βk on sk and that all K causal effects are independent of each other. Also, the form of the variational family does not depend on any observed data.
To better approximate the posterior distribution with members of the variational family, we aim to minimize the KL divergence between the posterior distribution and the proposed variational distribution, which is equivalent to maximizing the ELBO [20]:
To maximize the above ELBO, the following requirement should be satisfied for each k: where is the expectation with respect to the variational distribution excluding the k-th component. With the joint probability provided in Equation (4) we have
Taking expectation with respect to the variational distribution excluding the k-th component and plugging in skg = 1 and sk\g = 0 for all SNPs excluding the g-th SNP, we can obtain the joint distribution of the k-th effect as: where
We recognize that
By matching sufficient statistics for this normal distribution, we can obtain the following variational parameters for updates:
By integrating out βk in Equation (5), we obtain that
Therefore, the posterior probability of the g-th SNP being causal in the k-th effect can be estimated as:
This completes the variational expectation step in our inference algorithm. When functional annotations are available, we use the following maximization step to integrate relevant annotations. After the expectation step, we have that
To maximize ELBO with respect to the relative enrichment of the m-th candidate annotation, we take partial derivatives of ELBO with respect to wm and set it to 0 to solve for wm: where
We then obtain: and solve for:
Notably, this estimate is analogous to a relative risk estimate in a 2 × 2 contigency table. Suppose we consider one annotation, then k1 corresponds to the number of variants with this specific annotation while k0 corresponds to the number of variants without the annotation. Mean-while, r0 corresponds to the sum of posterior probability for variants with the annotation while r1 corresponds to the sum of posterior probability for variants without the annotation.
Similarly, the standard error of this estimate can be calculated based on the standard error of a relative risk:
Finally, we can evaluate the statistical significance of enrichment with the log likelihood ratio test (G-test) [21].
2 Supplementary Table Legends
Supplementary Table S1 Relative enrichment of functional priors in simulation studies.
Supplementary Table S2 Comparison of fine-mapping results based on different hyperparameter settings of the number of causal effects K.
Supplementary Table S3 Details of computation time by each method.
Supplementary Table S4 Fine-mapping results for five functional biomarkers based on the UK Biobank, including genetic variants with a PIP > 0.1.
Supplementary Table S5 Relative enrichment of functional annotations for five functional biomarkers.
Supplementary Table S6 Comparison of fine-mapped SNP heritability for five functional biomarkers.
6 Acknowledgements
YL is supported by Natural Sciences and Engineering Research Council (NSERC) Discovery Grant (RGPIN-2019-0621), Fonds de recherche Nature et technologies (FRQNT) New Career (NC-268592), and Canada First Research Excellence Fund Healthy Brains for Healthy Life (HBHL) initiative New Investigator start-up award (G249591). This study has been conducted using UK Biobank Resources under Application Number 45551. This study was enabled, in part, by support from Calcul Québec and Compute Canada. WZ has been supported by a doctoral training fellowship from the Healthy Brains, Healthy Lives Program, funded by the Canada First Research Excellence Fund (CFREF), Quebec’s Ministe`re de l’Économie et de l’Innovation (MEI), and the Fonds de recherche du Québec (FRQS, FRQSC and FRQNT).H.S.N. holds a Canada Research Chair funded by the Canadian Institutes of Health Research.