Abstract
We introduce an innovative statistical framework to optimize and benchmark polygenic risk score (PRS) models using summary statistics of genome-wide association studies. This framework builds upon our previous work and can fine-tune virtually all existing PRS models while accounting for linkage disequilibrium. In addition, we provide an ensemble learning strategy named PUMA-CUBS to combine multiple PRS models into an ensemble score without requiring external data for model fitting. Through extensive simulations and analysis of many complex traits in the UK Biobank, we demonstrate that this approach closely approximates gold-standard analytical strategies based on external validation, and substantially outperforms state-of-the-art PRS methods. We argue that PUMA-CUBS is a powerful and general modeling technique that can continue to combine the best-performing PRS methods out there through ensemble learning and could become an integral component for all future PRS applications.
Introduction
Genetic risk prediction is a main focus in human genetics research and a key step towards precision medicine1-3. Continued success in genome-wide association studies (GWAS) in the past decade has facilitated the development of polygenic risk scores (PRS) that aggregate the effects of millions of single nucleotide polymorphisms (SNPs) for many complex traits4-6. Compared to earlier statistical methods that require individual-level data for model training7-10, PRS which only relies on GWAS summary data is much more generally applicable due to the wide availability of GWAS summary statistics. Although earlier PRS models struggled to produce accurate prediction results, recent and more sophisticated PRS methods have achieved substantially improved prediction accuracy through statistical regularization and biological data integration11-17. In numerous studies, PRS has shown promising performance in stratifying disease risk and great potential in informing early lifestyle changes or medical interventions18-21.
Despite the progress, several lingering challenges create a significant gap between PRS methodology and applications. A main recurring issue we highlight (and address) throughout the paper is that PRS modelers often assume the existence of independent individual-level datasets that can be used for additional model tuning. But in practice, GWAS summary statistics are used for PRS model training, meaning that conventional sample splitting schemes cannot be used. Additional datasets that are independent from both training and testing samples also rarely exist. This suggests that model-tuning samples will have to come from the precious testing dataset which inevitably reduces the sample size and statistical power in downstream applications.
This disconnection between impractical method requirements and limited data availability can lead to a variety of problems. For example, many PRS methods have tuning parameters that could substantially swing model performance when not chosen properly12-15,22-24. Conventionally, these parameters need to be fine-tuned on a separate dataset with individual-level genotypes and phenotypes. Although some recent methods employ fully Bayesian or empirical Bayesian techniques to bypass model fine-tuning25-27, these hyperparameter-free PRS do not always outperform fine-tuned models, trading predictive accuracy for computational feasibility28,29. Second, no PRS method universally outperforms all other approaches. The empirical performance of a PRS model depends on GWAS sample size, genetic architecture of the phenotype, quality of GWAS summary statistics, and heterogeneity between training and testing samples30-33. Thus, it is of great interest to systematically and impartially benchmark various PRS methods for each trait, ideally in an independent dataset11,30,34. Third, several recent studies have applied ensemble learning which combines multiple PRS models via another regression28,29. This brute-force approach has shown superior performance compared to any single PRS method but is data-demanding – the second level regression model needs to be fit on a separate dataset. Finally, we note that it may be of interest to combine all these tasks in practice, e.g., benchmarking an ensemble learner that combines multiple PRS models which all need to be tuned separately. Now this truly becomes mission impossible.
In this paper, we seek a solution to these problems. We base our statistical framework on PUMAS, a method we recently introduced to perform Monte Carlo cross-validation (MCCV) using GWAS summary statistics35. We have shown that PUMAS can effectively fine-tune PRS models with clumped SNPs36 and the approach has since been adopted in other applications37-39. Here, we first demonstrate that PUMAS can fine-tune and benchmark state-of-the-art PRS models without SNP pruning. Second, we introduce an extension to the PUMAS framework named PUMA-CUBS which is a highly innovative strategy to perform ensemble learning using GWAS summary data alone. Taken together, we showcase a sophisticated statistical framework for fine-tuning, benchmarking, and combining PRS models using GWAS summary statistics as input. We demonstrate the performance of our approach through extensive simulations and analysis of 19 complex traits in UK Biobank (UKB). On average, the PUMA-CUBS ensemble PRS achieves a 6.54% relative gain in predictive R2 compared to LDpred2 and a 15.00% gain compared to PRS-CS, respectively. We also apply our method to 31 well-powered GWAS with publicly available summary statistics and provide a catalog of ensemble PRS with benchmarked predictive performance.
Results
Method Overview
First, we present an overview of the PUMA-CUBS workflow. Statistical details and technical discussions are presented in the Methods section. For illustration, first we assume individual-level data is available. In this case, we would divide the samples into 4 independent sets for PRS training, model fine-tuning, constructing ensemble PRS, and benchmarking model performance, respectively (Figure 1A). The main goal of our new approach is to mimic this procedure when only summary statistics are available. Using PUMAS, we could sample marginal association statistics for a subset of individuals in the GWAS35. Doing this repeatedly, we could divide the full GWAS summary data to corresponding training, tuning, ensemble learning, and testing summary statistics (Figure 1B). Using these four sets of sub-sampled summary statistics, we train a series of PRS models, fine-tune each PRS model to select the besting tuning parameters, apply PUMA-CUBS to combine PRS models through linear regression, and finally evaluate the predictive performance of PRS models. The entire procedure only requires GWAS summary statistics and linkage disequilibrium (LD) references as input.
(A) Conventional approach divides the entire individual-level dataset to different subset of samples for each of 4 stages of PRS analysis. (B) PUMA-CUBS directly partitions the full summary-level data to corresponding summary statistics for different analytical purposes.
Simulation results
We performed simulations using imputed genotype data from UKB to demonstrate that PUMAS and PUMA-CUBS can fine-tune, combine, and benchmark PRS models. We included 100,000 independent individuals of European descent and 944,547 HapMap3 SNPs in the analysis. We simulated phenotypes with heritability of 0.2, 0.5, and 0.8 and randomly assigned causal variants under sparse and polygenic settings to mimic different types of genetic architecture (Methods). We performed GWAS and obtained marginal association statistics. We then implemented PUMAS and PUMA-CUBS to conduct a 4-fold MCCV to train, optimize, and evaluate lassosum, PRS-CS, LDpred2, and an ensemble PRS which combines all three methods22,25,26. For comparison, we also implemented a MCCV procedure using individual-level UKB data. We partitioned the UKB dataset into 4 mutually exclusive datasets. We used datasets 1 and 2 to train and fine-tune each PRS method, then used the third dataset to fit a regression to combine multiple PRS. We evaluated each PRS method in the fourth dataset and reported PRS prediction accuracy quantified by R2. We describe implementation details of both summary-statistics-based and individual-level-data-based MCCV in Methods.
Overall, we observed highly consistent results between PUMAS/PUMA-CUBS and MCCV for both quantitative and binary phenotypes (Figure 2; Supplementary Figures 1-7; Supplementary Tables 1-4). In addition, summary statistics-based approaches can closely approximate R2 values obtained from model-tuning and benchmarking techniques using individual-level data. PUMA-CUBS also constructed scores that were highly concordant with ensemble PRS built from individual-level data which universally outperformed all PRS models used as input.
(A and C) Simulation results for quantitative traits. (B and D) Simulation results for binary traits with balanced case-control ratio. Proportion of causal variants is 0.1% in A and B, and 20% in C and D. The heritability is set to be 0.5 in all panels. Y-axis: predictive R2 across 4 repeats of MCCV; X-axis (left to right): lassosum models (red boxes) with tuning parameter settings: s=0.2 and λ=0.005, s=0.2 and λ=0.01, s=0.5 and λ=0.005, s=0.5 and λ=0.01, s=0.9 and λ=0.005, s=0.9 and λ=0.01. LDpred2 models (green boxes): non-infinitesimal with p=0.1, non-infinitesimal with p=0.01, non-infinitesimal with p=0.001, non-infinitesimal with p=auto, and infinitesimal model. PRS-CS (blue boxes): ϕ=0.01, 0.0001, and auto. Finally, the purple box shows the results of ensemble PRS. Results for remaining simulation settings are summarized in Supplementary Figures 1-7 and Supplementary Tables 1-4.
PUMAS can fine-tune and benchmark PRS methods
Next, we demonstrate that PUMAS effectively fine-tunes PRS models and performs accordantly with the gold standard external validation approach based on individual-level data. We applied PUMAS to 16 quantitative traits and 3 diseases in UKB (Supplementary Tables 5-6). After quality control, the UKB dataset contained 375,064 independent individuals and 1,030,187 SNPs (Methods). We applied a 9-to-1 data split to hold out 10% of the samples for external validation, and performed GWAS for all traits using 90% of the samples. We applied 4-fold MCCV implemented in PUMAS to train and fine-tune three PRS models (i.e., LDpred2, lassosum, and PRS-CS which have been demonstrated to achieve high prediction accuracy in a recent benchmark study22,25,26,29) using only summary statistics. For external validation, we trained PRS models using summary statistics and calculated PRS prediction accuracy on the holdout dataset. We report the best tuning parameters for LDpred2, lassosum, and PRS-CS and corresponding R2 obtained from both PUMAS and external validation.
Our summary-statistics-based approach showed highly consistent model-tuning performance for all analyzed traits compared to external validation (Figure 3, Supplementary Figures 8-26; Supplementary Tables 7-8). Among 19 traits, PUMAS and external validation selected the same best tuning parameters 19, 17, and 11 times for lassosum, LDpred2, and PRS-CS, respectively. When the model tuning results were different between PUMAS and external validation, both approaches still selected models with very similar prediction accuracy. In addition, PUMAS provided precise R2 estimates for all models compared to external validation, advocating the use of our summary-statistics-based approach for PRS model benchmarking.
Four panels show the model-tuning results for (A) height, (B) monocyte count, (C) coronary artery disease, and (D) high blood pressure. Y-axis: average predictive R2 across 4-fold replications from PUMAS; X-axis: predictive R2 evaluated by external validation on the holdout dataset. Each data points represents a PRS model with different tuning parameters and the shape of data points indicate three different PRS methods: LDpred2, PRS-CS, and lassosum. The best tuning parameter setting suggested by PUMAS for each PRS method is highlighted and colored. The dashed red line is fitted regression line between PRS R2 from PUMAS and external validation. Pearson correlations between two sets of results are shown in each panel. Detailed model-tuning results for all 19 traits are summarized in Supplementary Tables 7-8 and Supplementary Figures 8-26.
We also observed that the parameter-tuning results are accordant with the analyzed traits’ genetic architecture. For both height and monocyte count, PUMAS accurately selected the best tuning parameters based on external validation (Figure 3A-B), but the selected models were not the same between these two traits. Height is known to be extremely polygenic with more than 12,000 independent GWAS signals in the latest GWAS40. In comparison, fewer loci have been found to significantly associate with monocyte count41. Our model-tuning results suggest that polygenic prediction models fit best for height (e.g., LDpred2-Infinitesimal and PRS-CS with ϕ = 0.01) while sparser PRS models with stronger regularization (e.g., PRS-CS with ϕ = 0.0001) provide better prediction accuracy for monocyte count.
Finally, PUMAS can also effectively estimate predictive R2 for binary traits (Figure 3C-D). To calculate interpretable R2 for binary outcomes, PUMAS first transforms GWAS summary statistics obtained from logistic regressions to the linear regression scale, and then computes R2 on the observed scale42-44. To show that such transformation is valid, we trained two sets of PRS models using both transformed and original logistic regression summary statistics for 3 disease traits and observed nearly identical PRS performance between two approaches (Supplementary Table 8; Supplementary Figures 24-26). Details in the implementation of binary trait analysis and summary statistics transformation are presented in Methods.
Ensemble learning via PUMA-CUBS substantially improves PRS prediction accuracy
Here we apply PUMA-CUBS, the ensemble learning extension of PUMAS, to UKB traits and show that ensemble PRS has superior prediction accuracy compared to each PRS method and our summary statistics-based approach is comparable to ensemble learning results based on individual-level data. We constructed linearly combined scores of lassosum, PRS-CS, and LDpred2. Using individual-level data, we split the 10% UKB holdout dataset into two equally sized subsets. We fitted a multiple regression on the first holdout set to aggregate the best-performing PRS models trained and tuned from GWAS summary statistics, and then evaluated the ensemble score’s prediction accuracy using the second holdout set. For comparison, we implemented PUMA-CUBS to conduct 4-fold MCCV to perform ensemble learning using summary statistics alone and assess its performance on the second holdout set.
Our approach showed almost identical performance compared to individual-level data results (Figure 4A), showcasing PUMA-CUBS’ ability to benchmark and construct ensemble PRS without requiring additional datasets. In addition, ensemble PRS achieved the highest prediction accuracy for all traits compared with three input PRS models (Supplementary Figure 27; Supplementary Table 9). The ensemble PRS using individual-level data as input had an average 15.77% and 7.25% relative gain in R2 compared to PRS-CS-auto and LDpred2-auto while the PUMA-CUBS ensemble PRS delivered a similar 15.00% and 6.54% R2 increase respectively (Figure 4B), highlighting the substantial gain in prediction accuracy from ensemble learning.
(A) Comparing two sets of ensemble PRS obtained from PUMA-CUBS and individual-level data. The gray dashed line is the diagonal line. (B) Comparing ensemble PRS with input PRS methods. Y-axis: relative percentage increase in R2 compared to PRS-CS-auto; X-axis: 4 sets of PRS models, including the best single PRS suggested by PUMAS, the best single PRS selected based on the first individua-level holdout set, the ensemble PRS obtained from PUMA-CUBS, and the ensemble PRS trained from individual-level data. All R2 values were computed using the second half of holdout dataset.
Y-axis: average predictive R2 of PUMA-CUBS ensemble PRS; X-axis: heritability estimates from LD score regression45. Size of data points indicates the effective sample size of each GWAS. Binary traits and continuous traits are highlighted with different colors. Detailed PRS benchmark results are presented in Supplementary Table 12.
Constructing and benchmarking ensemble PRS for 31 complex traits
Finally, we applied PUMA-CUBS to provide a comprehensive catalog of ensemble PRS for 31 publicly available GWAS summary statistics with varying sample size and genetic architecture. The detailed information and selecting criteria for GWAS summary-level data are summarized in Methods and Supplementary Table 10. We employed extensive quality controls to pinpoint and calibrate misspecifications in GWAS summary statistics following a recent study31 (Supplementary Table 11). We also transformed logistic summary statistics to linear scale to produce interpretable R2 for binary traits42-44. For each trait, we reported prediction accuracy of the best performing PRS model and ensemble PRS. The full results of the PRS catalog are presented in Supplementary Table 12. The predictive performance of ensemble PRS is correlated with estimated trait heritability, and the predictive R2 ranged from 0.001 to 0.227 across 31 traits, showing highly diverse predictive performance of genetic risk prediction. We also note that ensemble PRS improved predictive R2 for every trait in the analysis with an average increase of 31.36% compared to PRS-CS-auto (Supplementary Figure 28). Among 31 complex diseases and traits, we observed the highest prediction improvement for rheumatoid arthritis (103.8%), Alzheimer’s disease (71.96%, 83.68%, and 98.82% on three datasets), ischaemic stroke (69.48% and 78.35% on two datasets), and Parkinson’s disease (75.55%).
Another observation is that the ensemble PRS R2 exceeded the estimated trait heritability for all three Alzheimer’s disease GWAS. To demonstrate that this is not an artifact from overestimating predictive R2, we conducted additional analysis (Methods) using IGAP 2019 Alzheimer’s GWAS summary statistics46 and compared our results with external validation based on 2,600 Alzheimer’s disease cases and 5,200 healthy controls in UKB (Supplementary Table 13). The R2 of AD PRS obtained from external validation also exceeded estimated heritability (h2=0.072, SE=0.012) and the results were consistent with PUMAS R2 estimation (Supplementary Figure 29; Supplementary Table 14). We hypothesized that this is driven by the APOE region which contributes an unusually large fraction of AD risk47-49. Indeed, after removing 383 SNPs in the APOE region from IGAP 2019 AD summary statistics (Methods), we observed a steep decline in R2 for both external validation and PUMAS. Both R2 values became substantially lower than the estimated h2 of 0.066 without APOE region (SE=0.009; Supplementary Table 14).
Discussion
Fine-tuning and benchmarking PRS models are challenging tasks due to the need of external individual-level datasets that are independent from the input GWAS. In this work, we extended our PUMAS approach to incorporate LD and fine-tune state-of-the-art PRS methods. In both simulations and analysis of UKB traits, we observed high concordance between PUMAS and results based on external validation using holdout samples. In addition, we presented a novel framework named PUMA-CUBS to perform ensemble learning and create combined PRS using only GWAS summary statistics. We showed that ensemble PRS created by PUMA-CUBS closely approximates scores built from holdout samples. Further, these ensemble scores substantially outperformed state-of-the-art PRS methods for all complex traits we analyzed in the study. Finally, we applied PUMA-CUBS to a collection of publicly available GWAS summary statistics and provided a comprehensive catalog of benchmarked and optimized PRS.
Our work presents several major advances that will impact future PRS applications. First, our method fills an important gap between PRS methodological research and its real-world applications. Currently, many PRS methods still have tuning parameters and grid search on external individual-level datasets remains the most common technique for fine-tuning these models. In practice, this kind of data can either be impossible to obtain, or need to be split from testing samples which could hurt statistical power in PRS applications32. Our method provides a universal solution to PRS model fine-tuning. Second, model benchmarking is another major challenge in the field which conventionally relies on external validation data. Comprehensive and unbiased benchmarking allows researchers to compare the effectiveness of different PRS methods for particular traits of interest, and importantly, estimate PRS predictive accuracy without using testing samples. We note that although some advanced PRS approaches do not require model fine-tuning anymore, no existing methods could benchmark model performance using a single set of GWAS summary data, which is crucial for model selection, power calculation, and study design. Our approach now provides a solution to this problem. Third, the ensemble learning approach which combines multiple predictive models through a second level regression has been viewed as a highly effective but data-demanding approach28,29,33. A major advance in this study is the introduction of PUMA-CUBS which allows ensemble learning on GWAS summary statistics. We note that this approach not only showcased a substantial gain over existing PRS methods, but is generally applicable to future PRS developments. If a future PRS approach shows promising improvements compared to older methods, that new approach can also be incorporated into the ensemble PRS. In our view, PUMA-CUBS is not a competing approach for any existing PRS model, but instead is a flexible and general modeling technique that combines the best-performing methods out there and should be applied to all future PRS applications.
Our study has several limitations. First, we have constrained statistical analysis in this study to the European ancestral population. PRS is known to transfer poorly in terms of prediction accuracy for non-European populations which could exacerbate the disparity in genomic medicine between ancestral groups50,51. It is an important future direction to systematically optimize and benchmark PRS for diverse ancestral populations which would require incorporation of multiple sets of ancestry-specific GWAS and LD references. Although we did not explore this topic in this paper, our recent work introduced parallel ideas to tackle the challenges in multi-ancestry genetic risk prediction39. Second, analyses in this study were limited to GWAS summary statistics computed from independent samples. It remains to be investigated whether application of these approaches will be affected if the input GWAS summary statistics were obtained from linear mixed models with related samples or family-based designs52-54. Future work will focus on developing statistical methods to correct for sample relatedness or demonstrate robustness to these issues. That said, we expect PRS model-tuning to remain valid even with sample relatedness since the inflation in R2 should be uniform across various tuning parameter settings, although biases may be introduced to the predictive R2 which could affect benchmarking efforts. Third, our current analyses focused only on lassosum, PRS-CS, and LDpred2. While it serves to support the superiority of ensemble PRS as a proof of concept, more PRS methods need to be jointly modeled and evaluated in the future, including scores that leverage auxiliary information including functional annotation13,14 or multiple phenotypes15,17,55. Finally, collinearity among PRS models could arise when using multiple regression to combine a large number of scores since some PRS methods tend to yield similar results. Therefore, another future direction is to incorporate variable selection strategies into our ensemble learning framework which could also involve penalized regression.
To sum up, we presented a sophisticated statistical framework to fine-tune, combine, and benchmark PRS methods using only GWAS summary statistics. This is a statistically novel and computationally efficient approach with flexible implementation that can handle a variety of applications. We have demonstrated its performance through careful and comprehensive analyses, and we argue that this framework presents highly innovative and generally applicable features that should become the default in many future PRS studies.
Methods
Sampling distribution of summary statistics
We adopt a commonly used linear model framework to quantify the relationship between a quantitative trait and SNP genotypes:
Here, Y denotes the trait, X = (X1, …, Xp) denotes the genotypes of p SNPs, β ∈ ℝp denotes their true effect sizes, and ϵ denotes the random error that is independent from X and follows a normal distribution with mean zero and some variance
. Let y and x = (x1, …, xp) denote the observed values for Y and X from N independent individuals. For simplicity, we assume both y and xj (j = 1, …, p) are centered. Then, GWAS summary statistics can be denoted as:
where
are the residuals from the marginal linear regression between the trait and the j-th SNP. To train, fine-tune, combine, and benchmark PRS models, independent datasets are required to avoid overfitting. We have previously proposed a flexible statistical framework to generate training and fine-tuning datasets when only GWAS summary statistics are available35. Here, we generalize this statistical framework in two different directions. First, we allow our method to incorporate LD information. We note that this extension is similar to some recent work built on our initial PUMAS paper37,39. Second, we allow the method to partition full GWAS summary statistics into more than two datasets for various analytical purposes. Let y(s) and x(s) denote phenotype and genotype data for any arbitrary subset of N individuals with sample size N(s). When N is large enough, we have previously shown that by central limit theorem35:
where XTY = (X1Y, …, XpY)T. Then, given the observed summary-level data from GWAS, the conditional distribution of summary statistics of a subset of GWAS samples is
where
is the observed variance-covariance matrix for XTY. To subsample summary statistics
, we need to estimate xTy and
first. Recall formula (1) for marginal regression coefficient estimation,
can be calculated using
and
which is proportional to SNP variance and can be estimated by minor allele frequency (MAF) reported from GWAS or imputed from LD reference panel. On the other hand, deriving Σ is more complicated and we discuss how
is estimated using summary statistics and an LD reference panel in the following section.
Estimate variance-covariance matrix of summary statistics
Let D denote the SNP correlation matrix and djk denote the correlation between the j-th and the k-th SNPs. Let Σ be the true covariance matrix of summary statistics with diagonal and off-diagonal elements denoted as Σj and Σjk, respectively. For convenience, we write Y = Xβ + ϵ = X1β1+… +Xpβp + ϵ = Xjβj + ϵj, where ϵj = Σi: i ≠j Xi βi + ϵ. Then the diagonal terms of the Σ can be written as
We partition all SNPs in the genome into 2 sets. Let S1 be the index set that contains all SNPs that are independent from the j-th SNP and S2 be the set with all remaining SNPs that are in LD with the j-th SNP. Then we can further expand Σj by
We can simplify Σj based on two commonly made assumptions. First, any given SNP should be in linkage equilibrium with the vast majority of SNPs in the genome. Therefore, we can safely assert |S1| ≫ |S2|. Second, each individual SNP’s effect on the phenotype is typically very small such that the products of any effect sizes are negligible in practice. Taken together, we can reduce the expansion of Σj by discarding SNPs in S2 which eventually allows us to treat Xj and ϵj as independent in practice:
Note that
can be easily approximated using an MAF-based estimator, denoted as
, that may be obtained either from the full GWAS summary statistics or the LD reference data. For
, we can estimate its value by standard error of effect size estimation from GWAS summary data using formula (2). In this way we can obtain an estimator of Σj as
To estimate off-diagonal terms Σjk, we now write Y = Xβ + ϵ = X1β1+… +Xpβp + ϵ = Xjβj + Xkβk + ϵjk, where ϵjk = Σi:i∉{j,k} X,β, + ϵ. Under the same assumption where the magnitude of SNP effects is very small, we can simplify Σjk by:
In a similar fashion, we further partition all SNPs in the genome other than the j-th and the k-th SNP into two sets. Let S3 denote the collection of SNPs that are independent from both the j-th and the k-th SNPs, and S4 includes the remaining SNPs that are in LD with either the j-th or the k-th SNP. Based on a similar rationale, we can safely assume that |S3| ≫ |S4|. Then, by ignoring SNPs in S4 and thus treating Xj and Xk as being independent from ϵjk, we express Σjk as:
where E(XjXk) can be directly estimated by the LD correlation matrix and MAF-based SNP variance estimator. For
, it is the residual variance from a two-SNP regression model and should be smaller than both
and
. In practice, we can approximate it by the smaller value between
and
. Therefore, the numerical approximation for Σjk becomes
Now we can then generate summary statistic from the multivariate normal distribution in formula (3). Note that our earlier subsampling framework is a special case where SNPs are independent and its only difference with the current method is the estimation of
. In the next session we will discuss how to subsample summary statistics efficiently from a multivariate normal distribution.
Strategy for subsampling summary statistics
Next, we discuss how to partition full GWAS summary statistics into K independent subsets of GWAS samples, denoted as for K > 2. When K = 2, formula (3) can be directly applied to divide GWAS summary statistics into two independent sets. Otherwise, let N(1), …, N(K) denote the corresponding sample size for each subset of individuals and
. By formula (3), we can subsample
from xTy observed in the complete GWAS summary data. After that, we calculate summary statistics excluding N(1) individuals from the first subset as
. To generate summary statistics for any following subset numbered t + 1 (i.e.,
) for t = 1, …, K − 2, we update the conditional distribution in (3) with the new “full” GWAS summary statistics and correspondent total sample size:
where
represents summary statistics excluding first t subsets of individuals. This subsampling strategy guarantees that every subset is independent from each other and avoids overfitting when K > 2. Finally, for the last subset K, we can directly calculate its summary statistics by
. Together, this is a flexible framework for generating summary statistics and can be used for various types of PRS analyses as we discuss in later sections.
It is a difficult task to subsample summary statistics for all SNPs in the genome simultaneously given the large dimension of genotype and imputed data. Even if PRS modeling is restricted to HapMap3 SNPs, it remains challenging to subsample for more than one million SNPs altogether26. To efficiently generate data, we partition the whole genome into approximately independent LD blocks and subsample summary statistics for SNPs in each LD block separately56,57. Then
becomes a sparse block-diagonal matrix, i.e.,
. Within each LD block, the empirical SNP correlation matrix may not always be positive-definite and thus making it impossible to randomly generate data from that LD block. A straightforward remedy is to conduct eigen decomposition for any
that is negative definite, manually change negative eigenvalues to 0’s, and obtain an approximation of
that is positive semi-definite. Note that this may not be the best approach and other methods for estimating LD blocks can also be applied58,59.
Evaluate predictive performance of PRS
Here, we generalize the summary-statistics-based PRS evaluation scheme proposed in our previous work to incorporate LD. We denote PRS as a weighted sum of allele counts across many SNPs:
where ω ∈ ℝp is a vector of SNP weights, which can be marginal regression coefficients from GWAS or post-hoc effect size estimates. If individual-level data is available, then R2 evaluated on any holdout dataset (y(s), x(s)) can be calculated as
where ŷi is the PRS for the i-th person,
the mean phenotypic value, and
is the mean PRS value in holdout dataset s. On the other hand, we have shown that when only summary statistics of the holdout dataset is available and SNPs are independent,
can be approximated by35:
given that x(s), y(s), and ŷ(s)are centered. In practice, we use the 90% quantile instead of
to get a robust estimate of Var(y(s)). When LD is present, the approximations for Cov(y(s), ŷ(s)) and Var(y(s)) remain the same. For Var(ŷ(s)), it can now be approximated by ωTVar(x(s))ω, with Var(x(s)) estimated using the LD correlation matrix and MAF calculated from the reference panel. Taken together, we have
Note that similar versions of this formula have been tested and applied in the literature22,37,38. In practice, we can directly calculate PRS on the LD reference genotype data and use the sample variance of PRS to replace
for optimal computational efficiency.
The PUMAS framework
Given the flexible framework we introduced for subsampling GWAS summary data and evaluating PRS based on summary statistics, PUMAS becomes a special case where the entire GWAS summary-level data is partitioned into a training and a tuning dataset, denoted as and
. PUMAS first draws
from (3) and then calculates
by
. For each SNP, the marginal effect size and its standard error from the training set can be calculated as
Then these summary statistics from the training dataset can be used to train any PRS methods that use GWAS summary statistics as input. R2 of the PRS model assessed on the fine-tuning dataset can be approximated by replacing
with
and changing the corresponding sample size in formula (7). This procedure can be repeated k times to implement a k-fold Monte Carlo cross-validation (MCCV) to select the best-performing tuning parameter. When there is a set of tuning parameters λ in a PRS framework, that is,
PUMAS chooses the optimal tuning parameter
by
where
denotes the mean
across k-fold MCCV. This cross-validation technique also applies to models that are hyperparameter-free or fine-tuned in advance. When the goal is to pick the best PRS model among a total of M PRS methods, the best model
can be selected by
where
is the besting tuning parameter for PRS framework m.
Combining multiple PRS with PUMA-CUBS
Next, we introduce PUMA-CUBS, an extension of PUMAS that applies ensemble learning to combine multiple PRS using GWAS summary statistics. To do this, PUMA-CUBS further partitions the full GWAS association results to 4 independent sets of summary statistics corresponding to training , tuning
, ensemble training
, and testing
summary statistics. Using formula (6), we subsample summary statistics iteratively and compute
. Like PUMAS, PUMA-CUBS first conducts k -fold MCCV using training and tuning summary statistics to pick the best tuning parameter for each PRS method. Then, it trains each optimal PRS model’s weight on the ensemble training data and evaluates the combined PRS on the testing summary statistics. A straightforward and intuitive way of combining PRS is through multiple linear regression. However, if individual-level genotype and phenotype data is not available, we cannot fit the regression in the conventional way. Below we illustrate how to calculate regression coefficients using summary-level data alone. We define the multiple linear regression model on the ensemble training dataset as:
where α = [α1 α2 … αM]T are PRS weights for M PRS methods. We also define
as the observed PRS matrix with dimension N(etr) × M, and W = [w1 w2 … wM] are a p × M SNP weights matrix for p SNPs from M methods. To obtain the least squares estimator of α, that is
, we need to estimate zTz and zTy(etr) separately. In fact, under the assumption that genotype and phenotype are both centered, we can show that
where
is the empirical covariance matrix of the PRS matrix z. In practice, we can estimate
by calculating PRSs and their sample covariance matrix on a reference LD genotype dataset or approximate it by computing WTDW. Taken (8) and (9) together, we can estimate PRS weights using only summary statistics. Then we take the average PRS weights across k folds, i.e.,
, and use it as the PRS weight to combine optimized PRSs. Finally, we modify equation (7) to calculate predictive R2 for ensemble PRS on the testing summary-level data:
In the end, PUMA-CUBS reports the average prediction accuracy of ensemble PRS across k folds. Note that PUMA-CUBS can benchmark all PRS models in addition to the ensemble PRS on the testing summary statistics since it is independent from training and tuning datasets. Therefore, PUMA-CUBS becomes a highly flexible framework to train, fine-tune, combine, and evaluate PRS models based on GWAS summary statistics.
Binary phenotypes
There are two challenges when applying PUMAS and PUMA-CUBS to binary phenotypes. First, summary statistics obtained from logistic regression frameworks violate the linear regression model assumption in our derivation. Therefore equations (3) and (6) are not directly appliable to subsampling summary statistics for binary traits because XTY calculation is non-trivial for log odds ratios. Second, squared Pearson correlation between a binary outcome and PRS using logistic regression coefficients as input is less interpretable and rarely reported. On the other hand, area under the ROC curve (AUC) is often the preferred metric to quantify PRS accuracy for binary outcome. AUC calculation based on summary statistics has been developed but is not yet generalized to handle whole genome data, making it difficult to evaluate more sophisticated PRS methods that leverage contributions from millions of SNPs when individual-level data is not accessible60. Here we propose a simple solution that allows us to apply PUMAS and PUMA-CUBS to binary phenotypes and report interpretable R2. For binary traits, R2 on the observed scale (i.e.,) has been defined and discussed in the literature as an alternative metric for evaluating PRS prediction accuracy44.
is the squared correlation between PRS and 0-1 status where PRS uses effect sizes estimated from linear probability model (LPM, i.e., linear regression between the binary response and SNP allele counts) as inputs61. If GWAS summary-level data is acquired from linear probability model, then PUMAS and PUMA-CUBS can be directly applied to calculate
for binary traits53. When LPM summary statistics are not available, since a single SNP has very weak effect on the phenotypic outcome in practice, we can still safely approximate LMP coefficient estimations using Z-score from logistic regression42,43. Specifically, we can calculate
where Zj, logistic is Z-score for the j-th SNP from logistic summary statistics and ν is the sample prevalence. Then, we can use
and correspondent standard error
to apply PUMAS and PUMA-CUBS to dichotomous phenotypes. Eventually, if it is preferred to transform
to R2 on the liability scale
which can be comparable across different studies and phenotypes, such transformation has been developed using sample and population prevalence44.
Sample size imputation
In this section, we discuss how to handle sample size misspecification in GWAS summary statistics when applying our approach. Sample size misspecification is common in published GWAS datasets since many studies often do not report SNP-specific sample size and only provide a maximum sample size for the entire study. This is sub-optimal for PRS training if variant-level samples sizes differ substantially (e.g., in meta-analysis). A recent study has extensively investigated sample size misspecification in marginal association statistics and observed consistently decreased PRS prediction accuracy when the issue is not properly addressed31. For PUMAS and PUMA-CUBS, incorrect sample sizes will both affect the quality of subsampled summary statistics and bias the estimation of predictive R2. To address this issue, we employed the approach proposed in Privé et al. to impute and conduct quality control on variant-specific sample size31. Specifically, when the summary-level data does not provide sample size information for each SNP, we first impute sample size and remove SNPs with imputed sample size smaller than 70% and larger than 110% of reported maximum sample size. For summary statistics that provides per-SNP sample sizes, we simply removed variants with sample size smaller than 70% of the largest sample size. On the other hand, to make sure formula (7) and (10) work for summary statistics with varying SNP-specific sample sizes, we enforce all summary statistics other than training summary statistics to have the same sample size for every SNP. We achieve this by subsampling all other summary statistics first where we can specify subset size and calculate at last.
PRS training
We trained lassosum, PRS-CS, and LDpred2 models for all PRS analyses in this study22,25,26. lassosum is a penalized regression framework that trains lasso regression coefficients for SNPs in each LD blocks with tuning parameters s and λ, where s controls the sparsity of LD matrix and λ is the penalty term that regularizes shrinkage of effect sizes. PRS-CS and LDpred2 are both Bayesian PRS frameworks with different prior assumptions for the SNP effect size distribution. PRS-CS has a global shrinkage parameter ϕ that uniformly shrinks its continuous prior distribution for each SNP and includes a fully Bayesian approach that automatically learns ϕ during model fitting. LDpred2 is an extension of LDpred that places a point normal prior on SNP effects based on tuning parameter p that represents the proportion of causal variants in the genome (LDpred non-inf and LDpred2_grid) or a univariate normal prior on all SNPs that doesn’t require model-tuning (LDpred/LDpred2-Inf)12. Like PRS-CS, LDpred2 can also employ an empirical Bayesian approach to optimize p on the training summary statistics. For implementation, we trained PRS-CS (v1.0.0) models using UKB European LD reference for simulation study and 1000 Genomes European LD reference for real data analysis. We followed PGS server pipeline to implement lassosum (R package ‘lassosum’ v0.4.5) and LDpred2 (R package ‘bigsnpr’ v1.9.11)22,28,62. Due to larger computational burden, we implemented LDpred2 on each chromosome separately and only used the estimated heritability from LD-score regression as the tuning parameter h2 in LDpred245. For real data analysis in UKB we constructed both non-sparse and sparse versions of LDpred2 models. We employed more shrinkage on LDpred2-auto model (shrink_corr = 0.5) and LDpred2_grid models (low_h2=0.1*h2) when analyzing publicly available GWAS summary statistics to ensure model convergence. We only trained PRS models on HapMap3 SNPs in all analyses throughout this study. The best tuning parameter for lassosum was obtained through grid search. For LDpred2 and PRS-CS, we compared grid search with empirical Bayesian models to find the best parameter.
Simulation settings
We conducted simulations using UKB genotype data imputed to the Haplotype Reference Consortium reference. We removed samples who are not of European ancestry and genetic variants with MAF below 0.01, imputation R2 below 0.9, Hardy-Weinberg equilibrium test p-value below 1e-6, or missing genotype call rate greater than 2%. We further extracted variants in the HapMap3 SNP list and 1000 Genome Project Phase III LD reference data for European ancestry from PRS-CS. 377,509 samples and 944,547 variants remained after quality control. Then, we randomly selected 100,000 samples to be the training dataset and 1,000 samples as the LD genotype reference for our summary-statistics-based approach. To generate trait values, we simulated true effect sizes from a point normal distribution, i.e., where p is the proportion of causal variants, δ0 is point mass at 0, h2 is the total heritability of the phenotype, and M is the total number of SNPs7,12. We did not simulate associations between SNP true effects on the allelic scale and MAF since previous analysis has shown minimal difference in performance between PUMAS and PRS validation using individual-level data35,63. We chose p to be 0.1% and 20% corresponding to sparse and polygenic genetic models, and h2 = 0.2, 0.5, 0.8 to create a total of 6 simulation settings with various types of genetic architecture. Within each setting, we randomly selected causal variants across the whole genome. Then we simulated quantitative traits by adding up the SNP allele counts weighted by their true effect sizes and randomly generated gaussian noises scaled based on trait heritability. We fitted marginal linear regression in PLINK to obtain GWAS summary statistics in each setting64.
We compared PUMA-CUBS with 4-fold MCCV. To implement 4-fold MCCV, in each fold we randomly selected 60% of all samples to form the training dataset (N=60,000), 20% as the tuning dataset (N=20,000), 10% as the ensemble training dataset (N=10,000), and the remaining 10% as the testing dataset (N=10,000). We conducted GWAS on the training data and used summary statistics to train PRS models, fine-tuned PRS methods on the tuning data, obtained optimized PRSs’ weights in the ensemble score by fitting multiple linear regression on the ensemble training data, and finally evaluated each PRS model’s predictive R2 on the testing data. For PUMA-CUBS, we first used all samples (N=100,000) to fit marginal linear regression and obtained the full summary statistics. In a similar fashion, we partitioned the full summary statistics to training summary data (N=60,000), tuning summary data (N=20,000), ensemble learning summary data (N=10,000) and testing summary data (N=10,000) for corresponding PRS analysis. Similarly, we compared PUMAS with 4-fold MCCV by using only the training and tuning summary-level and individual-level data for two approaches, respectively. In all simulations, we used 1000 Genomes Project European LD dataset provided by the PRS-CS software to subsample summary statistics. Both lassosum and LDpred2 model training used the holdout UKB LD genotype data (N=1,000) as the LD reference. We implemented lassosum with s = 0.2, 0.5, 0.9 and λ = 0.005, 0.01, PRS-CS with ϕ = 0.0001, 0.01, auto, LDpred2 with p = 0.001, 0.01, 0.1, auto and the infinitesimal model. We repeated this procedure four times and calculated average R2 to pick the best set of tuning parameters for both approaches.
We conducted additional simulations to demonstrate that PUMAS and PUMA-CUBS can be applied to binary traits. For each setting in the quantitative simulation study, we dichotomized the continuous phenotype (i.e., true liability value under a liability threshold model) using either the median or 90% quantile to acquire balanced (5-to-5) and unbalanced (1-to-9) case-control ratios. Therefore, we have a total of 12 binary simulation settings. We fitted logistic regressions in PLINK to obtain GWAS summary statistics in each setting and transformed logistic regression summary statistics to the linear scale42,43,64. We then compared PUMAS and PUMA-CUBS with MCCV. We computed R2 on the observed scale (i.e., R2 between PRS and 0-1 status) and transformed it to R2 on the liability scale by44:
where ν is prevalence, ϕ and Φ−1 are the pdf and inverse cdf of the standard normal distribution.
UKB data analysis
We applied our approach to 16 quantitative traits and 3 diseases in UKB. The list of UKB phenotypes is presented in Supplementary Tables 5-6. The imputed UKB genotype data consists of 375,064 independent individuals of European ancestry and 1,030,187 variants after quality control. We used Hail (v0.2.57) to perform linear regression for quantitative traits while adjusting for sex, age polynomials to the power of two, interactions between sex and age polynomials, and top 20 principal components65. For 3 disease outcomes, we obtained GWAS summary statistics via regenie (v3.0.3) accounting for sex, age polynomials to the power of 3, interactions between sex and age polynomials, and top 10 principal components as recommended66.
We compared PUMAS with external validation using a holdout subset of UKB samples. For external validation of quantitative traits, we randomly selected 38,521 samples with non-missing phenotypic measurements for all traits to form the holdout dataset. The remaining samples for each phenotype were used as training data. In this way, we implemented an approximately 9-to-1 training-testing split. Similarly for each binary outcome, we continued to employ a 9-to-1 sample partition while matching the case-control ratio between the training and holdout datasets. Detailed sample size information for all traits is included in Supplementary Tables 5-6. Then, we conducted GWAS on the training data and obtained summary statistics. For quantitative traits, we computed and evaluated PRS models on the entire holdout set and reported predictive R2 between PRS and phenotypes with covariates regressed out. For disease traits, we constructed PRS models and calculated R2 on the observed scale using both linear probability model summary statistics and logistic model summary statistics. For all phenotypes, the holdout set of quantitative traits (N=38,521) was also used as LD reference data for PRS model training. For comparison, we applied PUMAS to partition the same GWAS summary-level data used in MCCV to 75% training summary statistics and 25% tuning summary statistics. We used the holdout dataset (N=38,521) for summary statistics subsampling56 and as the LD reference for lassosum and LDpred2 model training. We estimated variance of PRS models based on a smaller subset (N=1,000) of the holdout data when evaluating PRS performance. This procedure was repeated 4 times and we reported the average R2 for each PRS model. In all simulations, we implemented lassosum with s = 0.2, 0.5, 0.9 and λ = 0.005, 0.01, PRS-CS with ϕ = 0.0001, 0.01, auto, LDpred2 with p = 0.001, 0.01, 0.1, auto and the infinitesimal model.
Next, we compared PUMA-CUBS with the training-testing split approach for ensemble learning on the holdout dataset. For PUMA-CUBS, we partitioned full GWAS summary statistics into training (60%), tuning (20%), and ensemble training (10%) summary statistics to train PRS models based on a grid of tuning parameters, select the best tuning parameter setting for each PRS method, and fit a second level regression to obtain regression weights for fine-tuned PRS models. We then randomly partitioned the holdout dataset into two equally sized subsets. We used PUMA-CUBS to obtain PRS models’ regression weights and then constructed and evaluated the ensemble PRS on the second half of the holdout set. PRS models with negative weights were removed from linear combination. In comparison, for the training-testing split approach based on individual-level data, we used the first half of the holdout set to fit multiple linear regression to obtain regression coefficients for fine-tuned lassosum, LDpred2, and PRS-CS scores. Then we computed and evaluated the ensemble PRS models on the second half of the holdout data. In all analyses, we trained lassosum with s = 0.2, 0.9 and λ = 0.001, 0.01, 0.1, PRS-CS with ϕ = 0.0001, 0.01, auto, LDpred2 with p = 0.001, 0.01, 0.1, auto and the infinitesimal model.
Building a catalog of PUMA-CUBS ensemble scores
We applied PUMA-CUBS to a collection of publicly available GWAS summary statistics. We selected complex diseases and traits with a minimal case sample size of 5,000 and a total sample size of 50,000, respectively. We excluded studies that performed GWAS on related samples and retained traits with significant heritability estimation (p-value below 0.05) from LD score regression45. In the end, we obtained a list of 31 GWAS summary statistics including 23 binary outcomes and 8 complex traits as summarized in Supplementary Table 10. For each summary statistics, we kept HapMap 3 SNPs that passed a series of quality control criteria listed in Supplementary Table 11, including transformation of logistic summary statistics and imputation of per-SNP sample size. Then we applied PUMA-CUBS to each phenotype to implement 4-fold MCCV by partitioning the summary statistics to training (60%), tuning (20%), ensemble training (10%), and testing (10%) datasets. We used 1000 Genomes Project Phase III European samples as the LD panel for summary statistics subsampling, PRS model fitting and benchmarking. We implemented lassosum with s = 0.2, 0.5, 0.9 and λ = 0.005, 0.01, PRS-CS with ϕ = 0.0001, 0.01, auto, LDpred2 with p = 0.001, 0.01, 0.1, auto and the infinitesimal model. We reported average predictive R2 of ensemble PRS, the best single PRS model, PRS-CS-auto and LDpred2-auto on the testing summary statistics.
We conducted additional analysis to investigate the validity of predictive R2 of ensemble PRS for Alzheimer’s disease. We used IGAP 2019 Alzheimer’s GWAS summary statistics to train PRS models and included 2,600 Alzheimer’s disease cases of European ancestry from the UKB cohort in the external validation dataset46. The data fields used for Alzheimer’s cases extraction are presented in Supplementary Table 13. We randomly selected 5,200 independent UKB samples not diagnosed with Alzheimer’s disease to use as healthy controls to match the case-control ratio in the IGAP 2019 study. Together, we obtained a UKB external validation dataset with 7,800 samples in total. We applied PUMAS to IGAP 2019 GWAS summary-level data and compared its performance with external validation. We compared R2 from both approaches with and without removing the APOE region from GWAS summary statistics. We excluded the APOE region from PRS analysis by removing variants between base pairs 45,116,911 and 46,318,605 (hg19) on chromosome 19.
Data and code availability
PUMAS/PUMA-CUBS software is freely available at https://github.com/qlu-lab/PUMAS.
Author Contribution
Z.Z. and Q.L. conceived and design the study.
Z.Z. developed the statistical framework.
Z.Z. and T.G. performed statistical analyses.
Z.Z. and Y.W. wrote the software.
S.Z. assisted in preparing and curating summary statistics.
J.M. assisted in developing ensemble PRS approach.
J.M. and Y.W. assisted in UKB data preparation.
J.S. assisted in developing statistical method for subsampling summary statistics.
Q.L. advised on statistical and genetic issues.
Z.Z. and Q.L. wrote the manuscript.
All authors contributed to manuscript editing and approved the manuscript.
Acknowledgements
The authors gratefully acknowledge research support from National Institutes of Health (NIH) grants U01 HG012039 and R21 AG067092, and support from the University of Wisconsin-Madison Office of the Chancellor and the Vice Chancellor for Research and Graduate Education with funding from the Wisconsin Alumni Research Foundation (WARF). We also acknowledge use of the facilities of the Center for Demography of Health and Aging at the University of Wisconsin-Madison, funded by NIA Center Grant P30 AG017266. We thank members of the Social Genomics Working Group at University of Wisconsin for helpful comments. This research has been conducted using the UK Biobank Resource under Application 42148.
Footnotes
Notations in statistical methodology revised; Minor revision to main texts; Minor revision to figures, supplementary figures and supplementary tables