RápidoPGS: A rapid polygenic score calculator for summary GWAS data without validation dataset

Polygenic scores (PGS) aim to predict complex traits at an individual level based on genetic data. Computation of PGS is based on simply ascertained genome-wide association summary statistics but typically requires an independent test dataset to tune PGS parameters, and for the more sophisticated methods, the computation of LD matrices. Internally tuned methods have recently been proposed that obviate the need for a test dataset, but they remain computationally intensive to run. Here we present RápidoPGS, a flexible and fast method to compute PGS without the need to compute LD-matrices, requiring only summary-level GWAS datasets. Based on fine-mapping principles, RápidoPGS computes the posterior probability that each variant is causal, which in turn is used to shrink effect sizes adaptively as a function of LD and strength of association. We show by summary and individual-level validation that RápidoPGS performs well in comparison with another well-established internally-trained method (median AUC difference [RápidoPGS - LDpred2-auto] = −0.02205, range = [−0.1184, 0.0217]), with at least 50-fold improved speed. RápidoPGS is implemented in R, and can work with user-supplied summary statistics or download them directly from the GWAS catalog. We propose that RápidoPGS can be used to rapidly screen a set of candidate traits for utility, before more computationally intensive methods are applied to selected traits.

Polygenic scores (PGS) estimate individual propensity to a phenotype (eg. disease) by a summing individuals genotypes (coded 0, 1, 2), weighted by their effect sizes, calculated from GWAS data is the estimated effect for variant i , and the genotype. Challenges relate to the β i G i nature of GWAS data, to determine the set of SNPs (i) to use. Linkage disequilibrium (LD) between variants, if ignored, would mean double-counting the effects of causal variants with multiple variants in high LD. Error in the estimated effect sizes is most pronounced for small effects, for which true association may not be distinguishable from noise about a null result. Different approaches have developed to deal with these. Initial approaches selected only the most strongly associated variant in each genome wide-significant peak, but it was soon realised that predictive accuracy could be improved by reducing the significance threshold, to capture more truly associated variants even at the cost of including some false associations (which should add noise, but not bias, to any prediction), and methods developed to select variants through using an external validation set to tune the significance parameter, as well as using automated LD-pruning algorithms to select independent variants rather than visual inspection of Manhattan plots 6 .
More statistically sophisticated approaches have since been developed to average over multiple variants in LD without double counting (which is more accurate than selecting just one), and to shrink effect estimates by a continuous weight, [7][8][9] As with the "prune and threshold" approach, such methods initially required external independent validation datasets to tune their parameters, which can present a further barrier. Recently, automated methods have been developed that tune their parameters on the same dataset used for training via hierarchical Bayesian models 7,9 which perform nearly as well as their externally tuned counterparts. However, the mathematics of dealing with LD while allowing the capturing of signals from nearby causal variants requires storage inversion of large matrices, which is computationally intensive, and tuning, internally or externally, adds a further layer of iterative computation. Some approaches mitigate this by adding a thinning step, discarding a random subset of SNPs to reduce the burden, but constructing PGS still takes hours.
PGS construction is related to fine mapping. If we knew exactly the causal variants for a trait, an optimal PGS would comprise the estimated effects of just those variants. Fine mapping methods estimate probabilities that a variant is causal given observed data, and so in the absence of knowing the exact set of true causal variants, a natural PGS might be constructed as above, with set to probability variant i is causal. The basic fine mapping w i approach is very fast, because it makes a simplifying assumption that only a single causal may exist in any LD-defined region. While unrealistic, it has been shown to perform well, and of relevance to PGS, allows the definition of fine mapping probabilities using predefined LD regions and without the burden of processing large LD matrices 10 . Distinguishing noise around null associations from true associations is dealt with by two parameters: a prior probability that a random SNP is causal, and the prior distribution of effect sizes at true causal variants. While in a PGS these would be open to tuning, in fine mapping sensible default values have been chosen that reflect existing knowledge from the breadth of GWAS studies already conducted 11 .
Here we present RápidoPGS, a lightweight and fast ( rápido , in Spanish) method to compute PGS based on this basic fine mapping approach, that only requires a summary statistics dataset, and that can generate weights for millions of SNPs in only a few seconds. RápidoPGS works fully in R, requiring few dependencies. A PGS can be quickly constructed from any GWAS summary statistic dataset, or any GWAS PubMed ID, if harmonised datasets are available at GWAS catalog.
We used RápidoPGS to compute PGS models for eight case-control and two quantitative traits from publicly available GWAS summary statistics ( Suppl. Table 1) and compared its performance to that of PGS generated using an automated tuning method, LDpred2-auto, 6 in GWAS data from large biobanks (UK Biobank and Finngen, Suppl. Table 1) . Note that LDpred2-auto uses SNP thinning to limit computational burden, and we found that we could not raise the thinning parameter above 10% (i.e. discard 90% of SNPs at random) without segfault errors, thus all are comparisons are against this setting. For binary traits, we used summaryAUC 12 , a method for PGS evaluation using summary statistics as validation datasets. SummaryAUC measures were similar for RápidoPGS and LDpred2-auto models (median difference [RápidoPGS -LDpred2-auto] = 0.0267, range = [-0.1123, 0.0 566], Figure  1A), wit h the exception of type 1 diabetes (T1D). T1D has multiple very strong associations in the HLA region 12 , which is well known for its extensive long range LD architecture 14 . We suspected this was the cause of the underperformance, and indeed, removing the HLA region (Chr6:20-40Mb) produced more similar (but lower) AUCs for the two method s ( Figure  1B ). This warns that RápidoPGS may underperform in the presence of multiple very strong signals in LD.
Both LDpred2-auto and RápidoPGS tended to underperform compared to the AUC values reported by Privé et al. 7 . We checked whether this was due to using summaryAUC, which suggests trimming to the strongest 20,000 SNPs for accuracy, but found individual-level estimates of AUC in UK Biobank gave similar results ( Figure 1C), suggesting that the issue related to our inability to raise the LDpred2-auto thinning parameter above 0.1. The RápidoPGS AUC, even trimmed to 20k SNPs, was typically only a little lower than the reported AUC values (median difference [RápidoPGS -LDpred2-auto] = -0.02205, range = [-0.1184, 0.0217]).
For the quantitative traits, we used individual level data to calculate r 2 in UK Biobank, and again found similar performance for RápidoPGS compared to LDpred2-auto ( Figure 1D).

Figure 1. RápidoPGS has comparable performance to LDpred2-auto. A . AUC for case-control traits in large biobank datasets (UKBB for all except FinnGen for CAD). AUC
was estimated by summaryAUC on LDpred2-auto and RápidoPGS. summaryAUC suggests trimming to the top 20k SNPs for accuracy. For comparison purposes, we added AUC reported by Privé et al. 6 . B. Repeated on the T1D dataset with HLA region removed (Chr6:20-40Mb). No Privé results available for this subset. C. For asthma, we also computed AUC by individual genotypes in UKBB. D. r 2 results from BMI and height computed using individual-level data from UK Biobank. No Privé results available for these traits.
RápidoPGS includes an optional step to generate a sparse PGS, by identifying a subset of SNPs with largest weights and recalculating the score. We found that, in general, the larger the number of SNPs, the highe r AUC, and post-threshold recalculation generally, but not always, improves performance (Suppl. Fig. 1). In gen eral, 10 -4 -thresholded (median 231,493 SNPs) with recalculation performed better than any other alternative, although in most cases 20k threshold (ie. 20,000 SNPs with largest absolute weights) performed almost as well, suggesting that smaller PGS, which may be easier to use in subsequent calculations, may be sufficient for the rapid screening we envisage RápidoPGS may be useful for.
We benchmarked the speed of RápidoPGS and LDpred2-auto, using repeated analysis of the asthma GWAS dataset 13 . Since the number of iterations may affect LDpred2 speed, we run LDpred2-auto using 500 (LDpred2-auto default), 1000, and 3000 (the number we used for our computations) iterations, and 15 cores. RápidoPGS does not require parallelisation nor high computational power and can be run on a regular laptop, although we ran it here for comparison on the same high performance computing system as LDpred2. RápidoPGS took 18 seconds to finish, while the fastest LDpred2-auto (500 iterations) took ~997 seconds. LDpred2-auto (3000 iterations) took ~1504 seconds to finish (Figure 2). RápidoPGS is a standard R package, and does not require pre-installation or download of any external dataset (eg. a reference panel).

Figure 2. Timings of RápidoPGS and LDpred2-auto (with variable numbers of iterations), calculating PGS for Asthma.
Having PGS scores computed easily and quickly can be advantageous in a context in which there is a need for rapid assessment of many traits. Our work suggests that fine mapping approaches offer an alternative method to generate the weights w i in the weighted PGS calculation. Fine mapping methods do require setting parameters (typically, as prior parameters describing the frequency and distribution of true causal effect sizes), but sensible values for these have been learnt from analysis of the many GWAS datasets to date, and thus avoids the need to tune for individual datasets, which may be an insurmountable barrier for rare traits where a pair of training and test GWAS do not exist. While we chose a simple fine mapping approach here, other approaches have been developed which work do not make the simplifying single causal variant assumption yet still work with reasonable speed on GWAS summary data 15 , and these more sophisticated approaches may potentially increase the performance of fine-mapping PGS here to be competitive with PGS-inspired methods.
We conclude that RápidoPGS is 50 times or more faster than the most competitive automated PGS method with only a small loss in accuracy, suited for rapid screening of many potentially related traits, before maximising the accuracy of scores for specific traits of interest.

Declaration of interests
The authors declare no competing interests.