Abstract
Polygenic Risk Scores (PRS) consist in combining the information across many single-nucleotide polymorphisms (SNPs) in a score reflecting the genetic risk of developing a disease. PRS might have a major public health impact, possibly allowing for screening campaigns to identify high-genetic risk individuals for a given disease. The “Clumping+Thresholding” (C+T) approach, which is the most common method to derive PRS, uses only univariate genome-wide association studies (GWAS) summary statistics, which makes it fast and easy to use. However, previous work showed that jointly estimating SNP effects for computing PRS has the potential to significantly improve the predictive performance of PRS as compared to C+T.
In this paper, we present an efficient method to jointly estimate SNP effects, allowing for practical application of penalized logistic regression on modern datasets including hundreds of thousands of individuals. Moreover, our implementation of penalized logistic regression directly includes automatic choices for hyper-parameters. The choice of hyper-parameters for a predictive model is very important since it can dramatically impact its predictive performance. As an example, AUC values range from less than 60% to 90% in a model with 30 causal SNPs, depending on the p-value threshold in C+T.
We compare the performance of penalized logistic regression to the C+T method and to a derivation of random forests. Penalized logistic regression consistently achieves higher predictive performance than the two other methods while being very fast. We find that improvement in predictive performance is more pronounced when there are few effects located in nearby genomic regions with correlated SNPs; AUC values increase from 83% with the best prediction of C+T to 92.5% with penalized logistic regression. We confirm these results in a data analysis of a case-control study for celiac disease where penalized logistic regression and the standard C+T method achieve AUC of 89% and of 82.5%.
In conclusion, our study demonstrates that penalized logistic regression is applicable to large-scale individual-level data and can achieve more discriminative polygenic risk scores. Our implementation is publicly available in R package bigstatsr.
Contact: florian.prive{at}univ-grenoble-alpes.fr & michael.blum{at}univ-grenoble-alpes.fr