Abstract
Polygenic risk scores (PRS) is one of the most popular prediction methods for complex traits and diseases with high-dimensional genome-wide association (GWAS) data where sample size n is typically much smaller than the number of SNPs p. PRS is a weighted sum of candidate SNPs in a testing data where each SNP is weighted by its estimated marginal effect from a training data. The motivations behind PRS are that 1) only summary statistics are needed for constructing PRS rather than raw data which may not be readily available due to privacy concerns; 2) most complex traits are affected by many genes with small effects, or follow a polygenic (or newly emerging omnigenic) model. PRS aggregates information from all potential causal SNPs and thus as its name suggested, is expected to be powerful for ploygenic and omnigenic traits. However, disappointing to many researchers, the prediction accuracy of PRS in practice is low, even for traits with known high heritability. To solve this perplex, in this paper we investigate PRS both empirically and theoretically. We show in addition to heritability, how the performance of PRS is influenced by the triplet (n, p, m), where m is the number of true causal SNPs. Our major findings are that 1) when PRS is constructed with all p SNPs (referred as GWAS-PRS), its prediction accuracy is solely determined by the p/n ratio; 2) when PRS is built with a list of top-ranked SNPs that pass a pre-specified P-value threshold (referred as threshold-PRS), its accuracy can vary dramatically depending on how sparse true genetic signals are. Only when m is magnitude smaller than n, or genetic signals are sparse, can threshold-PRS perform well. In contrast, if m is much larger than n, or genetic signals are not sparse, which is often the case for complex polygenic traits, threshold-PRS is expected to fail. Our results demystify the poor performance of PRS and demonstrate that the original purpose of PRS to aggregate effects from a large number of causal SNPs for polygenic traits is wishful and can lead us to a practical paradox for polygenic/omnigenic traits. Our results, as turned out, are closely related to the “spurious correlation” problem of Fan et al. [2012], which has been gaining more and more attention in the statistics community.