Abstract
Polygenic risk scores are becoming increasingly predictive of complex traits, but subpar performance in non-European populations raises concerns about their potential clinical applications. We develop a powerful and scalable method to calculate PRS using GWAS summary statistics from multi-ancestry training samples by integrating multiple techniques, including clumping and thresholding, empirical Bayes and super learning. We evaluate the performance of the proposed method and a variety of alternatives using large-scale simulated GWAS on ~19 million common variants and large 23andMe Inc. datasets, including up to 800K individuals from four non-European populations, across seven complex traits. Results show that the proposed method can substantially improve the performance of PRS in non-European populations relative to simple alternatives and has comparable or superior performance relative to a recent method that requires a higher order of computational time. Further, our simulation studies provide novel insights to sample size requirements and the effect of SNP density on multi-ancestry risk prediction.
Competing Interest Statement
Jianan Zhan, Yunxuan Jiang, Jared O. Connell, and Betram L. Koelsch are employed by and hold stock or stock options in 23andMe, Inc.
Footnotes
Conflicts of interest: J.Z., Y.J., J.O., and B.L.K. are employed by and hold stock or stock options in 23andMe, Inc.
Polishing the manuscript