Current clinical use of polygenic scores will risk exacerbating health disparities

Current clinical use of polygenic scores will risk exacerbating health disparities 1 2 Alicia R. Martin, Masahiro Kanai, Yoichiro Kamatani, Yukinori Okada, 3 Benjamin M. Neale, Mark J. Daly 4 5 1 Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, 6 MA 02114, USA 7 2 Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, 8 Cambridge, MA 02142, USA 9 3 Stanley Center for Psychiatric Research, Broad Institute of Harvard and MIT, 10 Cambridge, MA 02142, USA 11 4 Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, 12 USA 13 5 Laboratory for Statistical Analysis, RIKEN Center for Integrative Medical Sciences, 14 Yokohama 230-0045, Japan 15 6 Kyoto-McGill International Collaborative School in Genomic Medicine, Graduate 16 School of Medicine, Kyoto University, Kyoto 606-8507, Japan 17 7 Department of Statistical Genetics, Osaka University Graduate School of Medicine, 18 Suita 565-0871, Japan 19 8 Laboratory of Statistical Immunology, Immunology Frontier Research Center (WPI20 IFReC), Osaka University, Suita 565-0871, Japan 21 9 Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, 22 Finland 23 24 Correspondence: Alicia R. Martin, armartin@broadinstitute.org 25


Common discoveries and low-hanging fruit 119
First, the power to discover an association in a genetic study depends on the effect size 120 and frequency of the variant 29 . This dependence means that the most significant 121 associations tend to be more common in the populations in which they are discovered 122 than elsewhere 13,30 . For example, GWAS catalog variants are more common on 123 average in European populations compared to East Asian and African populations 124 ( Figure 2B), an observation not representative of genomic variants at large. 125 Understudied populations offer low-hanging fruit for genetic discovery because variants 126 that are common in these groups but rare or absent in European populations could not 127 be discovered even with very large European sample sizes. Some examples include 128 SLC16A11 and HNF1A associations with type II diabetes in Latino populations, as well 129 as APOL1 associations with end-stage kidney disease and associations with prostate 130 cancer in African descent populations 31-34 . If we assume that causal genetic variants 131 have an equal effect across all populations-an assumption with some empirical 132 support that offers the best case scenario for transferability 35-40 -Eurocentric GWAS 133 biases mean that variants associated with risk are disproportionately common and 134 discovered in European populations, accounting for a larger fraction of the phenotypic 135 variance there 13 . Furthermore, imputation reference panels share the same study 136 biases as in GWAS 41 , creating challenges for imputing sites that are rare in European 137 populations but common elsewhere when the catalog of non-European haplotypes is 138 substantially smaller. These issues are insurmountable through statistical methods 139 alone 13 , but rather motivate substantial investments in more diverse populations to 140 produce similar-sized GWAS of biomedical phenotypes in other populations. 141 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted February 1, 2019. ; https://doi.org/10.1101/441261 doi: bioRxiv preprint 142

Linkage disequilibrium 143
Second, LD, the correlation structure of the genome, varies across populations due to 144 demographic history (Figure 2A,C-E). These LD differences in turn drive differences in 145 effect size estimates (i.e. predictors) from GWAS across populations in proportion to LD 146 between tagging and causal SNP pairs, even when causal effects are the same 35,37-40 147 (Supplementary Note). Differences in effect size estimates due to LD differences may 148 typically be small for most regions of the genome (Figure 2C-E), but PRS sum across 149 these effects, also aggregating these population differences. While it would be ideal to 150 use causal effects rather than correlated effect size estimates to calculate PRS, it may 151 not be feasible to fine-map most variants to a single locus to solve issues of low 152 generalizability, even with very large GWAS. This is because complex traits are highly 153 polygenic, meaning most of our prediction power comes from small effects that do not 154 meet genome-wide significance and/or cannot be fine-mapped, even in many of the 155 best-powered GWAS to date 42 . 156 157

Complexities of history, selection, and the environment 158
Lastly, other cohort considerations may further worsen prediction accuracy differences 159 across populations in less predictable ways. GWAS ancestry study biases and LD 160 differences across populations are extremely challenging to address, but these issues 161 actually make many favorable assumptions that all causal loci have the same impact 162 and are under equivalent selective pressure in all populations. In contrast, other effects 163 on polygenic adaptation or risk scores such as long-standing environmental differences 164 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted February 1, 2019. ; https://doi.org/10.1101/441261 doi: bioRxiv preprint Differences in environmental exposure, gene-gene interactions, gene-environment 187 interactions, historical population size dynamics, statistical noise, some potential causal 188 effect differences, and/or other factors will further limit generalizability for genetic risk 189 scores in an unpredictable, trait-specific fashion [46][47][48][49]  with previous studies, we find that relative to European prediction accuracy, genetic 223 prediction accuracy was far lower in other populations: 1.6-fold lower in Hispanic/Latino 224 Americans, 1.7-fold lower in South Asians, 2.5-fold lower in East Asians, and 4.9-fold 225 lower in Africans on average (Figure 3). same sample sizes in each population. We symmetrically demonstrate that prediction 240 accuracy is consistently higher with GWAS summary statistics from ancestry-matched 241 summary statistics (Figure 4, Supplementary Figures 2-6). Keeping in mind issues of 242 comparability described above, we note that BBJ is a hospital-based disease-243 ascertained cohort, whereas UKBB is a healthier than average 59 population-based 244 cohort; thus, differences in observed heritability among these cohorts (rather than 245 among populations) due to differences in phenotype precision likely explain lower 246 prediction accuracy from the BBJ GWAS summary statistics for anthropometric and 247 blood panel traits, but higher prediction accuracy for five ascertained diseases 248 Table 2). Indeed, other East Asian studies have estimated higher 249 heritability for some quantitative traits than BBJ using the same methods, such as for  indicate that effect sizes were mostly highly correlated across ancestries, with a few 256 traits that were somewhat lower than excepted (e.g. height and BMI, with ρ ge =0.69 and 257 0.75, respectively). Prediction accuracy was far lower in individuals of African descent in 258 the UK Biobank (Supplementary Figures 4 and 11) using GWAS summary statistics 259 from either European or Japanese ancestry individuals, consistent with reduced 260 prediction accuracy with increasing genetic divergence (Figures 3 and 4). Our impetus for raising these statistical issues limiting the generalizability of PRS across 284 population stems from our concerns that, while they are legitimately clinically promising 285

(Supplementary
for improving health outcomes for many biomedical phenotypes, they may have a larger 286 potential to raise health disparities than other clinical factors for several reasons. The 287 opportunities they provide for improving health outcomes means they inevitably will and 288 should be pursued in the near term, but we urge that a concerted prioritization to make 289 clearly an important problem that can provide fundamental interpretability gains with 299 implications for some major health benefits (e.g. need for dialysis and development of 300 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted February 1, 2019. ; https://doi.org/10.1101/441261 doi: bioRxiv preprint Type 2 diabetes based on ethnicity-specific serum creatinine and hemoglobin A1C 301 reference intervals, respectively) 67 . Simply put, some biomarkers or clinical tests scale 302 directly with health outcomes independent of ancestry, and many others may have 303 distributional differences by ancestry but are equally valid after centering with respect to 304 a readily collected population reference. all, compared to random (Figure 4F). They are therefore least likely to benefit from 320 improvements in precision healthcare delivery from genetic risk scores with existing 321 data due to human population history and study biases. This is a major concern globally 322 and especially in the U.S., which already leads other middle-and high-income countries 323 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under What can be done? The single most important step towards parity in PRS accuracy is 331 by vastly increasing the diversity of participants included and analyzed in genetic 332 studies, which will improve utility for all and most rapidly for underrepresented groups. 333 Regulatory protections against genetic discrimination are necessary to accompany calls 334 for more diverse studies; while some already exist in the U.S., including for health 335 insurance and employment opportunities via the Genetic Information Nondiscrimination 336 Act (GINA), stronger protections in these and other areas globally will be particularly 337 important for minorities and/or marginalized groups. An equal investment in GWAS 338 across all major ancestries and global populations is the most obvious solution to 339 generate a substrate for equally informative risk scores but is not likely to occur any 340 time soon absent a dramatic priority shift given the current imbalance and stalled 341 diversifying progress over the last five years (Figure 1, Supplementary Figure 1). 342 While it may be challenging or in some cases infeasible to acquire sample sizes large 343 enough for PRS to be equally informative in all populations, some much-needed efforts CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted February 1, 2019. ; https://doi.org/10.1101/441261 doi: bioRxiv preprint funded and as many privately funded projects as possible should be made easily and 393 publicly accessible to improve global health outcomes. Efforts to unify phenotype 394 definitions, normalization approaches, and GWAS methods among studies will also 395 improve comparability. 396

397
To enable progress towards parity, it will be critical that open data sharing standards be 398 adopted for all ancestries and for genetic studies of all sample sizes, not just the largest 399 European results. Locally appropriate and secure genetic data sharing techniques as 400 well as equitable technology availability will need to be adopted widely in Asia and 401 Africa as they are in Europe and North America, to ensure that maximum value is 402 achieved from existing and ongoing efforts that are being developed to help counter the 403 current imbalance. Simultaneously, ethical considerations require that research capacity 404 is increased in LMICs with simultaneous growth of diverse population studies to balance 405 the benefits of these studies to scientists and patients globally versus locally to ensure 406 that everyone benefits. Methodological improvements that better define risk scores by 407 accounting for population allele frequency, LD, and/or admixture differences 408 appropriately are underway and may help considerably but will not by themselves bring 409 equality. All of these efforts are important and should be prioritized not just for risk 410 prediction but more generally to maximize the use and applicability of genetics to inform 411 on the biology of disease. Given the acute recent attention on clinical use of PRS, we 412 believe it is paramount to recognize their potential to improve health outcomes for all 413 individuals and many complex diseases. Simultaneously, we as a field must address the 414 disparity in utility in an ethically thoughtful and scientifically rigorous fashion, lest we 415 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted February 1, 2019.   . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted February 1, 2019. ; https://doi.org/10.1101/441261 doi: bioRxiv preprint  Table 6. Each point shows the maximum R 2 (i.e. best predictor) across five p-value 480 thresholds, and lines correspond to 95% confidence intervals calculated via bootstrap. f . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted February 1, 2019. ; https://doi.org/10.1101/441261 doi: bioRxiv preprint R 2 values for all p-value thresholds tested are shown in Supplementary Figures 2-6. 482 Prediction accuracy tends to be higher in the UKBB for quantitative traits than in BBJ 483 and vice versa for disease endpoints, likely because of concomitant phenotype 484 precision and consequently observed heritability for these classes of traits 485 (Supplementary Tables 2-4). Thalassemia and sickle cell disease are unlikely to 486 explain a significant fraction of prediction accuracy differences for blood panels across 487 populations, as few individuals have been diagnosed with these disorders via ICD-10 488 codes (Supplementary Table 9). 489 . CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted February 1, 2019. ; https://doi.org/10.1101/441261 doi: bioRxiv preprint