Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Overestimated Polygenic Prediction due to Overlapping Subjects in Genetic Datasets

David Keetae Park, Mingshen Chen, Seungsoo Kim, Yoonjung Yoonie Joo, Rebekah K. Loving, Hyoung Seop Kim, Jiook Cha, Shinjae Yoo, Jong Hun Kim
doi: https://doi.org/10.1101/2022.01.19.476997
David Keetae Park
1Department of Biomedical Engineering, Columbia University, New York, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Mingshen Chen
2Department of Applied Mathematics & Statistics, Stony Brook University, New York, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Seungsoo Kim
3Department of Obstetrics and Gynecology, Columbia University Irving Medical Center, New York, NY, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Yoonjung Yoonie Joo
4Institute of Data Science, Korea University, Seoul, South Korea
5Department of Psychology, Brain and Cognitive Sciences, AI Institute, Seoul National University, Seoul, South Korea
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Rebekah K. Loving
6Department of Biology, California Institute of Technology, Pasadena, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Hyoung Seop Kim
7Department of Physical Medicine and Rehabilitation, Dementia Center, National Health Insurance Service Ilsan Hospital, Goyang, South Korea
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Jiook Cha
5Department of Psychology, Brain and Cognitive Sciences, AI Institute, Seoul National University, Seoul, South Korea
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Shinjae Yoo
8Computational Science Initiative, Brookhaven National Laboratory, New York, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: jh7521@naver.com sjyoo@bnl.gov
Jong Hun Kim
9Department of Neurology, Dementia Center, National Health Insurance Service Ilsan Hospital, Goyang, South Korea
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: jh7521@naver.com sjyoo@bnl.gov
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Preview PDF
Loading

ABSTRACT

Recently, polygenic risk score (PRS) has gained significant attention in studies involving complex genetic diseases and traits. PRS is often derived from summary statistics, from which the independence between discovery and replication sets cannot be monitored. Prior studies, in which the independence is strictly observed, report a relatively low gain from PRS in predictive models of binary traits. We hypothesize that the independence assumption may be compromised when using the summary statistics, and suspect an overestimation bias in the predictive accuracy. To demonstrate the overestimation bias in the replication dataset, prediction performances of PRS models are compared when overlapping subjects are either present or removed. We consider the task of Alzheimer’s disease (AD) prediction across genetics datasets, including the International Genomics of Alzheimer’s Project (IGAP), AD Sequencing Project (ADSP), and Accelerating Medicine Partnership - Alzheimer’s Disease (AMP-AD). PRS is computed from either sequencing studies for ADSP and AMP-AD (denoted as rPRS) or the summary statistics for IGAP (sPRS). Two variables with the high heritability in UK Biobank, hypertension, and height, are used to derive an exemplary scale effect of PRS. Based on the scale effect, the expected performance of sPRS is computed for AD prediction. Using ADSP as a discovery set for rPRS on AMP-AD, ΔAUC and ΔR2 (performance gains in AUC and R2 by PRS) record 0.069 and 0.11, respectively. Both drop to 0.0017 and 0.0041 once overlapping subjects are removed from AMP-AD. sPRS is derived from IGAP, which records ΔAUC and ΔR2 of 0.051±0.013 and 0.063±0.015 for ADSP and 0.060 and 0.086 for AMP-AD, respectively. On UK Biobank, rPRS performances for hypertension assuming a similar size of discovery and replication sets are 0.0036±0.0027 (ΔAUC) and 0.0032±0.0028 (ΔR2). For height, ΔR2 is 0.029±0.0037. Considering the high heritability of hypertension and height of UK Biobank, we conclude that sPRS results from AD databases are inflated. The higher performances relative to the size of the discovery set were observed in PRS studies of several diseases. PRS performances for binary traits, such as AD and hypertension, turned out unexpectedly low. This may, along with the difference in linkage disequilibrium, explain the high variability of PRS performances in cross-nation or cross-ethnicity applications, i.e., when there are no overlapping subjects. Hence, for sPRS, potential duplications should be carefully considered within the same ethnic group.

Competing Interest Statement

The authors have declared no competing interest.

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.
Back to top
PreviousNext
Posted January 22, 2022.
Download PDF

Supplementary Material

Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Overestimated Polygenic Prediction due to Overlapping Subjects in Genetic Datasets
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Overestimated Polygenic Prediction due to Overlapping Subjects in Genetic Datasets
David Keetae Park, Mingshen Chen, Seungsoo Kim, Yoonjung Yoonie Joo, Rebekah K. Loving, Hyoung Seop Kim, Jiook Cha, Shinjae Yoo, Jong Hun Kim
bioRxiv 2022.01.19.476997; doi: https://doi.org/10.1101/2022.01.19.476997
Digg logo Reddit logo Twitter logo Facebook logo Google logo LinkedIn logo Mendeley logo
Citation Tools
Overestimated Polygenic Prediction due to Overlapping Subjects in Genetic Datasets
David Keetae Park, Mingshen Chen, Seungsoo Kim, Yoonjung Yoonie Joo, Rebekah K. Loving, Hyoung Seop Kim, Jiook Cha, Shinjae Yoo, Jong Hun Kim
bioRxiv 2022.01.19.476997; doi: https://doi.org/10.1101/2022.01.19.476997

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Genomics
Subject Areas
All Articles
  • Animal Behavior and Cognition (3497)
  • Biochemistry (7341)
  • Bioengineering (5317)
  • Bioinformatics (20248)
  • Biophysics (9999)
  • Cancer Biology (7734)
  • Cell Biology (11291)
  • Clinical Trials (138)
  • Developmental Biology (6431)
  • Ecology (9943)
  • Epidemiology (2065)
  • Evolutionary Biology (13311)
  • Genetics (9358)
  • Genomics (12575)
  • Immunology (7696)
  • Microbiology (18998)
  • Molecular Biology (7432)
  • Neuroscience (40971)
  • Paleontology (300)
  • Pathology (1228)
  • Pharmacology and Toxicology (2133)
  • Physiology (3154)
  • Plant Biology (6855)
  • Scientific Communication and Education (1272)
  • Synthetic Biology (1895)
  • Systems Biology (5309)
  • Zoology (1087)