Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Simultaneous SNP selection and adjustment for population structure in high dimensional prediction models

View ORCID ProfileSahir R Bhatnagar, Yi Yang, Tianyuan Lu, Erwin Schurr, JC Loredo-Osti, Marie Forest, Karim Oualkacha, View ORCID ProfileCelia MT Greenwood
doi: https://doi.org/10.1101/408484
Sahir R Bhatnagar
1Department of Epidemiology, Biostatistics and Occupational Health, McGill University
2Lady Davis Institute, Jewish General Hospital, Montréal, QC
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Sahir R Bhatnagar
  • For correspondence: sahir.bhatnagar@mcgill.ca
Yi Yang
4Department of Mathematics and Statistics, McGill University
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Tianyuan Lu
2Lady Davis Institute, Jewish General Hospital, Montréal, QC
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Erwin Schurr
6Department of Medicine, McGill University
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
JC Loredo-Osti
7Department of Mathematics and Statistics, Memorial University
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Marie Forest
2Lady Davis Institute, Jewish General Hospital, Montréal, QC
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Karim Oualkacha
3Département de Mathématiques, Université de Québec À Montréal
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Celia MT Greenwood
1Department of Epidemiology, Biostatistics and Occupational Health, McGill University
2Lady Davis Institute, Jewish General Hospital, Montréal, QC
5Departments of Oncology and Human Genetics, McGill University
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Celia MT Greenwood
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Data/Code
  • Preview PDF
Loading

Abstract

Complex traits are known to be influenced by a combination of environmental factors and rare and common genetic variants. However, detection of such multivariate associations can be compromised by low statistical power and confounding by population structure. Linear mixed effects models (LMM) can account for correlations due to relatedness but have not been applicable in high-dimensional (HD) settings where the number of fixed effect predictors greatly exceeds the number of samples. False positives or false negatives can result from two-stage approaches, where the residuals estimated from a null model adjusted for the subjects’ relationship structure are sub-sequently used as the response in a standard penalized regression model. To overcome these challenges, we develop a general penalized LMM with a single random effect called ggmix for simultaneous SNP selection and adjustment for population structure in high dimensional prediction models. We develop a blockwise coordinate descent algorithm with automatic tuning parameter selection which is highly scalable, computationally efficient and has theoretical guarantees of convergence. Through simulations and three real data examples, we show that ggmix leads to more parsimonious models compared to the two-stage approach or principal component adjustment with better prediction accuracy. Our method performs well even in the presence of highly correlated markers, and when the causal SNPs are included in the kinship matrix. ggmix can be used to construct polygenic risk scores and select instrumental variables in Mendelian randomization studies. Our algorithms are available in an R package (https://github.com/greenwoodlab/ggmix).

Author Summary This work addresses a recurring challenge in the analysis and interpretation of genetic association studies: which genetic variants can best predict and are independently associated with a given phenotype in the presence of population structure ? Not controlling confounding due to geographic population structure, family and/or cryptic relatedness can lead to spurious associations. Much of the existing research has therefore focused on modeling the association between a phenotype and a single genetic variant in a linear mixed model with a random effect. However, this univariate approach may miss true associations due to the stringent significance thresholds required to reduce the number of false positives and also ignores the correlations between markers. We propose an alternative method for fitting high-dimensional multivariable models, which selects SNPs that are independently associated with the phenotype while also accounting for population structure. We provide an efficient implementation of our algorithm and show through simulation studies and real data examples that our method outperforms existing methods in terms of prediction accuracy and controlling the false discovery rate.

Footnotes

  • This revision contains results from the UK Biobank.

  • https://sahirbhatnagar.com/ggmix/

  • ↵1 scripts available at https://github.com/sahirbhatnagar/ggmix/tree/pgen/manuscript

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.
Back to top
PreviousNext
Posted December 14, 2019.
Download PDF
Data/Code
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Simultaneous SNP selection and adjustment for population structure in high dimensional prediction models
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Simultaneous SNP selection and adjustment for population structure in high dimensional prediction models
Sahir R Bhatnagar, Yi Yang, Tianyuan Lu, Erwin Schurr, JC Loredo-Osti, Marie Forest, Karim Oualkacha, Celia MT Greenwood
bioRxiv 408484; doi: https://doi.org/10.1101/408484
Digg logo Reddit logo Twitter logo CiteULike logo Facebook logo Google logo Mendeley logo
Citation Tools
Simultaneous SNP selection and adjustment for population structure in high dimensional prediction models
Sahir R Bhatnagar, Yi Yang, Tianyuan Lu, Erwin Schurr, JC Loredo-Osti, Marie Forest, Karim Oualkacha, Celia MT Greenwood
bioRxiv 408484; doi: https://doi.org/10.1101/408484

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (2235)
  • Biochemistry (4302)
  • Bioengineering (2958)
  • Bioinformatics (13483)
  • Biophysics (5959)
  • Cancer Biology (4633)
  • Cell Biology (6641)
  • Clinical Trials (138)
  • Developmental Biology (3939)
  • Ecology (6240)
  • Epidemiology (2053)
  • Evolutionary Biology (9181)
  • Genetics (6883)
  • Genomics (8803)
  • Immunology (3918)
  • Microbiology (11286)
  • Molecular Biology (4458)
  • Neuroscience (25625)
  • Paleontology (183)
  • Pathology (722)
  • Pharmacology and Toxicology (1209)
  • Physiology (1776)
  • Plant Biology (3999)
  • Scientific Communication and Education (892)
  • Synthetic Biology (1194)
  • Systems Biology (3627)
  • Zoology (654)