Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

prewas: Data pre-processing for more informative bacterial GWAS

View ORCID ProfileKatie Saund, View ORCID ProfileZena Lapp, View ORCID ProfileStephanie N. Thiede, View ORCID ProfileAli Pirani, View ORCID ProfileEvan S. Snitkin
doi: https://doi.org/10.1101/2019.12.20.873158
Katie Saund
1Department of Microbiology and Immunology, University of Michigan, Ann Arbor, Michigan
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Katie Saund
Zena Lapp
2Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Zena Lapp
Stephanie N. Thiede
1Department of Microbiology and Immunology, University of Michigan, Ann Arbor, Michigan
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Stephanie N. Thiede
Ali Pirani
1Department of Microbiology and Immunology, University of Michigan, Ann Arbor, Michigan
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Ali Pirani
Evan S. Snitkin
1Department of Microbiology and Immunology, University of Michigan, Ann Arbor, Michigan
3Department of Internal Medicine/Division of Infectious Diseases, University of Michigan, Ann Arbor, Michigan
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Evan S. Snitkin
  • For correspondence: esnitkin@med.umich.edu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Preview PDF
Loading

ABSTRACT

While variant identification pipelines are becoming increasingly standardized, less attention has been paid to the pre-processing of variants prior to their use in bacterial genome-wide association studies (bGWAS). Three nuances of variant pre-processing that impact downstream identification of genetic associations include the separation of variants at multiallelic sites, separation of variants in overlapping genes, and referencing of variants relative to ancestral alleles. Here we demonstrate the importance of these variant pre-processing steps on diverse bacterial genomic datasets and present prewas, an R package, that standardizes the pre-processing of multiallelic sites, overlapping genes, and reference alleles before bGWAS. This package facilitates improved reproducibility and interpretability of bGWAS results. Prewas enables users to extract maximal information from bGWAS by implementing multi-line representation for multiallelic sites and variants in overlapping genes. Prewas outputs a binary SNP matrix that can be used for SNP-based bGWAS and will prevent the masking of minor alleles during bGWAS analysis. The optional binary gene matrix output can be used for gene-based bGWAS which will enable users to maximize the power and evolutionary interpretability of their bGWAS studies. Prewas is available for download from GitHub.

DATA SUMMARY

  1. prewas is available from GitHub under the MIT License (URL: https://github.com/Snitkin-Lab-Umich/prewas) and can be installed using the command devtools::install_github(“Snitkin-Lab-Umich/prewas”)

  2. Code to perform analyses is available from GitHub under the MIT License (URL: https://github.com/Snitkin-Lab-Umich/prewas_manuscript_analysis)

  3. All genomes are publicly available on NCBI (see Table S1 for more details)

IMPACT STATEMENT In between variant calling and performing bacterial genome-wide association studies (bGWAS) there are many decisions regarding processing of variants that have the potential to impact bGWAS results. We discuss the benefits and drawbacks of various variant pre-processing decisions and present the R package prewas to standardize single nucleotide polymorphism (SNP) pre-processing, specifically to incorporate multiallelic sites and prepare the data for gene-based analyses. We demonstrate the importance of these considerations by highlighting the prevalence of multiallelic sites and SNPs in overlapping genes within diverse bacterial genomes and the impact of reference allele choice on gene-based analyses.

Copyright 
The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.
Back to top
PreviousNext
Posted December 20, 2019.
Download PDF

Supplementary Material

Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
prewas: Data pre-processing for more informative bacterial GWAS
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
prewas: Data pre-processing for more informative bacterial GWAS
Katie Saund, Zena Lapp, Stephanie N. Thiede, Ali Pirani, Evan S. Snitkin
bioRxiv 2019.12.20.873158; doi: https://doi.org/10.1101/2019.12.20.873158
Digg logo Reddit logo Twitter logo Facebook logo Google logo LinkedIn logo Mendeley logo
Citation Tools
prewas: Data pre-processing for more informative bacterial GWAS
Katie Saund, Zena Lapp, Stephanie N. Thiede, Ali Pirani, Evan S. Snitkin
bioRxiv 2019.12.20.873158; doi: https://doi.org/10.1101/2019.12.20.873158

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4118)
  • Biochemistry (8825)
  • Bioengineering (6529)
  • Bioinformatics (23481)
  • Biophysics (11800)
  • Cancer Biology (9220)
  • Cell Biology (13334)
  • Clinical Trials (138)
  • Developmental Biology (7442)
  • Ecology (11421)
  • Epidemiology (2066)
  • Evolutionary Biology (15168)
  • Genetics (10448)
  • Genomics (14054)
  • Immunology (9181)
  • Microbiology (22186)
  • Molecular Biology (8821)
  • Neuroscience (47613)
  • Paleontology (350)
  • Pathology (1430)
  • Pharmacology and Toxicology (2492)
  • Physiology (3736)
  • Plant Biology (8085)
  • Scientific Communication and Education (1438)
  • Synthetic Biology (2222)
  • Systems Biology (6042)
  • Zoology (1254)