The variant call format provides efficient and robust storage of GWAS summary statistics

Matthew Lyon; Shea J Andrews; Ben Elsworth; Tom R Gaunt; Gibran Hemani; Edoardo Marcora

doi:10.1101/2020.05.29.115824

Abstract

Genome-wide association study (GWAS) summary statistics are a fundamental resource for a variety of research applications ^1–6. Yet despite their widespread utility, no common storage format has been widely adopted, hindering tool development and data sharing, analysis and integration. Existing tabular formats ^7,8 often ambiguously or incompletely store information about genetic variants and their associations, and also lack essential metadata increasing the possibility of errors in data interpretation and post-GWAS analyses. Additionally, data in these formats are typically not indexed, requiring the whole file to be read which is computationally inefficient. To address these issues, we propose an adaptation of the variant call format⁹ (GWAS-VCF) and have produced a suite of open-source tools for using this format in downstream analyses. Simulation studies determine GWAS-VCF is 9-46x faster than tabular alternatives when extracting variant(s) by genomic position. Our results demonstrate the GWAS-VCF provides a robust and performant solution for sharing, analysis and integration of GWAS data. We provide open access to over 10,000 complete GWAS summary datasets converted to this format (available from: https://gwas.mrcieu.ac.uk).

Main

The GWAS is a powerful tool for identifying genetic loci associated with any trait, including diseases and clinical biomarkers, as well as non-clinical and molecular phenotypes such as height and gene expression ³ (eQTLs). Sharing of GWAS results as summary statistics (i.e. variant, effect size, standard error, p-value etc.) has enabled a range of important secondary research applications including: causal gene and functional variant prioritisation ¹, causal cell/tissue type nomination ², pathway analysis ³, causal inference (Mendelian randomization; MR) ⁴, risk prediction ³, genetic correlation ⁵ and heritability estimation ⁶. However, the utility of GWAS summary statistics is hampered by the absence of a universally adopted storage format and associated tools.

Historic lack of a common standard has resulted in GWAS analysis tools outputting summary statistics in different tabular formats (e.g. plink ¹⁰, GCTA ¹¹, BOLT-LMM ¹², GEMMA ¹³, Matrix eQTL ¹⁴ and meta-analysis tools e.g. METAL ¹⁵). As a consequence, various processing issues are typically encountered during secondary analysis. First, there is often inconsistency and ambiguity of which allele relates to the effect size estimate (the “effect” allele). Confusion over the effect allele can have disastrous consequences on the interpretation of GWAS findings and the validity of post-GWAS analyses. For example MR studies may provide causal estimates with incorrect effect directionality ¹⁶. Likewise, prediction models based on polygenic risk scores might predict disease wrongly or suffer reduced power if some of the effect directionalities are incorrect. Second, the schema (i.e. which columns/fields are included and how they are named) of these tabular formats varies greatly. Absent fields can limit analyses and although approaches exist to estimate the values of some of these missing columns (e.g. standard error from P value) imprecision is introduced reducing subsequent test power. Varying field names are easily addressed in principle, but the process can be cumbersome and error prone. Third, data are frequently distributed with no or insufficient metadata describing the study, trait(s), and variants (e.g., trait measurement units, variant id/annotation sources, etc.) which can lead to errors, impede integration of results from different studies and hamper reproducibility. Fourth, querying unindexed text files is slow and memory inefficient, making some potential applications computationally infeasible (e.g. systematic hypothesis-free analyses).

Some proposals for a standard tabular format have been made. The EBI-NHGRI GWAS catalog (www.ebi.ac.uk/gwas) developed a tab-separated values (TSV) text format with a minimal set of required (and optional) columns along with standardised headings ⁷. The SMR tool ⁸ introduced a binary format for rapid querying of quantitative trait loci. These approaches are adequate for storing variant level summary statistics but do not enforce allele consistency or support embedding of essential metadata. Learning from these examples and our experiences performing high-throughput analyses across two research centres, we developed a set of requirements for a suitable universal format (Table 1). These features place emphasis on consistency and robustness, capacity for metadata to provide a full audit trail, efficient querying and file storage, ensuring data integrity, interoperability with existing open-source tools and across multiple datasets to support data sharing and integration. We determined that adapting the variant call format (VCF) ⁹ was a convenient and constructive solution to address these issues. We provide evidence demonstrating how the VCF meets our requirements and showcase the capabilities of this medium (Table 1).

View this table:

Table 1.

Requirements for a summary statistics storage format and solutions offered by the VCF

The VCF is organised into three components: a flexible file header containing metadata (lines beginning with ‘#’), and a file body containing variant-(one locus per row with one or more alternative alleles/variants) and sample-level information (one sample per column). We adapt this format to include GWAS-specific metadata and utilise the sample column to store variant-trait association data (Figure 1; Supplementary Table 1).

View this table:

Supplementary Table 1.

Data fields in the GWAS-VCF

Figure 1.

VCF format adapted to store GWAS summary statistics (GWAS-VCF)

Figure 2.

Performance comparison for querying summary statistics in plain text and GWAS-VCF

Mean query time (log milliseconds [lower is quicker]; repetitions n=100) to extract either: a single variant using the chromosome position or dbSNP ²⁰ identifier or multiple variants using a 1 Mb interval or association P value. AWK, grep, bcftools ²⁴ and rsidx ²³ were evaluated using uncompressed and GZIP/BGZIP ²⁴ compressed unindexed text and VCF. Error bars represent the 95% confidence interval.

According to the VCF specification, the file header consists of metadata lines containing 1) the specification version number, 2) information about the reference genome assembly and contigs, and 3) information (ID, number, type, description, source and version) about the fields used to describe variants and samples (or variant-trait associations in the case of GWAS-VCF) in the file body. We take advantage of the VCF file header to store additional information about the GWAS including 1) source and date of summary statistics, 2) study IDs (e.g., PMID/DOI of publication describing the study, or accession number and repository of individual-level data), 3) description of the trait(s) studied (e.g., type, association test used, sample size, ancestry and measurement unit) as well as the source and version of trait IDs (e.g., Experimental Factor Ontology ¹⁷, Human Phenotyping Ontology ¹⁸ or Medical Subject Headings ¹⁹ IDs for clinical and other traits, or Ensembl Gene IDs for eQTL datasets).

Unlike VCF where a row can contain information about multiple alternative alleles observed at the same site/locus (and thus may store more than one variant), the GWAS-VCF specification requires that each variant is stored in a separate row of the file body. Each row contains eight mandatory fields: chromosome name (CHROM), base-pair position (POS), unique variant identifier (ID), reference/non-effect allele (REF), alternative/effect allele (ALT), quality (QUAL), filter (FILTER) and variant information (INFO). The ID, QUAL and FILTER fields can contain a null value represented by a dot. Importantly, the ID value (unless null) should not be present in more than one row. The FILTER field may be used to flag poor quality variants for exclusion in downstream analyses. The INFO column is a flexible data store for additional variant-level key-value pairs (fields) and may be used to store for example: population frequency (AF), genomic annotations and variant functional effects. We also use the INFO field to store the dbSNP ²⁰ locus identifier (rsid) for the site at which the variant resides. This is because (despite their common usage as variant identifiers) rsids uniquely identify loci (not variants!) and thus cannot be used in the ID field, as we will discuss further at the end of this manuscript. Following the INFO column is a format field (FORMAT) and one or more sample columns which we use to store variant-trait association data, with values for the fields listed in the FORMAT column for example: effect size (ES), standard error (SE) and −log10 P-value (LP).

This format has a number of advantages over existing solutions. First, the VCF provides consistent and robust approaches to storing genetic variants, annotations and metadata. Furthermore, variable type and number requirements reduce parsing errors and missing data and prevent unexpected program operation. Second, the VCF is well established and supported by existing tools providing a range of functions for querying, annotating, transforming and analysing genetic data. Third, the GWAS-VCF file header stores comprehensive metadata about the GWAS. Fourth, a GWAS-VCF file can store individual or multiple traits (in one or more sample columns) in a single file which is beneficial for the distribution of GWAS datasets where genotypes of each sample/individual have been tested for association with multiple traits (e.g., eQTL datasets).

Simulations of query performance demonstrate compressed GWAS-VCF is substantially quicker than unindexed and uncompressed TSV format for querying by genomic position. On average GWAS-VCF was 16x faster to extract a single variant using chromosome position (mean query duration in GWAS-VCF 0.08 seconds [95% CI 0.08, 0.08]) vs mean query duration in TSV 1.29 seconds [95% CI 1.29, 1.30]) and 9x quicker using the rsid (0.09 seconds [95% CI 0.09, 0.09] vs 0.81 seconds [95% 0.80, 0.82]). Using a 1Mb window of variants GWAS-VCF was 46x quicker (0.11 seconds [95% CI 0.11, 0.11] vs 5.02 seconds [95% CI 4.99, 5.04]). Although querying on association P value was faster using TSV (mean query duration in TSV 7.18 seconds [95% CI 7.09, 7.26] vs mean query duration in GWAS-VCF 18.04 seconds [95% CI 17.92, 18.16]) GWAS-VCF could be improved by using variant flags (i.e. in the INFO field) to highlight records below prespecified thresholds if the exact value is unimportant. For example, all variants below genome-wide significance (P < 5e-8) or a more relaxed threshold (e.g. P < 5e-5).

To automate the conversion of existing summary statistics files to the GWAS-VCF format, we developed open-source Python3 software (Gwas2VCF; Table 2). The application reads in metadata and variant-trait association data using a user-defined schema. During processing, variants are harmonised using a supplied reference genome file to ensure the non-effect allele matches the reference sequence enabling consistent directionality of allelic effects across studies. Insertion-deletion variants are left-aligned and trimmed for consistent representation using the vgraph library ²¹. Finally, the GWAS-VCF is indexed using tabix ²² and rsidx ²³ which enable rapid queries by genomic position and rsid, respectively. We have developed a freely available web application providing a user-friendly interface for this implementation and encourage other centres to deploy their own instance (Table 2).

View this table:

Table 2.

Open-source tools for working with GWAS-VCF

Once stored in a GWAS-VCF file, summary statistics can be read and queried using R or Python programming languages with our open-source libraries (Table 2) or from the command line using for example: bcftools ²⁴, GATK ²⁵ or bedtools ²⁶. Alternatively, GWAS-VCF may be converted to NHGRI-EBI format ²⁷ or any other tabular format to support incompatible tools. Further, the gwasglue R package provides convenient programming functions to automate preparation of genetic association data for a range of downstream analyses (Table 2). Currently, methods exist for streamlining variant fine-mapping ^28–32, colocalization ³³, MR ³⁴ and data visualisation ³⁵. New methods are being actively added and users may request new features via the repository issues page.

To encourage adoption, we made openly available over 10,000 complete GWAS summary statistics in GWAS-VCF format as part of the IEU OpenGWAS database. These studies include a broad range of traits, diseases and molecular phenotypes building on the initial collection for the MR Base platform ³⁴.

A limitation of current summary statistics formats, including GWAS-VCF, is the lack of a widely adopted and stable representation of sequence variants that can be used as universal unique identifier for said variants. Published summary statistics often use rsids ²⁰ to identify variants but this practice is inappropriate because rsids are locus identifiers and do not distinguish between multiple alternative alleles observed at the same site. Moreover, rsids are not stable as they can be merged and retired over time. The reason this is a problem is that in GWAS summary statistics every record represents the effect of a specific allele on one or more traits, and if a record identifier is used that is not unique for each allelic substitution it cannot technically be considered an identifier. An alternative approach is to concatenate chromosome, base-position, reference and alternative allele field values into a single string, but this is non-standardised, and genome build specific. Worst still is the common approach of mixing these types of identifiers within a single file. In version 1.1 of the GWAS-VCF specification we suggest querying variants by chromosome and base-position and filtering the output to retain the target substitution (implemented in our parsers), but we acknowledge that this approach can be cumbersome and difficult to interoperate with other software. The ideal solution would be to populate the ID column of a GWAS-VCF file using universally accepted and unique variant identifiers. We have reviewed several existing variant identifier formats as candidates for the variant identifier field, to be implemented in the next version of the specification (Supplementary Table 2). However, we refrain from making a unilateral choice at this juncture because successful implementation will require consultation from a range of stakeholders. The genetics community uses different approaches already to deal with the problem of sequence variant representation and there is a need to coalesce upon a single format.

View this table:

Supplementary Table 2.

Possible variant identifier schemes for the ID column of GWAS-VCF

Here we present an adaptation of the VCF specification for GWAS summary statistics storage that is amenable to high-throughput analyses and robust data sharing and integration. We implement open-source tools to convert existing summary statistics formats to GWAS-VCF, and libraries for reading or querying this format and integrating with existing analysis tools. Finally, we provide complete GWAS summary statistics for over 10,000 traits in GWAS-VCF. These resources enable convenient and efficient secondary analyses of GWAS summary statistics and support future tool development.

Code availability

Open-source query performance evaluation source code available from GitHub (https://github.com/MRCIEU/gwas-vcf-performance) or pre-built image available from DockerHub (mrcieu/gwas-vcf-performance)

Data availability

Version 1.1 of the GWAS - VCF format specification is available from: https://github.com/MRCIEU/gwas-vcf-spec/releases/tag/1.1

Full summary statistics for over 10,000 GWAS in VCF format are available from the IEU OpenGWAS Database (https://gwas.mrcieu.ac.uk)

Method

Specification

The specification was developed through experience of collecting and harmonising GWAS summary data across two research centres at scale ³⁴ and performing a range of representative high throughput analyses on these data (for example LD score regression ³⁶, MR ³⁷, genetic colocalisation analysis ³⁸ and polygenic risk scores ³⁹).

Query performance simulation

Densely imputed summary statistics (13,791,467 variants) for a large GWAS of body mass index data were obtained from Neale et al ⁴⁰. The data were mapped to VCF using Gwas2VCF v1.1.1 and processed using bcftools v1.10 ²⁴ to remove multiallelic variants or records with missing dbSNP ²⁰ identifiers. A tabular (unindexed) file was prepared from the VCF to replicate a typical storage medium currently used for distributing summary statistics. Query runtime performance was compared between tabix v1.10.2 ²² and standard UNIX commands under the following conditions: single variant selection using dbSNP identifier ²⁰ or chromosome position, multi-variant selection by association P value (thresholds: P < 5e-8, 0.2, 0.4, 0.6, 0.8) or 1 Mb genomic interval. Tests were undertaken with 100 repetitions using VCF or unindexed text formats with and without GZIP compression on an Ubuntu v18.04 server with Intel Xeon(R) 2.0 Ghz processor. All comparisons were performed using singled thread operations and therefore differences in runtime performance were due to tool and/or file index usage.

Author contributions

All authors contributed the manuscript and storage format specification. G.H. and E.M. designed the research. M.L. and G.H. wrote software packages and performed query performance simulations. B.E. and G.H. prepared the GWAS data.

Competing interest

TRG receives funding from GlaxoSmithKline and Biogen for unrelated research.

Acknowledgments

This study was funded by the NIHR Biomedical Research Centre at University Hospitals Bristol National Health Service Foundation Trust and the University of Bristol. The views expressed are those of the author(s) and not necessarily those of the NIHR or the Department of Health and Social Care.

M.L., B.E., T.R.G. work in the Medical Research Council Integrative Epidemiology Unit at the University of Bristol, which is supported by the Medical Research Council and the University of Bristol (MC_UU_00011/4). G.H. is supported by the Wellcome Trust and Royal Society [208806/Z/17/Z].

E.M. and S.J.A. are supported by the JPB foundation and by the National Institute of Health (U01AG052411 and U01AG058635; principal investigator Alison Goate).

Footnotes

https://gwas.mrcieu.ac.uk

References

1.↵
Hou, L. & Zhao, H. A review of post-GWAS prioritization approaches. Front. Genet. 4, 280 (2013).
OpenUrl PubMed
2.↵
Finucane, H. K. et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 47, 1228–1235 (2015).
OpenUrl CrossRef PubMed
3.↵
Visscher, P. M. et al. 10 Years of GWAS Discovery: Biology, Function, and Translation. American Journal of Human Genetics 101, 5–22 (2017).
OpenUrl CrossRef PubMed
4.↵
Smith, G. D. & Ebrahim, S. ‘Mendelian randomization’: Can genetic epidemiology contribute to understanding environmental determinants of disease? International Journal of Epidemiology (2003). doi:10.1093/ije/dyg070
OpenUrl CrossRef PubMed Web of Science
5.↵
Bulik-Sullivan, B. et al. LD score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. (2015). doi:10.1038/ng.3211
OpenUrl CrossRef PubMed
6.↵
Yang, J., Zeng, J., Goddard, M. E., Wray, N. R. & Visscher, P. M. Concepts, estimation and interpretation of SNP-based heritability. Nature Genetics 49, 1304–1310 (2017).
OpenUrl CrossRef PubMed
7.↵
Buniello, A. et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47, D1005–D1012 (2019).
OpenUrl CrossRef PubMed
8.↵
Zhu, Z. et al. Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nat. Genet. 48, 481–487 (2016).
OpenUrl CrossRef PubMed
9.↵
Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).
OpenUrl CrossRef PubMed Web of Science
10.↵
Purcell, S. et al. PLINK: A tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. (2007). doi:10.1086/519795
OpenUrl CrossRef PubMed
11.↵
Yang, J., Lee, S. H., Goddard, M. E. & Visscher, P. M. GCTA: A tool for genome-wide complex trait analysis. Am. J. Hum. Genet. 88, 76–82 (2011).
OpenUrl CrossRef PubMed
12.↵
Loh, P. R. et al. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat. Genet. 47, 284–290 (2015).
OpenUrl CrossRef PubMed
13.↵
Zhou, X. & Stephens, M. Genome-wide efficient mixed-model analysis for association studies. Nat. Genet. 44, 821–824 (2012).
OpenUrl CrossRef PubMed
14.↵
Shabalin, A. A. Gene expression Matrix eQTL: ultra fast eQTL analysis via large matrix operations. 28, 1353–1358 (2012).
OpenUrl
15.↵
Willer, C. J., Li, Y. & Abecasis, G. R. METAL: fast and efficient meta-analysis of genomewide association scans. Bioinforma. Appl. NOTE 26, 2190–2191 (2010).
OpenUrl
16.↵
Hartwig, F. P., Davies, N. M., Hemani, G. & Smith, G. D. Two-sample Mendelian randomization: avoiding the downsides of a powerful, widely applicable but potentially fallible technique. Int. J. Epidemiol. 1717–1726 (2016). doi:10.1093/ije/dyx028
OpenUrl CrossRef PubMed
17.↵
Malone, J. et al. Databases and ontologies Modeling sample variables with an Experimental Factor Ontology. 26, 1112–1118 (2010).
OpenUrl
18.↵
Köhler, S. et al. Expansion of the Human Phenotype Ontology (HPO) knowledge base and resources. Nucleic Acids Res. 47, D1018–D1027 (2019).
OpenUrl CrossRef PubMed
19.↵
Medical Subject Headings - Home Page. Available at: https://www.nlm.nih.gov/mesh/meshhome.html. (Accessed: 16th April 2020)
20.↵
Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Research 29, (2001).
21.↵
bioinformed/vgraph: vgraph is a command line application and Python library to compare genetic variants using variant graphs. “vgraph” utilizes a graph representation of genomic variants in to precisely compare complex variants that are refractory to comparison by conventional comparison methods. Available at: https://github.com/bioinformed/vgraph. (Accessed: 5th May 2020)
22.↵
Li, H. Tabix: fast retrieval of sequence features from generic TAB-delimited files. Bioinforma. Appl. NOTE 27, 718–719 (2011).
OpenUrl
23.↵
bioforensics/rsidx: Library for indexing VCF files for random access searches by rsID. Available at: https://github.com/bioforensics/rsidx. (Accessed: 5th March 2020)
24.↵
Li, H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27, 2987–93 (2011).
OpenUrl CrossRef PubMed Web of Science
25.↵
McKenna, A. et al. The genome analysis toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. (2010). doi:10.1101/gr.107524.110
OpenUrl Abstract/FREE Full Text
26.↵
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinforma. Appl. NOTE 26, 841–842 (2010).
OpenUrl
27.↵
MacArthur, J. et al. The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog). Nucleic Acids Res. 45, D896–D901 (2017).
OpenUrl CrossRef PubMed
28.↵
Benner, C. et al. Genetics and population analysis FINEMAP: efficient variable selection using summary data from genome-wide association studies. doi:10.1093/bioinformatics/btw018
OpenUrl CrossRef PubMed
29.
Kichaev, G. et al. Integrating Functional Data to Prioritize Causal Variants in Statistical Fine-Mapping Studies. PLoS Genet. 10, e1004722 (2014).
OpenUrl CrossRef PubMed
30.
Kichaev, G. & Pasaniuc, B. Leveraging Functional-Annotation Data in Trans-ethnic Fine-Mapping Studies. Am. J. Hum. Genet. 97, 260–271 (2015).
OpenUrl CrossRef PubMed
31.
Kichaev, G. et al. Improved methods for multi-trait fine mapping of pleiotropic risk loci. Bioinformatics 33, 248–255 (2017).
OpenUrl CrossRef PubMed
32.↵
Hormozdiari, F., Kostem, E., Kang, E. Y., Pasaniuc, B. & Eskin, E. Identifying causal variants at loci with multiple signals of association. Genetics 198, 497–508 (2014).
OpenUrl Abstract/FREE Full Text
33.↵
Wallace, C. Statistical Testing of Shared Genetic Control for Potentially Related Traits. Genet. Epidemiol. 37, 802–813 (2013).
OpenUrl CrossRef PubMed
34.↵
Hemani, G. et al. The MR-base platform supports systematic causal inference across the human phenome. Elife 7, (2018).
35.↵
jrs95/gassocplot: Regional association plotter for genetic and epigenetic data. Available at: https://github.com/jrs95/gassocplot. (Accessed: 21st April 2020)
36.↵
Zheng, J. et al. Databases and ontologies LD Hub: a centralized database and web interface to perform LD score regression that maximizes the potential of summary level GWAS data for SNP heritability and genetic correlation analysis. Bioinformatics 33, 272–279 (2017).
OpenUrl CrossRef PubMed
37.↵
Hemani, G. et al. Automating Mendelian randomization through machine learning to construct a putative causal map of the human phenome. bioRxiv 173682. (2017). doi:10.1101/173682
OpenUrl Abstract/FREE Full Text
38.↵
Richardson, T. G., Hemani, G., Gaunt, T. R., Relton, C. L. & Davey Smith, G. A transcriptome-wide Mendelian randomization study to uncover tissue-dependent regulatory mechanisms across the human phenome. Nat. Commun. 11, 1–11 (2020).
OpenUrl CrossRef
39.↵
Richardson, T. G., Harrison, S., Hemani, G. & Smith, G. D. An atlas of polygenic risk score associations to highlight putative causal relationships across the human phenome. Elife 8, (2019).
40.↵
UK Biobank – Neale lab. Available at: http://www.nealelab.is/uk-biobank/. (Accessed: 25th February 2020)
41.
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinforma. Appl. NOTE 25, 2078–2079 (2009).
OpenUrl
42.
Obenchain, V. et al. Sequence analysis VariantAnnotation: a Bioconductor package for exploration and annotation of genetic variants. 30, 2076–2078 (2014).
OpenUrl
43.
Gentleman, R. C. et al. Open Access Bioconductor: open software development for computational biology and bioinformatics. Genome Biology 5, (2004).
44.
Huber, W. et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nat. Methods 12, 115–121 (2015).
OpenUrl CrossRef PubMed
45.
Bioconductor - Home. Available at: https://www.bioconductor.org/. (Accessed: 27th March 2020)
46.
pysam-developers/pysam: Pysam is a Python module for reading and manipulating SAM/BAM/VCF/BCF files. It’s a lightweight wrapper of the htslib C-API, the same one that powers samtools, bcftools, and tabix. Available at: https://github.com/pysam-developers/pysam. (Accessed: 10th March 2020)
47.
IEU GWAS database. Available at: https://gwas.mrcieu.ac.uk/. (Accessed: 10th March 2020)
48.
broadinstitute/picard: A set of command line tools (in Java) for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF. Available at: https://github.com/broadinstitute/picard. (Accessed: 25th February 2020)
49.
GenomicsDB/GenomicsDB: Highly performant data storage in C++ for importing, querying and transforming variant data with Java/Spark. Used in gatk4. Available at: https://github.com/GenomicsDB/GenomicsDB. (Accessed: 25th February 2020)
50.
Voss, K., Gentry, J. & Auwera, G. Van Der. GATK4 + WDL + Cromwell. F1000Research 6, 4 (2017).
OpenUrl
51.
Morales, J. et al. A standardized framework for representation of ancestry data in genomics studies, with application to the NHGRI-EBI GWAS Catalog. Genome Biol. 19, 21 (2018).
OpenUrl
52.
den Dunnen, J. T. et al. HGVS Recommendations for the Description of Sequence Variants: 2016 Update. Hum. Mutat. 37, 564–569 (2016).
OpenUrl CrossRef PubMed
53.
Holmes, J. B., Moyer, E., Phan, L., Maglott, D. & Kattman, B. SPDI: data model for variants and applications at NCBI. doi:10.1093/bioinformatics/btz856
OpenUrl CrossRef
54.
Wagner, A. et al. ga4gh/vr-spec: 1.0 GA4GH Approved. (2019). doi:10.5281/ZENODO.3572974
OpenUrl CrossRef

View the discussion thread.

Posted May 30, 2020.

Download PDF

Data/Code

Citation Tools

Subject Area

Genetics

Subject Areas

All Articles

Animal Behavior and Cognition (5215)
Biochemistry (11745)
Bioengineering (8752)
Bioinformatics (29200)
Biophysics (14972)
Cancer Biology (12096)
Cell Biology (17411)
Clinical Trials (138)
Developmental Biology (9421)
Ecology (14182)
Epidemiology (2067)
Evolutionary Biology (18308)
Genetics (12245)
Genomics (16803)
Immunology (11869)
Microbiology (28085)
Molecular Biology (11592)
Neuroscience (60969)
Paleontology (451)
Pathology (1871)
Pharmacology and Toxicology (3238)
Physiology (4959)
Plant Biology (10427)
Scientific Communication and Education (1683)
Synthetic Biology (2885)
Systems Biology (7340)
Zoology (1651)

[1] 1.↵
Hou, L. & Zhao, H. A review of post-GWAS prioritization approaches. Front. Genet. 4, 280 (2013).
OpenUrl PubMed

[2] 2.↵
Finucane, H. K. et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 47, 1228–1235 (2015).
OpenUrl CrossRef PubMed

[3] 3.↵
Visscher, P. M. et al. 10 Years of GWAS Discovery: Biology, Function, and Translation. American Journal of Human Genetics 101, 5–22 (2017).
OpenUrl CrossRef PubMed

[4] 4.↵
Smith, G. D. & Ebrahim, S. ‘Mendelian randomization’: Can genetic epidemiology contribute to understanding environmental determinants of disease? International Journal of Epidemiology (2003). doi:10.1093/ije/dyg070
OpenUrl CrossRef PubMed Web of Science

[5] 5.↵
Bulik-Sullivan, B. et al. LD score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. (2015). doi:10.1038/ng.3211
OpenUrl CrossRef PubMed

[6] 6.↵
Yang, J., Zeng, J., Goddard, M. E., Wray, N. R. & Visscher, P. M. Concepts, estimation and interpretation of SNP-based heritability. Nature Genetics 49, 1304–1310 (2017).
OpenUrl CrossRef PubMed

[7] 7.↵
Buniello, A. et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47, D1005–D1012 (2019).
OpenUrl CrossRef PubMed

[8] 8.↵
Zhu, Z. et al. Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nat. Genet. 48, 481–487 (2016).
OpenUrl CrossRef PubMed

[9] 9.↵
Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).
OpenUrl CrossRef PubMed Web of Science

[10] 10.↵
Purcell, S. et al. PLINK: A tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. (2007). doi:10.1086/519795
OpenUrl CrossRef PubMed

[11] 11.↵
Yang, J., Lee, S. H., Goddard, M. E. & Visscher, P. M. GCTA: A tool for genome-wide complex trait analysis. Am. J. Hum. Genet. 88, 76–82 (2011).
OpenUrl CrossRef PubMed

[12] 12.↵
Loh, P. R. et al. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat. Genet. 47, 284–290 (2015).
OpenUrl CrossRef PubMed

[13] 13.↵
Zhou, X. & Stephens, M. Genome-wide efficient mixed-model analysis for association studies. Nat. Genet. 44, 821–824 (2012).
OpenUrl CrossRef PubMed

[14] 14.↵
Shabalin, A. A. Gene expression Matrix eQTL: ultra fast eQTL analysis via large matrix operations. 28, 1353–1358 (2012).
OpenUrl

[15] 15.↵
Willer, C. J., Li, Y. & Abecasis, G. R. METAL: fast and efficient meta-analysis of genomewide association scans. Bioinforma. Appl. NOTE 26, 2190–2191 (2010).
OpenUrl

[16] 16.↵
Hartwig, F. P., Davies, N. M., Hemani, G. & Smith, G. D. Two-sample Mendelian randomization: avoiding the downsides of a powerful, widely applicable but potentially fallible technique. Int. J. Epidemiol. 1717–1726 (2016). doi:10.1093/ije/dyx028
OpenUrl CrossRef PubMed

[17] 17.↵
Malone, J. et al. Databases and ontologies Modeling sample variables with an Experimental Factor Ontology. 26, 1112–1118 (2010).
OpenUrl

[18] 18.↵
Köhler, S. et al. Expansion of the Human Phenotype Ontology (HPO) knowledge base and resources. Nucleic Acids Res. 47, D1018–D1027 (2019).
OpenUrl CrossRef PubMed

[19] 19.↵
Medical Subject Headings - Home Page. Available at: https://www.nlm.nih.gov/mesh/meshhome.html. (Accessed: 16th April 2020)

[20] 20.↵
Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Research 29, (2001).

[21] 21.↵
bioinformed/vgraph: vgraph is a command line application and Python library to compare genetic variants using variant graphs. “vgraph” utilizes a graph representation of genomic variants in to precisely compare complex variants that are refractory to comparison by conventional comparison methods. Available at: https://github.com/bioinformed/vgraph. (Accessed: 5th May 2020)

[22] 22.↵
Li, H. Tabix: fast retrieval of sequence features from generic TAB-delimited files. Bioinforma. Appl. NOTE 27, 718–719 (2011).
OpenUrl

[23] 23.↵
bioforensics/rsidx: Library for indexing VCF files for random access searches by rsID. Available at: https://github.com/bioforensics/rsidx. (Accessed: 5th March 2020)

[24] 24.↵
Li, H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27, 2987–93 (2011).
OpenUrl CrossRef PubMed Web of Science

[25] 25.↵
McKenna, A. et al. The genome analysis toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. (2010). doi:10.1101/gr.107524.110
OpenUrl Abstract/FREE Full Text

[26] 26.↵
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinforma. Appl. NOTE 26, 841–842 (2010).
OpenUrl

[27] 27.↵
MacArthur, J. et al. The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog). Nucleic Acids Res. 45, D896–D901 (2017).
OpenUrl CrossRef PubMed

[28] 28.↵
Benner, C. et al. Genetics and population analysis FINEMAP: efficient variable selection using summary data from genome-wide association studies. doi:10.1093/bioinformatics/btw018
OpenUrl CrossRef PubMed

[29] 29.
Kichaev, G. et al. Integrating Functional Data to Prioritize Causal Variants in Statistical Fine-Mapping Studies. PLoS Genet. 10, e1004722 (2014).
OpenUrl CrossRef PubMed

[30] 30.
Kichaev, G. & Pasaniuc, B. Leveraging Functional-Annotation Data in Trans-ethnic Fine-Mapping Studies. Am. J. Hum. Genet. 97, 260–271 (2015).
OpenUrl CrossRef PubMed

[31] 31.
Kichaev, G. et al. Improved methods for multi-trait fine mapping of pleiotropic risk loci. Bioinformatics 33, 248–255 (2017).
OpenUrl CrossRef PubMed

[32] 32.↵
Hormozdiari, F., Kostem, E., Kang, E. Y., Pasaniuc, B. & Eskin, E. Identifying causal variants at loci with multiple signals of association. Genetics 198, 497–508 (2014).
OpenUrl Abstract/FREE Full Text

[33] 33.↵
Wallace, C. Statistical Testing of Shared Genetic Control for Potentially Related Traits. Genet. Epidemiol. 37, 802–813 (2013).
OpenUrl CrossRef PubMed

[34] 34.↵
Hemani, G. et al. The MR-base platform supports systematic causal inference across the human phenome. Elife 7, (2018).

[35] 35.↵
jrs95/gassocplot: Regional association plotter for genetic and epigenetic data. Available at: https://github.com/jrs95/gassocplot. (Accessed: 21st April 2020)

[36] 36.↵
Zheng, J. et al. Databases and ontologies LD Hub: a centralized database and web interface to perform LD score regression that maximizes the potential of summary level GWAS data for SNP heritability and genetic correlation analysis. Bioinformatics 33, 272–279 (2017).
OpenUrl CrossRef PubMed

[37] 37.↵
Hemani, G. et al. Automating Mendelian randomization through machine learning to construct a putative causal map of the human phenome. bioRxiv 173682. (2017). doi:10.1101/173682
OpenUrl Abstract/FREE Full Text

[38] 38.↵
Richardson, T. G., Hemani, G., Gaunt, T. R., Relton, C. L. & Davey Smith, G. A transcriptome-wide Mendelian randomization study to uncover tissue-dependent regulatory mechanisms across the human phenome. Nat. Commun. 11, 1–11 (2020).
OpenUrl CrossRef

[39] 39.↵
Richardson, T. G., Harrison, S., Hemani, G. & Smith, G. D. An atlas of polygenic risk score associations to highlight putative causal relationships across the human phenome. Elife 8, (2019).

[40] 40.↵
UK Biobank – Neale lab. Available at: http://www.nealelab.is/uk-biobank/. (Accessed: 25th February 2020)

[41] 41.
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinforma. Appl. NOTE 25, 2078–2079 (2009).
OpenUrl

[42] 42.
Obenchain, V. et al. Sequence analysis VariantAnnotation: a Bioconductor package for exploration and annotation of genetic variants. 30, 2076–2078 (2014).
OpenUrl

[43] 43.
Gentleman, R. C. et al. Open Access Bioconductor: open software development for computational biology and bioinformatics. Genome Biology 5, (2004).

[44] 44.
Huber, W. et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nat. Methods 12, 115–121 (2015).
OpenUrl CrossRef PubMed

[45] 45.
Bioconductor - Home. Available at: https://www.bioconductor.org/. (Accessed: 27th March 2020)

[46] 46.
pysam-developers/pysam: Pysam is a Python module for reading and manipulating SAM/BAM/VCF/BCF files. It’s a lightweight wrapper of the htslib C-API, the same one that powers samtools, bcftools, and tabix. Available at: https://github.com/pysam-developers/pysam. (Accessed: 10th March 2020)

[47] 47.
IEU GWAS database. Available at: https://gwas.mrcieu.ac.uk/. (Accessed: 10th March 2020)

[48] 48.
broadinstitute/picard: A set of command line tools (in Java) for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF. Available at: https://github.com/broadinstitute/picard. (Accessed: 25th February 2020)

[49] 49.
GenomicsDB/GenomicsDB: Highly performant data storage in C++ for importing, querying and transforming variant data with Java/Spark. Used in gatk4. Available at: https://github.com/GenomicsDB/GenomicsDB. (Accessed: 25th February 2020)

[50] 50.
Voss, K., Gentry, J. & Auwera, G. Van Der. GATK4 + WDL + Cromwell. F1000Research 6, 4 (2017).
OpenUrl

[51] 51.
Morales, J. et al. A standardized framework for representation of ancestry data in genomics studies, with application to the NHGRI-EBI GWAS Catalog. Genome Biol. 19, 21 (2018).
OpenUrl

[52] 52.
den Dunnen, J. T. et al. HGVS Recommendations for the Description of Sequence Variants: 2016 Update. Hum. Mutat. 37, 564–569 (2016).
OpenUrl CrossRef PubMed

[53] 53.
Holmes, J. B., Moyer, E., Phan, L., Maglott, D. & Kattman, B. SPDI: data model for variants and applications at NCBI. doi:10.1093/bioinformatics/btz856
OpenUrl CrossRef

[54] 54.
Wagner, A. et al. ga4gh/vr-spec: 1.0 GA4GH Approved. (2019). doi:10.5281/ZENODO.3572974
OpenUrl CrossRef

The variant call format provides efficient and robust storage of GWAS summary statistics

Abstract

Main

Code availability

Data availability

Method

Specification

Query performance simulation

Author contributions

Competing interest

Acknowledgments

Footnotes

References

Citation Manager Formats

Subject Area