Abstract
Large-scale next-generation sequencing datasets have been transformative for informing clinical variant interpretation and as reference panels for statistical and population genetic efforts. While such resources are often treated as ground truth, we find that in widely used reference datasets such as the Genome Aggregation Database (gnomAD), some variants pass gold standard filters yet are systematically different in their genotype calls across genotype discovery approaches. The inclusion of such discordant sites in study designs involving multiple genotype discovery strategies could bias results and lead to false-positive hits in association studies due to technological artifacts rather than a true relationship to the phenotype. Here, we describe this phenomenon of discordant genotype calls across genotype discovery approaches, characterize the error mode of wrong calls, provide a blacklist of discordant sites identified in gnomAD that should be treated with caution in analyses, and present a metric and machine learning classifier trained on gnomAD data to identify likely discordant variants in other datasets. We find that different genotype discovery approaches have different sets of variants at which this problem occurs but that there are characteristic variant features that can be used to predict discordant behavior. Discordant sites are largely shared across ancestry groups, though different populations are powered for discovery of different variants. We find that the most common error mode is that of a variant being heterozygous for one approach and homozygous for the other, with heterozygous in the genomes and homozygous reference in the exomes making up the majority of miscalls.
Competing Interest Statement
M.J.D. is a founder of Maze Therapeutics. B.M.N. is a member of the Deep Genomics Scientific Advisory Board and serves as a consultant for the Camp4 Therapeutics Corporation, Takeda Pharmaceutical and Biogen. K.J.K is a consultant for Vor Biopharma. The remaining authors declare no competing interests.
Footnotes
This version of the manuscript has been revised to improve the clarity and precision of our manuscript text and figures and, notably, to add an additional large-scale dataset into our comparisons: the All of Us Research Program. All of Us has >95,000 individuals who were both whole genome sequenced and received a genotype array, allowing for comparison to an additional genotype discovery strategy as well as across non-coding variation to complement our in-depth investigations into coding variants.