Improving variant calling using population data and deep learning

Nae-Chyun Chen; Alexey Kolesnikov; Sidharth Goel; Taedong Yun; Pi-Chuan Chang; Andrew Carroll

doi:10.1101/2021.01.06.425550

Abstract

Large-scale population variant data is often used to filter and aid interpretation of variant calls in a single sample. These approaches do not incorporate population information directly into the process of variant calling, and are often limited to filtering which trades recall for precision. In this study, we develop population-aware DeepVariant models with a new channel encoding allele frequencies from the 1000 Genomes Project. This model reduces variant calling errors, improving both precision and recall in single samples, and reduces rare homozygous and pathogenic clinvar calls cohort-wide. We assess the use of population-specific or diverse reference panels, finding the greatest accuracy with diverse panels, suggesting that large, diverse panels are preferable to individual populations, even when the population matches sample ancestry. Finally, we show that this benefit generalizes to samples with different ancestry from the training data even when the ancestry is also excluded from the reference panel.

Competing Interest Statement

AK, SG, TY, PC and AC are employees of Google LLC and own Alphabet stock as part of the standard compensation package. This study was funded by Google LLC.

Footnotes

↵‡ Work performed while an intern at Google Health.
1. Re-did all the experiments using the DeepVariant v1.1 model. 2. Analyzed the performance of the model with respect to the commonness of variants. 3. Generated an 1000-genomes call set using the calls from the allele-frequency model and performed analyses. 4. Re-plotted all the figures to make them more informative and visually clearer. 5. Compared results from an imputation pipeline with our calls.