ABSTRACT
Detection of de novo variants (DNVs) is critical for studies of disease-related variation and mutation rates. We developed a GPU-based workflow to rapidly call DNVs (HAT) and demonstrated its effectiveness by applying it to 4,216 Simons Simplex Collection (SSC) whole-genome sequenced parent-child trios from DNA derived from blood. In our SSC DNV data, we identified 78 ± 15 DNVs per individual, 18% ± 5% at CpG sites, 75% ± 9% phased to the paternal chromosome of origin, and an average allele balance of 0.49. These calculations are all in line with DNV expectations. We sought to build a control DNV dataset by running HAT on 602 whole-genome sequenced parent-child trios from DNA derived from lymphoblastoid cell lines (LCLs) from the publicly available 1000 Genomes Project (1000G). In our 1000G DNV data, we identified 740 ± 967 DNVs per individual, 14% ± 4% at CpG sites, 61% ± 11% phased to the paternal chromosome of origin, and an average allele balance of 0.41. Of the 602 trios, 80% had > 100 DNVs and we hypothesized the excess DNVs were cell line artifacts. Several lines of evidence in our data suggest that this is true and that 1000G does not appear to be a static reference. By mutation profile analysis, we tested whether these cell line artifacts were random and found that 40% of individuals in 1000G did not have random DNV profiles; rather they had DNV profiles matching B-cell lymphoma. Furthermore, we saw significant excess of protein-coding DNVs in 1000G in the gene IGLL5 that has already been implicated in this cancer. As a result of cell line artifacts, 1000G has variants present in DNA repair genes and at Clinvar pathogenic or likely-pathogenic sites. Our study elucidates important implications of the use of sequencing data from LCLs for both reference building projects as well as disease-related projects whereby these data are used in variant filtering steps.
Competing Interest Statement
Competing interests: P.V., M.W., G.V. And T.T.H are full time employees of NVIDIA
Footnotes
These are the updates to the paper: Performed analysis of whole-genome sequencing data from an additional 4,216 trios (n = 8,922 individuals) where the sequencing was done on DNA derived from blood. This included calling of DNVs with our new workflow as well as phasing of the data. This increased the sample size of the paper from 1,806 individuals to 10,728 individuals (~6x more genomes). It also provides us with the unique opportunity to showcase the excellence of our de novo variant caller on DNA derived from blood. With this specific information, we have added a new component to the study that compares the blood-derived DNA results versus the cell-line-derived DNA results. We made a CPU-based version of the code to be released fully free and publicly available. We performed Sanger sequencing, at the wet bench, on 25 random DNVs from the NA12878 trio and 92% were confirmed as real DNVs. This adds to the manual DNV scoring we did by visualization of read data at 3,980 DNVs.
https://app.globus.org/file-manager?origin_id=3eff453a-88f4-11eb-954f-752ba7b88ebe&origin_path=%2F