RT Journal Article
SR Electronic
T1 Genomic variant calling: Flexible tools and a diagnostic data set
JF bioRxiv
FD Cold Spring Harbor Laboratory
SP 027227
DO 10.1101/027227
A1 Michael Lawrence
A1 Melanie A. Huntley
A1 Eric Stawiski
A1 Art Owen
A1 Thomas D Wu
A1 Leonard D Goldstein
A1 Yi Cao
A1 Jeremiah Degenhardt
A1 Jason Young
A1 Joseph Guillory
A1 Sherry Heldens
A1 Marlena Jackson
A1 Somasekar Seshagiri
A1 Robert Gentleman
YR 2015
UL http://biorxiv.org/content/early/2015/09/18/027227.abstract
AB The accurate identification of low-frequency variants in tumors remains an unsolved problem. To support characterization of the issues in a realistic setting, we have developed software tools and a reference dataset for diagnosing variant calling pipelines. The dataset contains millions of variants at frequencies ranging from 0.05 to 1.0. To generate the dataset, we performed whole-genome sequencing of a mixture of two Corriel cell lines, NA19240 and NA12878, the mothers of YRI (Y) and CEU (C) HapMap trios, respectively. The cells were mixed in three different proportions, 10Y/90C, 50Y/50C and 90Y/10C, in an effort to simulate the heterogeneity found in tumor samples. We sequenced three biological replicates for each mixture, yielding approximately 1.4 billion reads per mixture for an average of 64X coverage. Using the published genotypes as our reference, we evaluate the performance of a general variant calling algorithm, constructed as a demonstration of our flexible toolset, and make comparisons to a standard GATK pipeline. We estimate the overall FDR to be 0.028 and the FNR (when coverage exceeds 20X) to be 0.019 in the 50Y/50C mixture. Interestingly, even with these relatively well studied individuals, we predict over 475,000 new variants, validating in well-behaved coding regions at a rate of 0.97, that were not included in the published genotypes.