RT Journal Article SR Electronic T1 Genomic variant calling: Flexible tools and a diagnostic data set JF bioRxiv FD Cold Spring Harbor Laboratory SP 027227 DO 10.1101/027227 A1 Michael Lawrence A1 Melanie A. Huntley A1 Eric Stawiski A1 Art Owen A1 Thomas D Wu A1 Leonard D Goldstein A1 Yi Cao A1 Jeremiah Degenhardt A1 Jason Young A1 Joseph Guillory A1 Sherry Heldens A1 Marlena Jackson A1 Somasekar Seshagiri A1 Robert Gentleman YR 2015 UL http://biorxiv.org/content/early/2015/09/18/027227.abstract AB The accurate identification of low-frequency variants in tumors remains an unsolved problem. To support characterization of the issues in a realistic setting, we have developed software tools and a reference dataset for diagnosing variant calling pipelines. The dataset contains millions of variants at frequencies ranging from 0.05 to 1.0. To generate the dataset, we performed whole-genome sequencing of a mixture of two Corriel cell lines, NA19240 and NA12878, the mothers of YRI (Y) and CEU (C) HapMap trios, respectively. The cells were mixed in three different proportions, 10Y/90C, 50Y/50C and 90Y/10C, in an effort to simulate the heterogeneity found in tumor samples. We sequenced three biological replicates for each mixture, yielding approximately 1.4 billion reads per mixture for an average of 64X coverage. Using the published genotypes as our reference, we evaluate the performance of a general variant calling algorithm, constructed as a demonstration of our flexible toolset, and make comparisons to a standard GATK pipeline. We estimate the overall FDR to be 0.028 and the FNR (when coverage exceeds 20X) to be 0.019 in the 50Y/50C mixture. Interestingly, even with these relatively well studied individuals, we predict over 475,000 new variants, validating in well-behaved coding regions at a rate of 0.97, that were not included in the published genotypes.