RT Journal Article SR Electronic T1 Genomic variant calling: Flexible tools and a diagnostic data set JF bioRxiv FD Cold Spring Harbor Laboratory SP 027227 DO 10.1101/027227 A1 Lawrence, Michael A1 Huntley, Melanie A. A1 Stawiski, Eric A1 Owen, Art A1 Wu, Thomas D A1 Goldstein, Leonard D A1 Cao, Yi A1 Degenhardt, Jeremiah A1 Young, Jason A1 Guillory, Joseph A1 Heldens, Sherry A1 Jackson, Marlena A1 Seshagiri, Somasekar A1 Gentleman, Robert YR 2015 UL http://biorxiv.org/content/early/2015/09/18/027227.abstract AB The accurate identification of low-frequency variants in tumors remains an unsolved problem. To support characterization of the issues in a realistic setting, we have developed software tools and a reference dataset for diagnosing variant calling pipelines. The dataset contains millions of variants at frequencies ranging from 0.05 to 1.0. To generate the dataset, we performed whole-genome sequencing of a mixture of two Corriel cell lines, NA19240 and NA12878, the mothers of YRI (Y) and CEU (C) HapMap trios, respectively. The cells were mixed in three different proportions, 10Y/90C, 50Y/50C and 90Y/10C, in an effort to simulate the heterogeneity found in tumor samples. We sequenced three biological replicates for each mixture, yielding approximately 1.4 billion reads per mixture for an average of 64X coverage. Using the published genotypes as our reference, we evaluate the performance of a general variant calling algorithm, constructed as a demonstration of our flexible toolset, and make comparisons to a standard GATK pipeline. We estimate the overall FDR to be 0.028 and the FNR (when coverage exceeds 20X) to be 0.019 in the 50Y/50C mixture. Interestingly, even with these relatively well studied individuals, we predict over 475,000 new variants, validating in well-behaved coding regions at a rate of 0.97, that were not included in the published genotypes.