Abstract
The accuracy of variant calling is crucially important in clinical settings, as the misdiagnosis of a genetic disease such as cancer can compromise patient survival. Although many variant callers were developed, variant-calling accuracy is still insufficient for clinical applications.
Here we describe UVC, a method for calling small variants of germline or somatic origin. By combining contrary assumptions with sublation, we found two principles to improve variant calling. First, we discovered the following power-law universality: allele fraction is inversely proportional to the cubic root of variant-calling error rate. Second, we found that zero inflation can combine Bayesian and frequentist models of sequencing bias.
We evaluated UVC with other state-of-the-art variant callers by considering a variety of calling modes (germline, somatic, tumor-only, and cell-free DNA with unique molecular identifiers (UMIs)), sequencing platforms (Illumina, BGI, and IonTorrent), sequencing types (whole-genome, whole-exome, and PCR-amplicon), human reference genomes (hg19, hs37d5, and GRCh38), aligners (BWA and NovoAlign), and representative sequencing depths and purities for both tumor and normal. UVC generally outperformed other germline variant callers on the GIAB germline truth sets. UVC strongly outperformed other somatic variant callers on 192 scenarios of in silico mixtures simulating 192 combinations of tumor/normal sequencing depths and tumor/normal purities. UVC strongly outperformed other somatic variant callers on the GIAB somatic truth sets derived from physical mixture and on the SEQC2 somatic reference sets derived from the breast-cancer cell-line HCC1395. UVC achieved 100% concordance with the manual review conducted by multiple independent researchers on a Qiagen 71-gene-panel dataset derived from 16 patients with colon adenoma. Additionally, UVC outperformed Mageri and smCounter2, the state-of-the-art UMI-aware variant callers, on the tumor-only datasets used for publishing these two variant callers. Performance is measured by using sensitivity-specificity trade off for all called variants. The improved variant calls generated by UVC from previously published UMI-based sequencing data are able to provide additional biological insight about DNA damage repair.
UVC enables highly accurate calling of small variants from a variety of sequencing data, which can directly benefit patients in clinical settings. UVC is open-sourced under the BSD 3-Clause license at https://github.com/genetronhealth/uvc and quay.io/genetronhealth/gcc-6-3-0-uvc-0-6-0-441a694.
Competing Interest Statement
The algorithms presented in this manuscript are patent-pending. S.W. is one of the founders of Genetron Health. X.W. is a scientific advisor for Genetron Health. The remaining authors declare no competing interests.