PT - JOURNAL ARTICLE AU - Chen-Shan Chin AU - Justin Wagner AU - Qiandong Zeng AU - Erik Garrison AU - Shilpa Garg AU - Arkarachai Fungtammasan AU - Mikko Rautiainen AU - Tobias Marschall AU - Alexander T Dilthey AU - Justin M. Zook TI - A Diploid Assembly-based Benchmark for Variants in the Major Histocompatibility Complex AID - 10.1101/831792 DP - 2019 Jan 01 TA - bioRxiv PG - 831792 4099 - http://biorxiv.org/content/early/2019/11/05/831792.short 4100 - http://biorxiv.org/content/early/2019/11/05/831792.full AB - We develop the first human benchmark derived from a diploid assembly for the openly-consented Genome in a Bottle/Personal Genome Project Ashkenazi son (HG002). As a proof-of-principle, we focus on a medically important, highly variable, 5 million base-pair region - the Major Histocompatibility Complex (MHC). Most human genomes are characterized by aligning individual reads to the reference genome, but accurate long reads and linked reads now enable us to construct base-level accurate, phased de novo assemblies from the reads. We assemble a single haplotig (haplotype-specific contig) for each haplotype, and align reads back to each assembled haplotig to identify two regions of lower confidence. We align the haplotigs to the reference, call phased small and structural variants, and define the first small variant benchmark for the MHC, covering 21496 small variants in 4.58 million base-pairs (92 % of the MHC). The assembly-based benchmark is 99.95 % concordant with a draft mapping-based benchmark from the same long and linked reads within both benchmark regions, but covers 50 % more variants outside the mapping-based benchmark regions. The haplotigs and variant calls are completely concordant with phased clinical HLA types for HG002. This benchmark reliably identifies false positives and false negatives from mapping-based callsets, and enables performance assessment in regions with much denser, complex variation than regions covered by previous benchmarks. These methods demonstrate a path towards future diploid assembly-based benchmarks for other complex regions of the genome.