Abstract
De novo genome assembly of outbred diploid organisms remains a challenge in computational biology due to the difficulty of resolving similar haplotypes. FALCON-Unzip, a phased diploid genome assembler, separates PacBio long-reads by haplotype during assembly. The assembler outputs contiguous primary contigs, which are pseudohaplotypes containing phased haplotype regions and collapsed haplotypes. The ability to phase depends on the density of heterozygous variants, depth of coverage, and read length. As a result, haplotype phase information is lost when phase blocks are interrupted by regions of low heterozygosity, resulting in phase switches. Here, we present FALCON-Phase, a new method that resolves phase-switches by reconstructing contig-length phase blocks using Hi-C short-reads mapped to both homozygous regions and phase blocks. Such Hi-C data contain ultra-long-range phasing information (>1Mb). The novel FALCON-Phase algorithm is highly accurate (>96%) when benchmarked against a pedigree-based truth-set. The FALCON-Phase pipeline can also be extended to scaffolds to generate chromosome-scale phase blocks. The code is freely available (https://github.com/phasegenomics/FALCON-Phase) under a BSD and attribution license.