Abstract
Telomere-to-telomere phased assemblies have become the norm in genomics. To achieve these for diploid and even polyploid genomes, the contemporary approach involves a combination of two long-read sequencing technologies: high-accuracy long reads, e.g. Pacific Biosciences (PacBio) HiFi or Oxford Nanopore (ONT) ‘Duplex’ reads, and ultra-long ONT ‘Simplex’ reads. Using two different technologies increases the cost and the required amount of genomic DNA. Here, we show that comparable results are possible using error correction of ultra-long ONT Simplex reads and then assembling them using state-of-the-art de novo assembly methods. To achieve this, we have developed the deep learning-based HERRO framework, which corrects ONT Simplex reads while carefully preserving differences in related genomic sequences. Taking into account informative positions that differentiate the haplotypes or genomic repeat copies, HERRO achieves an increase of read accuracy of up to 100-fold for diploid human genomes. By combining HERRO with Verkko assembler, we achieve high contiguity on several human genomes by reconstructing many chromosomes telomere-to-telomere, including chromosomes X and Y. HERRO supports both R9.4.1 and R10.4.1 ONT Simplex reads and generalizes well to other species. These results provide an opportunity to reduce the cost of genome sequencing and use corrected ONT reads to analyze more complex genomes with different levels of ploidy or even aneuploidy.
Competing Interest Statement
Oxford Nanopore Technologies and AI Singapore jointly funded the AI-driven De Novo Diploid Assembler project, which resulted in this manuscript. M.S. has received travel funds to speak at events hosted by Oxford Nanopore Technologies. S. N. and P. F. S. are Oxford Nanopore Technologies employees.
Footnotes
Updated figures and tables, added Sergey as a co-author, changed Results and Discussions accordingly