Abstract
Generating the most contiguous, accurate genome assemblies given available sequencing technologies is a long-standing challenge in genome science. With the rise of long-read sequencing, assembly challenges have shifted from merely increasing contiguity to correctly assembling complex, repetitive regions of interest, ideally in a phased manner. At present, researchers largely choose between two types of long read data: longer, but less accurate sequences, often generated with Oxford Nanopore (ONT) technology, or highly accurate, but shorter, reads typically generated with Pacific Biosciences HiFi. To understand how both technologies influence genome assembly and to clarify how scale of data (i.e., mean length and sequencing depth) influence outcomes, we compared genome assemblies for a caddisfly, Hesperophylax magnus, generated with ONT and HiFi data. Despite shorter reads and less coverage, HiFi reads outperformed ONT reads in all assembly metrics tested and allowed for accurate assembly of the repetitive ∼20-Kb H-fibroin gene. Next, we quantified the influence of data type on genome assemblies across 6,750 plant and animal genomes. We show that HiFi reads consistently outperform all other data types for both plants and animals and may represent a particularly valuable tool for assembling complex plant genomes. To realize the promise of biodiversity genomics, we call for greater uptake of highly accurate long-reads in future studies.
Significance statement Understanding how types of sequence data influence genome assembly is an important aspect of genome science. In general, more data–i.e., longer reads, greater depth of coverage–often yields better genome assemblies. However, it is unclear how highly accurate long-read sequence data (e.g., PacBio HiFi) compare to noisier long-read data. We showed that HiFi outperformed noisier long-read data for a caddisfly species in terms of assembly contiguity and resolution of the highly repetitive ∼20-Kb H-fibroin gene. We also showed that this outperformance likely extends to all animals and plants via a field-wide meta-analysis. Thus, long-read accuracy should be emphasized in future genome studies.
Competing Interest Statement
The authors have declared no competing interest.