Benchmarking Oxford Nanopore read assemblers for high-quality molluscan genomes

Choosing the optimum assembly approach is essential to achieving a high-quality genome assembly suitable for comparative and evolutionary genomic investigations. Significant recent progress in long-read sequencing technologies such as PacBio and Oxford Nanopore Technologies (ONT) also brought about a large variety of assemblers. Although these have been extensively tested on model species such as Homo sapiens and Drosophila melanogaster, such benchmarking has not been done in Mollusca which lacks widely adopted model species. Molluscan genomes are notoriously rich in repeats and are often highly heterozygous, making their assembly challenging. Here, we benchmarked 10 assemblers based on ONT raw reads from two published molluscan genomes of differing properties, the gastropod Chrysomallon squamiferum (356.6Mb, 1.59% heterozygosity) and the bivalve Mytilus coruscus (1593Mb, 1.94% heterozygosity). By optimising the assembly pipeline, we greatly improved both genomes from previously published versions. Our results suggested that 40-50X of ONT reads are sufficient for high-quality genomes, with Flye being the recommended assembler for compact and less heterozygous genomes exemplified by C. squamiferum, while NextDenovo excelled for more repetitive and heterozygous molluscan genomes exemplified by M. coruscus. A phylogenomic analysis utilising the two updated genomes with other 32 published high-quality lophotrochozoan genomes resulted in maximum support across all nodes, and we show that improved genome quality also leads to more complete matrices for phylogenomic inferences. Our benchmarking will ensure the efficiency in future assemblies for molluscs and perhaps also other marine phyla with few genomes available.


Introduction
Sequencing the whole genome of an organism can be highly beneficial to understanding its biology and evolution. High-quality genome assemblies allows researchers to begin deciphering the causal connection between genotype and phenotype, explore the molecular control of traits through gene expression and regulation, shed light on the species' evolutionary process at the genomic scale, and perform comprehensive and robust phylogenomic analyses to infer evolutionary relationships [1]. Mollusca is the second largest animal phylum, and their habitats cover a dramatic range of ecosystems worldwide, from high mountains to the deep sea, from lush rainforests to arid deserts, and from freshwater streams to coral reefs. Many molluscs provide important resources for humans, being the largest aquaculture resources second only to teleost fishes [2]; a number of disease-causing human parasites also use molluscs as an intermediate host, such as blood flukes in the genus Schistosoma which causes schistosomiasis [3]. With members as different as the colossal squid and microscopic worms living between grains of sand, it is also the most morphologically disparate animal phylum [4], and the understanding of their biology and evolution provides clues to answer fundamental questions on genotypic vs phenotypic adaptation, origins of evolutionary novelties, and macroevolutionary processes.
Currently, the number of published molluscan genomes is relatively small compared to other major phyla such as Arthropoda and Chordata. At the time of writing (September 2020), there are 666 Arthropoda and 1372 Chordata genomes deposited on NCBI Genome database; only 29 highquality Mollusca genomes (i.e. BUSCO score over 80%) have been published (Supplementary  Table S1), representing only approximately <0.05% of molluscan species (considering the number of described species in Mollusca is ~65,000). Among these, only 9 are chromosome-scale. In addition, the sequenced taxa are heavily biased to the conchiferan classes of Bivalvia, Gastropoda, and, to a lesser extent, Cephalopoda, with no published high-quality genomes in other classes including the aculiferans (Polyplacophora, Solenogastres, and Caudofoveata), which are key to understanding molluscan evolution. The small number of molluscan genomes sequenced and the taxonomic bias can be attributed to a few reasons: 1) some groups, such as monoplacophorans, are rare and not easy to collect [5]; 2) extraction of adequate, high-quality genomic DNA from molluscs can be very difficult; 3) molluscan genomes tend to be very heterozygous and repetitive, which significantly hinders the assembly of reads into high-quality genomes.
Along with the rapid progress of sequencing technologies in recent years, especially so-called thirdgeneration long read sequencing technologies [PacBio and Oxford Nanopore Technologies (ONT)], a large variety of assemblers have been developed for long-read based genome assembly. Although the effectiveness of different assemblers has been extensively tested on model species such as Homo sapiens and Drosophila melanogaster, there are currently no widely adopted molluscan model species. As such, the efficacy of different genomic assemblers for de novo molluscan genome assembly has not been evaluated systematically. Considering the distinct genomic features between molluscan genomes sequenced to date and model species in other phyla, the assembly strategies and assemblers effective in those model species are unlikely to be equally effective for molluscan genomes. Since running and testing different assemblers is computationally intensive and time consuming, it is beneficial to carry out a systematic benchmarking of different assembly strategies for molluscs to identify the best strategy for assembling future molluscan genomes.
Here, we selected two species from different molluscan classes with existing genomes that were previously assembled from ONT reads as the models to benchmark different assembly strategies. Namely, we focused on the Scaly-foot Snail Chrysomallon squamiferum (Gastropoda) and the Hard-shelled Mussel Mytilus coruscus (Bivalvia). We selected ONT because 1) the low cost of the MinION instrument makes this technology more accessible to researchers studying molluscs, 2) currently more tools have been specifically designed for ONT data, such as NECAT [6], NextDenovo (https://github.com/Nextomics/NextDenovo), and Shasta [7], and 3) the potential to apply the ultra-long DNA sequencing capability of ONT. The published Scaly-foot Snail genome is relatively compact (444.4 Mb) while the M. coruscus genome is larger (1.90 Gb); both genomes are very heterozygous, with the heterozygosity of C. squamiferum being 1.38% and M. coruscus being even higher at 1.64% [8,9]. We recorded genome contiguity, completeness, as well as misassemblies resulting from up to 10 assemblers. Using these two re-assembled and improved genomes, we performed an updated phylogenomic analysis with all publicly available, high-quality molluscan genomes and explored the gene families specifically expanded in particular lineages within Mollusca, exemplifying the utility of high-quality genomes in understanding the evolution and biology of molluscs.

Methods
Raw ONT reads in the .fast5 format from the published Scaly-foot Snail genome sequencing project (PRJNA523462) [8] were re-basecalled using Guppy version 3.6.0 with the high-accuracy (HAC) mode on a GeForce® RTX 1080 Ti (NVIDIA) GPU. The Illumina and ONT reads from the Mytilus coruscus genome sequencing project were downloaded from the NCBI SRA database (ERR3415816 and ERR3431204) [9]. Illumina reads were cleaned to remove bacterial contamination using Kraken2 [10], and the genome size and heterozygosity was calculated by Jellyfish v2.3.0 with the k-mer size of 17, 19, and 21 and GenomeScope 2.0 [11].
The following assemblers were used for the benchmarking, including the long-read only assemblers (Canu [12], Flye [13] , Wtdbg2 [14], Miniasm [15], NextDenovo (https://github.com/Nextomics/NextDenovo), NECAT [6], Raven [16], and Shasta [7]) and hybrid assemblers (MaSuRCA [17] and QuickMerge [18]). Canu was not tested on the M. coruscus genome due to the extremely intensive computing time required for this large genome. Previous analyses have suggested using corrected ONT reads could improve the genome assembly [19]. To check the effect of this has on the assemblies, the ONT reads that were corrected and / or trimmed by Canu and NECAT were also tested. To check whether including the shorter ONT reads could affect the assembly. The ONT reads were also sub-sampled with different cut-off lengths (see Table  1 for the lengths used). CPU hour was calculated in the Slurm workload manager system by recording the program start and end time point. However, since the hardware configuration in each node varied, the CPU hour presented is only an indicator of the relative trend among different assemblers.
The assembled contigs were polished at least three times with Flye, and heterozygous contigs were removed with the purge_dup pipeline [20]. The resultant genomes were polished twice using Pilon version 1.23 [21] with Illumina reads. The genome completeness of each assembly was thoroughly monitored at each step using BUSCO v4.0.6 with odb10 metazoan dataset [22]. The genome quality of the Scaly-foot Snail assemblies were assessed by QUAST version 5.0.2 [23] comparing against the formerly published version of the genome as a reference [8]. QUAST calculates genome assembly characteristics such as N50 and total size, but also assesses mis-assemblies with minimap2. The detailed commands and settings used for all analyses can be found in the Supplementary Information.
Repeat content was initially predicted with RepeatModeler version 2.0.1. Genomes were hardmasked using RepeatMasker v4.1.0 with a species-specific repeat library generated by RepeatModeler and all the known repeat content in the RepeatMasker repeat database. Augustus version 3.3.3, a de novo gene predictor, was trained using Braker version 2.1.5 with the hardmasked genome and the transcriptome assemblies. Genome annotation was then performed using Maker v3.01.03 with the trained Augustus ab initio predictor plus each species' transcriptome assembly and molluscan protein sequences downloaded from the NCBI protein database (July 2020). Each gene was annotated by InterProScan-5.36-75.0.
To identify putative orthologous sequences shared among taxa, we used OrthoFinder 2.4.0 [24] with an inflation parameter of 2.1. Working from the .fasta files generated in the "Orthogroup_Sequences" directory, we removed sequences shorter than 100 amino acids and removed sequences that were identical where they overlapped, keeping the longest non-redundant sequence. We then retained only those .fasta files sampled for at least four taxa, and aligned them using MAFFT 7.310 [25] with the following options: --auto, --localpair, and --maxiterate 1000. We then removed putatively mistranslated regions with HmmCleaner [26] with the --specificity option. We deleted sequences that did not overlap with any other sequences by at least 20 amino acids, starting with the shortest sequence not meeting this overlap criterion. Then, we trimmed the alignments to remove ambiguously aligned and 'noisy' regions with BMGE v. 1.12.2 [27] and constructed "approximately maximum likelihood" trees for each alignment with FastTree 2 [28] using the -slow and -gamma options. In order to identify strictly orthologous sequences among taxa, we used PhyloPyPruner 0.9.5 (https://pypi.org/project/phylopypruner) with the following options: --min-support 0.9 --mask pdist --trim-lb 3 --trim-divergent 0.75 --min-pdist 0.01 --prune LS. Only alignments sampled for at least 75% of the taxa (i.e., 26 taxa) were retained for the final analyses. Datasets with genes sampled for at least four taxa and at least 50% of the taxa were also generated (available in the supplementary materials). In order to check if higher-quality genome assemblies lead to the recovery of more orthogroups, a regression analysis was carried out in SPSS 16.0 between BUCSO scores and the orthologue gene occupancy.
Phylogenetic analyses were conducted on the partitioned supermatrix produced by PhyloPyPruner using IQ-Tree 2 [29] with the best-fitting model of amino acid evolution for each partition (-m MFP). Topological support was assessed with 1,000 rapid bootstraps. MCMCTree version 4.8a was used to calibrate the time constraints. The "root-age" was set as 590Ma [30]. The following time constraints were applied: a soft minimum bound of 245 Ma for the first appearance of Ostreoidea [31]; a hard minimum bound of 465 Ma for the first appearance of Pteriomorpha [32]; a soft minimum bound of 125 Ma for the first appearance of Mactroidea [31]; a soft constraint of 520.5 Ma and 530 Ma for the origin of Bivalvia [31]; a hard upper bound of 150 Ma for the split of Lanistes nyassanus (representing the old world ampullariids) and the new world ampullariids [33]; a hard lower bound of 130 Ma for the first appearance of both the Stylommatophora and Hygrophila [34]; a hard lower bound of 168.6 Ma and a soft upper bound of 473.4 Ma for the split of Aplysia and Biomphalaria [35]; a hard minimum 390 Ma bound for the split of Caenogastropoda and Heterobranchia [36]; a hard lower bound of 470.2 Ma and a soft upper bound of 531.5 Ma for the first appearance of Gastropoda [35]; a hard lower bound of 532 Ma and a soft upper bound of 549 Ma for the first appearance of molluscs [37]; a hard lower bound of 550.25 Ma and a hard upper bound of 636.1 Ma for the origin of Lophotrochozoa [37]. The model of LG+I+G, which was the best-fitting model for the vast majority of the single-gene partitions, was applied with the burn-in set to 10 million and the sampling frequency set to 1000.

Results and Discussions (a) Scaly-foot Snail genome assembly
The Illumina sequencing reads used for the previously published genome assembly of Chrysomallon squamiferum were found to contain some endosymbiont contamination, which likely led to the overestimation of the genome size (444.4 Mb). With cleaned Illumina reads, the genome size was predicted to be 356.6 Mb, and the heterozygosity was estimated to be 1.59%. Although kmer count methods for genome size estimation may still be biased by the high heterozygosity, we used this genome size in the downstream analyses.
The high-accuracy mode of Guppy 3.6.0 significantly improved the ONT read accuracy, with the base quality score being improved from 13.2±1.5 when not using the high-accuracy mode to 16.9±3.5 in the case of the Scaly-foot Snail genome. Different genome assemblies were tested on the newly basecalled ONT reads.
The assembly results from different assemblers using different settings and filters are shown in Table 1. In the case of the minimap2+miniasm assembly, the best assembly resulted from >10Kb or longest 41X filtering ONT reads, suggesting the inclusion of shorter reads may actually reduce the contiguity of the assembly. A similar observation was also reported in a study assembling the Caenorhabditis elegans genome with ONT reads [38]. However, due to Miniasm lacking a sequence consensus step, the base accuracy in the assembly can only be as good as the input reads, leading to a very poor BUSCO score (1.5%) ( Table 1, Supplementary Table S2). Similarly, the best assembly using Flye resulted from filtering the raw ONT reads keeping only those longer than 10Kb; including shorter raw reads also appeared to reduce the contiguity of the assembly. Output assembly quality from Flye also seemed to be independent from whether the reads were corrected and/or trimmed or not, though the assembly with the longest 50X NECAT-corrected ONT reads had the longest contig (9.84Mb) among the assemblies done with Flye (Supplementary Table S2). These results indicate that Flye is indeed mainly optimised for raw ONT reads, as suggested by its authors [13]. For Wtdbg2, increasing the length filter also increased the assembly contiguity. The >10kb filter for raw ONT reads resulted in a higher-quality assembly than the >8kb for and >5Kb filter. Both Canu-corrected and NECAT-corrected ONT reads dramatically increased the NG50, and the trimming on the corrected reads to remove suspicious reads (e.g. adapters) was also effective in increasing the assembly quality. The assembly using the longest 49X NECATcorrected and trimmed reads had the longest contig (10.64Mb), and the assembly using the longest 40X NECAT-corrected and trimmed reads had the best NG50 value (2.44Mb).
Among the MaSuRCA assemblies, the highest quality assembly was from combining Illumina reads together with the longest 35X raw ONT reads; assemblies with Illumina reads and corrected/trimmed ONT reads did not increase the NG50. The MaSuRCA assembly had the highest BUSCO score (97.6%) at this point. This suggests a high base accuracy, as the mega-reads generated by MaSuRCA effectively combined both the more accurate Illumina reads and the longer ONT reads [17]. With the NECAT assembler, using the longest 50X raw ONT reads improved the assembly compared to when using top 30x, suggesting longest 50x reads may be the better input for NECAT. For Shasta assemblies, using the default settings and the settings of "-memoryMode filesystem and -memoryBacking 2M" resulted in assemblies of similar qualities. This result is different from what other authors reported in other genome assemblies of model species such as human and Drosophila [7] where the latter setting resulted in much improved assemblies, suggesting that the latter settings may not be beneficial for highly heterozygous genomes like those of Mollusca. The final assembler tested was Raven, an updated version of Ra [16]. The best assembly resulted from using either >8Kb or longest 60X filter ONT reads, running the assembler with Canu-corrected reads actually resulted in lower quality genome assemblies (Table 1).
With regards to the computing time (Table 1), Shasta was the fastest assembler, followed by Raven, minimap2+miniasmand, and Wtdbg2. Canu was the most computationally intensive, followed by MaSuRCA, then NECAT or NextDenovo. Canu also took more CPU hours to correct the reads than NECAT.
As different assemblers may include different (or lacking) consensus steps, and some assemblers may merge the heterozygous contigs (or 'bubble' in the assembly), the best genome assembly from each assembler (as judged by NG50, the N50 value after normalization using the predicted genome size) was polished with ONT reads using Flye. Removal of heterozygous contigs by purge_dups was carried out when the coverage histogram of the mapped ONT reads exhibited a heterozygous peak, which was often necessary except for Wtdbg2, Raven, and Shasta, suggesting that these three assemblers were able to actively merge heterozygous contigs. Furthermore, this step also helped to increase the BUSCO score by decreasing the duplicated BUSCOs (Supplementary Table S3). Among the post-polishing assemblies, the Flye assembly had the highest BUSCO score (C:97.8% [D:0.6%]). This is in line with the published finding that Flye is capable of assembling some genomic regions that may be missing in assemblies produced by other assemblers by better resolving the repetitive regions [13]. Following Flye, the next most complete assembly was that from minimap2+Miniasm (C:97.8% [D:0.7%]), and then NECAT and Shasta (both C:97.6%[D:0.4%]).
Analysis of misassembly and mismatch with QUAST revealed that NextDenovo has the least number of misassemblies, followed by QuickMerge. For the number of mismatches per 100 kb, Raven was the best performer followed by NextDenovo; for number of indels per 100 kb, MaSuRCA performed the best, followed by Shasta. Nevertheless, for genomes resulting from these 10 assemblers, the amount of misassembly, mismatches, and indels were not very different, particularly the latter two parameters (number of mismatches and number of indels per 100kb), indicating that they were similar in performance.
We selected the polished Flye assembly for the downstream analysis due to this assembly exhibiting the highest BUSCO score and also the largest size among the assemblies. The Hi-C library from the original published assembly [8] was used to further scaffold the contigs in the Flye assembly, resulting in a final assembly including 15 pseudo-chromosomal scaffolds plus 492 contigs. Annotating the genome suggested a dramatic improvement compared to the original published assembly in the number of gene models (21,469 vs 16,917), and the BUSCO score of the predicted genes increased from 87.5% to 94.1%. This assembly is one of the most complete genome assemblies in Mollusca to date.

(b) Mytilus coruscus genome assembly
The genome size of M. coruscus predicted with GenomeScope 2.0 was 1,593Mb, smaller than the size predicted by an earlier study (1.85Gb), and heterozygosity was estimated to be 1.94% [9]. A total of 158.9 Gb of ONT reads (99.8 X) with the base quality score of 13.1±0.9 were sequenced in the former study [9]. Since there is no chromosomal-scale of genome assembly for M. coruscus, the number of mis-assemblies were not documented for this species. For Wtdbg2 assemblies, the highest quality also resulted from the combination of NECAT-corrected and trimmed ONT reads, with the NG50 of 1.12Mb, like in the case of the Scaly-foot Snail above. However, the running time for M. coruscus was significantly inflated due to extensive computing required for the read error correction and trimming. Meanwhile, for Flye assemblies, the genome assembly from raw ONT reads was better than the assembly resulting from NECA-corrected and trimmed reads. However, neither of these two Flye assemblies had N50 values over 500Kb. This is rather different from the results in the Scaly-foot Snail assembly, and it may indicate the M. coruscus genome is too heterozygous or repetitive for Flye to be an effective assembler.
Among all of the M. coruscus assemblies generated in our benchmarking, the NextDenovo version exhibited the highest NG50 (3.40Mb) and BUSCO scores ( Table 3, Supplementary Table S4). The MaSuRCA version resulted in the same 'Complete' BUSCO score, but with higher 'Complete and Duplicated' score (2.2% vs 1.7%) (Supplementary Table S4), suggesting that NextDenovo performs better in merging allelic contigs. Shasta was again the speediest, followed by Wtdbg2 and then Raven. Since the NextDenovo assembly was by far the most contiguous genome, this was selected for the downstream analysis. After three rounds of polishing with ONT reads, purging redundant haplotigs with purge_dups, and two rounds of error correction with Illumina reads using Pilon, the final N50 reached 2.54Mb, and the complete BUSCO score was 95.8% (duplicated BUSCO =1.7%). This is a dramatic improvement from the original published assembly (N50 = 898.3Kb and complete BUSCO score = 91.7%, and duplicated BUSCOs = 2.5%) [9]. A total of 72,541 gene models were annotated from this genome assembly, with the BUSCO score of the gene models being 92.3%. The number of gene models is rather high among published Mollusca genomes. A recent genome assembly of its congener Mytilus galloprovincialis annotated 60,338 genes, and the authors suggested there is significant variation in gene presence / absence among Mytilus species [39]. These results collectively indicate that Mytilus is gene-rich, and a similar high number of genes was also reported from another lamellibranch bivalve, the scallop Pecten maximus with 67,741 genes [40].

(c) Overall remark on the two genome assemblies
The genomes of Chrysomallon squamiferum and Mytilus coruscus represent two genomes with drastically different genomic features: the former genome is compact and the latter is relatively large; although both genomes are very heterozygous the latter is much more so. Genome assemblies with different assemblers resulted in various levels of trade-off between time, contiguity, and completeness. In general, Shasta, Wtdbg2, and Raven are very speedy and are the recommended assemblers when a quick check of the genomic features is desired instead of a high-quality assembly. Flye is not sensitive to the read accuracy, but Wtdbg2 always performed better with corrected and trimmed reads. NextDenovo was the highest performer in terms of genome contiguity (e.g., N50), but the genome completeness assessed by the BUSCO score was not the best, indicating the assembler likely failed to assemble some parts of the genome. We also found that QuickMerge can increase the genomic contiguity without introducing misassemblies, mismatches, or erroneous indels ( Table 2). This is very impressive, since QuickMerge can be a cost-effective method for assembling a relatively high-quality genome. However, it should be noted that the BUSCO score after the QuickMerge is actually worsened, suggesting some parts of the genome have been lost during the consensus step.
Regarding the input reads, our results demonstrate that using the longest 40-50x of ONT reads is recommended for assembling molluscan genomes. In the case of NextDenovo, the authors suggested the longest 45X of ONT reads as the optimised input; in the case of Flye, the authors suggested longest 40X of ONT reads. We found that including shorter reads in the assembly in most cases resulted in lower quality assemblies. Haplotig reduction via the purge_dups pipeline, which performs best in our experience (data not shown), or a similar approach is necessary for a genome assembly larger than the predicted genome size, because heterozygous genomic regions can inflate the assembly size. Also, the BUSCO score remains the same or even improved after the purge_dups pipeline (Supplementary Table S3), suggesting the heterozygous contigs have detrimental effects on the BUSCO score.
In general, of the 10 assemblers tested, Flye and Nextdenovo performed better than the rest overall. However, their performance was vastly different in the two species tested, with Flye performing the best in the C. squamiferum genome and NextDenovo in M. coruscus. With an extremely heterozygous genome like M. coruscus, NextDenovo likely performs better than Flye and is the recommended assembler with ONT reads. When the sequencing effort is limited, we also suggest using QuickMerge to merge at least two versions of the assembly in order to increase the genome continuousness without sacrificing too much assembly accuracy loss of genomic regions as reflected by a reduced BUSCO score.

(d) Phylogenomic analyses on the available molluscan genomes
With these two updated genomes, we re-analysed the genome-level phylogeny of Mollusca including other publicly available high-quality molluscan genomes. Our pipeline recovered 5,388 orthologous genes and an alignment totalling 1,727,673 amino acid positions. All genes were sampled for at least 17/34 taxa with an average of 30 taxa sampled per alignment and 14.90% missing data overall in the resulting matrix. The resulting maximum likelihood tree exhibited maximum support at every node (Figure 1). Among the only three molluscan classes with highquality genomes available at the time these analyses were performed, Cephalopoda was recovered sister to a clade comprising Bivalvia and Gastropoda, similar to previous studies [4,8]. However, this result does not necessarily reflect sister relationships per se among clades, given the limited availability of taxa. Similarly, since only one decapodiform and one octopodiform was available, genomic insights of the relationships within Cephalopoda must await better taxon sampling in the future. In Bivalvia, the only significant difference with previously published phylogenies is the position of the order Arcida, which was previously recovered sister to the rest of Pteriomorpha [41] but here it was recovered as sister to Pectinida. Instead, our tree infers the split between Arcida and Pectinida within Pteriomorpha occurred later than the split between an Arcida/Pectinida clade with a Mytilida/Ostreida clade. The bivalve taxon sampling of high-quality genomes, however, continues to suffer from a heavy bias to the clade Pteriomorphia. Although in recent years a number of representatives of Imparidentia have been sequenced, all other major bivalve clades including Protobranchia, Paleoheterodonta, Archiheterodonta, and Anomalodesmata remain unrepresented. Since only two major clades have high-quality genomes, the relationships among major bivalve clades also remain a key topic of future genomic research. Within Gastropoda, the relationships among major subclass-level clades, as well as families, remained similar to previous phylogenomic trees [42]. A split between Patellogastropoda/Vetigastropoda/Neomphaliones and Caenogastropoda/Heterobranchia was seen, and, within the former clade, Neomphalida was sister to Vetigastropoda and this pair was in turn sister to Patellogastropoda. However, understanding of the internal relationships among gastropods continues to suffer from a lack of sufficient taxon sampling, such as the total lack of members of the subclass Neritimorpha. A mitochondrial genomebased phylogeny including all gastropod subclasses recovered Patellogastropoda sister to the rest of Gastropoda, which is in-line with evidence from fossils and morphology [43] but different from a phylogenomic study based on transcriptomes where it was recovered sister to Vetigastropoda [42]. Patellogastropods are also thought to suffer from long-branch attraction, and it is difficult to resolve this group's phylogenetic position without a dense taxon sampling, as demonstrated by the mitogenome study where the position of Patellogastropoda was only reliably resolved when multiple genera were included in addition to Lottia. We hope more high-quality genome assemblies will be published at a faster pace in the near future, especially for currently under-sampled groups in Mollusca [44], using our benchmarking presented herein as a guide to achieving high efficiency.
We found a positive, statistically significant, correlation (r = 0.342, P = 0.048) between BUSCO score and orthologue gene occupancy per genome used in the phylogenomic analysis ( Figure 2). This indicates increasing the genome quality indeed leads to better coverage per orthologue group, thereby benefitting phylogenomic analyses in increasing the completeness of the data matrix that can be used. The two genomes newly updated herein, i.e. C. squamiferum and M. coruscus, have both higher BUSCO scores and orthologue gene occupancy compared to most of other published molluscan genomes, exemplifying that re-assembly of existing genomes using improved techniques is beneficial and useful. Compared with other molluscan genomes available in Gastropoda and Bivalvia, the two cephalopod genomes exhibited comparatively low BUSCO score and also orthologue gene occupancy. More optimised, higher-quality genome assemblies are required in Cephalopoda for a better coverage in the orthogroups, in order to improve the quality of phylogenomic analyses both within Cephalopoda and Mollusca.

Conclusions
We carried out benchmarking of various genome assemblers for the ONT using data from two molluscs, which suggested a 40-50X of ONT reads to be sufficient for achieving a high-quality genome assembly. Although different assemblers showed varying performances on different scores, overall Flye appears to be the best assembler for relatively simple genomes in Mollusca exemplified by Chrysomallon squamiferum, while NextDenovo performs the best for more complicated molluscan genomes exemplified by Mytilus coruscus. Increasing the genome assembly quality is beneficial to various downstream analyses, for instance, by increasing the completeness of the sampling matrices in the phylogenetic analysis. These results may also be applicable to other important yet neglected groups of marine invertebrates, such as polychaetes, brachiopods, nemerteans, and other lophotrochozoans, which may share similar genomic features with molluscs. In the future, it is necessary to assess the genome assembly quality with ultra-long ONT reads and also PacBio HiFi reads, which are newly available techniques with very little sequencing errors, and a comprehensive comparison between these two sequencing techniques are also needed.     Figure 2