Assessing long‐read sequencing with Nanopore R9, R10 and PacBio CCS to obtain 1 high‐quality metagenome assembled genomes from complex microbial 2 communities

11 Short‐read DNA sequencing has led to a massive growth of genome databases but mainly with highly 12 fragmented metagenome assembled genomes from environmental systems. The fragmentation is 13 a result of closely related species, strains, and genome repeats that cannot be resolved with short 14 reads. To confidently explore the functional potential of a microbial community, high‐quality 15 reference genomes are needed. In this study, we evaluated the use of different combinations of 16 short (Illumina) and long‐read technologies (Nanopore R9.4, R10.3, and PacBio CCS) for recovering 17 high‐quality metagenome assembled genomes (HQ MAGs) from a complex microbial 18 community (anaerobic digester). Depending on the sequencing approach, 33 to 86 HQ MAGs 19 (encompassing up to 34 % of the assembly and 49 % of the reads) were recovered using long reads, 20 with Nanopore R9 featuring the lowest sequencing costs per HQ MAG recovered. PacBio CCS was 21 also found to be an effective platform for genome‐centric metagenomics (74 HQ MAGs) and 22 produced HQ MAGs with the lowest fragmentation (median of 9 contigs) as a stand‐alone 23 technology. Using PacBio CCS MAGs as reference, we show that, although a high number of high‐ 24 quality MAGs can be generated using Nanopore R9, systematic indel errors are still present, which 25


Introduction
Bacteria live in almost every environment on Earth and a recent analysis of the global microbial diversity estimated a total of 10 12 species (1).To obtain representative genomes culturing has been used to isolate specific microbes, but the throughput is highly limited.Recently, genome recovery from metagenomes with short, high-quality reads has drastically improved genome recovery for uncultivated species (2)(3)(4), and has spurred multiple tools for automated binning of draft genomes (5)(6)(7)(8).This has led to studies reporting thousands or even hundreds of thousands of genomes (9)(10)(11)(12)(13).However, the recovered metagenome assembled genomes (MAGs) are often very fragmented, incomplete, and contaminated with genome fragments from other microbes.Hence, even though the pace of genome sequencing is increasing, only a tiny fraction of the estimated microbial diversity have representative genomes (14) with 47,894 species having representative genomes in the Genome Taxonomy Database (GTDB) version 202 (15).
Recovering genomes from metagenomes using short reads is challenging due to the presence of genetic repeat elements.From an assembly point-of-view, repeat elements are classified as identical DNA segments that are not able to be spanned by the length of the sequenced reads.For example, the 16S ribosomal RNA gene is highly conserved between species, can be present in multiple copies within species, and is much longer than the typical short reads.This either breaks up the assembly graph or produce chimeric segments (16,17).In complex microbial communities a large source of repeat elements originates from closely related microbial strains that share large parts of their genomic content.The repeat elements have a coverage that is different from the rest of the genome, and are thus often not picked up by automated binning tools (11).In addition, genomes can also be contaminated with fragments from other organisms that just happen to correlate with the binned genome (11,14,18).Tools have been developed to estimate the completeness and the level of contamination and even "decontaminate" the bins (16,18).However, they rely heavily on good reference databases and assumptions about universal essential single copy genes that are known not to be truly universal (19,20).Maybe the only way to be certain about the completeness of a genome is to produce closed MAGs as in the case of circular chromosomes (21).This is especially important as repeat regions may contain biological information that is essential to understand a microbial community (22), as have been shown for some antimicrobial resistance genes (23).
Long-range information is, to a large degree, capable of solving the genome fragmentation caused by repeat elements (24).Different approaches have been developed to generate such data, e.g.mate-pair sequencing, synthetic long-read sequencing, and long-read single molecule sequencing, such as offered by Pacific Biosciences sequencing (PacBio) and Oxford Nanopore Technologies (25)(26)(27)(28)(29)(30)(31)(32).All of these techniques have been used to recover genomes from metagenomes in environments of varying complexity and recent technological breakthroughs deliver vastly increased data yields, which makes even higher complexity samples tractable (12,(33)(34)(35)(36)(37)(38).However, long read sequencing technologies feature drawbacks of their own, as the PacBio platform is relatively costly (39,40), while Nanopore sequencing exhibits systematic errors in homopolymer regions that can lead to insertions and deletions in the assemblies (41,42).
Here, we assess the recent DNA sequencing technologies in the context of genome-centric, highthroughput metagenomics under realistic scenarios for a typical small research project.We compare metagenome assembly, binning, rRNA and protein gene recovery between single flow cell sequencing runs of Illumina MiSeq and Oxford Nanopore MinION R.9.4.1 and R.10.3 (referred to as R9 and R10 hereafter) and PacBio Sequel II (circular consensus sequencing).The Illumina data also features variation in relative abundances (Fig S1b), presumably due to GC bias (43).Only the R9 and R10 relative abundance data correlated well (Fig S1c), even though they had a 2.7-fold difference in sequencing yield (Table 1).Hence, the comparisons are influenced by differences in sequencing depth, read length, library preparation technique (44) and changes to relative abundances of the community members, but still represent a realistic scenario for a typical research project.To assist automated contig binning, we performed Illumina sequencing of 9 additional samples from the same anaerobic digester spread over 9 years (Supp Table 1) and used the coverage profiles as input for binning.Furthermore, to evaluate the impact of micro-diversity on MAG quality, we calculated the polymorphic site rates for each MAG as a simple proxy for presence of microdiversity (10).

Results and discussion
After performing automated contig binning it is evident that micro-diversity has a large impact on MAG fragmentation, but that long-read sequencing results in much less fragmentation (Fig 1c-d) of bins at higher amounts of micro-diversity (Fig 2a).Despite large differences in read length for Nanopore and PacBio CCS data (N50 read length 6 kbp vs. 15 kbp), only small differences in bin fragmentation were observed, as compared to the Illumina-based results.
After performing bin quality estimates, the most HQ MAGs were recovered by using Nanopore sequencing on a R9 flow cell (n=86), supplemented with Illumina read polishing, while also featuring the second lowest laboratory costs per recovered HQ MAG (Table 2).The second highest number of HQ MAGs (n=74) was achieved via PacBio CCS, which also exhibited the least HQ MAG Nanopore assemblies with Illumina reads was observed to slightly improve the mean bin completeness for both R9 (from 91.9 ± 5.4 to 93.5 ± 4.7 %, p=0.0009) and R10 (from 91.9 ± 6.0 to 93.1 ± 5.9 %, p=0.021) chemistries.The observed increase in bin completeness from Illumina read estimation is based on the presence of marker proteins ( 18)), due to correction of Nanopore sequencing systematic errors, such as homopolymer truncation that can artificially induce frameshifts, and thus introduce early stop codons in genes (45,46).Illumina read polishing still aids in recovering more HQ MAGs (Table 2) from complex microbial communities as well as producing presumably less erroneous genome sequences.Reduction of indel rates in genomes is expected to yield less truncated protein sequences (49).To investigate this we used the IDEEL test ( 12) that estimates protein length by comparing to known sequences.In the MAGs using Nanopore R9 or R10 data alone, respectively 15 % and 14 % of protein sequences were estimated to exhibit a greater than 20 % truncation, while proteins from the hybrid Nanopore and Illumina approach were found to be of comparable length to that of PacBio CCS (3-Page 11 of 36 protein truncation levels than PacBio CCS and the hybrid approach, although the following is expected to be influenced by the generally shorter contig lengths, leading to a higher count of fragmented protein sequences (50).A known source of errors in Nanopore-based assemblies is systematic homopolymer miscalling.
Similarly to the IDEEL test, using MAGs from PacBio CCS as a reference, Nanopore-only MAGs were found to feature increased homopolymer errors at longer lengths, especially for cytosine and guanine homopolymers (Fig S3), which coincides with read-level error rates of Nanopore sequencing (51).As expected, sequences from Nanopore R10 featured lower rates of homopolymer miscalling than R9 and the hybrid assemblies featured reduced error profiles, similar to that of Illumina-only assemblies.
To reiterate, although the benefits of the hybrid Nanopore and Illumina approach have been reported in numerous projects (52)(53)(54)(55)(56)(57)(58)(59), our study utilized PacBio CCS as a means of establishing reference sequences for examining the effects of short read polishing on Nanopore bins that were recovered from a complex microbial community.Given that short read length can be a cause for mapping issues, often leading to errors and biases in genome recovery (60)(61)(62), we observed evidence that Illumina read polishing of Nanopore bins still significantly reduces the estimated rate of indels, especially homopolymer errors, in a high complexity metagenome.Furthermore, although multiple bioinformatics workflows for frame-shift correction of Nanopore assemblies without short reads have been developed (48,(63)(64)(65), the indel corrections are not de novo, as the workflows are based on comparisons to reference databases, and hence erroneous sequences of novel genes might not get corrected or proteins with biologically occurring frame-shift mutations might be falsely converted to a full-length state, making the method suboptimal for characterising novel microbial species.Nonetheless, using databases to correct Nanopore sequencing errors could still be a useful and cost-effective approach, when de novo recovery of novel genomic sequences is not relevant for the research project.
Lastly, the presence of repetitive DNA sequences, including genes in multiple copies, can lead to contig breaks in the assembly (66).Ribosomal RNA genes were chosen as an example of repeat gene as it is also the most widely used marker gene for amplicon sequencing, phylogenetic analysis and FISH (67)(68)(69).Hence, sufficient recovery and binning of rRNA genes is an important aspect of high throughput genome-centric metagenomics in order to connect MAGs to additional information.
Comparing 44 MAGs captured in all assemblies (see Materials and Methods), 13 out of 44 Illumina bins featured all rRNA genes, while it was 40 for Nanopore R9, 41 for Nanopore R10 and 41 for PacBio CCS (Fig S4 a-c).In the Illumina assembly, the ratio of 16S to 23S rRNA gene counts was 0.76 (Fig S4 d-e) indicating either collapsed or fragmented genes as a result of micro-diversity, whereas for long-read technologies a comparable ratio of ∼1 was observed (Supp Table 2).

Conclusion
Long read DNA sequencing improves assembly contiguity over short read assemblies and, as a result, produces significantly greater numbers of high-quality genomes from complex microbial communities.Recovery of repeated rRNA genes via long read sequencing is also vastly improved over the short read approach.Nevertheless, short read sequencing is still useful, as an economical way of acquiring time series data for binning MAGs and it can be combined with Nanopore sequencing to recover more high-quality genome bins and full protein sequences from metagenomes.The PacBio CCS platform features overall superior read metrics in terms of accuracy and length, although the relatively higher sequencing costs can act as a bottleneck for large scale genome-centric metagenomics projects.Despite the multiple differences between Nanopore and PacBio CCS platforms, recovering genomes from complex metagenomes via Nanopore R9 sequencing with supplemental short read Illumina polishing was, at the moment of writing, found to be the optimal strategy, balancing laboratory costs, adequate sequencing depth and minimal sequencing errors in the assembled genomes.

Sampling
Sludge biomass was sampled from the anaerobic digester at Fredericia wastewater treatment plant (Latitude 55.552219, Longitude 9.722003) at multiple time points and stored as frozen 2 mL aliquots at -20°C.

DNA extraction
DNA was extracted from the anaerobic digester sludge using DNeasy PowerSoil Kit (QIAGEN, Germany) following the manufacturer's protocol.The extracted DNA was then size selected using the SRE XS (Circulomics, USA), according to the manufacturer's instructions.
DNA QC DNA concentrations were determined using Qubit dsDNA HS kit and measured with a Qubit 3.0 fluorimeter (Thermo Fisher, USA).DNA size distribution was determined using an Agilent 2200 Tapestation system with genomic screentapes (Agilent Technologies, USA).DNA purity was determined using a NanoDrop One Spectrophotometer (Thermo Fisher, USA).

Nanopore DNA sequencing
Library preparation was carried out using the SQK-LSK109 kit (Oxford Nanopore Technologies, UK) following the manufacturer's protocol.The DNA libraries were sequenced using the R.9.4.1 and the R.10.3 MinION flowcells (Oxford Nanopore Technologies, UK) on a MinION Mk1B (Oxford Nanopore Technologies, UK) device.After sequencing, the MinION flowcells were washed using the Flow Cell Wash Kit (EXP-WSH002, Oxford Nanopore Technologies, UK) and the same library was loaded again to generate additional sequencing data.

Illumina DNA sequencing
The Illumina libraries were prepared using the Nextera DNA library preparation kit (Illumina, USA) following the manufacturer's protocol and sequenced using the Illumina MiSeq platform.

PacBio CCS
Size-selected DNA sample was sent out to the DNA Sequencing Center at Brigham Young University.
The DNA sample was fragmented with Megaruptor (Diagenode, Belgium) to 15 kb and size-selected using the Blue Pippin (Sage Science, USA).The sample was then prepared using SMRTbell Express Template Preparation Kit 1.0 (PacBio, USA) according to manufacturers' instructions.Sequencing was performed on the Sequel II system (PacBio, USA) using the Sequel II Sequencing Kit 1.0 (PacBio, USA) with the Sequel II SMRT Cell 8M (PacBio, USA) for a 30 hour data collection time.
predictions.Bin quality was determined following the Genomic Standards Consortium guidelines, wherein a MAG of high quality featured genome completeness of more than 90 %, less than 5 % contamination, at least 18 distinct tRNA genes and the 5S, 16S, 23S rRNA genes occurring at least once (14).MAGS with completeness above 50 % and contamination below 10 % were classified as medium quality, while low quality MAGs featured completeness below 50 % and contamination below 10 %.MAGs with contamination estimates higher than 10 % were classified as contaminated.
Bins were clustered using dRep (v.2.6.2 (83)) with "-comp 50 -con 10 -sa 0.95" settings.Only the bins that featured higher coverage than 10 in their respective sequencing platform and a higher Illumina read coverage than 5 for bins from the hybrid approach were included in downstream analysis.For IDEEL test (12), the predicted protein sequences from clustered bins were searched against the UniProt TrEMBL (84) database (release 2021_01) using Diamond (v.2.0.6 (85)).Query matches, which were not present in all datasets, were omitted to reduce noise.
QUAST (v.4.6.3(86)) was applied on the clustered bins with less than 0.5 % SNP rate to acquire mismatch and indels metrics.Cases with Quast parameters "Genome Fraction" of less than 75 % and "Unaligned length" of more than 250 kb were omitted to reduce noise.For homopolymer analysis, the clustered bins were mapped to each other using "asm5" mode of Minimap2 and DNA of a microbial community from an anaerobic digester was sequenced via Illumina MiSeq (2x300 bp), PacBio CCS and the Oxford Nanopore MinION platform, using R9 and R10 chemistry flow cells (Fig 1a).The reads from different sequencing platforms were then assembled using Megahit for short reads and metaFlye for long reads.Despite being the same sample of extracted DNA, direct comparisons are difficult as the additionally size selection of the PacBio CCS dataset both increased the read length (Fig 1b) and altered the relative abundances of the species in the sample (Fig S1a).

Figure 1 .
Figure 1.Overview of general sequencing results from short and long read platforms.a) Read length and estimated accuracy (inferred from Phred scores, log-scaled) distributions of the read datasets.b) Contig Nx plot for the assembled metagenomes (after 1 kb length filtering).Differential coverage plots are presented for c) Illumina and d) Nanopore R9 metagenome assemblies.

Figure 2 .
Figure 2. Comparison of bins from different sequencing approaches.a) MAG fragmentation (logscaled) at different bin SNP rates in PacBio CCS MAGs.b) Genome bin completeness estimates for different sequencing platforms.IL -Illumina, NP -Nanopore, PBCCS -PacBio CCS.Bin c) indel and d) mismatch rates (log-scaled) for MAGs from Nanopore sequencing with and without Illumina read polishing, compared to MAGs from PacBio CCS.The presented bin coverage on the x axis (logscaled) is for the corresponding Nanopore chemistry type.HQ MAGs are represented by circle, while triangles denote MQ MAGs.For all figures, only the bins that were clustered together between all the different sequencing platforms (see Materials and methods) are presented.
homopolymer calling errors.For QUAST and Counterr, PacBio CCS bins were used as reference sequences.Supplemental Figure 3. Homopolymer calling estimates in metagenomes from different sequencing platforms.Values in the heatmap show observed homopolymer counts estimated to be called correctly at given sequence length.The total count of homopolymers (called correctly and incorrectly) are in brackets.Only the contigs for bins that were clustered together between different platforms were used to generate values for the plot.Supplemental Figure 4. Recovery of rRNA genes from complex microbial communities via different sequencing techniques.Distribution of rRNA gene counts, found in MAGs from different sequencing methods, are presented for a) 16S, b) 23S and c) 5S rRNA genes.Only the MAGs, which clustered together between all the different sequencing approaches, were included in the plots.d) Counts for total rRNA sequences that were recovered from the assembled metagenomes.e) Recovered rRNA gene counts, normalized to assembly size.

Table 2 .
(18)lts for automated contig binning.Genome quality was defined based on the MiMAG (14) standard using CheckM(18): HQ MAGs are defined by completeness of >90 %, <5 % contamination, at least 18 distinct tRNA genes and the 5S, 16S, 23S rRNA genes occurring at least once.MQ MAGs by completeness >50 % and contamination <10 %, while low quality MAGs are defined by completeness <50 % and contamination <10 %.MAGs with contamination estimates higher than 10 % were classified as contaminated.*Fraction of the assembled contigs (> 1kb) that were placed into bins.**Sequencing costs refers to the expenses encountered at the time of conducting the experiments and may differ for other research groups, depending on agreements with sequencing service providers.
fragmentation in terms median contigs per bin (n=9), although the overall sequencing costs per HQ MAG recovered were greater, compared to other long read approaches.