Abstract
Single-molecule long-read sequencing technologies, such as Nanopore and PacBio, may be particularly relevant for microbiome studies, since they can perform sequencing without PCR amplification or bacteria culture, and the much longer reads may facilitate assignments of operational taxonomic units (OTUs) from genus to species level. However, due to the relatively high per-base error rates (∼15%), the application of long-read sequencing on microbiomes remains largely unexplored, and there is a lack of benchmarking study on reference materials to assess their potential utility in microbiome studies. Here we deeply sequenced two human microbiota mock community samples from the Human Microbiome Project (525× coverage on HM-276D with 20 evenly mixed strains, 1068× coverage on HM-277D with 20 unevenly mixed strains). We showed that assembly programs consistently achieved high accuracy (∼99%) and completeness (∼99%) for bacterial strains with adequate coverage (∼99% in 276D and ∼72% in 277D). For HM-277D, we also found that long-read sequencing provides accurate estimates of species-level abundance (R=0.94, for 20 bacteria with abundance ranging from 0.005% to 64%). Taxonomic binning and profiling were more accurate at higher rank, while performance decreased at the species level. We further compared the results with data generated from the Illumina short-read sequencing and PacBio long-read sequencing. Our results demonstrate the feasibility to characterize complete microbial genomes and populations from error-prone Nanopore sequencing data, but also highlight necessary bioinformatics improvements for future metagenomics tool development. All the data sets on reference microbiomes are made publicly available to facilitate benchmarking studies on metagenomics and the development of novel software tools.
Background
The fundamental importance of microbiota as the microbial communities that reside in human body is increasingly recognized. Over the past decade, there have been tremendous amounts of evidence suggesting that microbiota plays a crucial role in human health through modulating the metabolic functions, as well as food energy harvest and storage. Microbiota, especially the gut microbiota, is associated with many chronic diseases such as obesity, diabetes, metabolic syndrome, inflammatory bowel disease (IBD), irritable bowel syndrome (IBS), liver disease, hepatocellular and colorectal carcinoma[1–14]. Therefore, accurate profiling of complete genomes and population are crucial to understanding the impact of microbiota on human health. Currently, high-throughput sequencing technologies have been widely used in microbial community characterization. In particular, 16S ribosomal RNA (rRNA)[15] and shotgun metagenome sequencing on Illumina platforms[16] are two dominant approaches for describing microbiomes. Overall, the high-throughput nature of metagenomics sequencing allows us to interpret microbial community by using computational approaches such as operational taxonomic unit (OTU) identification[17], abundance quantification[18], read assembly[19–23], binning and taxonomic profiling[24–29]. Specifically, 16S rRNA sequencing targets on very specific regions that are highly variable between species, which is much cost-efficient. This is very useful for us to examine and compare the microbiota across high number of samples in a large scale project. However, this technique can only identify bacteria but not viruses or fungi, and the low resolution limits its usage in microbiome study below the genus level. As opposed to only the 16S sequences, shotgun metagenome sequencing surveys the whole genomes of all organism in the community [30–32]. It allows us to perform deep investigation of the microbial community as its ability to capture sequences from all organisms.
Despite the theoretical advantage of shotgun metagenome sequencing, due to the short read length (150 to 300 nucleotides), metagenomes cannot be fully characterized by next-generation sequencing (NGS) data. In addition, the lack of contextual information has become a barrier for short read to span both intra- and intergenomic repeats, which is crucial for complete de novo genome assembly of all dominant species in a microbial community. As a consequence, short-read assemblies remain highly fragmented. In comparison, the use of long-read sequencing has the potential to facilitate the complete and contiguous metagenome assembly. Lee et al. [33] sequenced a reference mock community sample using PacBio long read and evaluated the metagenome assembly performance. Results showed that single-molecule real-time (SMRT) long read data offered significantly improved assembly contiguity by spanning many of repetitive regions while single bacteria chromosome was assembled to more than 50 contigs based on short read data. In recent years, the Oxford Nanopore technologies (ONT) have offered advantages over traditional short-read NGS technologies in genome study. This single-molecule sequencing platform is able to generate average read length of >10kbp, spanning low complexity and repetitive genomic regions, which provides much more continuous assemblies. Subsequently, this approach has become an attractive option in metagenomics sequencing. While the ONT have great potential, complete and contiguous de novo metagenome assembly is still constrained by the high error rate (∼15%) of single-molecule long-read sequence data[34]. Therefore, a comprehensive evaluation of long-read bioinformatics tools in microbial profiling is needed[35]. Nicholls et al.[36] presented Nanopore sequencing data sets of two mock communities with 10 microbial species from ZymoBIOMICS[37]. They showed the utility of these data sets for future bioinformatics method development for long-read metagenomics. However, publicly available data sets based other sequencing technologies of these samples are limited as the samples are only commercially available and are not well studied so far by competing approaches. A study to evaluate the advantages of Nanopore sequencing in complete microbial genomes and a comparison over other sequencing technologies is still lacking so far.
In this article, we generated two deeply sequenced Nanopore data sets from new reference samples that are more commonly studied, and performed comprehensive analysis to compare microbial community profiling performance with PacBio and Illumina technologies. We first generated 525× coverage data on HM-276D mock community sample from Human Microbiome Project, which is an evenly mixed DNA sample of 20 bacterial strains (each with 5% abundance). We performed de novo assembly analysis with 4 long-read assemblers at different depth of coverage. 20 bacterial genomes were assembled with high accuracy and genome completeness. This sample also has been well studied by many groups. As mentioned above, Lee et al. [33] sequenced this mock community with PacBio to show the improvement of long-read data in metagenome assembly analysis. Jones et al.[5] compared the influence of different NGS platforms on genomic and functional predictions using HM-276D sample. We downloaded these two data sets and compared the performance with Nanopore data. Our results show that Nanopore consistently improved assembly contiguity, and completeness compared to PacBio and Illumina across computational approaches. Next, we sequenced HM-277D Mock Community sample with 1068× coverage. HM-277D is unevenly mixed DNA sample of 20 bacterial strains. Kuleshove et al.[38] sequenced this sample with Illumina TruSeq synthetic long read technique and showed the improvement in bacterial species identification, genome reconstruction compared to short sequences. Also, Leggett et al. [39] demonstrated Nanopore metagenomics sequence can be reliably classified using this community. In addition to metagenome assembly, we evaluated taxonomy binning and profiling performance across technologies (Nanopore and PacBio) and samples (HM-276D and HM-277D). High identification and classification accuracy were achieved above the species level. Overall, we demonstrate the technical feasibility to characterize complete microbial genomes and populations from error-prone Nanopore sequencing without any DNA amplification. We also discuss the limitations of current bioinformatics tools, when dealing with error-prone long-read metagenomics sequencing data. All our data are made publicly available, to benefit computational tool development on long-read based microbial genome assembly for metagenomics studies.
Results
Sequence data quality
HM-276D DNA sample includes 20 evenly mixed bacteria strains with reference genome size 70 Mb in total with 39 chromosomes. 11,610,183 reads with 35,578,375,166 bases (525× coverage depth) were generated on the Nanopore GridION platform, with a median length of 1,374 bp. The N50 length is 6,828 bp and median read quality is 9.39 in Phred scale. By using minimap2, 95% of reads were successfully aligned to reference genomes of 20 bacterial strains with 13.1% error rate. As shown in Figure 1(a), read coverage across 20 bacterial strains has good agreement with known abundances. Read depth is relatively homogenous across bacteria strains with 521.9X (sd = 524.7X) in average. Sequencing depth of each strain is at least 150 reads and only 0.03% region is covered by less than 3 reads.
HM-277D DNA sample includes 20 unevenly mixed bacteria strains. 18,254,839 reads data set with 72,312,638,112 bases (1068× coverage depth) were generated, leading to 2,065 bp in median read length with 10.12 median read quality. The N50 length is 7,857 bp. 99.2% of QC-passed reads were mapped to the reference genome and the error rate was 9.8%. As shown in Figure1(b), read distribution is more heterogeneous across strains due to unevenly mixed samples. The average coverage is 988.8 reads with standard deviation =1941.6 bp. This leads to 1.6% of region with less than 3 reads covered and 4 strains with sequencing depth less than 10 bp, which makes it more difficult for biological interpretation of this microbial community.
De novo assembly of HM-276D mock community
To assess the ability of Nanopore sequencing in profiling microbial community, we first conducted a de novo assembly of data set with 525× coverage from HM-276D mock community using 4 assemblers: wtdbg2[19], OPERA-MS[20], Canu[21] and meta-flye[22]. Canu and meta-flye are designed to be capable of handling metagenome data, while wtdbg2 and canu are broadly used for haploid or diploid genomes. Overall, the results show promise for the characterization of microbial genomes using long-read sequencing data. Canu produced the largest assembly of 69.5 Mb (99.3% of the benchmark data), including 83 contigs with contig N50 length of 3.91 Mb. meta-flye assembled 67.7Mb genome with 89 contigs. wtdbg2 generated similar results with 64.9 Mb genome size, 61 contigs and 2.97 Mb N50 length. Assembly metrics of OPERA-MS (67.9 Mb genome size, 4734 contigs with contig N50 length of 2.94 Mb) are similar with Canu and wtdbg2 whereas much more contigs were generated because OPERA-MS utilizes both long and short sequencing reads for assembly. By mapping all contigs to the reference genomes using MUMMer v3.23, we assessed the accuracy and genome completeness of contigs produced by 4 assemblers. As shown in Figure 2(a), meta-flye achieved the highest genome fraction (99.99%) and 1-to-1 identity percentage (99.62%), followed by OPERA-MS (genome fraction: 99.98% and accuracy 99.92%), Canu (genome fraction 99.81% and accuracy 99.4%) and wtdbg2 (genome fraction 95.94% and accuracy 98.73%). Thus, 4 tools generated results with similar good quality in term of contiguity, accuracy and completeness using long read data with evenly mixed samples at 525× coverage depth.
Next, we subsampled 525× data set to 365× (70%), 160× (30%), 80× (15%), 40× (7.5%) and 20× (3.75%) to examine the effect of sequencing depths on de novo assembly. The assembly results of 4 tools ranges 95.95% to 99.96% in consensus accuracy and 91.26% to 99.99% in genome fraction. In specific, OPERA-MS outperforms others with the highest and most consistent metrics for completeness and accuracy across different sequencing depths because its metagenomics design substantially improves the robustness to low sequencing depth, where genome fractions are 99.68% in average (sd = 0.61%) and consensus identities are 99.92% in average (sd = 0.05%). Despite of reduced metrics as sequencing depth becoming lower, meta-flye and Canu still recovered at least 96.8% genomes with 98.5% accuracy. Notably, wtdbg2 improved the assembly metrics with coverage depth reduced from 520× to 80×. In addition, we examined whether genomes of 20 bacterial strains can be better constructed with Nanopore sequencing technology compared to PacBio and Illumina. As shown in Figure 2(b), assemblers using Nanopore sequenced data outperforms other two technologies. With the same assembler, on average, the number of contigs of Nanopore is ∼30% lower than PacBio, genome fraction and genome size are 1.56% and 3.1 Mb higher respectively. Assemblies using Illumina sequenced data are 99.9% in accuracy, but with more contigs generated and lower genome size in total compared to Nanopore.
De novo assembly of HM-277D mock community
To evaluate the metagenome reconstruction in a more realistic setting, we carried out another de novo assembly of 1068× data set from HM-277D Mock Community, with unevenly mixed DNA samples of the 20 bacteria strains. Assembly accuracy still remains high, ranging from 97.78% to 99.75% across tools. However, not surprisingly, genome fractions and genome sizes of all methods are substantially lower than even community. This is because 13 bacterial strains have extremely low abundances (<1%) in this unevenly mixed samples, leading to reduced genome coverage fractions (Canu: 71.68%, OPERA-MS: 71.25%, meta-flye: 91.57%, wtdbg2: 59.7%) and genome sizes (Canu: 50.21 Mb, OPERA-MS: 47.99 Mb, meta-flye: 64.12 Mb, wtdbg2: 41.85 Mb). To assess how strain abundance affects assemblies, we calculated strain-specific genome fraction for each tool in Figure 2(a). Across bacterial strains, meta-flye recovered the highest percentage of genome (median 100%), followed by OPERA-MS (median: 98.75%) and Canu (median 94.78%), while assemblies of wtdbg2 covered only 31.22% (median). For bacteria with relative abundance higher than 0.2%, least 99.99% of reference genome can be covered by assembly contigs (meta-flye), with identity consensus reaching to 99.93%. These results suggest that bacterial strain with nontrivial abundance can be accurately assembled with Nanopore sequenced data. Overall, we observed that meta-flye returned assemblies for 20 bacterial strains with the best performance in completeness and accuracy. Metric for each strain is correlated with abundance of the corresponding bacteria. Some strains were proved hard to assemble for all assemblers due to extremely low relative abundance. For example, 13.6% of region of Enterococcus faecalis (0.011% relative abundance) were covered by 0 or 1 read and 56.1% covered by less than 3 reads, leading to 4.47% genome fraction for meta-flye. Moreover, there were 2 contigs belong to two different bacteria species, Bacteroides vulgatus (0.19% relative abundance) and Streptococcus pneumoniae (0.05% relative abundance), indicating the difficulty in differentiating one bacteria from another with low relative abundance.
Taxon binning and identification
Metagenome assemblers construct contigs with variable length to recover original genome of each bacteria from microbial community. Subsequently, another major challenge in studying the identity and diversity of this community member is to classify sequenced reads or contigs correctly according to their taxonomic origins. Here we investigated the taxonomic binning performance based on 3 scenarios of long-read sequencing data, HM-276D (Nanopore, PacBio) and HM-277D (Nanopore) at 160× depth of coverage, using a state-of-art taxonomic binner Megan-LR. First, all long reads were aligned to NCBI-nr database. Then, we used Megan-LR with interval-union LCA algorithm to assign ∼2 million aligned reads (∼4.6 Mb bases) to taxonomic nodes (Figure 3(a,b)). Overall, 4.22 Mb (0.087%) from Nanopore data of HM-276D sample were mis-assigned, while 4.37 Mb (0.075%) and 4.66 Mb (0.141%) for Nanopore data of HM-277D and PacBio data of HM-276D respectively. Specifically, we evaluated the recovery of taxon bins at different ranks. We considered two metrics to quantify the read assignment accuracy, average precision and sensitivity of 20 bacteria strains. For each taxonomic bin, we obtained precision by calculating the percentage of reads correctly classified out of all binned reads. Sensitivity is the percentage of correctly assigned reads out of all reads originally from the bin. As shown in Figure 3(c), HM-276D (Nanopore) has the highest precision, which are all above 60% from phylum to genus. HM-277D (Nanopore) followed, with all above 50%, while HM-276D (PacBio) has the lowest average precision due to predicted small false positive bins at the species level. Sensitivity has similar pattern (Figure 3(d)). HM-276D (Nanopore) still appears to the best data set for read classification than other two and the difference in accuracy between these 3 scenarios is similar across ranks. Nanopore is ∼8% higher than PacBio and HM-276D is 10% higher than HM-277D. To evaluate the stability of read assignment accuracy, we calculated 95% confidence interval of precision and sensitivity for each scenarios at each rank. Not surprisingly, confidence bands are narrower at higher rank, indicating that more taxon recovery accuracy can be reached. Owing to unevenly mixed bacteria strains, sensitivity is much more variable for HM-277D than other HM-276D. Overall, these results demonstrated the advantage of long-read data in accurate taxon recovery above the family level, while binning accuracy and stability were relatively at the species level.
In addition to assigning sequence fragments (reads or contigs) to taxon bins, we recognized the importance of accurate determination of taxonomic identity presence or absence from microbial community. Therefore, we continued to investigate the performance of taxonomic identity prediction between data from HM-276D (Nanopore, PacBio) and HM-277D (Nanopore). For taxon prediction, we defined that the species is significantly present in the community when at least 10 reads were assigned to it, while identity with less 10 supporting reads was marked as absence. We considered two other metrics to quantify the detection accuracy, true positive rate (TPR) and false discover rate (FDR), where TPR is the percentage of correctly predicted taxonomic identities out of known existing taxon and FDR is the percentage of incorrectly predicted taxonomic identities out of all predicted taxon. TPR and FDR were calculated at different ranks in Figure 3(e). TPR were consistent across 3 data sets from phylum to order level (90%-77%). Below the order level, PacBio (HM-276D) and Nanopore (HM-277D) are 22% lower compared to Nanopore (HM-276D) (92%-87%). From phylum to family level, FDRs were controlled under 15% for all 3 data sets. However, at the genus level, more than 20% of detections are false for PacBio (HM-276D) and Nanopore (HM-277D) while 6% for Nanopore (HM-276). All 3 scenarios have inflated FDR (>20%) at the species level. Across data sets, there was drastic increase in FDR between phylum to family level and below family level, 10%±3% and 21%±5%. Similar to binning results, Nanopore data of HM-276D still consistently performed better than other two data sets across ranks. However, accurately predicting taxonomic profiles at the species level still remains challenging due to many false predicted taxonomic identities with 10 to 100 reads assigned incorrectly.
Strain profiling
Despite the challenges in assembly and binning of HM-277D microbial community even at the species level, especially for low abundance bacteria (relative abundance < 1%), the golden standard profile of this mock community still allows us to evaluate other unique advantages of this deeply sequenced data set at strain level. First, we examined the ability in identifying these 13 extremely rare strains based on annotated target genes. To explore the sensitivity of strain detection using this data set, we mapped raw sequenced reads to reference genomes of the 20 bacterial strains with Minimap2. Then, for each strain-specific gene, the average coverage were estimated by summing up read depth across all exonic region, normalized for gene length. In addition, exon coverage fractions were calculated. We required a gene with average coverage greater than 1 and exon coverage fraction greater 50% simultaneously in order to be declared as a detected gene. The results are shown in Figure 4(a). Detection rates and average coverage among all genes largely keep high in abundant strains (>1%), ranging from 96.4 bp to 4207.6 bp, as well as most of rare strains (<1%). Most of bacterial strains except for Bacteroides vulgatus (69.1%) and Streptococcus pneumoniae (81.7%) have achieved at least 97% gene detection rate.
Next, we recognized that 16S rRNA genes are most commonly used as gene marker for bacteria identification, we further selected them out for each strain based RefSeq annotation. As shown in Figure 4(a), though Bacteroides vulgatus and Streptococcus pneumoniae still have about 50% of 16S rRNA genes undetected by raw sequenced reads, 18 strains have 100% detection rates and exon coverage fraction with 434.77 bp coverage in average, which demonstrates the feasibility of identifying rare strain (<1%) in microbial community with long-read sequencing data. Additionally, read coverage of protein coding genes for 20 bacterial strains was summarized, which shows similar results. 14 strains have average coverage above 100 bp and gene detection rates for 18 strains have reached to 99%, indicating the presence of bacterial strains in the sample.
To understand the composition, diversity and spatial dynamics of microbial communities, we continued to evaluate the bacterial abundance estimation accuracy based on Nanopore data. We determined two abundance metrics to measure the accuracy, Pearson correlation and L1 norm. These two metrics assess how well Nanopore sequenced reads can reconstruct the bacterial abundances in comparison to the gold standard. Relative abundance was obtained by normalizing total read coverage with chromosome length for each taxon at different ranks. As shown in Figure 4(b), abundance estimates at the species level agrees well with the known relative abundances from the mock community. However, abundance estimation at higher ranks appears to be more challenging, as correlation coefficient ranges from 0.87 to 0.85 and L1 norm is above 0.3 from class to family level, while two metrics improved with Pearson correlation > 0.9 and L1 < 0.29 when rank is below the family level. Poor abundance estimation at class or family level may due to the presence of extremely rare bacterial strains in the HM-277D sample, as read coverages were simply summed up between species belonging to the same family or class without accounting for abundance heterogeneity.
Discussion
Complete genome assembly and population profiling are critical for the interpretation of microbial community diversity. However, a benchmarking long-read data set with consistent evaluation metrics is still lacking, which has hindered our understanding of long-read sequence data in metagenome assembly. In this study, we deeply sequenced HM-276D and HM-277D samples to assess the performance of error-prone Nanopore sequencing data and bioinformatics tools in characterizing microbial community. Assemblers consistently achieved high accuracy and completeness for nontrivial bacteria strains and genome binners performed well at above the genus level. Furthermore, by targeting on marker genes, we were able to identify rare strains with extremely low abundance in microbial community. Overall, our results have demonstrated that the technical feasibility to characterize complete microbial genomes and populations from Nanopore sequencing data with metagenomic software.
We note that despite the feasibility to characterize complete microbial genomes from long-read sequencing data, there are still challenges to be resolved in our study. Even for evenly mixed samples, the best performing assembler meta-flye achieve 99.99% consensus accuracy. However, as the reference genomes contains 70 Mb, 0.04% error rate has led to 28 Kbp of mismatches. These erroneous bases could be due to sequencing errors in low quality read, a major drawback of long-read sequence data and base modification, which may complicate the genome assembly. To prevent these errors, a sequencer with unbiased and methylation-aware base caller is in need. (We also acknowledge that some of the mismatches may be due to natural differences between reference microbiome samples and the reference genomes that were used.) In addition, there is still room for further improvement in assembly completeness by using longer reads or better designed assemblers to account for long repeats in genomes. In our study, we assembled long-read sequenced data from 20 bacterial strains across species. However, the performance at strain-level still remains unknown as closely related genomes is always a major challenge for genome assembly. In the future, we anticipate that more mock microbial community will be released with bacteria at strain level for benchmarking study.
By evaluating the performance of bioinformatics tools across different technologies, we found that third generation sequencing generally facilitates the complete characterization of complex bacterial genomes by overcoming many limitations of second generation sequencing. The short read length has limited the ability of Illumina sequencing in genome interpretation. For example, the length of repetitive genomic region is larger than a single read. As a consequence, intra- and intergenomic diversities are unlikely to be captured by short sequencing data. This issue has been resolved by long-read sequencing technologies (ONT and PacBio), which is able to span low complexity and repetitive regions by providing sequence reads with at least 10 kb in length. While generating data with much higher error rate than PacBio, ONT has become a promising platform in many applications, especially for studies requiring large amounts of data. This is because ONT provides longer reads (up to 900 kb in length) with higher throughput compared to PacBio (10-15 kb in length). Moreover, ONT is currently more affordable with lower per-base cost of data generation, which is a key factor in long-read sequencing studies. Overall, the application of these two major long-read sequencing platforms in metagenomics analysis of complex communities is still restricted by higher error rate. This problem could be addressed with improvement of consensus sequences. Recently, newly released R10 chip from ONT has longer base-contacting constriction in the pore, which improves the homopolymer resolution as compared to R9. This can lead to metagenome assembly with higher accuracy and completeness, as well as more accurate OTU identification. Future metagenomics studies are expected to be changed dramatically by this approach. For example, strain UA159 and NN2025 under species Streptococcus mutans only share 8% common regions, which can be uniquely assigned. We then found that 20% of ONT reads can cover the unique region of these two strains respectively, which is infeasible for short reads. Therefore, with better quality of long-read data, this approach may allow us to identify bacteria of interest directly at strain level instead of performing binning analysis in the future.
In addition to illustrating the advantages brought by long-read sequence data, we also assessed the performance of four de novo assembly algorithms and a long-read genome binner. The bioinformatics challenges to interpret rich information from complex microbial community include high error rates and low throughput for long-read sequencing, fragmented nature for short-read sequencing, and large CPU hours requirement. For evenly mixed (each with 5% abundance) HM-276D mock community, 4 tools consistently achieved high accuracy and completeness. No single assembler significantly outperforms others. By subsampling data to less coverage depths, not surprisingly, we found that the corresponding metrics for 4 tools decreased. In terms of speed, wtdbg2 is tens of times faster than other tools. For the unevenly mixed mock community HM-277D, assembly accuracy still remain high for all 4 tools (∼97-98%). Genome fraction was reduced because 13 rare bacterial strains (<1%) were poorly assembled. Hybrid-assembler OPERA-MS, which combines the advantages from long and short-read technologies, shows more robust performance to bacterial strains with extremely low abundance than other tools. However, it produced much more contigs with less contiguity while meta-flye, Canu and wtdbg2 returned single contig for 18, 15 and 17 strains respectively. Furthermore, taxonomic binning results show that Megan-LR performs well when genomes are not closely related. Taxon bins were reconstructed with acceptable accuracy down to the genus level while performance decreased at species and strain level.
In summary, our results demonstrate the feasibility to characterize complete microbial genomes and populations from error-prone Nanopore sequencing data, but also highlight necessary bioinformatics improvements for future metagenomics tool development to handle specific challenges in error-prone long-read sequencing data. We believe that future metagenomics studies will benefit from this approach to assemble complete microbial genomes, while maintaining the theoretical ability to detect DNA methylations and base modifications, infer repetitive elements and structural variants, and achieve strain-level resolution within microbial communities. All the data sets on reference microbiomes are made publicly available to facilitate benchmarking studies on metagenomics and the development of novel software tools.
Methods and materials
Oxford nanopore sequencing of HM-276D and HM-277D
DNA samples of HM-276D and HM-277D were ordered from BEI Resources. Concentration of DNA was assessed using the dsDNA HS assay on a Qubit fluorometer (Thermo Fisher).
For library preparation, 1.0 µg DNA was used as the input DNA of each library. The library was prepared using the ligation sequencing protocol (SQK-LSK109) from ONT. Concretely, end repair, dA-tailing and DNA repair was performed using NEBNext Ultra II End Repair/dA-tailing Module (catalog No. E7546) and NEBNext FFPE Repair Mix (M6630). In all, 3.5 μl Ultra II End-prep reaction buffer, 3 μl Ultra II End-prep enzyme mix, 3.5 μl NEBNext FFPE DNA Repair Buffer and 2 μl NEBNext FFPE DNA Repair Mix were added to the input DNA. The total volume was adjusted to 60 µl by adding nuclease-free water (NFW). The mixture was incubated at 20 °C for 5 min and 65 °C for 5 min. A 1 × volume (60 µl) AMPure XP clean-up was performed and the DNA was eluted in 61 µl NFW. One microliter of the eluted dA-tailed DNA was quantified using the Qubit fluorometer. A total of 0.7 µg DNA should be retained if the process is successful.
Adaptor ligation was performed using the following steps. Five microliter Adaptor Mix (ONT, SQK-LSK109 Kit), 25 μl Ligation Buffer (ONT, SQK-LSK109 Kit) and 10 μl NEBNext Quick T4 DNA Ligase (NEB, catalog No. E6056) were added to the 60 µl dA-tailed DNA from the previous step. The mixture was incubated at room temperature for 10 min. The adaptor-ligated DNA was cleaned up using 40 µl AMPure XP beads. The mixture of DNA and AMPure XP beads was incubated for 5 min at room temperature and the pellet was washed twice by 250 μl Long Fragment Buffer (ONT, SQK-LSK109). The purified-ligated DNA was resuspended in 15 µl Elution Buffer (ONT, SQK-LSK109). A 1-µl aliquot was quantified by fluorometry (Qubit) to ensure ≥ 400 ng DNA was retained. The final library was prepared by mixing 37.5 μl Sequencing Buffer (ONT, SQK-LSK109), 25.5 μl Loading Beads (ONT, SQK-LSK109), and 12 µl purified-ligated DNA. The library was loaded to R9.4 flow cells (FLO-MIN106, ONT) according to the manufacturer’s guidelines. GridION sequencing was performed using default settings for the R9.4 flow cell and SQK-LSK109 library preparation kit. The sequencing was controlled and monitored using the MinKNOW software developed by ONT.
Metagenome assembly
Genome assemblies of the 20-mixed bacteria from HM-276D and MH-277D mock communities were conducted using 4 existing assemblers based on generated long-read sequencing reads. These 4 dedicated long-read assemblers we used are wtdbg2 (v2.4), OPERA-MS, Canu (v1.8) and meta-flye, where Canu and meta-flye are designed to be capable to handle metagenome while wdtbg2 and OPERA-MS are for broadly application. To evaluate the impact of coverage depth in genome assembly, in addition to 525× (HM-276D) and 1068× (HM-277D), we subsampled 5 data sets with 365×, 160×, 80×, 40× and 20× coverages for these two mock communities. In addition to long-read data, OPERA-MS requires short reads to improve the assembly accuracy. Hence, we downloaded Illumina sequenced HM-276D[5] and HM-277D data sets[38]. Similarly, these short-read data were also subsampled with depths 160×, 80×, 40× and 20×, which were provided to OPERA-MS in corresponding data set analysis. We also analyzed a PacBio data set[33] of HM-276D sample using wtdbg2, OPERA-MS, Canu and meta-flye to compare assembly performance across sequencing technologies. For comparison fairness, we applied consistent configuration settings for each tool across different coverage depths. In specific, we specified estimated genome size as 70M, where the parameters are “-x ont -g 70m –t 20” for wtdbg2, “genomeSize=70M useGrid=True” for Canu, and “CONTIG_LEN_THR 500, CONTIG_EDGE_LEN 80, CONTIG_WINDOW_LEN 340, KMER_SIZE 60, LONG_READ_MAPPER minimap2” for OPERA-MS, “-t 40 -g 70m -o ./ --meta” for meta-flye. 40 contig output files were obtained (2 mock community samples, 6 depths of coverage, 4 assembly tools) for further evaluation.
Metagenome assembly evaluation
Assembled genomes produced by each tool based on different samples and coverage depths were evaluated with metrics related to contiguity, genome completeness and accuracy. To assess the assembly contiguity, we first used our script to calculate the widely-used statistic N50, which is the shortest contig needed to cover at least 50% of the assembly. In addition, other related statistics, such as number of contigs, number of long contigs (>10kb), longest contigs and total assembly size, were collected from the FASTA output file of each assembler. Furthermore, we summarized NG50 for each method by replacing the assembly size with estimated genome size. This quantity represents the shortest contig needed to cover 50% of the genome. Based on these metrics, the contiguity of assemblies was comprehensively evaluated. Next, we downloaded the reference genome FASTA files of all 20 bacteria from NCBI database to measure the concordance between the references and assemblies. First, assemblies were mapped to the reference genomes using Mummer v3.23 with parameters “-maxmatch -c 100 -p nucmer”. Then, by comparing all contigs mapped onto the reference using dandiff, assembly accuracy was calculated using 1-to-1 alignment identity, which is the correctly matched base-pair percentage of contigs uniquely mapped to the reference genome (1-mismatch%). In addition, to assess the assembly completeness, we calculated the percentage of genome covered by the contigs. In real case, instead of evenly mixed in HM-276D mock community, bacterial strains are non-uniformly distributed, where some are likely to share extremely low abundance. Therefore, we evaluated the impact of the genomic DNA abundance on genome assembly. For the unevenly mixed HM-277D mock community samples, we calculated the abundance for each bacterial strain by normalizing the concentration with related reference genome size. The relationship between abundances and assessment metrics was displayed using scatter plots. For each plot, linearity was measured based on Spearman correlation using R v3.3.3.
Taxonomic binning analysis
Taxon bins of the 20-mixed bacteria from two mock communities were recovered using taxonomic binner Megan-LR[25] with 3 long-read sequencing data sets: HM-276D (Nanopore, PacBio) and HM-277D (Nanopore) at 160× depth of coverage. We first aligned all reads against NCBI-nr protein reference database using LAST with parameters “-P 100 -F15”. Next, output MAF files were converted to DAA format in smaller size. Then, we meganized the DAA files using MEGAN[26], which allows us to interactively visualize and explore these taxonomic results. To evaluate the taxonomic binning performance, we first counted the number of reads and bases which were correctly assigned to each taxon from the mock microbial community. We determined the metrics (precision, sensitivity, true positive rate and false positive rate). Precision and sensitivity assess how accuracy each read is classified across different sequencing technologies. Precision is the percentage of reads assigned correctly to the corresponding taxa out of all reads. Sensitivity is the percentage of correct reads out of reads assigned to the particular taxa. Next, we use true positive rate (TPR) and false discover rate (FDR) to assess the accuracy in taxonomic detection across sequencing technologies. TPR is the percentage of correctly detected taxon out of known taxon from the microbial community. FDR is the percentage of correctly detected taxon out of all detected taxon. All metrics are defined at each taxonomic rank.
Competing interests
The authors declare no conflict of interest.
Supplementary Tables
Supplementary Figures
Acknowledgements
This work was supported by CHOP Research Institute to K.W..
The following reagent was obtained through BEI Resources, NIAID, NIH as part of the Human Microbiome Project: Genomic DNA from Microbial Mock Community B (Staggered, High Concentration), v5.2H, for Whole Genome Shotgun Sequencing, HM-277D.
The following reagent was obtained through BEI Resources, NIAID, NIH as part of the Human Microbiome Project: Genomic DNA from Microbial Mock Community B (Even, High Concentration), v5.1H, for Whole Genome Shotgun Sequencing, HM-276D.