Abstract
The miniaturized and portable DNA sequencer MinIONTM has been released to the scientific community in the framework of an early access programme to evaluate its scope in a wide variety of genetic approaches. Although this technology is under constant development to completely deliver error-free and high quality reads, it has demonstrated a great potential especially in wide-genome analyses. In this study, we tested the ability of the MinIONTM to perform amplicon sequencing in order to design new approaches to study microbial diversity using nearly full-length 16S rDNA sequences.Using the R7.3 chemistry, we have generated more than 3.8 million events (nt) during a sequencing run. Such data was enough to reconstruct more than 90% of the 16S rRNA gene sequences for 20 different species present in the mock community used as a reference. After read mapping and 16S rRNA gene assembly, we could recover consensus sequences useful to make taxonomy assignments at the species level. Additionally, we could measure the relative abundance of all the species present in the mock community by detecting a homogeneous distribution for most of the species as expected.Despite the nanopore-based sequencing produces reads with lower per-base accuracy in comparison with platforms such as Illumina and 454, promising results were obtained from MinIONTM, indicating that this technology is helpful to perform microbial diversity analysis. With the imminent improvement of the nanopore chemistry, better results and global performance of the platform are expected to contribute to the specific detection of microbial species and strains in complex ecosystems.
Introduction
The third generation of DNA sequencers is based on single-molecule analysis technology that constantly is under development to deliver error-free and high quality reads. Oxford Nanopore Technologies (ONT) released the first miniaturized and portable DNA sequencer to researchers in early 2014 in the framework of the MinIONTM Access Programme. The MinIONTM is a USB-stick size device operated from a computer via USB 3.0. Real-time data analysis can be visualized in terms of number of reads and length distribution. Nucleotide basecalling and quality assessment of reads require a further processing where data exchange of Hierarchical Data Format (HDF5) files, containing a large amount of numerical data, is indispensable. This data exchange is done via the Internet through the Metrichor platform, process that can optionally be launched after sequencing process itself. According to its theoretical capabilities, the MinIONTM brings out new alternatives for genomic analyses of which the accomplishing of completely finished bacterial genomes is some of the most attractive, as it has been demonstrated recently by Quick and co-workers (Quick et al 2014). The moderate throughput that MinIONTM exhibits in terms of number of reads when compared with other massive sequencing platforms is definitively compensated with its performance in terms of read length, thus making it possible to obtain reads of thousands nucleotides in length. Short-read length sequencing approaches have permitted delivering high quality but partial genome sequences with unsolved repetitive elements, thus making it impossible to study genetic variation or molecular evolution directly or indirectly associated to such elements. Therefore, long-read approaches offer new insights into genomic analyses facilitating consecution of finished genomes through hybrid assembly strategies (Utturkar et al 2014). Additionally to genome sequencing analysis, microbial diversity and taxonomy approaches are also deeply limited by short-read strategies. Early massive sequencing approaches producing effective 50nt (Genome Analyzer, SolexaIllumina) to 200nt (454 Roche) reads only permitted to accurately explore diversity at the phylum level. However, and thanks to the chemistry improvement of most common sequencing platforms, in recent years dozens of studies have presented a vast inventory of human-or environment-associated microbial communities reaching detail at the family or even genus level. To date, the paired-end short reads approaches for massive sequencing permits the analysis of sequence information of roughly 30% (∼500nt) of the 16S rDNA, leaving taxonomic assignment of reads at the species level elusive. Therefore, implementation of long-read sequencing approaches to study the 16S rRNA genes should be determinant to design new studies conducted to evidence central role of precise bacterial species in a great variety of microbial consortia. As a consequence, we present a preliminary study of 16S rDNA amplicon sequencing from a mock microbial community composed of genomic DNA from 20 different bacterial species (BEI Resources) using MinIONTM to evaluate the scope of nanopore technology on bacterial diversity and taxonomic analysis performing sequence analysis of the nearly-full length bacterial 16S rRNAgenes.
Methods
Bacterial DNA and 16S rDNA Amplicons
Genomic DNA for the reference mock microbial community was kindly donated by BEI Resources (http://www.beiresources.org). This mock community (HM-782D) is composed of a genomic DNA mix from 20 bacterial strains containing equimolar ribosomal RNA operon counts (100,000 copies per organism per μL) as indicated by the manufacturer. According to BEI Resources instructions, 1 μL of mock community DNA was used for amplification of 16S rDNA genes. A 30 cycles PCR of 95° for 20 seg, 47°C for 30 seg, and 72°C for 60 seg; was setup using with Phusion High-Fidelity Taq Polymerase (Thermo Scientific) and S-D-Bact-0008-c-S-20 and S-D-Bact-1391-a-A-17 primers targeting a high range of bacterial 16S rDNA (Klindworth et al 2012, Loy et al 2007). PCR reactions produced ∼1.5Kbp blunt-end fragments, which were purified using Illustra GFX PCR DNA and Gel Band Purification Kit (GE Healthcare). Amplicon DNA was quantified using a Qubit 3.0 fluorometer (Life Technologies).
Amplicon DNA library preparation
The Genomic DNA Sequencing Kit SQK-MAP-005 was used to prepare the amplicon library to be loaded in MinIONTM. Approximately, 250 ng of amplicon DNA (0.25 pmol) was processed for end-repair using NEBNext End Repair Module (New England Biolabs) followed by purification using Agencourt AMPure XP beads (Beckman Coulter). Subsequently, we used 200 ng of the purified amplicon DNA (∼0.2 pmol) to perform dA-tailing using the NEBNextdA-tailing module (New England Biolabs) in a total volume of 30 μl according to the manufacturer’s instructions during 15 minutes at 37°C. To the 30 μl dA-tailed ampliconDNA, 50 μl Blunt/TA ligase master mix (New England Biolabs), 10 μl of Adapter mix, and 2 μl HP adapter were added and the reaction was incubated at 16°C for 15 minutes. The adaptor-ligated amplicon was recovered by using Dynabeads® His-Tag (Life Technologies) and washing buffer provided with the Genomic DNA Sequencing Kit SQK-MAP-005 (Oxford Nanopore Technologies). Finally, the sample was eluted from the Dynabeads® by adding 25 μl of elution buffer and incubating for 10 minutes before pelleting in a magnetic rack.
Flowcellset-up
The brand sealed R7.3 flowcell was stored at 4°C until usage. The R7.3 flowcell was fitted to the MinIONTM ensuring a good thermal contact with plastic screws. Priming of the R7.3 flow cell was done two times with premixed 71 μl nuclease-free water, 75 μL 2x Running Buffer, and 4 μL Fuel Mix. At least 10 minutes were needed to equilibrate the flowcell before every priming round and final DNA library loading.
Amplicon DNASequencing
The sequencing mix was prepared with 63 μl nuclease-free water, 75 μl 2x Running Buffer, 8 μL DNA library, and 4μL Fuel Mix. A standard 48-hour sequencing protocol was initiated using the MinKNOWTMv0.50.1.15. Base-calling was performed through data transference using the MetrichorTM agent v2.29.1 and 2D basecalling workflow v1.16. During the sequencing run, one additional freshly diluted aliquot of DNA library was loaded after 12 hours of initial input.
Data analysis
Quality assessment of read data and conversion to fasta format was performed with poretools(Quick et al 2014) and poRe(Watson et al 2014) packages. Read mapping was performed against the reference 16S ribosomal RNA sequences (accessions NC_009085, NZ_ACYT00000000, NC_003909, NC_009614, NC_009617, NC_001263, NC_017316, NC_000913, NC_000915, NC_008530, NC_003210, NC_003112, NC_006085, NC_002516, NC_007493, NC_010079, NC_004461, NC_004116, NC_004350, and NC_003028) using the LAST aligner v.189 with parameters -q1 -b1 -Q0 -a1 -e45 which were configured to give the best balance between 16S rDNA assembly length and variants. LAST outputs were converted to sam files and processed with samtools (Li et al 2009) to build indexed bam files and obtain consensus sequences from alignments and variant calling. Read mapping stats from sam files were calculated with the ea-utils package and its sam-stats function (Aronesty 2011). Different comparisons, GC content correlations, and plots were performed and designed in R v3.2.0 (http://cran.r-project.org). The species coverage was calculated by obtaining fold-change (Log2) of species-specific read counting against the expected average for the entire community. A coverage bias was assumed when coverage deviation was lower than -1 or higher than 1. The Simpson’s reciprocal index was calculated with the general formula, , where pi is the proportion of reads belonging to the ith species.
Results
The raw data collected in this experiment was obtained from MinKNOW software v0.50.1.15 (Oxford Nanopore Technologies) as fast5 files after conversion of electric signals into base calls via Metrichor Agent v2.29 and the 2D Basecalling workflow v1.16. Base called data passing quality control and filtering was downloaded and basic statistics of the experiment’s data was assessed with poretools (Loman and Quinlan 2014) and poRe (Watson et al 2014) packages. Fasta sequences were filtered by size and then mapped against reference 16S rRNA gene sequences using common and publicly available sequence analysis tools (see methods). Mapping stats are depicted in Table 1 and fast5 raw data can be accessed at ENA (European Nucleotide Archive) under project PRJEB8730 (sample ERS760633). Only one data set was generated after a sequencing run of MinIONTM. After the sequencing process, we obtained 3,404 reads of which 58.5% were “template” reads (1,991), 23.8% were “complement” reads (812), and 17.7% were “2d” reads (601). Read lengths had a wide distribution ranging from reads with 12 nt to more than 50,000 nt in length with median of 1,100 nt. We hypothesize that extremely large reads could be products of multiple amplicon ligation. However, when we tried to perform alignment among reference sequences and large reads, we could not detect any matches (data not shown). Accordingly, we performed a filtering step only retaining 97% of the original dataset (3,297 reads) with a size range between 100 to 2,000nt in length for downstream analysis. When we performed initial analysis using separate sets of reads being “template”, “complement”, and “2d”, we observed a detrimental effect of 2d reads in the quality of assembled sequences, thus obtaining a higher number of unnatural variants along several 16S sequences (∼14%). This could be explained because more than 86% of 2d reads were considered as low quality, a fact critical for aims addressed in this study and regarding study of sequence variants after assembly to distinguish very close species as those included in the mock community and belonging to the Streptococcus (3 species) and Staphylococcus (2 species) genus. Given that this effect was even present for some species when the full set of “template”, “complement”, and “2d” reads were combined and used to accomplish the sequence analyses, the dataset was finally reduced to contain information from “template” and “complement” reads (2,696 in total). Using reference sequence information for the mock microbial community analyzed, we reconstructed more than 90% of 16S rRNA gene sequences for all organisms included in the mock community (Table1). We observed that even at very low coverage as that retrieved for Bacteroides vulgatus 16S rRNA gene (Figure 1), it is possible to reconstruct almost 94% of the entire molecule; therefore, MinIONTM sequencing shows no size limitations additional to that associated to the PCR process itself. Indeed, the maximum size of amplicons sequenced in all cases was that expected according to PCR design (Table 1). In terms of coverage, we retrieved a notable lower number of 16S reads than expected from B. vulgatus species (Figure 1). We could not define if such lower coverage could be caused during PCR amplification even using high coverage primers (Klindworth et al 2012, Loy et al 2007), or if that bias coverage is the result of the sequencing process itself. Despite this, B. vulgatus 16S rDNA amplicon sequences were almost fully assembled with low amounts of variants after DNA read alignment and pileup (Table 1). Data produced by MinIONTM was further assessed to theoretically calculate the level of diversity observed using the Simpson’s reciprocal index. This diversity index was estimated to be 17.785, a value very close of the maximum expected, 20 for the mock community analyzed. Read mapping stats were analyzed in order to further measure the performance of MinIONTM sequencing in microbial diversity analysis based on 16S rDNA sequences. GC content of reads produced by MinIONTM showed an important and significant correlation (Pearson’s r = 0.47, p-value ≤ 0.0376) against GC content of reference values (Figure 2A) which indicates that 16S rDNA GC content is fairly well replicated during sequencing. However, we found a 16S rDNA GC content bias to some extent in the reads obtained from MinIONTM that in all cases exceeds the GC content of reference (Figure 2A). To test the probable influence of such bias and GC content itself in basecalling accuracy, we performed linear comparisons against mismatch rate, indels rate, and coverage deviation. We observed that coverage deviation (p-value ≤ 0.00003) and mismatch rate (p-value ≤ 0.00004) are influenced by read GC content (Figure 2B and 2C, respectively). In the first case, the influence of GC content on coverage deviation could have minimal effect because 95% of species analyzed show no more than 1-fold deviation. However, with GC bias detected in reads from the MinIONTM sequencer, this effect could be magnified, especially in species where GC content is high. On the other hand, we found a strong correlation between GC content of reads and the mismatch rate retrieved from alignments which would insinuate again the GC content as a factor that influences 16S amplicon sequencing in the MinIONTM platform. Conversely, GC content did not appear to profoundly affect indel rate (Figure 2D). The complete assembly of the amplified 16S rRNA gene permitted the quantification of the level of sequence variants in the consensus sequence. These variants were recovered after a pileup of reads against reference sequences and they were variable in number with a median of 8 variants per 16S rRNA gene (Table 1). Such number of nucleotide substitutions means that approximately 0.5% of the 16S rDNA sequence assembled from MinIONTM reads retained unnatural genetic variants directly generated from the sequencing process itself, theoretically leaving a bona fide identification and taxonomy assignment of 16S rDNA sequences at species level. In the worst cases where number of variants are meaningfully (∼2.3% of the full assembly), like those observed for the Acinetobacter baumannii and Bacillus cereus (Table 1), direct BLAST comparisons of these assembled 16S rDNA sequences against the non-redundant reference database only matched those sequences with homologue sequences belonging to the same species, respectively. As expected, a homogenous distribution for strand mapping was observed roughly obtaining 50% of reads mapped against the forward strand and 50% of reads mapped against the complement strand, on average (Table 1).
Species abundance in the mock community. Species coverage was calculated by obtaining fold-change (Log2) of species-specific read counting against the expected average for the entire community. A coverage bias was assumed when coverage deviation was lower than -1 or higher than 1.
Per-base accuracy of the mapped reads. A - Scatter plot of the GC content observed in mapped reads against the GC obtained from the references sequence. The dashed line indicates correlation with a Pearson’s r = 1. B - Correlation between GC content observed in mapped reads and coverage bias observed in Figure 1. C - Influence of the GC content observed in mapped reads on mismatch rates calculated after mapping. D - Scatter plot of the observed GC content of mapped reads and indels rates calculated after mapping. In all cases the Pearson’s r coefficients and p-values supporting such correlations are presented inside the scatter plots and solid lines indicate the tendency of correlations.
Discussion
The microbial diversity analyses based on 16S rDNA sequencing are frequently used in biomedical research to determine dysbiosis associated in a great variety of gut-related human diseases. Identification of microbial species inhabiting different organs and cavities of human body relies on handling and processing of millions of DNA sequences obtained through the second generation of massive and parallel sequencing methods which still present limitations mainly in terms of DNA read length. Inability to fully determine 16S rDNA sequences during massive sequencing has led to the development of multiple algorithms dedicated to theoretically discern microbial species present in samples according to the sequence similarity degree, the Operational Taxonomy Units (OTUs). Despite high accuracy and a constant update of methods used in OTU-based approaches, available algorithms produce no consensus outputs leaving a high degree of uncertainty when the number of theoretical species and their abundance is being subject of study (He et al 2015, Koskinen et al 2014, Schmidt et al 2014a, Schmidt et al 2014b).
The third generation of sequencing methods based on single-molecule technology offers a new fashion to study the microbial diversity and taxonomic composition thanks to overcoming DNA read limitations at the expense of decreasing their throughput. MinIONTM is one of these single-molecule methodologies which has demonstrated its capacity in genome sequencing (Ashton et al 2015, Quick et al 2014). Very recent reports have shown application of this technology in medical microbiology by using amplicon sequencing to potentially determine bacterial and viral infections (Kilianski et al 2015, Quick et al 2015). In this study, we have further explored the scope of MinIONTM into microbial diversity studies by using amplicon sequencing of nearly full-size 16S rDNA from a mock bacterial community, obtaining a Simpson’s reciprocal index diversity index close to expected. Our results indicate that MinIONTM per-base accuracy (65-70%) is in concordance with previous results (Kilianski et al 2015, Mikheyev and Tin 2014, Quick et al 2014). We found that sequence coverage was close to expected in most of cases with only one exceptions, B. vulgatus (gene GC = 52%) which presented 1.84-fold less of the expected coverage. Although we could not demonstrate the definitive implication of the sequencing process itself in this coverage bias, this effect could be associated with the PCR process despite using “universal” primers with higher coverage among bacteria species during amplicon synthesis (Klindworth et al 2012). In any case, such coverage was enough to reconstruct 93% of the 16S rRNA gene with a low proportion of unnatural variants. We observed a general influence of GC content in the mismatch rate but not in the indel rate, suggesting that base miscalling could be associated with the amplicon GC content. Moreover, a slight correlation between the amplicon GC content observed and coverage bias was evidenced, indicating that GC content could be negatively affecting amplicon coverage to some extent. Although MinIONTM replicated fairly well the GC content expected for every amplicon sequenced, we observed a slight overrepresentation of GC in all reads obtained. This over-calling of GC bases in 16S rDNA amplicons could additionally influence the issues stated above in a negative manner.
The R7.3 chemistry used in MinIONTM allowed the acquisition of reads of moderate quality which were enough to reconstruct more than 90% of the 16S rDNA molecule in all 20 bacterial species analyzed. None of the 20 16S rDNA consensus sequences analyzed showed more than 3% of sequence variation, which can be considered as a threshold for canonical species identification. Therefore, the consensus sequence assembled were useful to get a reliable taxonomic identification at the species level. As expected, unnatural variants were associated with low coverage regions; therefore, increasing the sequencing coverage will reduce drastically the ambiguities at the assembled sequences. Notwithstanding, further analysis could help to understand if some coverage bias might be associated with certain taxonomic groups.
We have obtained promising results regarding the study of microbial communities by using 16S rDNA amplicon sequencing through MinIONTM device. Despite the observed modest per-base accuracy of this sequencing platform, we were able to reconstruct nearly full-length16S rDNA sequences for 20 different species analyzed from a mock bacterial community while obtaining a modest coverage for some species. To date, MinIONTM and nanopore technology have demonstrated a great potential in DNA sequencing allowing one to retrieve bacterial whole genome sequences with a minimum level of variation (Quick et al 2014). With the results presented here, we postulate that the MinIONTM platform is a reliable methodology to study diversity of microbial communities permitting: i) a taxonomy identification at the species level through 16S rDNA sequence comparisons; and ii) a quantitative method to determine the relative species abundance. This type of analysis will likely become more accurate over time as the nanopore chemistry is improved in future releases together the implementation of the “What’s In My Pot” (WIMP) Metrichor workflow aiming real-time taxonomic identification of sequences by comparison against different bacterial references databases (i.e. NCBI, SILVA, GreenGenes). Accordingly, sequence studies of the entire 16S rDNA molecule could permit a bypass of OTU-based analysis, thus making it feasible to obtain a direct inventory of bacterial species and relative abundance, as well as determine the key players at the species and/or strain level in different microbial communities of interest. Implications of the primary and secondary structure of 16S rDNA amplicons in the MinIONTM sequencing performance must be further explored in order to evaluate, minimize, and correct for technical bias regarding quantitative approaches of microbial diversity studies.
Conflict of Interest
ABP is part of the MinIONTM Access Program supported by ONT. Sequencing kits used in this research were kindly donated by ONT.
Acknowledegements
Authors thank to the European 7th Framework Programme for funding to ABP and KP researchers who were supported by the EC Project no. 613979 (MyNewGut).