Abstract
High-throughput amplicon sequencing of large genomic regions remains challenging for short-read technologies. Here, we report a high-throughput amplicon sequencing approach combining unique molecular identifiers (UMIs) with Oxford Nanopore Technologies or Pacific Biosciences CCS sequencing, yielding high accuracy single-molecule consensus sequences of large genomic regions. Our approach generates amplicon and genomic sequences of >10,000 bp in length with a mean error-rate of 0.0049-0.0006% and chimera rate <0.022%.
Main
High-throughput amplicon sequencing is a ubiquitous method for studying genetic populations with low-abundance variants or high heterogeneity, including cancer driver genes1–3, virus populations4–6 and microbial communities7. Short-read Illumina sequencing has dominated amplicon-related research due to its unprecedented throughput and low native error-rate of ∼0.1%, but its maximum amplicon size of ∼500 bp8 limits important long-range information and assay resolution9. Unique molecular identifiers (UMIs) have been applied to enable sequencing of longer amplicons with short-reads via assembly of synthetic long reads10. Each template molecule in a sample is tagged with a UMI sequence consisting of 10-20 random bases, which can subsequently be used to sort and analyse reads based on their original template molecule. UMIs can enable sequencing of synthetic long-reads up to ∼11,000 bp, but this approach cannot resolve amplicons with repeats longer than the short-read length11, which limits its application. The high native error rates of Oxford Nanopore Technologies (ONT) (5-25%12) and Pacific Biosciences (PacBio) (13%13) have, until now, made it difficult to confidently identify true UMI tag sequences necessary to accurately assign raw reads to their template molecules. Furthermore, the combination of UMIs with long-read sequencing is relatively unexplored, and only recently has it been applied with PacBio CCS14–16, but without using dual UMIs for chimera filtering17 and profiling the error of the generated consensus sequences.
Here, we present a simple workflow that combines UMIs with sequencing of long amplicons on the ONT and PacBio platforms to produce highly accurate single-molecule consensus sequences with a low chimera rate. To improve recognition of UMI-tagged error-prone reads, we designed UMIs to contain recognizable internal patterns (Figure 1C and Table S1) that avoid error-prone homopolymer stretches18, which combined with filtering based on UMI length and pattern allows for robust determination of true UMI sequences in raw error-prone ONT and PacBio data.
The DNA template is initially diluted to the target number of output sequences, which is estimated based on the desired single-molecule coverage and expected sequencing yield. The genetic region of interest is then targeted using 2 cycles of PCR with a customized set of tailed primers (Table S1), which includes a target-specific primer, a UMI sequence and a synthetic priming site used for downstream amplification (Figure 1A, step 1). For PCR-free approaches, such as whole genome or metagenomic DNA sequencing, the adapters containing UMIs can be ligated to the template DNA molecules. The product from the initial UMI-tagging step is a double-stranded DNA amplicon copy of the genetic target, containing the UMIs and synthetic primer sites on both ends. The UMI-tagged molecule is subsequently amplified by PCR targeting the synthetic primer sites (Figure 1A, step 2), and prepared for long-read sequencing with ONT or PacBio CCS (Figure 1A, step 3). After sequencing, reads are binned based on both terminal UMIs (Figure 1B, steps 1 and 2). UMI sequences that have a high probability of being correct are detected based on the presence of a designated pattern, as well as an expected UMI length of 18 bp. Chimeric sequences are de novo filtered by UMI-pairs in which either terminal UMI is observed in a more abundant UMI-pair17 (Figure 1B, step 2). The filtered, high-quality UMI-pair sequences are used as a reference for binning the raw dataset according to UMIs (Figure 1B, step 3). The consensus sequence for each UMI bin is then generated by multiple rounds of polishing using the binned raw reads (Figure 1B, step 4).
To assess the effectiveness of our UMI-tagging approach for error-correcting long reads, we sequenced full-length ribosomal RNA (rRNA) operons (∼4400 bp) in a mock microbial community from ZymoBIOMICS containing eight bacterial species (Table S13). To compare sequencing approaches, we calculated the necessary read coverage to obtain a mean error rate < 0.01%/Q40 (Table S2), and data below that read coverage cut-off was removed from the analysis. Afterwards, chimeric sequences and sequences from reagent contamination were identified (Figure S8), manually curated, and removed from the dataset to enable calculations of true error-rates. On a single ONT MinION R10 flowcell, a total of 23,365 amplicon UMI consensus sequences (ONT UMI) were generated with read coverages ≥ 25x (Q40 cutoff), an average length of 4,381 bp, a mean residual error rate of 0.0049%, and a chimera rate of 0.017% (Figure 2a, b). Sequencing the same library of UMI-tagged long amplicons with a PacBio Sequel II 8M flowcell in CCS mode resulted in 39,678 UMI consensus sequences (PB UMI) with read coverages ≥ 3x (Q40 cutoff), an average length of 4,376bp, a mean residual error rate of 0.0006%, and a chimera rate of 0.022% (Figure 2a, b). For comparison, raw PacBio CCS reads without UMI clustering generated 135,823 CCS reads (PB CCS) with read coverages ≥ 40x (Q40 cutoff), which had a mean error rate of 0.0080% and a chimera rate of 1.9%. The 1.9% chimeras observed in the raw PB CCS data are most likely introduced during the PCR amplification of the UMI library, and are therefore present in the amplicon library before sequencing. The exact same amplicon library was used to produce the ONT UMI and PB UMI data. Thus, using a rigorous UMI-based filtering approach almost eliminates PCR chimeras, which otherwise can make up over 20% of the amplicons depending on the PCR conditions19. The ligation-based ONT UMI approach was tested on genomic DNA fragments of up to 10,000 bp from Escherichia coli, and the results were consistent with the rRNA operon results (Figure S3), demonstrating the flexibility of this method for improving the accuracy of long-read sequencing.
The residual errors were markedly different in non-homopolymer regions compared to homopolymer regions for both the ONT UMI and PB CCS data, while the PB UMI data was extremely low in both cases (Figure 2a, c). The error rate for all error types (deletions, insertions, and mismatches), except homopolymer deletions, stabilized for both ONT UMI and PB CCS above a coverage of 20x (Figure 2a, Table S2). The high deletion error rate was primarily due to deletions in long (>4 bp) C and G homopolymers for both data types (Figure S4-5, Table S3), and reaffirmed that homopolymer-derived errors are a remaining obstacle for lower error rates. For the ONT UMI data, G-insertions in non-homopolymer regions made up the majority of remaining errors. Both the non-homopolymer insertions and the homopolymer deletions were to some degree systematic, with some errors in specific positions being present in >50% of the sequences (Figure 2c and Figure S4). For PB CCS data, the homopolymer deletion errors were not as systematic, but still a major error contributor (Figure 2b, Figure S5). Random mismatch error is the other major source of error for PB CCS data, which probably originate from PCR errors. For the PB UMI data, there are very few errors left (1109 errors in 39678 sequences of ∼4400 bp), and thus not enough data to elucidate potential error trends (Figure S6, Table S3).
Characterizing whether the residual error is random or systematic is important for the ability to accurately call variants from single molecule consensus sequences. We naively generated variants from the three data types (data >Q40 threshold but without removing chimeras) by clustering consensus sequences, phasing single nucleotide variants (SNVs) within clusters and calling variants if present at ≥3x coverage. The 43 references in the Zymo Mock were observed for all datatypes, (Figure S7), with no errors except for 2 variants from the ONT UMI data, which each had 1 error in homopolymer regions (Figure S4). For ONT UMI, PB CCS and PB UMI data, 1.00%/6.99%/0.18% of the consensus data were respectively assigned to variants with systematic errors, and 0%/0.46%/0% were assigned to chimeric variants (Table S4).
CCS-like strategies have been attempted as an alternative to UMIs to reduce the error-rate of amplicon sequencing on the ONT platform, but these methods suffer from insufficient molecule coverage to effectively reduce mean error rates below 2%20 (Table S10). In principle, lower error rates can also be achieved with denoising strategies20, 21, but at the cost of potentially missing true low-abundance variants22, which are critical for some applications (e.g. pathogen detection and drug resistance). Furthermore, state-of-the-art clustering algorithms depend on the abundance of unique sequences to model errors23, 24, which is not suitable for datasets where population micro-heterogeneity is high and evenness low.
Amplification and sequencing of rRNA genes has become an important method for studying the diversity and taxonomic composition of human- and environment-associated microbial communities. Here, we applied the PB UMI method to generate 253,089 high-quality, full-length bacterial rRNA operon sequences from 70 human fecal samples collected by the American Gut Project25. We assessed strain-level taxonomic resolution by annotating the full length 5S, 16S and 23S within the operons, and searched for these genes against gene-specific databases from the “Web of Life” 26. Using only the full-length 16S rRNA gene, 11.3% of the sequences could be matched at the strain level to the database, and 38.4% assigned at the species level. By using both the 16S and 23S rRNA genes within an operon, we could assign 22% at the strain level and 72.2% at the species level, representing a significant increase in assignment over using the full length 16S rRNA alone (Chi-squared statistic=124,086, p-value<2.23e-308; Figure 3). These results are inline with a recent study of the taxonomic resolution of the rRNA operon27. This UMI approach should also enable direct quantification of molecules in a sample28, which would be ideal for precise relative abundance estimates. However, the mock community used here, and probably many others, contains a biased fragment size and growth-dependent coverage, preventing proper quantification (Table S5 and Figure S9-13).
The choice of library strategy and sequencing platform for high-accuracy amplicon sequencing depends on the application. An overview of time, cost and yield comparisons was compiled for the three different approaches (Table S6-8), and the current projected price per sequence for >Q40 data from rRNA operons is: 0.015 USD (ONT UMI), 0.012 USD (PB CCS), 0.007 USD (PB UMI), which should be similar for other genetic targets. For rapid testing and iterative development, the ONT UMI approach is attractive due to its low cost and portability. PB CCS sequencing also performs well for high-accuracy amplicon sequencing, but the presence of low abundant chimeric variants is problematic, especially if they propagate into reference databases29. For sensitive applications, such as detecting low-abundance variants or generating reference sequences for key databases, the PB UMI approach appears to be the best suited at this time.
Contributions
SMK and RMZ conceived the method and developed the bioinformatics pipeline. SMK performed wet lab method development and experiments. EASO performed Nanopore UMI sequencing of E. coli. RHK assembled reference genomes. SMK, RMZ and MA performed data analysis on method performance. DM, QZ, RK analysed American Gut Project samples. SMK, RMZ, MA wrote the first draft of the manuscript. All authors contributed to the content and revision of the manuscript.
Competing Interests
MA, SMK, and RHK are co-owners of DNASense ApS. The other authors declare no competing financial interests.
Methods
rRNA operon UMI sequencing of mock microbial community
Source of DNA
The ZymoBIOMICS Microbial Community DNA Standard (D6306, lot no. ZRC190811) was obtained from Zymo Research (Irvine, California). The mock community DNA contained genomic material from 10 species (8 bacteria and 2 yeasts): Bacillus subtilis, Cryptococcus neoformans, Enterococcus faecalis, Escherichia coli, Lactobacillus fermentum, Listeria monocytogenes, Pseudomonas aeruginosa, Saccharomyces cerevisiae, Salmonella enterica, Staphylococcus aureus. Note, 2 of the yeast species were not targeted by PCR amplification of rRNA operons based on the primers used (see DNA Sequence Library Preparation). The concentration of DNA in the mock community sample was measured on a Qubit 3.0 fluorometer and Qubit dsDNA HS assay kit (Thermo Fisher Scientific), and the quality of the DNA was measured by gel electrophoresis on an Agilent 2200 Tapestation using Genomic Screentapes (Agilent Technologies).
DNA Sequence Library Preparation
Online protocol
An interactive step-by-step protocol is available at protocols.io: (https://www.protocols.io/private/F5C5FE21305911EAAC0B0242AC110003).
Tagging target gene with UMIs
PCR was used to target the bacterial 16S-23S rRNA operon and simultaneously tag each template molecule with terminal unique molecular identifiers (UMIs).
The following tailed primers lu_16S_8F_v7 and lu_23S_2490R_v7 were used for the PCR (see Supplementary Table 1). The first section of both tailed primers is a synthetic priming site used for downstream amplification. The second section is the ‘patterned’ UMI consisting of a total of 12 random nucleotides (N) and 6 degenerate nucleotides (Y or R) which results in a total of 1.2×1018 possible UMI combinations when UMIs from both terminal ends of a template molecule are concatenated (412*2 × 26*2 = 1.2×1018, see Figure 1). The last section of the tailed primers consists of the rRNA operon-specific primer for 27f1 or 2490r2, respectively.
The PCR reaction contained 5 ng of ZymoBIOMICS Microbial Community DNA Standard, 1 U Platinum SuperFi DNA Polymerase High Fidelity (Thermo Fisher Scientific, USA), and a final concentration of 1X SuperFi buffer, 0.2 mM of each dNTP, 500 nM of each lu_16S_8F_v7/ lu_23S_2490R_v7 primers in 50 µL. The PCR program consisted of initial denaturation (3 minutes at 95°C) and 2 cycles of denaturation (30 seconds at 95°C), annealing (30 seconds at 55°C) and extension (6 minutes at 72°C). The PCR product was purified using a custom bead purification protocol “SPRI size selection protocol for >1.5-2 kb DNA fragments” (Oxford Nanopore, England) based on: dx.doi.org/10.17504/protocols.io.idmca46. CleanPCR (CleanNA, Netherlands) bead solution was used for preparing the custom buffer. The purification was performed according to the custom protocol with the exception of an EtOH concentration of 80% and 0.9x bead solution/sample ratio.
Amplification of UMI-tagged amplicons
A second PCR was used to amplify the UMI-tagged template molecules. All of the UMI-tagged template molecules were added to the reaction containing 2 U Platinum SuperFi DNA Polymerase High Fidelity (Thermo Fisher Scientific, USA), and a final concentration of 1X SuperFi buffer, 0.2 mM of each dNTP, 500 nM of each lu_pcr_fw_v7 and lu_pcr_rv_v7 primers (see Table S1) in 100 µL. The PCR program consisted of initial denaturation (3 minutes at 95°C) and then 25 cycles of denaturation (15 seconds at 95°C), annealing (30 seconds at 60°C) and extension (6 minutes at 72°C) followed by final extension (5 minutes at 72°C). The PCR product was purified using a custom bead purification protocol “SPRI size selection protocol for >1.5-2 kb DNA fragments” (Oxford Nanopore, England) based on: dx.doi.org/10.17504/protocols.io.idmca. CleanPCR (CleanNA, Netherlands) bead solution was used for preparing the custom buffer. The purification was performed according to the custom protocol with the exception of an EtOH concentration of 80% and 0.9x bead solution/sample ratio. The concentration and quality of the PCR amplicons was measured as described before.
To obtain sufficient PCR product for sequencing, a third PCR was performed using the amplicons generated from the second PCR and using the same procedure as before, but with 3 × 100 µl reactions and 6 cycles of amplification. A large reaction volume was used to minimize the risk of overamplification. The final amount of amplicon DNA generated was 3.5 µg in 55 µL.
Oxford Nanopore sequencing of mock rRNA operon amplicons
1200 ng of the purified amplicon DNA from the third PCR was used as template for library preparation using the protocol “1D amplicon/cDNA by ligation (version ACDE_9064_v109_revA_23May2018, SQK-LSK109)” (Oxford Nanopore, England). A MinION R10 flowcell (FLO-MIN110) was used for sequencing on a MinION and MinION software v19.10.1 (Oxford Nanopore, England). Basecalling was performed with Guppy v3.4.4 (Oxford Nanopore, England) in GPU mode with following modifications to the standard settings --trim_strategy ‘none’ --device cuda:0 --chunk_size 1500 --chunks_per_runner 1024 --config dna_r10_450bps_hac.cfg model.
PacBio Sequel II sequencing of mock rRNA operon amplicons
2000 ng of the purified amplicon DNA from the third PCR was sent for PacBio library preparation and sequencing at the DNA Sequencing Center at Brigham Young University (https://dnasc.byu.edu/). The amplicons were incubated with T4 polynucleotide kinase (New England Biolabs, USA), following the manufacturer’s instructions. The sequencing library was prepared using SMRTbell Express Template Preparation Kit 1.0 following the standard protocol. Sequencing was performed on a Sequel II using Sequel II Sequencing Kit 1.0, Sequel II Binding and Int Ctrl Kit 1.0 and Sequel II SMRT Cell 8M following standard protocol with 1 hour pre-extension and 30 hour collection time (Pacific Biosciences, USA). Circular consensus (CCS) reads were generated using CCS version 3.4.1 (https://github.com/PacificBiosciences/ccs) using standard settings.
Data generation workflow
Online scripts
Source code and analysis scripts are freely available at https://github.com/SorenKarst/longread-UMI-pipeline
Trimming and filtering of raw data (Nanopore data only)
Raw fastq sequence data was trimmed of sequencing adapters using porechop with the commands: --min_split_read_size 3500 --adaptor_threshold 80 --min_trim_size 20 --extra_end_trim 0 --extra_middle_trim_good_side 0 --extra_middle_trim_bad_side 0--middle_threshold 80 --check_reads 5000 (v0.2.4 https://github.com/rrwick/Porechop). Additionally, the adaptors.py file in porechop was modified to include possible end-to-end ligation combinations of the customized primers. The customized settings and modifications to the adaptors.py file were necessary to correctly split amplicons concatenated in the ligation step of the library preparation, which can make up a substantial amount of the data.
The adaptor trimmed data was filtered using filtlong with the settings --min_length 3500--min_mean_q 70 (v0.2.0 https://github.com/rrwick/Filtlong) and cutadapt3 (v2.7) with -m 3500–M 6000. The output from these pre-processing steps was trimmed and filtered raw read data.
Extraction of UMI reference sequences
To efficiently bin reads according to the UMIs on their terminal ends, it was critical to extract and validate true UMI sequences that could be used as references for subsequent mapping steps. UMI sequences of the correct length (18 bp) were extracted from the reads by locating the flanking sequences within the custom primers. To do so, the first 200 bp from each terminal end of all reads were extracted, and saved into individual files. UMI sequences were extracted from each terminal end file with cutadapt -e 0.2 -O 11 -m 18 -M 18 --discard-untrimmed
-g CAAGCAGAAGACGGCATACGAGAT…AGRGTTYGATYMTGGCTCAG
-g AATGATACGGCGACCACCGAGATC…CGACATCGAGGTGCCAAAC
-G GTTTGGCACCTCGATGTCG…GATCTCGGTGGTCGCCGTATCATT
-G CTGAGCCAKRATCRAACYCT…ATCTCGTATGCCGTCTTCTGCTTG in paired-end input mode. This step insured that only reads with UMIs of the correct length (18 bp) in both ends were extracted. UMI pairs were then concatenated and filtered with grep -B1 -E “NNNYRNNNYRNNNYRNNNNNNYRNNNYRNNNYRNNN” to remove UMI pairs that did not follow the expected pattern. Filtered UMI pairs were clustered using usearch4 (v11.0.667) with the commands usearch -fastx_uniques -minuniquesize 1 -strand both and usearch -cluster_fast-id 0.90 -centroids -sizein -sizeout -strand both -minsize 1. To estimate the coverage of the clustered UMI pairs, UMI sequences were re-extracted using only a single primer with cutadapt -e 0.2 -O 11 -m 18 -L 18 --discard-untrimmed
-g CAAGCAGAAGACGGCATACGAGAT -g AATGATACGGCGACCACCGAGATC
-G GATCTCGGTGGTCGCCGTATCATT -G ATCTCGTATGCCGTCTTCTGCTTG in paired-end input mode. The re-extracted UMI pairs were concatenated and mapped to the clustered UMI pairs using bwa5(v0.7.17-r1198-dirty) with the commands: bwa index, bwa aln -n 6 –N, and bwa samse –n 10000000. The mapping results were then filtered using samtools6(v1.9) view -F 20. Using the mapping results the clustered UMI pairs were filtered using gawk to remove UMI pairs with a coverage < 3x. Potential chimeras were removed by filtering clustered UMI pairs containing sub UMI that was observed in another UMI pair with a higher abundance. The output from these steps was a list of trusted UMI pairs that could be used as references for binning reads.
Binning reads according to UMIs
The first 90 bp of each terminal end of the trimmed and filtered reads were extracted with gawk and saved into individual files. The UMI pair reference sequences were split into their corresponding sub UMIs and mapped to the read terminals using bwa with the commands: bwa index, bwa aln -n 3 –N, and bwa samse –n 10000000. The mapping results were then filtered using samtools view -F 20. Mapping results from each end of the reads were merged, and a read was binned to a specific UMI pair reference if the following conditions were met: A) Sub UMIs from the same UMI pair were the best hits for both terminal UMIs in the read. B) The mapping difference between the query read and each sub UMI was ≤ 3 bp. C) The mean mapping difference between all of the query reads and the sub UMI was ≤ 3.5 (Nanopore only) or ≤ 3 (PacBio only). D) The ratio between the number of UMI binned reads to the size of the UMI reference cluster was ≤ 10 (Nanopore only). E) The read strand ratio (+/-) was in the interval 10-0.6 to 100.6, which is equivalent to the read strand fraction containing the fewest reads comprising at least 25% of the total data amount (Nanopore only). The output from this step was trimmed and filtered reads divided into UMI bins.
Generation of UMI consensus sequences
For each individual UMI bin, a consensus sequence was initially generated using usearch -cluster_fast -id 0.75 -strand both -centroids, and picking the most abundant centroid. The centroid sequence was used as template for 2 rounds of polishing using all the UMI bin reads with minimap27 (v2.17-r954-dirty) -x map-ont and racon8(v1.4.3) -m 8 -x -6 -g -8 -w 500.
Polishing of UMI consensus sequences (Nanopore data only)
The racon-polished Nanopore consensus sequences were further polished individually by using all of the reads in each UMI bin and two rounds of Medaka (v0.11.2) (https://github.com/nanoporetech/medaka) with the commands medaka mini_align -m and medaka consensus --model r10_min_high_g340 --chunk_len 6000.
Trimming of UMI consensus sequences
The consensus sequences from all UMI bins were then pooled and trimmed and filtered using cutadapt -m 3000 -M 6000 –g AGRGTTYGATYMTGGCTCAG…GTTTGGCACCTCGATGTCG. Consensus sequences not containing both primers were discarded.
Phasing of consensus sequences and variant calling
Consensus sequences were phased and used to call variants using a custom workflow. The consensus sequences were first filtered to remove any consensus sequences with a read coverage less than the minimum read coverage to obtain >Q40 data quality (25x for ONT UMI, 40x for PB CCS and 3x for PB UMI). The homopolymers were masked in the consensus sequences by converting homopolymers of length ≥3 into length 2 to minimize the effect of homopolymer errors on the phasing accuracy. The masked consensus sequences were dereplicated using usearch -fastx_uniques -strand both -sizeout -uc and clustered using two rounds of usearch -cluster_fast -id 0.995 -strand both -centroids -uc -sort length -sizeout -sizein, and removing clusters of size < 3. The reads belonging to each cluster were mapped back to the centroid sequence of the cluster using minimap2 –ax asm5. Genotype likelihoods were estimated from the mappings with bcftools 9(v1.9) mpileup –Ov –d 1000000 –L 1000000 –a “FORMAT/AD,FORMATDP”, and the results were filtered to show positions of SNPs present in ≥2x coverage using bcftools view -i ‘AD[0:1-]>2’ for each cluster. The list of SNP positions were used to phase the reads within a cluster, and a variant was called if ≥3 reads supported a combination of SNPs. Consensus reads were then grouped according to called variants, and consensus sequences were re-generated for each variant group. First, the homopolymers were unmasked in the consensus reads and a crude variant-consensus was generated using usearch -cluster_fast -id 0.99 -strand both -consout –sizeout. The crude variant-consensus was polished with a workflow using minimap2 –ax map-ont, bcftools mpileup –Ov –d 1000000 –L 1000000 –a “FORMAT/AD,FORMAT/DP”, bcftools norm –Ov, bcftools view -i ‘AD[0:1]/FORMAT/DP>0.5’ –Oz and bcftools consensus.
Pipeline parallelization
Many steps in the pipeline have been parallelized using GNU parallel10.
Data analysis
Error profiling
Detection of error was based on a mapping of the sequence data (raw reads, consensus sequences, variant consensus sequences) to curated and non-curated rRNA operon reference sequences from the ZymoBIOMICS Microbial Community DNA Standard (see see Generation of Reference Sequences for Mock Community). Mapping was performed with minimap2 -ax map-ont --cs and filtered using samtools view -F 2308. The references and mappings were imported into the R software environment11 (v3.6.0) in RStudio 12, where errors in the sequences were profiled using the tidyverse13 (v1.2.1) and Biostrings14 (v2.52.0) R-packages and custom scripts (see Resource availability). In brief, errors and their type (mismatch, deletion, insert) were detected from the SAM --cs tags. The relative positions of the errors was determined with respect to the reference sequence, and this was used to categorize the errors as being within homopolymers regions (hp+) or not (hp-). The error information was combined with metadata from the UMI binning (UMI bin sizes, UMI cluster sizes) and quality analysis (consensus length, UMI bin contamination estimates, ZymoBIOMICS reference-based taxonomy, SILVA taxonomy, chimera detection - see below for details) used to explore and visualize error as a function of such parameters.
Taxonomic classification of consensus sequences with mock references
Taxonomic classification of UMI/CCS consensus reads was performed by mapping the reads to curated curated rRNA operon reference sequences from the ZymoBIOMICS Microbial Community DNA Standard with minimap2 -ax map-ont --cs and filtered using samtools view -F 2308. Read classification was based on best hit.
Taxonomic classification of consensus sequences with SILVA database
16S rRNA sequences were extracted from the rRNA operon consensus sequences with cutadapt--discard-untrimmed -m 1200 -M 2000 -a TGYACWCACCGCCCGTC. Mapping to the curated ZymoBIOMICS Microbial Community DNA Standard rRNA operon reference sequences and the SILVA 132 SSURef Nr99 database was performed with minimap2 -ax map-ont --cs and filtered using samtools view -F 2308. Read classification was based on best hit and error rate was calculated as above (see Error profiling). The SILVA taxonomy and error rate was used to classify consensus sequences as chimeras or contamination.
Estimating UMI bin contamination
Taxonomic classification of raw reads in each UMI bin was performed by mapping the reads to curated rRNA operon reference sequences from the ZymoBIOMICS Microbial Community DNA Standard with minimap2 -ax map-ont --cs and filtered using samtools view -F 2308. Read classification was based on best hit. Contamination was estimated by calculating the fraction of reads not assigned to the most abundant taxonomy in each UMI bin.
Chimera detection
Chimeras in the rRNA operon consensus sequences were detected by usearch -uchime2_ref-strand plus -mode sensitive, using our curated rRNA operon reference sequences from the ZymoBIOMICS Microbial Community DNA Standard. Flagged chimeras were manually verified by investigating their error profiles in R (see Error profiling, Supplementary Figure 2).
Examination of relative abundance inconsistencies
We observed a difference between the relative abundance estimated with our UMI consensus data and the theoretical abundance for the rRNA operons of the mock community as reported by the manufacturer. A potential cause of this discrepancy could be due to problems with the DNA composition of the mock community. This was investigated by comparing rRNA operon relative abundance with theoretical relative abundance estimated from metagenomic Nanopore data from ZymoBIOMICS Microbial Community DNA Standard (see Oxford Nanopore sequencing of mock metagenomic DNA). Metagenome read lengths were calculated and taxonomy of metagenome reads were classified as above (see Taxonomic classification of consensus sequences with mock references) (Figure S12). Reference genome size and number of rRNA operons was obtained from the ZymoBIOMICS Microbial Community DNA Standard product manual (_d6305_d6306_zymobiomics_microbial_community_dna_standard.pdf, Ver. 1.1.5). The metagenome data along with the consensus rRNA operon data was imported into the R software environment, and analysed using the tidyverse and Biostrings R-packages along with custom scripts (see Resource availability). In short the relative abundance of the consensus rRNA operon data was calculated: Loading…. To calculate the theoretical relative abundance of the rRNA operons using the metagenome data, the metagenome data was first filtered to remove reads < 5000 bp. Read length reflects the DNA template length present in a DNA sample, and < 5000 bp templates are unlikely to contain a complete rRNA operon that can be amplified by PCR, and should therefore be disregarded in an analysis of rRNA operon relative abundance. First the theoretical number of rRNA operons was estimated per reference in the metagenome: Loading…. Then the relative abundance was calculated as above.
Analysis of genomic relative abundance and coverage skew due to growth
A bias in relative abundance could occur due to the mock species being in different growth phases at the time of sampling. To investigate the potential contribution of growth to coverage bias, we used metagenomic Nanopore data from ZymoBIOMICS Microbial Community Standards generated internally (see Oxford Nanopore sequencing of mock metagenomic DNA) and externally (see Generation of rRNA operon reference sequences for mock microbial community). Nanopore data was mapped to each species reference genome using minimap2 -ax map-ont and calculated genome read coverage per position by using samtools depth -aa. rRNA operon genome coordinates were predicted by barrnap (v.0.9) (available from: https://github.com/tseemann/barrnap) and species genomes were obtained by de novo assembly (see Generation of Reference Sequences for Mock Community). The data was imported into R, and used to create read coverage plots (Supplementary Figure 11).
Investigation of PCR primer match
A bias in relative abundance can be introduced in the first PCR where the rRNA operon is targeted with region-specific primers. If there are mismatches between primers and template, we would expect a lower annealing/amplification efficiency. Primer/template mismatches were estimated using ipcress from the package exonerate (v.2.2) (Supplementary Table 11).
E. coli whole genome sequencing with UMIs
Sources of DNA
Culturing
Escherichia coli str. K-12 substr. MG1655 (DSM 18039) was procured from DSMZ in 2010 and stored at -80 ₀C until use. A culture was grown overnight in 2 × 100 mL LB-media (10 g/L NaCl, 10g/L Tryptone, 5 g/L yeast extract) at 37 ₀C. Cells were harvested by centrifugation at 7800 RPM for 10 minutes and washed with 1X PBS buffer and finally resuspended in 1 × PBS.
DNA extraction
Genomic DNA was extracted with DNeasy PowerSoil (Qiagen, Netherlands) using standard protocol. The DNA concentration was measured on a Qubit 3.0 fluorometer with the Qubit dsDNA HS assay kit (Thermo Fisher Scientific) and the DNA quality was measured by gel electrophoresis on an Agilent 2200 Tapestation using Genomic Screentapes (Agilent Technologies).
DNA Sequence Library Preparation
Online protocol
An interactive step-by-step protocol is available at protocols.io: (https://www.protocols.io/private/D92C9DC132B111EA92DD0242AC110005).
DNA fragmentation
10 µg genomic DNA was fragmented using g-TUBE (Covaris, USA) centrifuged at 8415xg for 60 seconds. The fragmented DNA was purified using CleanPCR (CleanNA, Netherlands) following the manufacturer’s instructions (CleanPCR, manual revision v1.02) with the exception of an EtOH concentration of 80%, post wash dry time of < 3 minutes and 0.8x bead solution/sample ratio and elution 100 µL. DNA concentration and quality was assessed as described earlier.
End-repair and UMI adaptor ligation
An end-repair reaction was prepared containing 7 µL NEBNext End Repair Reaction buffer, 3 µL NEBNext End Prep Enzyme mix (New England Biolabs, USA), 2.5 µg fragmented DNA and adjusted to 50 µL with nuclease free water (Qiagen, Netherlands). The reaction was mixed by pipetting 10 times and incubated for 5 minutes at 20 °C and 5 minutes at 65 °C. The DNA was purified using CleanPCR (CleanNA, Netherlands) and its concentration and quality was assessed as described above. UMI adapters were prepared by mixing 25 µL lu_adp_l_v4 (100 µM), 25 µL lu_adp_s_v4 (100 µM), 25 µL 10x NEB T4 DNA ligase buffer (New England Biolabs, USA) and 175 µL nuclease free water (Qiagen, Netherlands) followed by incubation for 5 minutes at 94 °C and 15 minutes at room temperature. The UMI adapters ligation reaction contained 50 µL Blunt/TA ligation Master mix (New England Biolabs, USA), 20 µL adaptor mix prepared above and 1 µg end-repaired DNA in 80 µL nuclease free water. The reaction was mixed by pipetting 10 times and incubated for 10 minutes at 25 °C. The DNA was purified using CleanPCR (CleanNA, Netherlands) and the DNA concentration and quality was assessed as described above.
Amplification of UMI-tagged amplicons
4 ng of the adapter-ligated DNA was used as template for an initial PCR amplification of 8 cycles under the same conditions as in ‘Amplification of UMI-tagged amplicons’. The DNA was purified using CleanPCR (CleanNA, Netherlands) and the DNA concentration and quality was assessed as described above. The PCR amplicon was diluted to 2,000 molecules/µL with nuclease free water and used as template for PCR as described in the ‘Amplification of UMI-tagged amplicons’ section.
Oxford Nanopore sequencing of E. coli whole genome UMI amplicons
Sequencing was performed as described in ’rRNA operon UMI sequencing of mock microbial community’ section with the following exceptions. Basecalling was performed with Guppy v3.3.0 (Oxford Nanopore, England) in GPU mode with following modifications to the standard settings --config dna_r10_450bps_hac.cfg model.
Data generation workflow
Data generation was performed as described in the ’rRNA operon UMI sequencing of mock microbial community’ section with the following exceptions. Min/max read length cutoffs 2000bp/15000bp. Adaptor sequences used for locating sub UMIs with cutadapt were:
-g CAAGCAGAAGACGGCATACGAGAT…ACGTGTGCTCTTCCGATCT
-G AGATCGGAAGAGCACACGT…ATCTCGTATGCCGTCTTCTGCTTG.
Only a single round of medaka (v0.8.1) polishing was performed with the commands medaka mini_align -m and medaka consensus --model r10_min_high --chunk_len 6000. No variants were called.
Data analysis
Data analysis was performed as described in ’rRNA operon UMI sequencing of mock microbial community’ with the exception that genomic sequences Escherichia coli str_K12_MG1655 genome (NCBI: U00096.3) were used as references when profiling errors (Supplementary Figure 3).
Application of PacBio UMI sequencing of rRNA operons of American Gut
Project samples
PacBio UMI data generation and processing
PacBio library preparation, sequencing and data generation was performed as described in ’rRNA operon UMI sequencing of mock microbial community’ with the PacBio settings and the following exceptions. 1-2 ng of sample DNA was used as input for ‘Tagging target gene with UMIs’. In the third PCR in the library preparation, individual libraries were barcoded by swapping the normal amplification primers for tailed barcode primers (see Supplementary Table 1). 25 barcoded libraries were pooled and sent for Sequel II sequencing. After data generation UMI consensus sequences, the data was demultiplexed based on barcodes (see Resource Availability).
Taxonomic consistency between 16S V4 fragments and full length 16S
To test the consistency of the derived data to the existing Earth Microbiome Project15 16S V4 data, we first extracted full length 16S sequences from the operons using RNAmmer16. The sequences were then dereplicated and clustered de novo at 99% similarity using VSEARCH 2.7.017using the QIIME 2 version 2019.1018 q2-vsearch plugin (parameters: --p-perc-identity 0.99). Taxonomy was assigned against Greengenes 13_519 and the “classify-consensus-vsearch” method of q2-feature-classifier (parameters: --p-strand plus --p-query-cov 0.9 --p-perc-identity 0.9). Next, using redbiom20 we obtained EMP 16S V4 Deblur sOTU profiles21 for the samples corresponding to the same extracted DNA from Qiita22. Both tables were then aggregated to genus level relative abundance, and filtered to only the set of genera in common (n=82) between the two tables. The relative abundance of each genus, per sample, from the full length 16S and the 16S V4 data were then plotted (Supplementary Figure 14). Plotting was performed in matplotlib23, and Pearson and Spearman correlations were computed using SciPy24.
Taxonomic specificity of operons from real samples
Sequences of individual rRNA genes were identified from the full-length operon sequences using RNAmmer 1.2 under the “Bacteria” mode. The 16S and 23S rRNA sequences were concatenated with a linker of 20 “N” characters in between. Taxonomic assignment was performed by using the BLASTn algorithm as implemented in NCBI BLAST+ 2.7.1 to align query sequences against the extended “Web of Life” database25, which contains all 86,200 non-redundant bacterial and archaeal genomes from NCBI RefSeq and GenBank as of March 2017. The E-value threshold was set as 1e-5, whereas other thresholds were left as default. For each query sequence, hits with a bit score no more than 10% lower than the top hit were selected, and the lowest common ancestor (LCA) of these hits in the taxonomy tree was assigned to the query sequence. This behavior and threshold are consistent with DIAMOND’s taxonomic assignment functionality26. The percentage of query sequences assigned to any taxonomic unit at each of the eight standard taxonomic levels were calculated. The taxonomic assignment ratios at species or strain were compared using Pearson’s Chi-square test, as implemented in the “chi2_contingency” function of SciPy 1.3.1.
Generation of rRNA operon reference sequences for mock microbial community
We obtained raw fast5 files (ENA accession: ERR2887847) from a previously-reported27 sequencing effort of the ZymoBIOMICS Microbial Community Standard (D6300, batch ZRC190633) using Oxford Nanopore sequencing. The raw fast5 data was basecalled using the GPU-basecaller guppy v. 2.2.3 with “flipflop” mode. The basecalled reads were mapped to existing reference sequences (updated September 29, 2017; https://s3.amazonaws.com/zymo-files/BioPool/ZymoBIOMICS.STD.refseq.v2.zip) using minimap2 (v.2.12) with default settings. The mapped reads were assembled separately for each reference using minimap2 to create overlaps and miniasm (v.0.3) to perform the assembly with default settings. Raw reads were then mapped to the assembled genomes using minimap2 with default settings and racon (v.1.3.1) was used to construct corrected consensus sequences using default settings. The corrected sequences were subsequently polished with medaka (v.0.6.0) with the “r941_flip_model” model. rRNA operons were extracted from the draft reference genome assemblies using in silico PCR with our forward and reverse primers using ipcress, and were verified with genome coordinates for rRNA operons predicted by barrnap (Supplementary Table 11).
To remove any residual errors from the rRNA operon reference sequences after assembly and polishing, high-quality short reads generated from Illumina sequencing were downloaded from NCBI for each bacterial strain in the mock community (ENA accessions: ERR2935851, ERR2935850, ERR2935852, ERR2935857, ERR2935854, ERR2935853, ERR2935848, ERR2935849) and used for final polishing. The Illumina reads were randomly subsampled to an expected average coverage of 100 for each bacterial strain using the sample command in seqtk (v.1.0) (available from: https://github.com/lh3/seqtk). The subsampled Illumina reads were mapped to the draft rRNA operon sequences using minimap2 with the settings: -ax sr. The BAM files were sorted and indexed by samtools. We performed variant calling using bcftools (v1.9) with the commands mpileup and call using the settings: ploidy =1. Variant calls were filtered using bcftools filter with the settings: quality > 200. Variant calls were manually inspected and corrected, if needed, by visualizing mapping profiles in CLC Workbench. Polished consensus sequences were generated with bcftools consensus to generate high-quality references (zymo-ref-uniq_2019-10-28.fa) for use in benchmarking error rates in this study (see Resource availability). Intragenomic rRNA operons differed by between 0 to 380 bp for the polished rRNA reference sequences (Supplementary Table 9).
Oxford Nanopore sequencing of mock metagenomic DNA
Source of DNA
The same ZymoBIOMICS Microbial Community DNA Standard (D6306, lot no. ZRC190811) as described before was used.
Oxford Nanopore metagenome sequencing
1500 ng of the mock DNA was used as template for library preparation using the protocol “Genomic DNA by Ligation (SQK-LSK109)” (Oxford Nanopore, England). A MinION R10 flowcell (FLO-MIN110) was used for sequencing on a MinION and MinION software v19.10.1 (Oxford Nanopore, England). Basecalling was performed with Guppy v3.2.2 (Oxford Nanopore, England) in GPU mode with following modifications to the standard settings --config dna_r10_450bps_hac.cfg.
Resources availability
Protocols
Interactive protocols are available at protocols.io for the primer based longread UMI approach (https://www.protocols.io/private/F5C5FE21305911EAAC0B0242AC110003) and for the ligation based longread UMI approach https://www.protocols.io/private/D92C9DC132B111EA92DD0242AC110005
Code
Source code and analysis scripts are freely available at https://github.com/SorenKarst/longread-UMI-pipeline
Data
Raw and assembled sequencing data is available at the European Nucleotide Archive (https://www.ebi.ac.uk/ena) under the project number PRJEB32674 and a complete data overview can be found in Supplementary Table 12.
Supplementary information
Acknowledgements
The study was funded by research grants from VILLUM FONDEN (15510) and the Poul Due Jensen Foundation (Microflora Danica). RMZ was funded by grants from the Natural Sciences and Engineering Research Council of Canada (Discovery Grant), and Genome British Columbia (SIP011).
Footnotes
- Updated Nanopore data from R9 to R10 - Included PacBio CCS data - Included analysis of 70 samples from the American Gut Project (Daniel McDonald, Qiyun Zhu, Rob Knight) - Included analysis of genomic DNA (Emil A. Soerensen) - Updated and greatly revised the manuscript.