Enabling high-accuracy long-read amplicon sequences using unique molecular identifiers and Nanopore sequencing

High-throughput amplicon sequencing of large genomic regions represents a challenge for existing short-read technologies. Long-read technologies can in theory sequence large genomic regions, but they currently suffer from high error rates. Here, we report a high-throughput amplicon sequencing approach that combines unique molecular identifiers (UMIs) with Oxford Nanopore sequencing to generate single-molecule consensus sequences of large genomic regions. We demonstrate the approach by generating nearly 10,000 full-length ribosomal RNA (rRNA) operons of roughly 4,400 bp in length from a mock microbial community consisting of eight bacterial species using a single Oxford Nanopore MinION flowcell. The mean error rate of the consensus sequences was 0.03%, with no detectable chimeras due to a rigorous UMI-barcode filtering strategy. The simplicity and accessibility of this method paves way for widespread use of high-accuracy amplicon sequencing in a variety of genomic applications.


Introduction
High throughput amplicon sequencing is a powerful method for analysing variation in defined genetic regions when sample amounts are limited, insights into low abundant subpopulations are important, or samples need to be analysed in an economical manner.
The method is therefore ideal for studying genetic populations with low abundant variants or high heterogeneity such as cancer driver genes 1-3 , virus populations 4-6 and microbial communities 7 .
For years, short-read Illumina sequencing has dominated amplicon related research due to its unprecedented throughput and low native error-rate of 0.1%, but with a limitation in maximum amplicon size of ~500 bp (merging of 2x300 bp PE reads) 8 . To enable a lower error-rate and sequencing of longer amplicons, unique molecular identifiers (UMI's) have been applied extensively. Each template nucleotide sequence molecule in a sample is tagged with a UMI sequence consisting of 10-20 random bases. All derived products throughout processing and sequencing will contain the UMI tag, which can subsequently be used to sort and analyse reads based on their original template molecule. This concept has many applications in high-throughput sequencing, such as absolute quantification 9 , generating molecule-level consensus sequences with a low error rate 10 , and assembly of synthetic long reads 11 . These applications have enabled key advances across diverse fields of research, such as absolute counting of transcripts in single cells 12 , detecting lowfrequency cancer mutations in plasma cell-free DNA 13 , and generating full-length microbial SSU ribosomal RNA (rRNA) sequences in a high throughput manner 14 , to mention a few.
The lowest possible error rate of Illumina based consensus sequencing is impressive (< 10 -7 %), but the upper limit of target length for UMI synthetic long-reads remains approximately 2000 bp due to inefficient cluster generation of longer DNA fragments on the flowcells 15 . UMI-based protocols exist that can generate longer consensus sequences from short reads 16 , but they are not widely adopted due to complicated laboratory profiling the error of the generated consensus sequences 23,24 . For ONT sequencing the raw error rate of 5-25% 25 has, until now, made it difficult to efficiently extract UMI sequences and confidently determine the true UMI sequences necessary for read binning.
Here, we created a UMI design containing recognizable internal patterns, which together with UMI length filtering now makes it possible to robustly determine true UMI sequences in raw nanopore data. We incorporated this patterned-UMI design into a simple, generally applicable laboratory and bioinformatics protocol that combines UMIs and ONT sequencing of long amplicons (>4500 bp) from low template amounts with high accuracy.
As a proof of concept, we apply the method to sequence full-length ribosomal RNA (rRNA) operons in a mock microbial community of eight bacterial species (ZymoBIOMICS Microbial Community DNA Standard) and generate consensus sequences with a mean error rate of 0.03% and no detectable chimeras.

Results and discussion
The method is simple and comprised of two PCR amplifications, Nanopore library preparation, Nanopore sequencing and custom data processing ( Figure 1). First, the DNA template is diluted according to the desired number of output sequences. The final yield is impacted by the initial dilution, as well as the amplicon length and PCR efficiency; thus, the dilution should be calibrated empirically for an amplicon target of interest. For rRNA operon sequencing, we found that 5 ng of template produced ~10,000 consensus sequences, and is a good general starting point for further optimization. The genetic region of interest is targeted using 2 cycles of PCR with a custom set of tailed primers, which include a target-specific primer, a UMI sequence and a synthetic priming site used for downstream amplification ( Figure 1A, step 1). Here we used the 27F (16S) 26 and the 2490R (23S) 27 primers to target the bacterial rRNA operon. The result from the initial PCR is a dsDNA amplicon copy of the genetic target with UMIs and synthetic primers in both ends. This template is subsequently amplified by PCR ( Figure 1A, step 2) and prepared for long read sequencing, in this case using the using the ONT 1D ligation kit and ONT MinION ( Figure 1A, step 3) followed by base-calling. After sequencing, the data is trimmed, filtered and reads are binned according to both terminal UMIs ( Figure 1B, steps 1 and 2). To overcome the obstacle of binning UMIs in raw nanapore data with a mean error rate ~9.5%, we designed `patterned` UMIs, with the structure "NNNYRNNNYRNNNYRNNN". The YR [C/T][A/G] patterns limit the length of homopolymer in the UMIs to 4 bases, which mitigates the higher homopolymer error rate present in ONT sequencing 8 . UMI sequences that have a high probability of being correct are detected based on the presence of the above pattern, as well as an expected UMI length of 18 bp. The two terminal UMIs in the amplicons make up a combined UMI pair of 36 bases with a theoretical complexity of 1.2x10 18 combinations, which means it is extremely unlikely that two molecules contain the same UMI pairs if aiming for 10,000 -1,000,000 molecules. Chimeric amplicons will form during the later cycles of PCR amplification step, especially if proof-reading polymerases are used 21 . UMI pairs from these chimeric sequences are de novo filtered by removing reads with UMI pairs in which either UMI has been observed before in a more abundant UMI pair ( Figure 1B, step 2) 28 .
The filtered, high-quality UMI pair sequences are used as a reference for binning of the raw dataset according to UMIs ( Figure 1B, step 3).
Sequencing of the mock community rRNA operon library resulted in 7.4 Gbp of basecalled raw data, of which 3.3 Gbp was binned based on UMIs. The mean read coverage per UMI bin was 67x. The consensus sequence for each UMI bin was generated by initially finding the centroid sequence in the bin, and polishing this centroid with all the data in the UMI bin using 5 rounds of racon 29 followed by 2 rounds of Medaka ( Figure 1B, step 4).
Initially, we observed error-rates that were highly correlated with the individual rRNA operons in the Zymo mock (Supplementary Figure 4), which indicated errors in the available reference genomes, as was also reported by others 18 . The reference genomes were generated using the Unicycler assembler with both Illumina and Nanopore reads and polished with pilon (personal communication with Zymo Research). As Unicycler uses a short-read assembly as starting point 30 and short-read polishing has been used for final curation, repeat regions are bound to contain errors resulting from ambiguous assembly and mapping 31 . To generate improved rRNA operon references, we first used a long-read assembly approach, in which publicly available ONT sequence data of the Zymo mock community 32 was assembled into individual reference genomes with miniasm 33 followed by racon and Medaka polishing. rRNA operons were then extracted from the high-quality long-read assemblies, and SNPs with no Illumina short-read support were manually curated, which were mainly indel errors in homopolymers. In total, we found 49 bacterial rRNA operons with 4-10 copies/species, where 44 operons were unique and had 1-379 intra-species difference (Supplementary Figure 2). The mean difference between the original references and our curated sequences was 0.063% (~2.8 SNP/operon), with a range of 0 -0.47% (0 -21 SNP/operon) (Supplementary Figure 3).
A total of 9759 amplicon UMI consensus sequences with an average length of 4372 bp were generated with a read coverage of ≥ 30x, a mean error rate of 0.03% and no detected chimeras ( Figure 2C). Of these sequences, 2570 were perfect with no errors. The error rate is markedly different in non-homopolymer regions compared to homopolymer regions ( Figure 2B). The non-homopolymer error rate stabilizes above a coverage of 10x for all error types (deletions, inserts and mismatches), with mismatches contributing to a majority of the remaining error ( Figure 2D). Within homopolymer regions, the error rate is higher and continues to drop beyond 100x coverage, which is primarily due to the indel errors ( Figure 2B). The mismatch error rate is similar between non-homopolymer and homopolymer regions over all coverage values. This demonstrates that the major obstacles for achieving a lower error rate are generally mismatch errors, as well as indel errors specifically in the homopolymer regions. The mismatch error rate of 0.012-0.016% is most likely derived from the 2 cycles of initial PCR performed to target the rRNA operon.
For this PCR, Platinum Taq DNA high-fidelity polymerase (Thermo Fisher) was used, which should have an error rate in the range of 0.003 -0.005% (6x lower than Taq) 34,35 per duplication which theoretically would result in a cumulative error rate over 2 PCR cycles of up to 0.01%. Other high-fidelity polymerases with lower error rates were tested, but we were unable to consistently produce amplicons, which we might be due to unwanted intra-or inter-molecule annealing. The homopolymer indel error rate is a consequence of the nanopore read-head structure in the CsgG pore used in the current R9.4 chemistry 8 . Generally, the homopolymer indel rate depends on homopolymer length and specific nucleotide (Supplementary Table 2), i.e. A-homopolymers have markedly lower errors than G-homopolymers. Yet, a closer inspection of the homopolymer error rates reveals a more complicated picture. For example, some positions of 3xC homopolymers contained more frequent insertions than longer C-homopolymers (Supplementary Figure 1). This problem is likely rooted within the calibration of the neural networks of the base-caller and consensus algorithms 36 , and is bound to change significantly in the future, and will probably be reduced with the introduction of the R10 pores. Despite residual systematic errors, the error-rate presented here is the lowest documented for long read amplicons yet (Supplementary Table 4). We did not identify any chimera's in the generated long-read amplicon data.
An important application of high-accuracy amplicon sequencing is the ability to confidently call variants, even if they are present in low relative abundance. To test our method, we  Table 3). An additional 26 spurious variants were detected with a mean error count of 1.4 (0.03% error rate) and a maximum of 3 (0.07% error rate). These spurious variants are supported by 1.6% of the total data, and seem to occur due to systematic errors at specific positions outside homopolymer regions.
The relative rRNA operon abundance within each species were very similar, as was expected ( Figure 3C). For some species the internal coverage variance was small (E. coli percent sd=4.9) and for others it is higher (L. fermentum sd=12.8) (Supplementary Figure   6 and Supplementary Table 5). By investigating the read coverage of the mock community genomes within the publicly-available metagenomic nanopore data 32 , we found evidence of heavy coverage skew across the genome in some species, likely due to different growth rates of the cultures at the time of sampling ( Figure 3D, Supplementary Table 7). This skew can impact the relative template abundances of the operons up to +/-50% (Supplemental Table 5), depending on their distance to the origin of replication, and could to some degree explain the variance we see among inter-species operon abundances.
The observed relative abundance between species did not match the theoretical abundance for all species reported by the vendor (Supplementary Table 6). Possible explanations are erroneous mixing of the mock community, species-dependent DNA fragment size, PCR primer mismatch, operon/genome GC content, and different amplification efficiencies. To our surprise, none of these potential causes could alone explain the observed discrepancy in relative abundance (Supplementary Tables 6-7 and  The data presented here was generated in 48 hrs (6 hrs lab work, 24 hrs sequencing, 6 hrs data processing) at a reagent cost of 1100 USD, which is ~0.1 USD/consensus sequence. Using this method on the PacBio Sequel system with the SMRT Cell 1m chips, we anticipate the output would be around 100,000 UMI consensus sequences at a cost of ~0.02 USD/consensus sequence with a marginally better error rate, as the PacBio errors . CC-BY-ND 4.0 International license available under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which was this version posted May 28, 2019. ; https://doi.org/10.1101/645903 doi: bioRxiv preprint seem more random and therefore better suited for consensus calling 37 . The throughput will likely change by a factor of 10x with the introduction of Sequel II and the SMRT Cell 8m chips. The turnaround for PacBio sequencing is theoretically < 24 hrs, but as most users would need sequencing out-of-house, this is more likely > 7 days. We predict that the ease of use, fast turn-around time and accessibility will favour sequencing of high-accuracy amplicons on the ONT platform. Over the past several decades, the amplification and sequencing of ribosomal RNA (rRNA) genes, primarily 16S and 18S, has become an integral method used to study the diversity and taxonomic composition of microbial communities in a variety of environments 38 . With our method, it is now possible to effortlessly improve upon highthroughput sequencing of environmental samples with databases based on full rRNA operon (SSU-ITS-LSU), which has not been previously feasible due to the length of the operon (≈ 5 kbp) and the method limitations aforementioned. A database of full operon rRNA sequences will help improve upon rRNA phylogeny, allow higher phylogenetic resolution 39-42 , especially critical if the method is applicable to eukaryotes 43,44 , and will present a wider range of target regions for designing short-read amplicon sequencing assays and fluorescent in situ hybridization probes 45,46 .
High-accuracy amplicon sequencing of long targets has many applications, and the ease and accessibility of this method now makes it possible for the wider scientific community to develop new solutionsall one needs is a modified version of their favourite primers, a few generic molecular laboratory instruments, and a MinION starter kit from Oxford Nanopore Technologies. While the residual error rate in the Nanopore consensus data is . CC-BY-ND 4.0 International license available under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which was this version posted May 28, 2019. ; https://doi.org/10.1101/645903 doi: bioRxiv preprint negligible, the remaining systematic indel errors could still be an issue in some contexts, such as sensitive assays where low abundant variants are important, or if shifts in reading frames cannot be tolerated. These systematic indel errors will hopefully be solved soon, and until then, this method can be applied with the PacBio platform for the specific purposes above. By exchanging the initial PCR for a ligation step, high-accuracy amplicon sequencing could also be applied to fragmented DNA with tight size distributions (5-15 kbp) to produce long reads with low error rate, which holds great promise for human genome sequencing 47 and resolving strain-diversity in metagenomes 48 .
. CC-BY-ND 4.0 International license available under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which was this version posted May 28, 2019. ; https://doi.org/10.1101/645903 doi: bioRxiv preprint . CC-BY-ND 4.0 International license available under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which was this version posted May 28, 2019. ; https://doi.org/10.1101/645903 doi: bioRxiv preprint . CC-BY-ND 4.0 International license available under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which was this version posted May 28, 2019. ; https://doi.org/10.1101/645903 doi: bioRxiv preprint . CC-BY-ND 4.0 International license available under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

Target gene and add UMIs
PCR was used to target the bacterial 16S-23S rRNA operon and simultaneously tag each template molecule with terminal unique molecular identifiers (UMIs).
The following primers were used for the PCR.

Amplification of UMI tagged amplicons
A second PCR was used to amplify the UMI-tagged template molecules. All of the UMI-tagged template molecules were added to the reaction along with a final concentration of 1x High Fidelity PCR buffer, 100 mM of each dNTP, 1.5 mM MgSO 4 , 500 nM of each ncec_pcr_fw_v7 To obtain sufficient PCR product for Oxford Nanopore sequencing, a third PCR was performed using amplicons from the second PCR and the same procedure as before, but with 4 x 100 µl reactions and 10 cycles of amplification. The final amount of amplicon generated was 10 µg in 55 µL.

DNA Sequencing
2000 ng of the purified amplicon from the third PCR was used as template for library preparation using the protocol "1D amplicon/cDNA by ligation (SQK-LSK109)" (Oxford Nanopore,

Data generation workflow
Trimming and filtering of raw data result from these pre-processing steps was trimmed and filtered raw read data.

Extraction of UMI reference sequences
To efficiently bin reads according to UMIs, it was critical to extract and validate true UMI sequences that could be used as references. UMI sequences of the correct length (18 bp) were

Binning reads according to UMI
The first 55-65 bp of each terminal of the trimmed and filtered reads were extracted with awk and saved into individual files. The UMI pair reference sequences were split into their corresponding single UMIs and mapped to the read terminals using bwa 5 (v0.7.17-r1198-dirty) with the commands: index, aln -n 3 -N, and samse -n 10000000. The mapping results were then filtered using samtools 6 (v1.9) with the command view -F 20. Mapping results from each end of the reads were merged, and a read was assigned to a specific UMI pair reference if two conditions were met: A) the UMI was the best hit; B) the mapping difference between the query read and each sub UMI was ≤ 3 bp. Based on these designations, the trimmed and filtered reads were divided into UMI bins.

Generation of UMI consensus sequences
For each individual UMI bin, a consensus sequences was initially generated using usearch

Pipeline parallelization
Many steps in the pipeline has been parallelized using GNU parallel 10 .

Generation of Reference Sequences for Mock Community
We obtained raw fast5 files from a previously-reported sequencing effort of the ZymoBIOMICS

Chimera detection
Chimeras in the consensus sequences were detected by usearch 12 with the commands -uchime2_ref -strand plus -mode sensitive, using our curated rRNA operon reference sequences from the ZymoBIOMICS Microbial Community DNA Standard (see above).

Error profiling
Detection of error was based on a mapping of the sequence data (raw reads, consensus sequences, variant consensus sequences) to our curated rRNA operon reference sequences from the ZymoBIOMICS Microbial Community DNA Standard (see above). Mapping was performed with minimap2 -ax map-ont --cs and filtered using samtools view -F 2308. The references and mappings were imported into R software environment 13

Exploration of relative abundance inconsistencies
We observed a difference between the relative abundance estimated with our UMI consensus data and the theoretical abundance for the rRNA operons of the mock community. We investigated several different potential causes of this discrepancy by importing relevant data and metadata into the R software environment 13 (v3.5.1), using mainly the tidyverse (v1.2.1 https://www.tidyverse.org/) and Biostrings 14 (v2.48.0) R-packages and custom scripts (see

Code availability).
Validate content of ZymoBIOMICS mock. Oxford Nanopore data from the ZymoBIOMICS Microbial Community DNA Standard described above was used for the analysis. The data was divided per species and imported into R. Based on read lengths, the total bp count was estimated for each species, and used together with the theoretical genome sizes and rRNA operon copy numbers to estimate the theoretical relative abundance of 16S (equal to rRNA operons). The read length data was used to estimate the amount of DNA theoretically available for rRNA operon PCR. A DNA fragment has to be equal to or larger than the rRNA operon to be a valid PCR template. Furthermore, DNA fragments are generated randomly and break points introduced within the operon will also render the DNA fragment useless as a template for PCR. Hence, all fragments below 4500 bp were discarded and 4500 bp were subtracted from all longer fragment lengths > 4500 bp to take broken operons into account. Based on the adjusted read lengths we estimated an adjusted theoretical relative abundance of 16S rRNA.
Investigate impact of GC and operon length. Possible impact of GC content (genome/rRNA operon) and operon read lengths was investigated by plotting relative difference between observed abundance and theoretical abundances.
Investigate PCR primer match. A bias in relative abundance can be introduced in the first PCR where the rRNA operon is targeted with region specific primers. If there are mismatches between primers and template, we would expect a lower annealing/amplification efficiency.
Primer/template mismatches were estimated using ipcress as described above.
Investigate PCR amplification bias. A bias in relative abundance can also be introduced in the second PCR where the UMI tagged amplicons are amplified with > 25 cycles of PCR. If a specific template has a relatively poor amplification efficiency we would expect this to impact the general bin size of this template. To investigate this, we imported UMI bin size statistics and UMI classifications into R and plotted bin sizes as function of species, operon and operon size.
Analysis of genomic coverage skew due to growth. A bias in relative abundance could also occur due to the mock species being in different growth phases at the time of sampling. To investigate the potential contribution of growth to coverage bias, we used the previously generated genomes of the mock community species. Nanopore data was mapped to each species genome using minimap2 -ax map-ont and calculated genome position depth using samtools. Ribosomal RNA operon genome coordinates were predicted by barrnap as described before. The data was imported into R, and used to create read coverage plots.

Code Availability
Source code and analysis scriptes are freely available at https://github.com/SorenKarst/longread-UMI-pipeline Figure S1: Error as function of homopolymers in reference sequences. All homopolymers in the reference sequences (44 unique curated rRNA operons from the ZymoBIOMICS Microbial Community DNA Standard) have been categorized according to nucleotide (four boxes) and length of homopolymer (x-axis). Each dot represents a specific homopolymer in a specific reference operon. The y-axis denotes the fraction of sequences with a specific operon that either has a deletion, insertion or is perfect. The numbers at the top of the boxes show the total number of homopolymers in each category.
. CC-BY-ND 4.0 International license available under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which was this version posted May 28, 2019. ; https://doi.org/10.1101/645903 doi: bioRxiv preprint  Comparison between curated and non-curated reference sequences at species level. Each overview table corresponds to a species in the mock community, in which each ZymoBIOMICS reference operon is listed next to the closest corresponding curated reference, differences between the two divided by type (deletion, insert and mismatch) and total error rate. The determined differences between the ZymoBIOMICS references and our curated references were determined through minimap2.
. CC-BY-ND 4.0 International license available under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which was this version posted May 28, 2019. ; https://doi.org/10.1101/645903 doi: bioRxiv preprint Figure S4: Number of errors in UMI consensus using non-curated references. The number of errors in the UMI consensus sequences estimated based on non-curated rRNA reference sequences. Each point represents a single polished consensus sequence that aligns to a specific reference operon. Black bars represent the median number of errors. . CC-BY-ND 4.0 International license available under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which was this version posted May 28, 2019. ; https://doi.org/10.1101/645903 doi: bioRxiv preprint Figure S5: Validation of chimera detection. Chimera detection is notoriously difficult when sequencing errors are present and uchime2_ref, which we used, will only call a chimera if a sequence is an error-free combination of the references 1 . We estimate that approximately ~10% of our consensus sequences are error-free, and hence the chimera detection only works as intended on that fraction of the data. To validate that closely related chimeras would be identified with uchime2_ref, we generated a mock chimera dataset from the references sequences, which had from 1 to 842 bp differences to the closest matching references. 99.98% of the inter species chimeras (n = 5000) were detected and 11.6% of the intra species chimeras (n = 3123). The plot shows the test results; the data is divided by inter and intra species chimeras and the x-axis shows number of differences between chimera and closest matching reference and the y-axis shows number of chimeras. It is mainly chimeras with few SNPs that an not classified.
. CC-BY-ND 4.0 International license available under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which was this version posted May 28, 2019. ; https://doi.org/10.1101/645903 doi: bioRxiv preprint Figure S6: Mock community metagenome read coverage across genomes. Coverage profile of the mock community based on shotgun Nanopore sequencing data. Each grey point is a 10 kbp average coverage value. Colored points represents the position of the individual rRNA operons. Figure S4: Read size distribution of mock community metagenome data. Each line plot represents the read size distribution from each mock community species estimated from the Nanopore metagenome data. Some species have significantly more high molecular DNA over 5000 bp compared to some of the other species, which is important for effective template availability in PCR. The distributions seem to be gram+/-dependent. Were different DNA extraction kits used? . CC-BY-ND 4.0 International license available under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which was this version posted May 28, 2019. ; https://doi.org/10.1101/645903 doi: bioRxiv preprint . CC-BY-ND 4.0 International license available under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which was this version posted May 28, 2019. ; https://doi.org/10.1101/645903 doi: bioRxiv preprint . CC-BY-ND 4.0 International license available under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which was this version posted May 28, 2019. ; https://doi.org/10.1101/645903 doi: bioRxiv preprint Table S1: Error rate and types depending on UMI read coverage. 'Consensus coverage' refers to the UMI bin size, the error rate is averaged for all consensus sequences in that bin size range, and 'mm', 'ins', and 'del' refer to mean absolute counts of mismatches, insertions, and deletions, respectively. Table S2: Error rate (%) divided by homopolymer type and length. 'hp-' indicates nonhomopolymer regions, while homopolymer regions are separated by length (3-7 bp) and nucleotide type (A,C,G,T).
. CC-BY-ND 4.0 International license available under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which was this version posted May 28, 2019. ; https://doi.org/10.1101/645903 doi: bioRxiv preprint The error rates shown are relative to the curated reference database. The closest operon relative is shown in the far left column, and the assigned variant name in the far right column. . CC-BY-ND 4.0 International license available under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which was this version posted May 28, 2019. ; https://doi.org/10.1101/645903 doi: bioRxiv preprint . CC-BY-ND 4.0 International license available under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which was this version posted May 28, 2019. ; https://doi.org/10.1101/645903 doi: bioRxiv preprint Table S5: Overview of metagenome read coverage stats for all species. The mean coverage, standard deviation (sd), maximum coverage, minimum coverage, and the relative standard deviation (sd_pct) were all based on whole-genome Nanopore sequence data (available from: https://github.com/LomanLab/mockcommunity). Table S6: Overview of different estimates of relative abundance in mock community. 'Theoretical 16S relative abundance' is the abundance provided by the vendor. 'Theoretical 16S relative abundance based on metagenome' is estimated based on the Nanopore metagenome data and the number of rRNA operons per genome. 'Observed 16S relative abundance UMI consensus data' is based on the UMI consensus data. GC content is estimated from the genome and rRNA operon reference sequences.
. CC-BY-ND 4.0 International license available under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which was this version posted May 28, 2019. ; https://doi.org/10.1101/645903 doi: bioRxiv preprint Table S7: Estimation of mismatches between primers and rRNA operon sequences.