High-accuracy long-read amplicon sequences using unique molecular identifiers with Nanopore or PacBio sequencing

High-throughput amplicon sequencing of large genomic regions remains challenging for short-read technologies. Here, we report a high-throughput amplicon sequencing approach combining unique molecular identifiers (UMIs) with Oxford Nanopore Technologies (ONT) or Pacific Biosciences circular consensus sequencing, yielding high-accuracy single-molecule consensus sequences of large genomic regions. We applied our approach to sequence ribosomal RNA operon amplicons (~4,500 bp) and genomic sequences (>10,000 bp) of reference microbial communities in which we observed a chimera rate <0.02%. To reach a mean UMI consensus error rate <0.01%, a UMI read coverage of 15× (ONT R10.3), 25× (ONT R9.4.1) and 3× (Pacific Biosciences circular consensus sequencing) is needed, which provides a mean error rate of 0.0042%, 0.0041% and 0.0007%, respectively. This work presents a sequencing strategy based on unique molecular identifiers that improves long-read consensus sequence accuracy of targeted amplicons as well as shotgun whole-genome fragments.

H igh-throughput amplicon sequencing is a ubiquitous method for studying genetics with low-abundance variants or high heterogeneity, including cancer driver genes 1-3 , viral populations 4-6 and microbial communities 7 . Short-read Illumina sequencing has dominated amplicon-related research due to its unprecedented throughput and low native error rate of ~0.1%, but its maximum amplicon size of ~500 bp 8 limits important long-range information and assay resolution 9 . UMIs have been applied to enable sequencing of longer amplicons with short-reads via assembly of synthetic long reads 10 . Each template molecule in a sample is tagged with a UMI sequence consisting of 10-20 random bases, which can subsequently be used to sort and analyze reads based on their original template molecule. UMIs can enable sequencing of synthetic long reads up to ~11,000 bp, but this approach cannot resolve amplicons with repeats longer than the short-read length 11 , which limits its application. The high native error rates of ONT (5-25% 12 ) and Pacific Biosciences (PacBio) (~13% 13 ) have until now, made it difficult to confidently identify true UMI tag sequences that are necessary to accurately assign raw reads to their template molecules. Furthermore, the combination of UMIs with long-read sequencing is relatively unexplored and only recently has this combination been applied with PacBio circular consensus sequencing (CCS) [14][15][16] , but without using dual UMIs for chimera filtering 17 and profiling the error of the generated consensus sequences.
Here, we present a simple workflow that combines UMIs with sequencing of long amplicons on the ONT and PacBio platforms to produce highly accurate single-molecule consensus sequences with a low chimera rate. To improve recognition of UMI-tagged error-prone reads, we designed UMIs to contain recognizable internal patterns ( Fig. 1a and Supplementary Table 1) that avoid error-prone homopolymer stretches 18 and, combined with filtering based on UMI length and pattern, allow for robust determination of true UMI sequences in raw error-prone ONT and PacBio data.

Results
Applying dual-UMI tagging for long-read amplicon sequencing. The DNA template is initially diluted to a target number of output sequences, which is estimated based on the desired single-molecule coverage and expected sequencing yield. This initial dilution step must be optimized for the genomic material of a given study to ensure that the target number of amplifiable target molecules can be achieved, either by trial sequencing or using methods such as semi-quantitative PCR and a suitable amplicon product standard. The genetic region of interest is then targeted using two cycles of PCR with a customized set of tailed primers (Supplementary  Table 1), which includes a target-specific primer, a UMI sequence and a synthetic priming site used for downstream amplification (Fig. 1b, step 1). For PCR-free approaches, such as whole-genome or metagenomic DNA sequencing, adaptors containing UMIs can be ligated to template DNA molecules. The product from the initial UMI-tagging step is a double-stranded DNA amplicon copy of the genetic target, containing the UMIs and synthetic primer sites on both ends. The UMI-tagged molecule is subsequently amplified by PCR targeting the synthetic primer sites (Fig. 1b, step 2) and prepared for long-read sequencing with ONT or PacBio CCS (Fig. 1b,  step 3). After sequencing, reads are binned based on the concatenation of the two terminal UMIs (UMI pair) (Fig. 1c, steps 1 and 2). High-quality UMI sequences are detected on the basis of the presence of the designated pattern, as well as the expected UMI length of 18 bp (Fig. 1a). Chimeric sequences are filtered de novo by removing UMI pairs in which either terminal UMI is observed in a more abundant UMI pair 17 (Fig. 1c, step 2). The filtered, high-quality UMI-pair sequences are used as a reference for binning the raw dataset according to UMIs (Fig. 1c, step 3). To remove potential collisions among UMI bins, post-binning filters were used to detect UMI pairs with abnormal match errors and read coverage properties. For ONT data, a filter was applied to normalize or discard UMI bins with heavy ± read-strand bias (<30% relative abundance of the least abundant strand), which can impact consensus error rates 19 . The consensus sequence for each UMI bin is then generated by multiple rounds of polishing using the binned raw reads (Fig. 1c, step 4). By using paired UMIs, issues with chimeras 17 and to some degree GC and length-dependent amplification bias 20 can be mitigated, but other long-range PCR issues, such as primer bias, off-target products and an upper limit to amplifiable target length still applies.
Benchmarking using microbial community standards. To assess the effectiveness of our UMI-tagging approach for error-correcting long reads, we sequenced full-length ribosomal RNA (rRNA) operons (~4,400 bp) in a mock microbial community from ZymoBIOMICS containing eight bacterial species (Supplementary Table 2) with a total of 43 unique operons ( Supplementary Fig. 1). To compare sequencing approaches, we calculated the necessary read coverage in windows of 5× to obtain a mean error rate <0.01%, meaning a read-level Phred quality score above 40 (Q40) (Supplementary Table 3), and data below that read coverage cutoff were removed from the analysis. Afterwards, chimeric sequences and sequences from reagent contamination were identified (Supplementary Figs. 2 and 3), manually curated ( Supplementary Figs. 2 and 4) and removed from the dataset to enable calculations of true error rates.
On a single ONT MinION R10.3 flowcell, a total of 23,337 amplicon UMI consensus sequences (ONT R10.3 UMI) were generated with read coverages over the Q40 cutoff (≥15×), an average length of 4,378 bp, a mean residual error rate of 0.0042% and a chimera rate of 0.009% (Fig. 2a,b). The same UMI-tagged amplicon library run on an ONT MinION R.9.4.1 flowcell produced 25,839 amplicon UMI consensus sequences (ONT R9.4.1 UMI) with coverages over the Q40 cutoff (≥25×) that had a mean residual error rate of 0.0041% and a chimera rate of 0.019% (Fig. 2a,b). Sequencing the same library of UMI-tagged long amplicons with a PacBio Sequel II 8M flowcell in CCS mode resulted in 39,678 UMI consensus sequences (PacBio UMI) with CCS read coverages over the Q40 cutoff (≥3×), an average length of 4,376 bp, a mean residual error rate of 0.0007% and a chimera rate of 0.015% (Fig. 2a, Table 3). For comparison, raw PacBio CCS reads without UMI clustering generated 135,823 CCS reads (PacBio CCS) with read coverages over the Q40 cutoff (≥40×) that had a mean error rate of 0.0080% and a chimera rate of 1.9%. The 1.9% chimeras observed in the raw PacBio CCS data are most likely introduced during the PCR amplification of the UMI library and are therefore present in the amplicon library before sequencing. To obtain highly accurate consensus sequences (≥Q40) from PacBio UMI data, at least three CCS reads per UMI pair is required. Three CCS reads are necessary so that PCR errors can then be polished using the sequence information within each CCS read derived from the same template; high raw CCS subread coverage alone is not sufficient. For this reason, PacBio UMI achieves lower error rates than PacBio CCS across all raw subread coverages (Supplementary Table 3  using a rigorous UMI-based filtering approach almost eliminates PCR chimeras, which otherwise can make up >20% of amplicons depending on PCR conditions 21 . The UMI-based chimera filtering removed 4.8%, 7.1% and 6.8% of the UMI bins in the ONT R10.3 UMI, ONT R9.4.1 UMI and PacBio UMI datasets, respectively. Furthermore, the ligation-based ONT UMI approach was tested on genomic DNA fragments of up to 10,000 bp from Escherichia coli and the results were consistent with rRNA operon results ( Supplementary Fig. 5), demonstrating the flexibility of this method for improving the accuracy of long-read sequencing. The residual errors were markedly different in nonhomopolymer regions compared to homopolymer regions for both the ONT UMI and PacBio CCS data, whereas residual errors in PacBio UMI data were extremely low in both cases (Fig. 2a,c) Table 4) and reaffirmed that homopolymer-derived errors remain an obstacle for further reducing error rates. For the ONT UMI R10.3 and R9.4.1 data, G-insertions in nonhomopolymer regions made up the majority of remaining errors. Both the nonhomopolymer insertions and the homopolymer deletions were to some degree systematic, with some errors in specific positions being present in 10-50% of sequences ( Fig. 2c Table 4). Characterizing whether the residual error is random or systematic is important for the ability to accurately call variants from single-molecule consensus sequences. We naively generated variants from the four data types (data > Q40 threshold but without removing chimeras) by clustering consensus sequences, phasing single-nucleotide variants within clusters and calling variants if present at ≥3× coverage. Homopolymer masking was used to minimize the impact of systematic homopolymer errors. The 43 references in the ZymoBIOMICS mock data were observed for all datatypes ( Supplementary Fig. 10) without error except for one variant from the ONT R9.4.1 UMI data, which had one error in a homopolymer region ( Supplementary Fig. 7). For ONT R10.3 UMI, ONT R9.4 UMI, PacBio CCS and PacBio UMI data, 1.27%, 0.29%, 6.99% and 0.12% of the consensus data, respectively were assigned to variants with systematic errors and 0%, 0%, 0.43% and 0% were assigned to chimeric variants, respectively (Supplementary Table 5). We also explored the impact of strand bias on error rate and systematic errors for ONT UMI data, by simulating UMI bins with different read coverage (10-30×) and strand bias (0-0.5 relative abundance of least abundant strand). The error distribution ( Supplementary Fig. 11) and systematic errors were impacted by strand bias, as also observed by Gilpatrick et al. 19 but most effects were negligible at strand bias ratios ≥0.3 ( Supplementary Fig. 12a). The impact of strand bias and read coverage in ONT data is complex due to local sequence context and changes in chemistry and software ( Supplementary Fig. 12b). As the fraction of the data with extreme strand bias is small, we added a simple strand bias filter that removes or normalizes bins with <0.3 relative abundance of the least abundant strand. This removed 7.3% (ONT 10.3) and 3.9% (ONT 9.4.1) of UMI data.

b and Supplementary
Application to human fecal samples in the American Gut Project. Amplification and sequencing of rRNA genes has become an important method for studying the diversity and taxonomic composition of human-and environment-associated microbial communities. Here, we applied the PacBio UMI method to generate 253,089 high-quality, full-length bacterial rRNA operon sequences from 70 human fecal samples collected by the American Gut Project 22 . We assessed strain-level taxonomic resolution by annotating full-length 5S, 16S and 23S rRNA genes within the operons and searched for these genes against gene-specific databases from the 'Web of Life' 23 . Using only the full-length 16S rRNA gene, 11.3% of the sequences could be matched at the strain level to the database and 38.4% assigned at the species level. By using both the 16S and 23S rRNA genes within an operon, we could assign 22% at the strain level and 72.2% at the species level, representing a significant increase in assignment over using the full-length 16S rRNA alone (chi-squared, 124,086; P value < 2.23 × 10 −308 ; Fig. 3). Furthermore, we observed a significant correlation between 16S rRNA genes from the operons and 16S rRNA V4 Illumina short-read data generated from the same physical specimens, even though different sets of primers were used (Supplementary Fig. 13; Spearman: r = 0.527, P = 1.449 × 10 −80 ; Pearson: r = 0.553, P = 4.555 × 10 −90 ). These results are in line with a recent study of the taxonomic resolution of the rRNA operon 24 . This UMI approach should also enable direct quantification of molecules in a sample 25 , which would be ideal for precise relative abundance estimates. However, the mock community used here, and probably many others, contains a biased fragment size and growth-dependent coverage, preventing proper quantification (Supplementary Table 6 and Supplementary Figs. 14-18).

Discussion
Here, we present a UMI-based approach that generates higher single-molecule accuracy and lower chimera rates with amplicons than was previously possible on long-read sequencing platforms. CCS-like strategies have been attempted as an alternative to UMIs to reduce the error rate of amplicon sequencing on the ONT platform, but these methods cannot remove PCR errors and chimeras and suffer from insufficient molecule coverage (majority of data <20× coverage) 26,27 to effectively improve the accuracy to >Q40 (Supplementary Table 7). In principle, lower error rates can also be achieved with de-noising strategies 28,29 , but at the cost of potentially missing true low-abundance variants 30 , which are critical for some applications (for example, pathogen detection and drug resistance). Furthermore, state-of-the-art clustering algorithms depend on the abundance of unique sequences to model errors 31,32 , which is not suitable for datasets where population micro-heterogeneity is high and evenness low.
We show that systematic errors do occur, especially, but not exclusively, around long homopolymer repeats using the ONT platform, which underlines that our observed error rate cannot be extrapolated to all sequence contexts; but this is true for any sequencing technology. Moreover, the error landscape is likely to keep changing due to the rapid development of sequencing technologies and algorithms. Meanwhile, it is possible to take advantage of the fact that UMI consensus error rates will be the same as shotgun sequencing consensus error rates of the same target region at the same raw read coverage. Thus, the validation results for the different sequencing platforms [33][34][35] can be used directly, or raw shotgun data could additionally be used, to infer error rates for specific targets or types of sequence complexity.
The choice of library strategy and sequencing platform for high-accuracy long-read amplicon sequencing depends on the specific application of interest. An overview of time, cost and yield was compared for the three different approaches (Supplementary Tables  8-10); the current projected price per sequence for >Q40 data from rRNA operons is US$0.015 (ONT UMI), US$0.012 (PacBio CCS) and US$0.007 (PacBio UMI), which should be similar for other genetic targets. For rapid testing and iterative development, the ONT UMI approach is attractive due to its low cost and portability. PacBio CCS sequencing also performs well for high-accuracy amplicon sequencing, but the presence of low-abundance chimeric variants is problematic, especially if they propagate into reference databases 36 . For sensitive applications, such as detecting low-abundance variants or generating reference sequences for key databases, the PacBio UMI approach seems to be the best suited.

online content
Any methods, additional references, Nature Research reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/ s41592-020-01041-y. Note that two of the yeast species were not targeted by PCR amplification of rRNA operons based on the primers used (see "Tagging target gene with UMIs"). The concentration of DNA in the mock community sample was measured on a Qubit 3.0 fluorometer and Qubit dsDNA HS assay kit (Thermo Fisher Scientific) and the quality of the DNA was measured by gel electrophoresis on an Agilent 2200 Tapestation using Genomic Screentapes (Agilent Technologies).

Methods
Tagging target gene with UMIs. PCR was used to target the bacterial 16S-23S rRNA operon and simultaneously tag each template molecule with terminal UMIs. The tailed primers lu_16S_8F_v7 and lu_23S_2490R_v7 were used for PCR (Supplementary Table 1). The first section of both tailed primers is a synthetic priming site used for downstream amplification. The second section is the 'patterned' UMI consisting of a total of 12 random nucleotides (N) and 6 degenerate nucleotides (Y or R), which results in a total of 1.2 × 10 18 possible UMI combinations when UMIs from both terminal ends of a template molecule are concatenated (4 12×2 × 2 6×2 = 1.2 × 10 18 ; Fig. 1a). The last section of the tailed primers consists of the bacterial rRNA operon-specific primer for 27f (ref. 37 ) or 2490r (ref. 38 ), respectively. The PCR reaction contained 5 ng of ZymoBIOMICS Microbial Community DNA Standard, 1 U Platinum SuperFi DNA Polymerase High Fidelity (Thermo Fisher Scientific) and a final concentration of 1× SuperFi buffer, 0.2 mM of each dNTP, 500 nM of each lu_16S_8F_v7 and lu_23S_2490R_v7 primer in 50 µl. The PCR program consisted of initial denaturation (3 min at 95 °C) and two cycles of denaturation (30 s at 95 °C), annealing (30 s at 55 °C) and extension (6 min at 72 °C). The PCR product was purified using a custom bead purification protocol 'SPRI size selection protocol for >1.5-2 kb DNA fragments' (ONT) based on https://doi.org/10.17504/protocols.io.idmca46. CleanPCR (CleanNA) bead solution was used for preparing the custom buffer. Purification was performed according to the custom protocol with the exception of an ethanol concentration of 80% and 0.9× bead solution/sample ratio.

Amplification of UMI-tagged amplicons.
A second PCR was used to amplify the UMI-tagged template molecules. All the UMI-tagged template molecules were added to the reaction containing 2 U Platinum SuperFi DNA Polymerase High Fidelity (Thermo Fisher Scientific) and a final concentration of 1× SuperFi buffer, 0.2 mM of each dNTP, 500 nM of each lu_pcr_fw_v7 and lu_pcr_rv_v7 primer (Supplementary Table 1) in 100 µl. The PCR program consisted of initial denaturation (3 min at 95 °C) and then 25 cycles of denaturation (15 s at 95 °C), annealing (30 s at 60 °C) and extension (6 min at 72 °C) followed by final extension (5 min at 72 °C). The PCR product was purified using a custom bead purification protocol 'SPRI size selection protocol for >1.5-2 kb DNA fragments' (ONT) based on https://doi.org/10.17504/protocols.io.idmca46. CleanPCR (CleanNA) bead solution was used for preparing the custom buffer. Purification was performed according to the custom protocol with the exception of an ethanol concentration of 80% and 0.9× bead solution/sample ratio. The concentration and quality of the PCR amplicons were measured as described above.
To obtain sufficient PCR product for sequencing, a third PCR was performed using the amplicons generated from the second PCR and using the same procedure as before, but with 3 × 100-µl reactions and six cycles of amplification. A large reaction volume was used to minimize the risk of overamplification. The final amount of amplicon DNA generated was 3.5 µg in 55 µl. PacBio Sequel II sequencing of mock rRNA operon amplicons. A total of 2,000 ng of the purified amplicon DNA from the third PCR was sent for PacBio library preparation and sequencing at the DNA Sequencing Center at Brigham Young University (https://dnasc.byu.edu/). Amplicons were incubated with T4 polynucleotide kinase (New England Biolabs), following the manufacturer's instructions. The sequencing library was prepared using SMRTbell Express Template Preparation kit 1.0 following the standard protocol. Sequencing was performed on a Sequel II using a Sequel II Sequencing kit 1.0, Sequel II Binding and Int Ctrl kit 1.0 and Sequel II SMRT Cell 8M, following the standard protocol with 1 h pre-extension and 30 h collection time (Pacific Biosciences). CCS reads were generated using CCS v.3.4.1 (https://github.com/PacificBiosciences/ccs) using default settings.
Data-generation workflow. Pipeline commands to generate data. The exact commands used to generate the data in this study are provided in the Supplementary Note "Pipeline commands to generate data. " Trimming and filtering of raw data (ONT data only). Raw fastq sequence data were trimmed of sequencing adaptors using Porechop with the commands --min_split_ read_size 3,500 --adaptor_threshold 80 --min_trim_size 20 --extra_end_trim 0 --extra_middle_trim_good_side 0 --extra_middle_trim_bad_side 0 --middle_ threshold 80 --check_reads 5,000 (v.0.2.4 https://github.com/rrwick/Porechop). Additionally, the adaptors.py file in Porechop was modified to include possible end-to-end ligation combinations of customized primers. Customized settings and modifications to the adaptors.py file were necessary to correctly split amplicons concatenated in the ligation step of library preparation, which can make up a substantial amount of the data.

Extraction of UMI reference sequences.
To efficiently bin reads according to UMIs on their terminal ends, it was critical to first extract and validate true UMI sequences that could be used as references for subsequent mapping steps. UMI sequences of the correct length (18 bp) were extracted from reads by locating flanking sequences within the custom primers. To do so, the first 200 bp from each terminal end of all reads were extracted and saved into individual files. UMI sequences were extracted from each terminal end file with cutadapt -e 0.2 -O 11 -m 18 -M 18 --discard-untrimmed -g CAAGCAGAAGACGGCATACGAGAT…AGRGTTYGATYMTGGCTCAG -g AATGATACGGCGACCACCGAGATC…CGACATCGAGGTGCCAAAC -G GTTTGGCACCTCGATGTCG…GATCTCGGTGGTCGCCGTATCATT -G CTGAGCCAKRATCRAACYCT…ATCTCGTATGCCGTCTTCTGCTTG in paired-end input mode. This step insured that only reads with UMIs of the correct length (18 bp) in both ends were extracted. UMI pairs were then concatenated and filtered with grep -B1 -E 'NNNYRNNNYRNNNYRNNNNNNYRNNNYRNNNYRNNN' to remove UMI pairs that did not follow the expected pattern. Filtered UMI pairs were clustered using usearch 40 (v.11.0.667) with the commands usearch -fastx_uniques -minuniquesize 1 -strand both and usearch -cluster_fast -id 0.90 -centroids -sizein -sizeout -sort size -maxaccepts 0 -maxrejects 0 -strand both -mincols 34 and all singleton clusters were discarded. Potential chimeric UMIs were flagged if clustered UMI pairs contained a sub-UMI that was observed in another UMI pair with a higher abundance. To facilitate this, sub-UMIs from clustered UMI pairs were clustered using usearch 40 (v.11.0.667) with the commands usearch -cluster_fast -id 0.94 -uc -sizein -sizeout -sort size -maxaccepts 0 -maxrejects 0 -strand both -mincols 17. UMIs that were potentially derived from correct higher-abundance UMIs due to systematic errors were flagged in a similar way to chimeric UMIs but using complete UMI pairs instead of sub-UMIs for clustering with usearch -cluster_fast -id 0.83 -uc -sizein -sizeout -sort size -maxaccepts 0 -maxrejects 0 -strand both -mincols 32. The output from these steps were lists of UMI matches that were used during the binning process to flag chimeras and UMI derivatives.
Binning reads according to UMIs. The reference UMIs of the correct length and pattern were then used to bin reads by mapping raw UMIs. Up to the first 110 bp of each terminal end of the trimmed and filtered reads were extracted with 'gawk' and saved into individual files. The UMI pair reference sequences were split into their corresponding sub-UMIs and mapped to read terminals using bwa with the commands bwa index, bwa aln -n 3 -N, and bwa samse -n 10,000,000. The mapping results were then filtered using SAMtools view -F 20. Mapping results from each end of the reads were merged and a read was binned to a specific UMI pair reference if the following conditions were met: (1) sub-UMIs from the same UMI pair were the best hits for both terminal UMIs in the read; (2) the mapping difference between the query read and each sub-UMI was ≤3 bp; (3) the mean mapping difference and s.d. between all of the query reads and the sub-UMI was ≤3.0/2.0 (ONT R9.4.1), ≤3/2.5 (ONT R10.3) or ≤0.75/1.5 (PacBio); (4) the ratio between the number of UMI binned reads to the size of the UMI reference cluster was ≤6 (ONT R9.4.1), ≤12 (ONT R10.3) and ≤2 (PacBio); and (5) the read-strand fraction of the least abundant strand out of total read coverage was ≧0.3 (ONT only). The output from this step was trimmed and filtered reads divided into UMI bins and stats (UMI bin size, strand information, UMI match error, UMI match s.d. and bin-to-cluster ratio) as well as chimera and derivate check results. A discussion about the impact of the allowed UMI mismatches on the probability of UMI collision occurring within a sequence library is presented in the Supplementary Note.

Generation of UMI consensus sequences.
For each individual UMI bin, a consensus sequence was initially generated using usearch -cluster_fast -id 0.75 -strand both -centroids -sizeout and picking the most abundant centroid. The centroid sequence was used as template for four rounds (ONT R10.3/R9.4.1) or two rounds (PacBio) of polishing using all UMI bin reads with minimap2 (ref. 41  Trimming of UMI consensus sequences. The consensus sequences from all UMI bins were then pooled and trimmed and filtered using cutadapt -m 3,000 -M 6,000 -g AGRGTTYGATYMTGGCTCAG…GTTTGGCACCTCGATGTCG. Consensus sequences not containing both primers were discarded. Phasing of consensus sequences and variant calling. Consensus sequences were phased and used to call variants using a custom workflow. The consensus sequences were first filtered to remove any consensus sequences with a read coverage less than the minimum read coverage to obtain >Q40 data quality (15× for ONT R9.4.1 UMI, 25× for ONT R10.3 UMI, 40× for PB CCS and 3× for PB UMI). The homopolymers were masked in consensus sequences by converting homopolymers of length ≥3 into length 2 to minimize the effect of homopolymer errors on phasing accuracy. The masked consensus sequences were de-replicated using usearch -fastx_uniques -strand both -sizeout -uc and clustered using two rounds of usearch -cluster_fast -id 0.995 -strand both -centroids -uc -sort length -sizeout -sizein and removing clusters of size <3. The reads belonging to each cluster were mapped back to the centroid sequence of the cluster using minimap2 -ax asm5. Genotype likelihoods were estimated from the mappings with bcftools 43 (v.1.9) mpileup -Ov -d 1,000,000 -L 1,000,000 -a 'FORMAT/ AD,FORMATDP' and results were filtered to show positions of SNPs present in ≥2× coverage using bcftools view -i ' AD[0:1-]>2' for each cluster. The list of SNP positions was used to phase the reads within a cluster and a variant was called if ≥3 reads supported a combination of SNPs. Consensus reads were then grouped according to called variants and consensus sequences were regenerated for each variant group. First, the homopolymers were unmasked in consensus reads and a crude variant consensus was generated using usearch -cluster_fast -id 0.99 -strand both -consout -sizeout. The crude variant consensus was polished with a workflow using minimap2 -ax map-ont, bcftools mpileup -Ov -d 1,000,000 -L 1,000,000 -a 'FORMAT/AD,FORMAT/DP' , bcftools norm -Ov, bcftools view -i ' AD[0:1]/ FORMAT/DP>0.5' -Oz and bcftools consensus.
Pipeline parallelization. GNU parallel 44 was used for pipeline parallelization. Data analysis. Error profiling. Detection of error was based on mapping the sequence data (raw reads, consensus sequences, variant consensus sequences) to curated and noncurated rRNA operon reference sequences from the ZymoBIOMICS Microbial Community DNA Standard (see "Generation of reference sequences for mock community"). Mapping was performed with minimap2 -ax map-ont --cs and filtered using SAMtools view -F 2308. References and mappings were imported into the R software environment 45 (v.3.6.0) in RStudio 46 , where errors in sequences were profiled using tidyverse 47 (v.1.2.1) and Biostrings 48 (v.2.52.0) R packages and custom scripts (see "Resource availability"). In brief, errors and their type (mismatch, deletion, insert) were detected from the SAM --cs tags. The relative positions of errors were determined with respect to reference sequence and this was used to categorize errors as being within homopolymer regions (hp+) or not (hp−). The error information was combined with metadata from the UMI binning (UMI bin sizes, UMI cluster sizes) and quality analysis (consensus length, UMI bin contamination estimates, ZymoBIOMICS reference-based taxonomy, SILVA taxonomy, chimera detection; see below for details) used to explore and visualize error as a function of such parameters.
Taxonomic classification of consensus sequences with mock references. Taxonomic classification of UMI/CCS consensus reads was performed by mapping the reads to curated rRNA operon reference sequences from the ZymoBIOMICS Microbial Community DNA Standard with minimap2 -ax map-ont --cs and filtered using SAMtools view -F 2,308. Read classification was based on the best hit.
Taxonomic classification of consensus sequences with SILVA database. 16S rRNA sequences were extracted from the rRNA operon consensus sequences with cutadapt --discard-untrimmed -m 1,200 -M 2,000 -a TGYACWCACCGCCCGTC.
Mapping to the curated ZymoBIOMICS Microbial Community DNA Standard rRNA operon reference sequences and the SILVA 138 SSURef Nr99 database was performed with minimap2 -ax map-ont --cs and filtered using SAMtools view -F 2,308. Read classification was based on best hit and error rate was calculated as described above (see "Error profiling"). The SILVA taxonomy and error rate was used to classify consensus sequences as chimeras or contamination.
Estimating UMI bin contamination. Taxonomic classification of raw reads in each UMI bin was performed by mapping the reads to curated rRNA operon reference sequences from the ZymoBIOMICS Microbial Community DNA Standard with minimap2 -ax map-ont --cs and filtered using SAMtools view -F 2,308. Read classification was based on the best hit. Contamination was estimated by calculating the fraction of reads not assigned to the most abundant taxonomy in each UMI bin.
Chimera detection. Chimeras in the rRNA operon consensus sequences were detected by usearch -uchime2_ref -strand plus -mode sensitive, using our curated rRNA operon reference sequences from the ZymoBIOMICS Microbial Community DNA Standard. Flagged chimeras were manually verified by investigating their error profiles in R (see "Error profiling"; Supplementary Fig. 4).
Generation of rRNA operon reference sequences for mock microbial community. An overview of the approach taken to generate reference sequences for the ZymoBIOMICS Microbial Community DNA Standard is given in the Supplementary Note.
Examination of relative abundance inconsistencies. We observed a difference between the relative abundance estimated with our UMI consensus data and the theoretical abundance for the rRNA operons of the mock community as reported by the manufacturer. We investigated sources of error and bias introduced due to genomic relative abundance, genomic coverage skew due to growth and primer mismatch, as presented in the Supplementary Note. E. coli whole-genome sequencing with UMIs. We conducted whole-genome shotgun sequencing of E. coli str. K-12 substr. MG1655 (DSM 18039) using our UMI-tagging long-read method. Details of the whole-genome shotgun sequencing, including culturing, DNA extraction, library generation, sequencing and data analysis, are presented in the Supplementary Note.

Application of PacBio UMI sequencing of rRNA operons of American Gut
Project samples. PacBio UMI data generation and processing. PacBio library preparation, sequencing and data generation was performed as described in "rRNA operon UMI sequencing of mock Microbial Community Standard" with the PacBio settings and the following exceptions: 1-2 ng of sample DNA was used as input for "Tagging target gene with UMIs". In the third PCR in the library preparation, individual libraries were barcoded by swapping the normal amplification primers for tailed barcode primers (Supplementary Table 1). Overall, batches of 20-25 barcoded libraries were pooled and sent for Sequel II sequencing. After data generation (UMI consensus sequences), the data were de-multiplexed on the basis of barcodes (see "Resource availability").
Taxonomic consistency between 16S V4 fragments and full-length 16S. To test the consistency of the derived data to the existing Earth Microbiome Project 49 16S V4 data, we first extracted full-length 16S sequences from operons using RNAmmer 50 . Sequences were then de-replicated and clustered de novo at 99% similarity using VSEARCH 2.7.0 (ref. 51 ) using the QIIME 2 v.2019.10 (ref. 52 ) q2-vsearch plugin (parameters: --p-perc-identity 0.99). Taxonomy was assigned against Greengenes 13_5 (ref. 53 ) and the 'classify-consensus-vsearch' method of q2-feature-classifier (parameters: --p-strand plus --p-query-cov 0.9 --p-perc-identity 0.9). Next, using redbiom 54 we obtained EMP 16S V4 Deblur sOTU profiles 55 for the samples corresponding to the same extracted DNA from Qiita 56 . Both tables were then aggregated to genus-level relative abundance and filtered to only the set of genera in common (n = 82) between the two tables. The relative abundance of each genus, per sample, from the full-length 16S and the 16S V4 data was then plotted (Supplementary Fig. 13). Plotting was performed in matplotlib 57 and Pearson and Spearman correlations were computed using SciPy 58 .
Taxonomic specificity of operons from real samples. Sequences of individual rRNA genes were identified from full-length operon sequences using RNAmmer 1.2 under the 'Bacteria' mode. The 16S and 23S rRNA sequences were concatenated with a linker of 20 'N' characters in between. Taxonomic assignment was performed using the BLASTn algorithm as implemented in the National Center for Biotechnology Information (NCBI) BLAST+ 2.7.1 to align query sequences against the extended 'Web of Life' database 33 , which contains all 86,200 nonredundant bacterial and archaeal genomes from NCBI RefSeq and GenBank as of March 2017. The E-value threshold was set as 1 × 10 −5 , whereas other thresholds were left as default. For each query sequence, hits with a bit score no more than 10% lower than the top hit were selected and the lowest common ancestor of these hits in the taxonomy tree was assigned to the query sequence. This behavior and threshold