Enabling high-accuracy long-read amplicon sequences using unique molecular identifiers with Nanopore or PacBio sequencing

High-throughput amplicon sequencing of large genomic regions remains challenging for short-read technologies. Here, we report a high-throughput amplicon sequencing approach combining unique molecular identifiers (UMIs) with Oxford Nanopore Technologies or Pacific Biosciences CCS sequencing, yielding high accuracy single-molecule consensus sequences of large genomic regions. Our approach generates amplicon and genomic sequences of >10,000 bp in length with a mean error-rate of 0.0049-0.0006% and chimera rate <0.022%. Main High-throughput amplicon sequencing is a ubiquitous method for studying genetic populations with low-abundance variants or high heterogeneity, including cancer driver genes 1–3 , virus populations 4–6 and microbial communities 7 . Short-read Illumina sequencing . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted January 11, 2020. ; https://doi.org/10.1101/645903 doi: bioRxiv preprint

45 50 55 has dominated amplicon-related research due to its unprecedented throughput and low native error-rate of ~0.1%, but its maximum amplicon size of ~500 bp 8 limits important long-range information and assay resolution 9 . Unique molecular identifiers (UMIs) have been applied to enable sequencing of longer amplicons with short-reads via assembly of synthetic long reads 10 . Each template molecule in a sample is tagged with a UMI sequence consisting of 10-20 random bases, which can subsequently be used to sort and analyse reads based on their original template molecule. UMIs can enable sequencing of synthetic long-reads up to ~11,000 bp, but this approach cannot resolve amplicons with repeats longer than the short-read length 11 , which limits its application. The high native error rates of Oxford Nanopore Technologies (ONT) (5-25% 12 ) and Pacific Biosciences (PacBio) (13% 13 ) have, until now, made it difficult to confidently identify true UMI tag sequences necessary to accurately assign raw reads to their template molecules. Furthermore, the combination of UMIs with long-read sequencing is relatively unexplored, and only recently has it been applied with PacBio CCS [14][15][16] , but without using dual UMIs for chimera filtering 17 and profiling the error of the generated consensus sequences.
Here, we present a simple workflow that combines UMIs with sequencing of long amplicons on the ONT and PacBio platforms to produce highly accurate single-molecule consensus sequences with a low chimera rate. To improve recognition of UMI-tagged error-prone reads, we designed UMIs to contain recognizable internal patterns ( Figure 1C and Table S1 ) that avoid error-prone homopolymer stretches 18 , which combined with filtering based on UMI length and pattern allows for robust determination of true UMI sequences in raw error-prone ONT and PacBio data. The residual errors were markedly different in non-homopolymer regions compared to homopolymer regions for both the ONT UMI and PB CCS data, while the PB UMI data was extremely low in both cases ( Figure 2a, c ). The error rate for all error types (deletions, insertions, and mismatches), except homopolymer deletions, stabilized for both ONT UMI and PB CCS above a coverage of 20x ( Figure 2a, Table S2) . The high deletion error rate was primarily due to deletions in long (>4 bp) C and G homopolymers for both data types ( Figure S4-5 , Table S3 ), and reaffirmed that homopolymer-derived errors are a remaining obstacle for lower error rates. For the ONT UMI data, G-insertions in non-homopolymer regions made up the majority of remaining errors. Both the non-homopolymer insertions and the homopolymer deletions were to some degree systematic, with some errors in specific positions being present in >50% of the sequences ( Figure 2c and Figure S4 ). For PB CCS data, the homopolymer deletion errors were not as systematic, but still a major error contributor ( Figure 2b, Figure S5 ). Random mismatch error is the other major source of error for PB CCS data, which probably originate from PCR errors. For the PB UMI data, there are very few errors left (1109 errors in 39678 sequences of ~4400 bp), and thus not enough data to elucidate potential error trends ( Figure S6, Table S3 ).
Characterizing whether the residual error is random or systematic is important for the ability to accurately call variants from single molecule consensus sequences. We naively generated variants from the three data types (data >Q40 threshold but without removing annotating the full length 5S, 16S and 23S within the operons, and searched for these genes against gene-specific databases from the "Web of Life" 26 . Using only the full-length 16S rRNA gene, 11.3% of the sequences could be matched at the strain level to the database, and 38.4% assigned at the species level. By using both the 16S and 23S rRNA genes within an operon, we could assign 22% at the strain level and 72.2% at the species level, representing a significant increase in assignment over using the full length 16S rRNA alone (Chi-squared statistic=124,086, p-value<2.23e-308; Figure 3 ). These results are inline with a recent study of the taxonomic resolution of the rRNA operon 27 . This UMI approach should also enable direct quantification of molecules in a sample 28 , which would be ideal for precise relative abundance estimates. However, the mock community used here, and probably many others, contains a biased fragment size and growth-dependent coverage, preventing proper quantification ( Table S5 and Figure S9-13 ).
The choice of library strategy and sequencing platform for high-accuracy amplicon sequencing depends on the application. An overview of time, cost and yield comparisons was compiled for the three different approaches ( Table S6- For rapid testing and iterative development, the ONT UMI approach is attractive due to its low cost and portability. PB CCS sequencing also performs well for high-accuracy amplicon sequencing, but the presence of low abundant chimeric variants is problematic, especially if they propagate into reference databases 29 . For sensitive applications, such as 180 185 detecting low-abundance variants or generating reference sequences for key databases, the PB UMI approach appears to be the best suited at this time. . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted January 11, 2020. ; https://doi.org/10.1101/645903 doi: bioRxiv preprint Figure 2: a) Left column: Error rate as a function of the number of reads in each UMI bin for the three data types. Right column: Error rate as a function of the number of reads in each UMI bin split by error type and whether the error fell inside (hp+) or outside (hp-) a homopolymer region. b) The top table shows the mean error rate (+/-standard deviation) of raw reads and consensus sequences (CCS/UMI) with a Q40 minimum and the observed chimera rate. The bottom table summarises the mean error rates for all error and homopolymer types for data with a Q40 minimum. c) Frequency of specific errors are plotted as a function of operon position (bp) for Salmonella operon 7, Bacillus operon 2 and Escherichia operon 7 for ONT UMI, PB CCS and PB UMI respectively. The error frequency is normalized as fractions of sequences containing the error in that position. Errors with ≥ 0.01 frequency have been annotated with error type. +[actg] means insertion -[actg] means deletion and *[actg][actg] means mismatch. Annotated errors in black are in non-homopolymer regions and errors in red are in homopolymer regions.
. CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted January 11, 2020. ; https://doi.org/10.1101/645903 doi: bioRxiv preprint 210 215 Figure 3: BLAST-based consensus taxonomic assignment against the Web of Life 86k reference database for whole rRNA operons, using the combination of 16S and 23S rRNAs, and the individual rRNA genes. In the dataset, 253,089 operons were available and used for assignment. Of these, n=253,087 had an annotatable 23S rRNA gene, n=253,088 had annotatable 16S rRNA gene, and n=50,560 had annotatable 5S rRNA gene. All raw and annotatable elements were used in this summary. . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted January 11, 2020. and Qubit dsDNA HS assay kit (Thermo Fisher Scientific), and the quality of the DNA was measured by gel electrophoresis on an Agilent 2200 Tapestation using Genomic Screentapes (Agilent Technologies).

DNA Sequence Library Preparation
Online protocol An interactive step-by-step protocol is available at protocols.io: ( https://www.protocols.io/private/F5C5FE21305911EAAC0B0242AC110003 ). Tagging target gene with UMIs PCR was used to target the bacterial 16S-23S rRNA operon and simultaneously tag each template molecule with terminal unique molecular identifiers (UMIs).
The following tailed primers lu_16S_8F_v7 and lu_23S_2490R_v7 were used for the PCR (see Supplementary Table 1 ). The first section of both tailed primers is a synthetic priming site used for downstream amplification. The second section is the `patterned` UMI consisting of a total of 12 random nucleotides (N) and 6 degenerate nucleotides (Y or R) which results in a total of 1.2x10 18 possible UMI combinations when UMIs from both terminal ends of a template molecule are concatenated (4 12*2 x 2 6*2 = 1.2x10 18 , see Figure 1 ) . The last section of the tailed primers consists of the rRNA operon-specific primer for 27f 1 or 2490r 2 , respectively.

Amplification of UMI-tagged amplicons
A second PCR was used to amplify the UMI-tagged template molecules. All of the UMI-tagged template molecules were added to the reaction containing 2 U Platinum SuperFi DNA Polymerase High Fidelity (Thermo Fisher Scientific, USA), and a final concentration of 1X SuperFi buffer, 0.2 mM of each dNTP, 500 nM of each lu_pcr_fw_v7 and lu_pcr_rv_v7 primers (see Table S1 ) in 100 µL. The PCR program consisted of initial denaturation (3 minutes at 95 • C) and then 25 cycles of denaturation (15 seconds at 95 • C), annealing (30 seconds at 60 • C) and extension (6 minutes at 72 • C) followed by final extension (5 minutes at 72 • C). The PCR product was purified using a custom bead purification protocol "SPRI size selection protocol for >1. 5 To obtain sufficient PCR product for sequencing, a third PCR was performed using the amplicons generated from the second PCR and using the same procedure as before, but with 3 x 100 µl reactions and 6 cycles of amplification. A large reaction volume was used to minimize the risk of overamplification. The final amount of amplicon DNA generated was 3.5 µg in 55 µL.  Additionally, the adaptors.py file in porechop was modified to include possible end-to-end ligation combinations of the customized primers. The customized settings and modifications to the adaptors.py file were necessary to correctly split amplicons concatenated in the ligation step of the library preparation, which can make up a substantial amount of the data.

Extraction of UMI reference sequences
To efficiently bin reads according to the UMIs on their terminal ends, it was critical to extract and validate true UMI sequences that could be used as references for subsequent mapping steps. samse -n 10000000 . The mapping results were then filtered using samtools 6 (v1.9) view -F 20 .
Using the mapping results the clustered UMI pairs were filtered using gawk to remove UMI pairs with a coverage < 3x. Potential chimeras were removed by filtering clustered UMI pairs containing sub UMI that was observed in another UMI pair with a higher abundance. The output from these steps was a list of trusted UMI pairs that could be used as references for binning reads.

Binning reads according to UMIs
The first 90 bp of each terminal end of the trimmed and filtered reads were extracted with gawk and saved into individual files. The UMI pair reference sequences were split into their corresponding sub UMIs and mapped to the read terminals using bwa with the commands: bwa index, bwa aln -n 3 -N , and bwa samse -n 10000000 . The mapping results were then filtered using samtools view -F 20 . Mapping results from each end of the reads were merged, and a read was binned to a specific UMI pair reference if the following conditions were met: A) Sub UMIs from the same UMI pair were the best hits for both terminal UMIs in the read. B) The mapping difference between the query read and each sub UMI was ≤ 3 bp. C) The mean mapping difference between all of the query reads and the sub UMI was ≤ 3.5 (Nanopore only) or ≤ 3 (PacBio only). D) The ratio between the number of UMI binned reads to the size of the UMI reference cluster was ≤ 10 (Nanopore only). E) The read strand ratio (+/-) was in the interval 10 -0.6 to 10 0.6 , which is equivalent to the read strand fraction containing the fewest reads comprising at least 25% of the total data amount (Nanopore only). The output from this step was trimmed and filtered reads divided into UMI bins.

Generation of UMI consensus sequences
For each individual UMI bin, a consensus sequence was initially generated using usearch -cluster_fast -id 0.75 -strand both -centroids , and picking the most abundant centroid. The centroid sequence was used as template for 2 rounds of polishing using all the UMI bin reads with minimap2 7 (v2.17-r954-dirty) -x map-ont and racon 8 Polishing of UMI consensus sequences (Nanopore data only) The racon-polished Nanopore consensus sequences were further polished individually by using all of the reads in each UMI bin and two rounds of Medaka (v0.11.2) ( https://github.com/nanoporetech/medaka ) with the commands medaka mini_align -m and medaka consensus --model r10_min_high_g340 --chunk_len 6000 .

Trimming of UMI consensus sequences
The consensus sequences from all UMI bins were then pooled and trimmed and filtered using cutadapt -m 3000 -M 6000 -g AGRGTTYGATYMTGGCTCAG…GTTTGGCACCTCGATGTCG .
Consensus sequences not containing both primers were discarded.

Phasing of consensus sequences and variant calling
Consensus sequences were phased and used to call variants using a custom workflow. The consensus sequences were first filtered to remove any consensus sequences with a read coverage less than the minimum read coverage to obtain >Q40 data quality (25x for ONT UMI, 40x for PB CCS and 3x for PB UMI). The homopolymers were masked in the consensus sequences by converting homopolymers of length ≥3 into length 2 to minimize the effect of homopolymer errors on the phasing accuracy. The masked consensus sequences were dereplicated using usearch -fastx_uniques -strand both -sizeout -uc and clustered using two rounds of usearch

Pipeline parallelization
Many steps in the pipeline have been parallelized using GNU parallel 10 .

Error profiling
Detection of error was based on a mapping of the sequence data (raw reads, consensus sequences, variant consensus sequences) to curated and non-curated rRNA operon reference

Estimating UMI bin contamination
Taxonomic classification of raw reads in each UMI bin was performed by mapping the reads to curated rRNA operon reference sequences from the ZymoBIOMICS Microbial Community DNA Standard with minimap2 -ax map-ont --cs and filtered using samtools view -F 2308 . Read classification was based on best hit. Contamination was estimated by calculating the fraction of reads not assigned to the most abundant taxonomy in each UMI bin.

Chimera detection
Chimeras in the rRNA operon consensus sequences were detected by usearch -uchime2_ref -strand plus -mode sensitive, using our curated rRNA operon reference sequences from the

Examination of relative abundance inconsistencies
We observed a difference between the relative abundance estimated with our UMI consensus data and the theoretical abundance for the rRNA operons of the mock community as reported by the manufacturer. A potential cause of this discrepancy could be due to problems with the DNA composition of the mock community. This was investigated by comparing rRNA operon relative abundance with theoretical relative abundance estimated from metagenomic Nanopore data from The metagenome data along with the consensus rRNA operon data was imported into the R software environment, and analysed using the tidyverse and Biostrings R-packages along with custom scripts (see Resource availability ). In short the relative abundance of the consensus rRNA operon data was calculated: . To calculate the theoretical relative abundance of the rRNA Loading… operons using the metagenome data, the metagenome data was first filtered to remove reads < 5000 bp. Read length reflects the DNA template length present in a DNA sample, and < 5000 bp templates are unlikely to contain a complete rRNA operon that can be amplified by PCR, and should therefore be disregarded in an analysis of rRNA operon relative abundance. First the theoretical number of rRNA operons was estimated per reference in the metagenome: Loading… . Then the relative abundance was calculated as above. Analysis of genomic relative abundance and coverage skew due to growth.
A bias in relative abundance could occur due to the mock species being in different growth phases at the time of sampling. To investigate the potential contribution of growth to coverage bias, we used metagenomic Nanopore data from ZymoBIOMICS Microbial Community Standards generated internally (see Oxford Nanopore sequencing of mock metagenomic DNA ) and externally (see Generation of rRNA operon reference sequences for mock microbial community ). Nanopore data was mapped to each species reference genome using minimap2 -ax map-ont and calculated genome read coverage per position by using samtools depth -aa . rRNA operon genome coordinates were predicted by barrnap (v.0.9) (available from: https://github.com/tseemann/barrnap ) and species genomes were obtained by de novo assembly (see Generation of Reference Sequences for Mock Community ). The data was imported into R, and used to create read coverage plots ( Supplementary Figure 11 ).

Investigation of PCR primer match.
A bias in relative abundance can be introduced in the first PCR where the rRNA operon is targeted with region-specific primers. If there are mismatches between primers and template, we would expect a lower annealing/amplification efficiency. Primer/template mismatches were estimated using ipcress from the package exonerate (v.2.2) ( Supplementary Table 11 ). Culturing Escherichia coli str. K-12 substr. MG1655 (DSM 18039) was procured from DSMZ in 2010 and stored at -80 ₀ C until use. A culture was grown overnight in 2 x 100 mL LB-media (10 g/L NaCl, 10g/L Tryptone, 5 g/L yeast extract) at 37 ₀ C. Cells were harvested by centrifugation at 7800 RPM for 10 minutes and washed with 1X PBS buffer and finally resuspended in 1 x PBS.

DNA extraction
Genomic DNA was extracted with DNeasy PowerSoil (Qiagen, Netherlands) using standard protocol. The DNA concentration was measured on a Qubit 3.0 fluorometer with the Qubit dsDNA HS assay kit (Thermo Fisher Scientific) and the DNA quality was measured by gel electrophoresis on an Agilent 2200 Tapestation using Genomic Screentapes (Agilent Technologies).

DNA Sequence Library Preparation
Online protocol An interactive step-by-step protocol is available at protocols.io: ( https://www.protocols.io/private/D92C9DC132B111EA92DD0242AC110005 ).

PacBio UMI data generation and processing
PacBio library preparation, sequencing and data generation was performed as described in `rRNA operon UMI sequencing of mock microbial community` with the PacBio settings and the following exceptions. 1-2 ng of sample DNA was used as input for ` Tagging target gene with UMIs `. In the third PCR in the library preparation, individual libraries were barcoded by swapping the normal amplification primers for tailed barcode primers (see Supplementary   Table 1 ). 25 barcoded libraries were pooled and sent for Sequel II sequencing . After data generation UMI consensus sequences, the data was demultiplexed based on barcodes (see

Taxonomic consistency between 16S V4 fragments and full length 16S
To test the consistency of the derived data to the existing Earth Microbiome Project 15 16S V4 data, we first extracted full length 16S sequences from the operons using RNAmmer 16 . The sequences were then dereplicated and clustered de novo at 99% similarity using VSEARCH 410 415 420 425 2.7.0 17 using the QIIME 2 version 2019.10 18 q2-vsearch plugin (parameters: --p-perc-identity 0.99) . Taxonomy was assigned against Greengenes 13_5 19 and the "classify-consensus-vsearch" method of q2-feature-classifier (parameters: --p-strand plus --p-query-cov 0.9 --p-perc-identity 0.9). Next, using redbiom 20 we obtained EMP 16S V4 Deblur sOTU profiles 21 for the samples corresponding to the same extracted DNA from Qiita 22 . Both tables were then aggregated to genus level relative abundance, and filtered to only the set of genera in common (n=82) between the two tables. The relative abundance of each genus, per sample, from the full length 16S and the 16S V4 data were then plotted ( Supplementary Figure 14 ). Plotting was performed in matplotlib 23 , and Pearson and Spearman correlations were computed using SciPy 24 .

Taxonomic specificity of operons from real samples.
Sequences of individual rRNA genes were identified from the full-length operon sequences using RNAmmer 1.2 under the "Bacteria" mode. The 16S and 23S rRNA sequences were concatenated with a linker of 20 "N" characters in between. Taxonomic assignment was performed by using the BLASTn algorithm as implemented in NCBI BLAST+ 2.7.1 to align query sequences against the extended "Web of Life" database 25 , which contains all 86,200 non-redundant bacterial and archaeal genomes from NCBI RefSeq and GenBank as of March 2017. The E-value threshold was set as 1e-5, whereas other thresholds were left as default. For each query sequence, hits with a bit score no more than 10% lower than the top hit were selected, and the lowest common ancestor (LCA) of these hits in the taxonomy tree was assigned to the query sequence. This behavior and threshold are consistent with DIAMOND's taxonomic assignment functionality 26 . The percentage of query sequences assigned to any taxonomic unit at each of the eight standard taxonomic levels were calculated. The taxonomic assignment ratios at species or strain were compared using Pearson's Chi-square test, as implemented in the "chi2_contingency" function of SciPy 1.3.1.

Generation of rRNA operon reference sequences for mock microbial community
We obtained raw fast5 files (ENA accession: ERR2887847) from a previously-reported 27 sequencing effort of the ZymoBIOMICS Microbial Community Standard (D6300, batch ZRC190633) using Oxford Nanopore sequencing. The raw fast5 data was basecalled using the

Supplementary information for
Enabling high-accuracy long-read amplicon sequences using unique molecular identifiers with Nanopore or PacBio CCS sequencing Figure S1: Unfiltered consensus error as a function of read coverage. The plots show consensus error rate as a function of the read coverage before filtering of contamination, chimeras and artefacts. The mean error rate and variance within sliding windows was used to define an error cut-off for that region. Data below the cutoff was flagged as normal ( ), and all data above the cut-off was manually inspected and flagged as either chimeric ( ), contamination ( ), CCS artefact ( ) and unknown ( ), see Figure S2 . Contamination originates from PCR reagents and was removed from the data. Chimeras and CCS artefacts were removed from the data and reported in the chimera rate. The CCS artefacts manifested themselves as long stretches of homopolymer inserts, which seem to be present in some of the raw reads and carried over through CCS processing and polishing. Unknown sequences were left in the dataset. The filtered data is presented in Figure 2 in the main article and was used to calculate error statistics. The CCS data shown has been randomly subset to 1/100 (17948 sequences) to make processing and plotting feasible.
. CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted January 11, 2020. Figure S2: Example of manual inspection of flagged consensus sequences from Bacillus subtilis . Outlier consensus sequences are shown for the Bacillus subtilis reference. The data is divided by intragenomic operons, and dots signify errors annotated as mismatch, deletion or insert by color. Manual annotations can be seen to the left of the sequences. Chimeras, contamination and artefacts could not be reliably detected by software alone. Therefore, the outliers were flagged depending on error rate and with uchime2_ref chimera detection, and were manually inspected and curated: sequences with errors concentrated in one part of the sequence were flagged as chimeras, sequences with many errors and with a better hit in the SILVA database compared to the ZymoBIOMICS reference were flagged as contamination and sequences with long homopolymer inserts in the PacBio data were flagged as CCS artefacts. . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted January 11, 2020.  Figure S3: Statistics for ONT UMI consensus sequences from the Escherichia coli genomic shotgun library. We used a shotgun genome library from E. coli str. K-12 substr. MG1655 as a proof of concept for using a UMI approach in context of extreme sequence heterogeneity. UMI adaptors were ligated to sheared E. coli genomic DNA (mean fragment length ~8 kbp) and otherwise processed similarly to the amplicon data, generating 3,658 UMI sequences with a read coverage of 30x with a mean length of 4,476 bp (min = ≥ 2000, max = 10578) and a mean error rate of 0.009% and 0.024% chimera rate. A) Error rate of unfiltered consensus sequences versus read coverage. B) Error rate of filtered consensus sequences versus read coverage. C) Error rate divided by type as a function of read coverage and table of error type statistics for data >Q40. D) Error positions and types of flagged outliers.
. CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted January 11, 2020. ; https://doi.org/10.1101/645903 doi: bioRxiv preprint was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted January 11, 2020. . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted January 11, 2020. . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted January 11, 2020. . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted January 11, 2020. ; https://doi.org/10.1101/645903 doi: bioRxiv preprint Figure S8: Validation of chimera detection. Chimera detection is notoriously difficult in the presence of sequencing errors and false positives/negatives are unavoidable 1 . For detecting chimeras in the ZymoBIOMICS Microbial Community DNA Standard rRNA operon data we used uchime2_ref in sensitive mode. To validate that closely related chimeras could be identified with this approach, we generated a chimera dataset from the reference sequences in the mock microbial community, which had between 1 to 842 bp differences to the closest matching references. 99.98% of the inter-species chimeras (n = 5000) were detected along with 11.6% of the intra-species chimeras (n = 3123). The plot shows the test results; the data is divided by inter-and intra-species chimeras, and the x-axis shows the number of differences between the chimera and closest matching reference and the y-axis shows the number of chimeras. It is mainly chimeras with few SNPs that are not classified. The chimera detection method proved to generate false positives when contaminating sequences were present. Hence, detected chimeras were validated by manual inspection (see Figure S2). Unfiltered data was used to calculate the relative abundance, as data filtered by read coverage cut-off resulted in adding additional taxa specific biases (see Figure S10 ). The trends were similar in all datasets, irrespective of using UMIs or not.
. CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted January 11, 2020. Correlation between mean UMI bin size and rRNA operon length coloured by genus. PB UMI data was used to generate the plots. UMI bin size can be used as a proxy for PCR efficiency. UMI consensus sequences originate from a single molecule, and all tagged molecules in sample are amplified using the same synthetic primers and under the same PCR conditions. Differences in relative read coverage per molecule in the final PCR product should therefore only originate from length and nucleotide composition based PCR efficiency biases. There is a clear trend of efficiency depending on length and taxonomy. However, the UMI approach should mitigate the post UMI tagging PCR biases shown in the above plots, which means the observed overall bias in relative abundance ( Table S5 ) must be introduced in the UMI PCR or be present in the template to begin with.
. CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted January 11, 2020. . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted January 11, 2020. ; https://doi.org/10.1101/645903 doi: bioRxiv preprint 90 95 A) B) Figure S12: Read size distribution of mock community metagenome data. Each line plot represents the read size distribution from each mock community species estimated from the Nanopore metagenome data. A) is data generated from a ZymoBIOMICS Microbial Community Standard (even) [D6300, batch ZRC190633] by the Loman lab 2 using the Nanopore GridION and a R9.4.1 flowcell. B) is in-house data generated from a ZymoBIOMICS Microbial Community DNA Standard (even) [product D6306, batch ZRC190811] using Nanopore MinION and a R10 flowcell. Some species have significantly more high molecular weight DNA over 5000 bp compared to some of the other species, which impacts the effective template availability in PCR for long amplicons. For the Loman lab data, the distinct gram+/-dependent trends fragment length is anticipated from their two step extraction protocol used on the mock community. The in-house generated data is generated from DNA standard prepared by the vendor, and here the taxa trend is not as pronounced, but the effect on the effective relative abundance is still dramatic ( Table S5 ). We can only speculate about the cause, but different extraction methods could be the reason. If all samples had been extracted with the same method, the result should be more similar fragment distributions as indicated by the Loman group data. The Nanopore library preparation and sequencing likely had an impact on the observed read fragment distributions, but the Loman group data indicates that extraction method probably is the most important factor.
. CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted January 11, 2020. ; https://doi.org/10.1101/645903 doi: bioRxiv preprint Figure S13: Genome and operon GC content (A) Mean GC content of the genome (solid lines) along with the GC content of each rRNA operon (points), arranged by species. Despite different average genome content the GC content of the intragenomic operons is very similar across species (46% to 52%), and can be very different from the genome GC content.
. CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted January 11, 2020. ; https://doi.org/10.1101/645903 doi: bioRxiv preprint Figure S14 . A scatter plot of both molecular preparations showing the relative abundances of bacterial genera within samples (n=70). Each point represents a bacterial genus within a given sample, and shows the observed relative abundance from Earth Microbiome Project 16S V4 derived data, and the observed relative abundance of the full operon data based on taxonomic annotation of the full length 16S. The red line depicts y=x. The relative abundances are significantly correlated with existing V4 data 3 (Spearman: r=0.527, p=1.449e-80; Pearson: r=0.553, p=4.555e-90).
. CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted January 11, 2020. B) The number of errors in the PB UMI consensus sequences estimated based on curated rRNA reference sequences. Each point represents a UMI consensus sequence that aligns to a specific reference operon. These observations were confirmed with ONT UMI data indicated errors in the available reference genomes, as was also reported by others 4 . To generate improved rRNA operon references, we first used a long-read first assembly approach, in which publicly available ONT sequence data of the Zymo mock community 2 was assembled into individual reference genomes with Miniasm 5 followed by Racon and Medaka polishing. rRNA operons were extracted from the high-quality long-read assemblies, and SNPs with no Illumina short-read support were manually curated, which . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted January 11, 2020. ; https://doi.org/10.1101/645903 doi: bioRxiv preprint were mainly indel errors in homopolymers. In total, we found 49 bacterial rRNA operons with 4-10 copies/species, where 43 operons were unique and had 1-379 intra-species differences ( Table S9 ). The mean difference between the original references and our curated sequences was 0.063% (~2.8 SNP/operon), with a range of 0 -0.47% (0 -21 SNP/operon) .
. CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted January 11, 2020. ; https://doi.org/10.1101/645903 doi: bioRxiv preprint was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted January 11, 2020. ; https://doi.org/10.1101/645903 doi: bioRxiv preprint . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted January 11, 2020. ; https://doi.org/10.1101/645903 doi: bioRxiv preprint . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted January 11, 2020. ; https://doi.org/10.1101/645903 doi: bioRxiv preprint  Table S5: rRNA operon relative abundance estimates. rRNA relative abundance estimated for the different rRNA amplicon data sets (PB UMI, PB CCS, ONT UMI) and compared with abundances estimated from metagenome sequencing data (> 5000 bp used, ONT meta) and the theoretical abundance provided by the vendor. The relative abundance estimates of the mock community was skewed, in the same direction, for all rRNA data types (see Figure S9). If the skew was caused by general PCR bias, we would expect the UMI datasets to be different compared to the CCS dataset, but this is not the case. This indicates the skew originates from the gene specific primers and/or initial template accessibility. Many factors can possibly contribute to the observed skew -most notable are length dependent PCR efficiency ( Figure S10), reference dependent DNA fragment size distribution ( Figure S11), different growth states ( Figure S12), and operon dependent nucleotide composition. The rRNA relative abundance estimated from the ONT metagenome data (ONT meta), is more similar to the rRNA amplicon data, indicating the DNA fragment length plays a major role in the observed relative abundance. A mock community with more defined fragment lengths and genome coverage is needed to evaluate whether the relative abundance estimates of rRNA operons can be applied effectively in microbial ecology. . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted January 11, 2020. was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted January 11, 2020. ; https://doi.org/10.1101/645903 doi: bioRxiv preprint Table S9: Difference between intra species rRNA operons. Each table show intra species difference between rRNA operons. Below the diagonal is total differences and above is total indels. The analysis was performed on the curated rRNA operons from the ZymoBIOMICS Microbial Community DNA Standard using CLC genomics workbench v9.5.5 (Qiagen) using the ´Create Alignment´ tool (Gap open cost = 10.0, Gap extension cost = 1.0, End gap cost = Free, Alignment mode = Very accurate (slow), Redo alignments = No, Use fixpoints = No) and the `Create pairwise comparison` tool (default settings).
. CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted January 11, 2020.  . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted January 11, 2020. ; https://doi.org/10.1101/645903 doi: bioRxiv preprint Table S11: Estimation of mismatches between primers and rRNA operon sequences. . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted January 11, 2020. ; https://doi.org/10.1101/645903 doi: bioRxiv preprint 230 was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted January 11, 2020. ; https://doi.org/10.1101/645903 doi: bioRxiv preprint ´Raw´ is sequencing data directly after basecalling before any filtering. ´UMI binned´ is data that has been quality filtered and been successfully assigned to a specific UMI bin. ´Consensus´ is data in consensus sequence form, either number of CCS consensus reads or number UMI consensus reads. ´>Q40 Consensus´ is data in consensus sequence form filtered based on coverage to obtain >Q40 data. . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted January 11, 2020. ; https://doi.org/10.1101/645903 doi: bioRxiv preprint