Mining underutilized whole-genome sequencing projects to improve 16S rRNA databases

Current approaches to interpreting 16S rDNA amplicon data are hampered by several factors. Among these are database inaccuracy or incompleteness, sequencing error, and biased DNA/RNA extraction. Existing 16S rRNA databases source the majority of sequences from deposited amplicon sequences, draft genomes, and complete genomes. Most of the draft genomes available are assembled from short reads. However, repeated ribosomal regions are notoriously difficult to assemble well from short reads, and as a consequence the short-read-assembled 16S rDNA region may be an amalgamation of different loci within the genome. This complicates high-resolution community analysis, as a draft genome’s 16S rDNA sequence may be a chimera of multiple loci; in such cases, the draft-derived sequences in a database may not represent a 16S rRNA sequence as it occurs in biology. We present Focus16, a pipeline for improving 16S rRNA databases by mining NCBI’s Sequence Read Archive for whole-genome sequencing runs that could be reassembled to yield additional 16S rRNA sequences. Using riboSeed (a genome assembly tool for correcting rDNA misassembly), Focus16 provides a way to augment 16S rRNA databases with high-quality re-assembled sequences. In this study, we augmented the widely-used SILVA 16S rRNA database with the novel sequences disclosed by Focus16 and re-processed amplicon sequences from several benchmarking datasets with DADA2. Using this augmented SILVA database increased the number of amplicon sequence variants that could be assigned taxonomic annotations. Further, fine-scale classification was improved by revealing ambiguities. We observed, for example, that amplicon sequence variants (ASVs) may be assigned to a specific genus where Focus16-correction would indicate that the ASV is represented in two or more genera. Thus, we demonstrate that improvements can be made to taxonomic classification by incorporating these carefully re-assembled 16S rRNA sequences, and we invite the community to expand our work to augment existing 16S rRNA reference databases such as SILVA, GreenGenes, and RDP.

Microbial genomes have a range of 16S rRNA gene copy numbers (GCNs), from the many Mycobacteria with diversity estimates can be skewed by overestimating taxa with higher GCN and underestimating those with low 52 GCN. A trivial example would be a community of two organisms -one with five rDNA copies, one with a single 53 copy; an even sequencing of the community would show a one-to-five abundance ratio. Second, some organisms 54 have sufficient sequence variability between copies that they may be assigned different taxonomic classifications; the RDP classifier to assign taxonomy to 16S rRNA sequences (Wang et al. 2007). SILVA and Greengenes inherit a sequence's taxonomic assignment from the source database (such as NCBI or EBI). SILVA provides a shows that 9.5% of sequences come from draft genome assemblies; the vast majority (87%) are obtained as 78 amplicon sequences (usually Sanger sequenced), and the remaining 2% come from complete genomes. While 79 Sanger-sequenced amplicons are generally very accurate, a common weakness of draft assemblies from short-read 80 sequencing is incorrect assembly of repeated rDNA regions of a genome, which may be collapsed/merged into 81 a single rDNA. The resulting 16S rRNA sequence could in turn be incorporated into SILVA, GreenGenes, or 82 RDP. This compromises the quality of 16S rRNA databases, and such sequences should be treated with caution. 83 Genome assemblies from short reads are prone to errors in rDNA regions, as the length of the repeated region 84 exceeds read lengths. PCR spanning the rDNA region, followed by Sanger sequencing, or the use of long-read 85 technologies such as PacBio or Nanopore sequencing, can resolve these multiple copies but, as the majority of 86 the data generated over the last two decades comes from short read sequences, fixing collapsed regions remains a

88
The correct re-assembly of multiple rDNA regions of draft genomes can be achieved using riboSeed, which 89 uses a reference genome to help assemble the rDNA regions of a draft genome (Waters et al. 2018). riboSeed 90 exploits the observation that the flanking regions of the rDNA region are highly conserved within a taxon 91 yet variable between rDNA copies in the same genome, by using targeted subassembly to correctly place each 92 re-assembled copy of multi-locus rDNA repeats. The knock-on effect of assembling multi-copy rDNA operons is 93 acquiring highly-accurate 16S rRNA sequences, which can be incorporated into 16S rRNA databases. 94 Here, we present the results of using such an approach to augment existing 16S rRNA databases with 95 newly-assembled sequences from the SRA data corresponding to pre-existing draft genomes. The additional 96 sequences provide greater coverage of ASVs in publicly-available datasets, aiding efforts to understand microbial 97 communities.

98
The Focus16 Pipeline 100 We developed Focus16: a pipeline to augment existing 16S rRNA databases by mining the SRA database 101 for candidate whole-genome sequencing studies for re-assembly. Candidate SRAs are identified, downloaded, taxonomic assignment with Kraken, and formatted for addition to existing databases. Kraken2 assigns taxonomy 104 using exact matches to a lowest common ancestor for k-mers from the whole genome assembly, mitigating the 105 risk of misclassification compared to using a 16S rRNA classifier alone.

106
The pipeline is shown as a flowchart (Figure 1). Details of third-party tools used in Focus16 can be found in 107 the Supplementary Methods section "Third-party software." 108 Given a genus or "Genus species" binomial, the pipeline progresses as follows:  2. Barrnap (Seemann 2020) is used to screen these complete genomes by estimating the 16S rRNA count.

112
Reference genomes with a single 16S rRNA are discarded. This catches two cases: 113 a. An organism may only have a single 16S rRNA. In this case riboSeed assembly will not improve on 114 existing draft genomes.           Figure 1: Flowchart of the pipeline that resolves multi-copy 16S loci from sequenced genomes with reads in SRA (as implemented in Focus16). Candidate reference genomes are downloaded from RefSeq. Reads for each SRA are downloaded and Kraken2 is used to assign taxonomy. Corresponding reference genomes and SRA read sets are identified (using SKESA and Mash), and a new assembly constructed from these using riboSeed to resolve 16S rDNAs. The assembled 16S rRNA regions are then taken forward for phylogenetic reconstruction, or to supplement existing reference databases. Numbers refer to stages outlined in the text; gray lines signify to taxonomic information, and black lines signify to sequence information.

156
The Focus16 pipeline can be installed from PyPI or from the source hosted on GitHub at https://github.com/   Table 1.

174
Unlike the mock communities, the number of genera present in the Endobiota samples is not known a 175 priori. We estimated the abundances of community members by processing the samples through DADA2 in was assigned using DADA2's addSpecies command as described in their manual.

185
In total, 333 unique genera were identified across the four datasets; these were cross-referenced with sraFind 186 and RefSeq, filtering to retain only those with both short-read SRAs available and at least one reference genome Candidate 16S rRNA sequences for a given organism could be extracted from riboSeed's subassemblies or the 191 final de fere novo assembly, we sought to determine which was the best choice for generating sequence to extend 192 the reference databases. Generating the subassembled sequences alone is a less computationally-intensive process 193 than whole-genome assembly, but the final whole-genome assembly step acts as a further refinement of the 194 subassemblies. To find which source is the best for augmenting 16S rRNA databases, we determined the error 195 (SNP/indel) rates by comparing complete genomes to the sequences recovered from either de novo re-assembly 196 of the complete genome, Focus16's "fast" mode (sequences from riboSeed's subassemblies only), and "full" mode 197 (sequences from riboSeed's final de fere novo assembly). The de novo assemblies were accurate but failed to 198 recover many individual 16S rDNAs, as is expected due to the repeated nature of these regions. riboSeed's 199 subassemblies have low error rates and successfully reconstruct the most 16S rDNAs (Figure 2), and as such are 200 the ones we report below and recommend for augmenting a database.
201 Figure 2: Comparing assembly modes for accuracy. SRAs in our dataset that underwent genome completion were used to identify the most accurate method of 16S rRNA sequence assembly. De novo assembly resulted in highly accurate 16S rRNA sequences, but was only able to recover 66 sequences. '-fast' mode proved to be the best tradeoff between accuracy and efficiency. the highest-scoring alignment for each given reference 16S rRNA was used to identify missassemblies relative to 227 the complete genome's 16S rRNA sequence. Alignments shorter than 1400bp were rejected.

229
The DADA2 pipeline was used to process each of the four datasets in Table 1. The resulting sequence tables were 230 combined, and we assigned taxonomy with the naive Bayes classifier implemented in DADA2. This classified 231 sequences at the genus level, and DADA2's "assignSpecies" command was used to assign species-level taxonomy; 232 we enabled the "allowMultiple" parameter to view ambiguities in the assignment. This analysis was used to 233 compare assignment with SILVA 132 alone and assignment with SILVA 132 augmented with sequences generated 234 by Focus16. All scripts can be found in supplementary materials.

Results
Benchmarking Re-assembly Accuracy 237 riboSeed has been shown to generate high-quality reconstructions of each rDNA region when benchmarked Comparing the re-assembled 16S rRNA sequences to the 16S rRNA sequences from complete genomes shows 243 a low error rate, with 53% of sequences being perfect reconstructions and 95% of sequences having fewer than 5 244 errors per Kbp (Figure 3). This confirms that Focus16's best-case accuracy yields perfect reconstructions of the 245 rDNA region; those cases for which reconstruction was imperfect rarely have more than 10 errors (an error rate 246 rarely exceeding 0.7%), and 99% of sequences had fewer than 10 errors per Kbp. This suggests that sequences 247 could be used to augment existing databases; the benefits and consequences of this are presented in the discussion.

248
The error rates for amplicon data in SILVA are difficult to determine; under optimal conditions Sanger sequencing 249 has very low error rates (Shendure and Ji 2008); however when multiple sequences are inadvertently sequenced 250 at the same time (i.e. multiple copies from a single organism), the trace will reflect the differences as short, 251 imperfect, or overlapping peaks. As the trace/quality data for amplicon sequences are not typically available, it 252 is impossible to determine the accuracy of such sequences.

253
Comparing re-assembled 16S rRNA to draft 16S rRNA  In such cases, without the capacity to verify the regions with Sanger or long-read sequencing, determining 261 which sequences are the missassemblies and which should be regarded as true is an impossible task.

262
Augmenting SILVA with results from Focus16

263
Recovering Sequences from Re-assembly 264 Focus16 was used to build an extended database for the three mock datasets described in the DADA2 paper 265 and a real-world dataset from the Endobiota study. From the 85 genera considered, Focus16 processed 2387 266 approximately 23 minutes. Several factors can contribute to failing to recover 16S rRNA sequences from a given 268 SRA, and among these are a too-distant reference genome, low rDNA flanking diversity, low read length, or high 269 read error rates. In total, we recovered 5854 16S rRNA sequences, of which 3008 were unique. Pink triangles show SRAs that were rejected due to limitations in the diversity of available reference genomes, and inverted green triangles show SRAs rejected due to read length, insufficient coverage, poor read quality, etc. A few errors occurred, usually when the SRAs metadata conflicted with the associated sequencing data and caused download errors or errors from reads with incorrect pairing. In these cases, the datasets were discarded.

278
DADA2 was used to identify ASVs from four datasets (Table 1), resulting in a total of 4098 ASVs (109 sequences 279 from the "HMP" dataset, 26 from the "Extreme" dataset, 94 from the "balanced" dataset, and the rest from 280 the EndoBiota study). We then compared the taxonomic results of classification using the SILVA 132 database 281 either as-is, or augmented with novel 16S identified with Focus16.

282
Of the 4098 sequences, Focus16 changed the taxonomic assignment of 20 strains (see Table 2, or Supplementary 283 file STABLE_different_assignment.tab for the actual sequences). Changes could happen in three ways: a 284 previously unclassified ASV gained classification, a previously-assigned ASV gained more species-(or genus-) 285 level details, or a previously-classified ASV was given a different classification. In our dataset, three unclassified 286 strains gained annotations (Table 2 rows 5, 10, and 11) . The remaining 17 had more detail added to the genus 287 or species level; usually, this meant that with SILVA alone a single species classification was given, but with the 288 augmented database, it was indicated that the ASV was ambiguous and could belong to more than one species 289 (   databases, and that the increased diversity results in measurable improvements to taxonomic assignment.

304
Focus16 improved fine-scale taxonomic assignment in two ways: by assigning previously unclassified sequences, 305 and by revealing "overeager" species assignment when a 16S rDNA sequence could have come from two or more 306 species. While at face value this appears to reduce the precision of taxonomic assignment, it reveals cases where 307 species-level assignment was inappropriate. Based on the improved taxonomic assignment in this pilot study of 308 85 genera, we believe a wide-scale application of Focus16 could benefit the community.

309
A natural concern about the approach presented here is the danger of "poisoning" the database with sequences 310 that may or may not be 100% accurate. This is valid concern, but one we believe to be outweighed by the 311 potential of offsetting the known problems with existing 16S rRNA databases. Those 16S rRNA sequences in 312 SILVA originating from amplicons lack to the taxonomic confidence that comes with whole-genome sequencing.

313
Sequences from draft whole-genome assemblies have known issues with rDNA missassembly; in the best case, 314 only one accurate 16S rRNA is represented; in the worst case, the one assembled 16S rRNA sequence may 315 be an amalgamation of the different copies. Until long-read sequencing efforts sufficiently explore the same 316 microbial genomic diversity currently covered by current 16S rRNA databases, these issues must be considered 317 when attempting community analysis via 16S rRNA sequencing. A conservative approach to utilizing sequences 318 recovered with the methodology presented here may be to incorporate a measure of taxonomic assignment 319 confidence, where references sequences originating from long reads, amplicons, draft assemblies, and focus16 320 reassemblies could be weighted appropriately.  The third limitation of this approach is the availability of high-quality closed genomes to use as references.

331
With the increased adoption of long read technologies, we envisage that this limitation will decrease with time; 332 re-running the pipeline as new, complete reference genomes are generated will allow for ongoing improvements 333 to the databases. Eventually, a point will come when Focus16 will no longer be needed as all candidate SRAs