Retrieval of a million high-quality, full-length microbial 16S and 18S rRNA gene sequences without primer bias

Small subunit ribosomal RNA (SSU rRNA) genes, 16S in bacteria and 18S in eukaryotes, have been the standard phylogenetic markers used to characterize microbial diversity and evolution for decades. However, the reference databases of full-length SSU rRNA gene sequences are skewed to well-studied ecosystems and subject to primer bias and chimerism, which results in an incomplete view of the diversity present in a sample. We combine poly(A)-tailing and reverse transcription of SSU rRNA molecules with synthetic long-read sequencing to generate high-quality, full-length SSU rRNA sequences, without primer bias, at high throughput. We apply our approach to samples from seven different ecosystems and obtain more than a million SSU rRNA sequences from all domains of life, with an estimated raw error rate of 0.17%. We observe a large proportion of novel diversity, including several deeply branching phylum-level lineages putatively related to the Asgard Archaea. Our approach will enable expansion of the SSU rRNA reference databases by orders of magnitude, and contribute to a comprehensive census of the tree of life.

Small subunit ribosomal RNA (SSU rRNA) genes, 16S in bacteria and 18S in eukaryotes, have been the standard phylogenetic markers used to characterize microbial diversity and evolution for decades. However, the reference databases of full-length SSU rRNA gene sequences are skewed to wellstudied ecosystems and subject to primer bias and chimerism, which results in an incomplete view of the diversity present in a sample. We combine poly(A)-tailing and reverse transcription of SSU rRNA molecules with synthetic long-read sequencing to generate high-quality, full-length SSU rRNA sequences, without primer bias, at high throughput. We apply our approach to samples from seven different ecosystems and obtain more than a million SSU rRNA sequences from all domains of life, with an estimated raw error rate of 0.17%. We observe a large proportion of novel diversity, including several deeply branching phylum-level lineages putatively related to the Asgard Archaea. Our approach will enable expansion of the SSU rRNA reference databases by orders of magnitude, and contribute to a comprehensive census of the tree of life.
In 1990 two studies reported the first few microbial 16S rRNA gene sequences from complex environmental samples, and provided a first glimpse of the vast, unknown microbial diversity present on Earth 1,2 . In recent years, high-throughput DNA sequencing of short variable regions in the SSU rRNA gene has formed the backbone of most microbial ecology studies. However, the usefulness of SSU rRNA fragments is highly dependent on the underlying reference databases of near full-length SSU rRNA gene sequences. Over the past two decades, two million near full-length SSU rRNA gene sequences have been deposited in databases (SILVA 3 SSU Ref v. 128). These sequences represent only a fraction of the estimated microbial diversity on Earth, which has been proposed to range from millions to trillions of species 4 . Most full-length SSU rRNA gene sequences are generated by PCR, cloning and Sanger sequencing, but the cost per sequence prevents the large-scale sequencing projects that are needed to properly populate the tree of life from being carried out. Due to technological limitations, the full-length SSU rRNA genes (1,400-1,900 bp) cannot be directly sequenced with cost-effective, short-read, high-throughput sequencing methods 5 . Long-read technologies combined with circular consensus error-correction (PacBio [6][7][8] or Oxford Nanopore 9 ) can be used to generate full-length 16S rRNA gene sequences reaching an average consensus error rate as low as 0.5% 6 . Furthermore, high-quality, full-length 16S rRNA gene sequences have also been obtained by synthetic long reads using molecular tagging of shortread Illumina data 10 . Despite this recent progress, long-read methods are not high-throughput and depend on primers to amplify the SSU rRNA gene, which greatly limits our ability to discover novel diversity 11 , especially for Archaea and Eukaryota, which lack good universal full-length primers 12 .
To avoid the primer biases that have been detected in full-length SSU rRNA gene sequencing, we combined and optimized methods for producing cDNA of full-length SSU rRNA 13,14 with synthetic long-read sequencing enabled by molecular tagging 10,[15][16][17] . Fulllength SSU rRNA molecules were enriched from extracted total RNA by size selection and converted to double-stranded cDNA by poly(A) tailing and single-stranded ligation, thereby avoiding the use of conventional SSU rRNA gene PCR primers and the resulting taxonomic bias 11 ( Fig. 1a and Supplementary Fig. 1). During first-and second-strand cDNA synthesis, the individual SSU rRNA molecules were uniquely tagged on both termini. This enabled preparation of short-read sequencing libraries, in which the resulting individual sequencing reads were tagged according to the original template molecule. By sorting the short reads into separate bins based on their tags, full-length SSU rRNA molecules can be reconstructed by de novo assembly of the individual bins (Supplementary Code). Furthermore, we developed a primer-based version of our approach (Supplementary Fig. 2) to be able to directly demonstrate the benefits of not using primers.
To estimate the error and chimera rate of our method, we applied it to a mock community containing Escherichia coli MG 1655, Bacillus subtilis str. 168 and Pseudomonas aeruginosa PAO1, each with multiple 16S rRNA gene copies (4-10×) that differ internally in 0 to 19 positions (up to 1.3% internal divergence). In a single Illumina MiSeq run, we generated 10,575 16S rRNA gene sequences longer than l e t t e R S 1,200 bp each (median 1,533 bp, Fig. 1b), with an average raw error rate of 0.17% (Fig. 1c) and a chimera rate of 0.4%. The raw error rate corresponds well with the errors introduced by the Taq DNA polymerase used in the PCR steps 18 . The chimera rate of 0.4% is ~50 times lower than that observed in conventional PCR-based studies 19 . This low error rate enabled assignment of all full-length 16S rRNA sequences to their respective operons, exemplifying the resolving power of our method (Fig. 1d). Interestingly, for B. subtilis, three of the rRNA operons (rrnI, rrnH and rrnG) were not expressed (confirmed by RNA-seq data, Supplementary Table 1). These operons are located closely together in the genome and also regulated by the same promoter 20 .
To demonstrate the application of our method to complex environmental samples, we generated 2,285,691 primer-free RNA sequences (>1,200 bp, median 1,492 bp; Supplementary Fig. 3 Tables 2 and 3). On a single Illumina MiSeq run we produced up to 54,489 rRNA sequences, whereas a single Illumina HiSeq2500 Rapid Run generated up to 541,676 rRNA sequences (>1,200 bp). SSU rRNA made up 24-76% of all sequences, while large subunit (LSU) rRNA fragments made up the majority of the remaining sequences. The relatively large fraction of LSU rRNA was unexpected, as the SSU rRNA peak was enriched using size selection (Fig. 1a) Figure 1 Full-length SSU rRNA sequencing. (a) Schematic overview of the preparation of full-length SSU rRNA gene sequences from total community RNA (a more detailed version is shown in Supplementary Fig. 1, and Supplementary Fig. 2 outlines the primer-based version). First, SSU rRNA is enriched from extracted total community RNA by gel electrophoresis size selection. Enriched SSU rRNA is polyadenylated, and the polyA-tail serves as a generic priming site for first-strand cDNA synthesis (reverse transcription). A second generic priming site is added to the end of the first cDNA molecule by single-strand ligation, and serves as the priming site for second-strand cDNA synthesis. PCR adaptors that are incorporated during firstand second-strand synthesis contain unique tags (green and blue), which, in combination, function as unique 'linked-tags' for each cDNA molecule. These adaptors also contain generic priming sites that are used for PCR amplification of tagged molecules. The amplicons are size-selected by gel electrophoresis to remove incomplete or truncated products. A defined number (e.g., 100,000) of full-length SSU rRNA amplicon molecules are amplified with PCR to generate >10,000 copies of each uniquely tagged amplicon. The clonal amplicon library is split in two. The major part is used to prepare a read-tag library and the last part is used to prepare a linked-tag library. The read-tag library is prepared by fragmenting the full-length SSU rRNA amplicons using the Illumina Nextera tagmentation and library preparation kit. The resulting sequence is an internal SSU rRNA fragment read connected to a single, unique, tag read. The linked-tag library is prepared by circularizing full-length SSU rRNA amplicons by intra-molecular ligation, to physically link the tags in close proximity. PCR is used to amplify the linked-tags, which are then identified by sequencing. The linked-tags are used to bin all SSU rRNA fragment tag-reads originating from the same parent molecule. De novo assembly is used to re-create the parent SSU rRNA gene sequence. The resulting sequences are trimmed to remove adaptors, filtered and classified. (b) Size distribution of assembled SSU rRNA gene sequences from the mock community. (c) Error count distribution for raw SSU rRNA sequences from the mock community (numbers indicate percent of all 16S rRNA gene sequences). (d) The relative abundance of the different 16S rRNA molecules for B. subtilis subsp. subtilis str. 168. l e t t e R S has been observed before 13,14 and can be explained by the presence of fragmented LSU, similar in size to the SSU, contaminating the enrichment. Fragmented LSU originates from degradation of the rRNA during extraction, in situ degradation due to starvation or environmental stress 21 , and from the occurrence of bacteria and lower eukaryotes with nicked rRNA molecules in their ribosomes 22,23 .
A total of 985,266 primer-free bacterial 16S rRNA gene sequences (>1,200 bp) were obtained from samples taken in seven different environments. With the primer-based version of our method, an additional 477,055 bacterial 16S rRNA gene sequences (>1,200 bp) were generated. To generate high-quality SSU rRNA gene operational taxonomic units (OTUs), we aligned the sequences to the SILVA alignment and used only full-length sequences for clustering at 97% similarity. A total of 44,902 bacterial OTUs were produced, of which 31,125 were observed more than once. The obtained OTUs represent 65 out of the 75 named bacterial phyla (SILVA SSU Ref v. 128), including most of the known candidate phyla (Fig. 2a, Supplementary Fig. 4 and Supplementary Table 4). Compared to the SILVA database, 58% of the full-length OTU sequences represent new diversity (>3% different from closest relative) (Fig. 2b and Supplementary Table 5). The degree of novelty was highly ecosystem-specific; for example, in the sediment samples, 67% (n = 25,694) of the bacterial OTUs were novel, whereas 40% (n = 828) of the OTUs were novel in the human gut sample.
For the Archaea 61,266 16S rRNA gene sequences (>1,200 bp) were obtained, which is more sequences for this domain than is present in the entire SILVA SSU Ref v. 128 database (39,138 sequences > 1,200 bp). After clustering, this resulted in 3,410 full-length OTUs (97% similarity) of which 2,197 were identified more than once. A total of 70,883 eukaryotic 18S rRNA gene sequences (>1,200 bp) were also obtained, which resulted in 415 full-length OTUs observed more than once. The number of OTUs we present is a conservative estimate, because the length of the full-length eukaryotic 18S rRNA gene is close to the upper limit (~2,000 bp) that can be sequenced using this method 24 . The full-length Eukaryota OTUs only capture 55% of the total eukaryotic sequences generated (>1,200 bp) when mapping these to the OTUs, while this number is 90% and 94% for Archaea and Bacteria, respectively.
We assessed the coverage of SSU rRNA gene PCR primers, which are commonly applied to analyze complex microbial communities, on the full-length SSU rRNA gene sequences (clustered at 97%) using PrimerProspector 25 . This enabled us to estimate the microbial diversity that would be missed by using SSU rRNA gene PCR primers. As is to be expected, the amount of 'missing diversity' was dependent on the combination of primer set and sample type. For example, the primer pair 27F/1492R theoretically missed 8.5-14.7% of the fulllength OTUs (Supplementary  l e t t e R S with results from a recent analysis of primer bias in metagenomics data sets, which estimated that a minimum of 9.6% of the bacterial diversity in any study might be missed using conventional 16S rRNA gene PCR primers 11 . To estimate potential biases using our primer-free full-length SSU rRNA sequencing method, we sequenced conventional shotgun RNA-seq libraries from three samples (sediment, soil and human gut) and compared these data with our full-length SSU rRNA data. Since no complete database of SSU rRNA sequences exists, the potential bias was estimated by comparing the fraction of RNA-seq reads (n = 1,000,000) that could be mapped to the SILVA SSU Ref NR99 v. 128 database, but not to the primer-free SSU rRNA gene sequences (normalized to 34,000 sequences pr. sample). For Bacteria, 3.7-6.6% of the putative SSU rRNA RNA-seq reads were not mapped to the primer-free SSU rRNA gene sequences (Supplementary Table 7), and rarefaction curves indicated that this was mainly an effect of insufficient sampling depth (Supplementary Fig. 5). Hence, in these environments, no substantial bias could be detected (Supplementary Note 1).
Based on proposed sequence similarity thresholds for the Bacteria and Archaea 26 , numerous novel class, order and family level lineages were obtained in the current study, for both well-represented phyla, such as Proteobacteria, the more recently defined Patescibacteria 27 / Candidate Phyla Radiation 28 (CPR) and the Asgard Archaea superphylum 29 (Supplementary Fig. 6, and Supplementary Tables 4 and 5). Interestingly, phylogenetic analyses revealed several deeply branching lineages within the Archaea that are not affiliated with existing defined phylogenetic clades in the SILVA SSU Ref v. 128 database (Fig. 3). The similarity to the database for these sequences was below the suggested threshold of 75% for the delineation of phyla 26 . Phylogenetic analyses indicate that these sequences were related to the Asgard Archaea superphylum. However, given their small numbers and low similarity to other sequences, more data are required to confidently resolve these phylogenetic relationships (we confirmed several of these novel lineages by independent primer-based PCR). The Asgard Archaea have been proposed to represent the ancestors of the Eukaryota, so resolving their phylogeny, along with deeply branching eukaryotes, will provide insight into several important questions regarding the evolution of life 29 . While these questions are best resolved with the full repertoire of phylogenomic analyses, our method can be used to catalog the diversity of deeply branching phyla and provide an important first step to identify species to target for genome retrieval using, for example, genome-centric metagenomics 30 .
The primer-independent aspect of our method is particularly relevant to studies of the Archaea, because there is a lack of sufficient 16S rDNA primer sets to map the diversity in this domain 12 . This problem is compounded by the primer biases of commonly used archaeal primers, for which the coverage, in general, is poor, especially for those primer sets designed to span the full length of the archaeal 16S rRNA gene; for example, in silico analysis predicted that full-length primer sets miss 28-77% of the Archaea OTUs in the sediment samples (Supplementary Table 8).
It is difficult to assess the substantial diversity of eukaryotic sequences obtained in this study ( Fig. 2a and Supplementary  Fig. 7), given that sequence divergence does not consistently correlate with phylogenetic definitions for this domain. Plus, use of the SSU gene for phylogenetic delineation is still under development for this domain, especially for the unicellular micro-eukaryotes 31 . The importance of micro-eukaryote diversity to the ecology of systems is often overlooked, as the SSU rRNA primer-based molecular tools that are available provide poor coverage 32 . For example, the majority of the novel soil organism diversity that we observe is in the Amoebozoa phylum, which is known to be grossly underrepresented using conventional methods 33,34 . Given the important roles already assigned to amoebozoids in shaping the soil bacterial community (owing to their role as bacterial predators 35 ), improving the detection of this phylum is crucial to better understand soil ecology.
The 1,168,276 LSU rRNA sequences (>1,200 bp) generated in our study comprise a data set that is bigger than the complete SILVA LSU Parc v. 128 database (735,238 sequences; >300 bp). Although not full length, these sequences dramatically increase the potential of the multitude of in situ studies that rely on rRNA, for example, fluorescence in situ hybridization (FISH) as numerous new regions for probe-design are now readily available. As the Illumina synthetic long-read part of our method is limited to <2,000 bp, we tried to replace the Illumina sequencing with Oxford Nanopore sequencing, which produces longer but error-prone reads. Similarly to the Illumina approach, the unique tags were used to bin the Nanopore reads followed by consensus error correction (Supplementary Fig. 8 l e t t e R S In this proof-of-principle analysis we found that it was possible to use our method for error correction of Oxford Nanopore reads, which in the future might enable sequencing of high-quality, primer-free, full-length LSU rRNA sequences. As the reference databases for both SSU and LSU rRNA gene sequences improve, we anticipate improvements in the design of primers that target entire rRNA operons, which would greatly enhance the phylogenetic resolution. Our error-correction Oxford Nanopore sequencing strategy could be used to sequence entire rRNA operons, because Oxford Nanopore sequencing has fewer length limitations In summary, we produced >1.6 million SSU rRNA/rRNA gene sequences (>1,200 bp), which is in the same range as the number of SSU rRNA gene sequences (>1,200 bp) currently available (1.9 million in SILVA SSU Ref v. 128). Our method is scalable and is optimized for Illumina synthetic long-read sequencing, but can be used with the Oxford Nanopore platform and, in principle, with any other long-read technology. We anticipate that use of this method will rapidly increase the number and diversity of full-length SSU rRNA gene sequences. Our method will also enable improved coverage of environments that are currently poorly represented in the databases 36 . In particular, we envisage improvements in coverage of Archaea and Eukaryota, for which a lack of adequate full-length SSU primers has hampered the development of the reference databases 12 . Efforts such as the eukaryotic reference database initiative (eukref.org) underline the recognized need (and effort) to improve these databases. Finally, we note that our method is not limited to producing only reference sequences and can be used as a substitute for conventional amplicon sequencing of any marker gene <2,000 bp to better document diversity in microbiomes.
Improved SSU rRNA reference databases will improve the utility of ecosystem-specific databases, such as the human oral microbiome database 37 . Albeit decentralized, these databases are easier to curate, meaning that more information can be assigned to individual organisms based on ecosystem context 38 , and taxonomic assignments using these databases will in turn be more robust 39 . High-quality ecosystem-or study-specific databases might also enable the design of specific amplicon sequencing primers and FISH probes. For example, it would be possible to design FISH probes with better accuracy that will increase the resolution of in situ single-cell physiology studies, which are invaluable for studying in situ niche differentiation 40 .

MetHOdS
Methods, including statements of data availability and any associated accession codes and references, are available in the online version of the paper.

ONLINe MetHOdS
Detailed pictorial overviews of the workflows for the primer-free and primerbased methods are shown in Supplementary Figures 1 and 8 44 . The bacteria were streaked on LB agar plates and incubated at 30 °C (B. subtilis) or 37 °C (E. coli and P. aeruginosa) until colonies appeared. A single colony of each bacterium was used to inoculate 50 mL of LB-media in a 250-mL conical flask, and the bacteria were grown overnight at the temperature provided above with 150 r.p.m. of shaking. The cultures were divided into 1.5-mL aliquots and bacteria pelleted by centrifugation (16,200g, 4 °C, 1 min). The cell pellets were stored at −20 °C. RNA was purified from the cell pellets using the RiboPure RNA purification kit for bacteria (Thermo Fisher Scientific). Total RNA from each of the three bacteria were mixed in equal amounts by weight to create the mock community RNA sample.
Fresh water. We filtered the sample through a 5-µm polypropylene filter (Merck Millipore) to remove large particulate matter, acknowledging some eukaryotic organisms would be lost. Biomass was then collected from 100 mL of the pre-filtered water on 0.2 µm cellulose nitrate filters with a filtration area of 18 cm 2 (Sartorius Stedim Biotech). RNA and DNA were isolated from filters that had been rolled into cylinders with the deposited biomass facing inward using the RNA PowerSoil total RNA isolation kit with the RNA PowerSoil DNA elution accessory kit (MO BIO Laboratories). DNA-free total RNA was obtained by treating a subsample of the purified RNA with the DNase Max kit (MO BIO Laboratories) followed by cleanup using 0.65× RNAClean XP beads with elution into 25 µL nuclease-free water.
Soil and sediment. Soil samples were collected at various depths by horizontal coring from the inner wall of vertical holes. Sediments samples were collected at various depths (5-40 cm) of marine, brackish and freshwater sediments using a handheld piston corer. All samples were transported on ice and processed immediately or stored at −20 °C. RNA and DNA were isolated from 2 g of sample using the RNA PowerSoil total RNA isolation kit with the RNA PowerSoil DNA elution accessory kit (MO BIO Laboratories). DNA-free total RNA was obtained by treating a subsample of the purified RNA with the DNase Max kit (MO BIO Laboratories) followed by cleanup using 0.65× RNAClean XP beads with elution into 25 µL nuclease-free water.
Human gut sample. Eight grams of fresh feces was mixed with 16 mL of RNAlater (Thermo Fisher Scientific) by vigorous shaking in a 50 mL tube. The sample was centrifuged (700g, 4 °C, 5 min) to remove solid fecal matter. Bacteria from 2 mL of the supernatant were pelleted by centrifugation (9,000g, 4 °C, 5 min). Total nucleic acid was immediately extracted from the pellet using the PowerMicrobiome RNA isolation kit (MO BIO Laboratories) with the optional phenol-based lysis. Purification was carried out according to the manufacturers' recommendations except that cell lysis was performed in a FastPrep-24 instrument for 4 × 40 s at 6.0 m/s to increase the yield of nucleic acids from bacteria with tough cell walls 45 . The sample was incubated on ice for 2 min between each bead beating cycle to prevent heating of the sample. DNA-free total RNA was obtained by treating a subsample of the purified nucleic acid with the DNase Max kit (MO BIO Laboratories) followed by cleanup using 0.65× RNAClean XP beads with elution into 25 µL nuclease-free water.
Anaerobic digester sludge. Anaerobic digestion biomass from a mesophilic biogas reactor at a wastewater treatment plant was transported on ice and stored as 2-mL aliquots at −80 °C. Total nucleic acids were purified from 500 µL of sample using the PowerMicrobiome RNA isolation kit (MO BIO Laboratories) as described for the human gut sample. DNA-free total RNA was obtained by treating a subsample of the purified nucleic acid with the DNase Max kit (MO BIO Laboratories) followed by cleanup using 0.65× RNAClean XP beads with elution into 25 µL nuclease-free water.
Activated sludge. Activated sludge from a wastewater treatment plant was transported on ice and processed within 4 h. The sample was homogenized (1 min, 1,650 r.p.m.) using a tissue homogenizer (30 mL) mounted on a Heidolph RZR 2020 (Heidolph, Germany) and stored as 2-mL aliquots at −80 °C. Total nucleic acids were purified from 500 µL of sample using the PowerMicrobiome RNA isolation kit (MO BIO Laboratories) as described for the human gut sample. DNA-free total RNA was obtained by treating a subsample of the purified nucleic acid with the DNase Max kit (MO BIO Laboratories) followed by cleanup using 0.65× RNAClean XP beads with elution into 25 µL nuclease-free water.
RNA based tagged amplicon preparation SSU rRNA size selection. SSU rRNA was isolated from total RNA by size selection on an E-gel electrophoresis system with precast E-Gel CloneWell gels (Thermo Fisher Scientific). Standard operation procedure was used with the following exceptions. Total RNA was heat denatured at 70 °C for 5 min and snap-cooled on an ice block for 2 min to prevent RNA from refolding. Up to 800 ng total RNA sample was added per well. A total of 500 ng GeneRuler 1 kb DNA ladder (Thermo Fisher Scientific) was used as a reference. The gel was run until the SSU peak (~1,500 bp) was 1 mm from the elution well, after which 20-µL elution aliquots were sampled every 15 s, up to a total of 16 aliquots, and the visible passing of the SSU rRNA peak. Every two aliquots were pooled, yielding eight pooled aliquots per sample. These were then analyzed on an Agilent 2200 Tapestation using the High Sensitivity RNA Screentape. Aliquots containing SSU rRNA were pooled, and purified using 1.0× RNAClean XP beads (Beckman Coulter) and eluted in 32 µL nuclease-free water.
Poly(A) tailing of SSU rRNA. Poly(A) tails were added to the purified SSU rRNA in a reaction containing 30 µL purified SSU rRNA (10-200 ng RNA), 4 µL 10X E. coli Poly(A) Polymerase Reaction Buffer (New England Biolabs), 4 µL 10 mM ATP (New England Biolabs), 1.5 µL E. coli Poly(A) Polymerase (New England Biolabs) and 1 µL RiboLock RNase Inhibitor. The reaction was incubated at 37 °C for 30 min. The product was immediately purified using 1.0× RNAClean XP beads and eluted in 15 µL nuclease-free water. Samples were validated on an Agilent 2200 Tapestation using RNA Screentape. Sample with concentrations higher than 3 ng/µL were diluted to 3 ng/µL using nuclease-free water based on the gel electrophoresis results.
First strand cDNA synthesis. The priming adaptor was first annealed to the poly(A) SSU rRNA in a reaction containing 11.5 µL poly(A) SSU rRNA (5-30 ng), 0.5 µL 100 µM SSU_rRNA_RT, 1 µL 10 mM dNTP mix. The annealing reaction was incubated at 70 °C for 5 min and then snap-cooled for 2 min on an ice block. Annealing was followed by a reverse transcription reaction, which contained the 13 µL primed SSU rRNA sample, 4 µL 5× SSIV buffer (Thermo Fisher Scientific), 1 µL 200 U/µL SuperScript IV Reverse Transcriptase (Thermo Fisher Scientific), 1 µL 100 mM DTT (Thermo Fisher Scientific), and 1 µL RiboLock RNase Inhibitor (Thermo Fisher Scientific). The reverse transcription was carried out at 50 °C for 50 min was followed by inactivation at 80 °C for 10 min. The sample was cooled to 37 °C, and 1 µL 10 U/µL RNase H from E. coli (cloned) (Thermo Fisher Scientific) was added. The sample was then incubated at 37 °C for 20 min to digest the RNA template. The cDNA was purified using 0.65× AMPure XP beads and eluted in 11 µL nuclease-free water.
Remaining ssDNA was removed by single-strand-specific nuclease digestion to improve the efficiency of the subsequent PCR reaction. The single-stranded nuclease digestion reaction contained 23 µL purified ds cDNA product, 3 µL 10X S1 nuclease buffer (Thermo Fisher Scientific), 3 µL 3 M NaCl (Thermo Fisher Scientific), 1 µL 10 U/µL S1 nuclease (Thermo Fisher Scientific). The reaction was incubated at 25 °C for 25 min and terminated by addition of 2 µL 0.5 M EDTA and heating to 70 °C for 10 min. The digested product was diluted to 50 µL with nuclease-free water and purified using 0.6× AMPure XP bead and eluted in 12 µL nuclease-free water.
Library amplification. The ds cDNA was amplified using PCR to obtain enough product for validation and size selection. The reaction contained 10 µL ds cDNA, 63.5 µL nuclease-free water, 10 µL 10× QIAGEN PCR buffer, 2 µL 10 mM dNTP, 5 µL 10 µM SSU_rRNA_pcr_fw, 5 µL 10 µM SSU_rRNA_pcr_ rv, 4 µL 25 mM MgCl 2 , and 0.5 µL 5 U/µL Taq polymerase. The reaction was incubated with an initial denaturation at 94 °C for 3 min followed by 20 cycles of denaturation at 94 °C for 30 s, annealing at 62 °C for 30 s, and extension at 72 °C for 2 min and then a final extension at 72 °C for 5 min. The PCR product was purified using 0.6× AMPure XP bead and eluted in 20 µL nuclease-free water. The cDNA amplicon was validated with a D5000 screentape.
Library size selection. Presence of unspecific and incomplete products was frequently observed after the PCR. To increase the final yield of full-length SSU rRNA amplicons, a size selection on a precast E-Gel CloneWell gel was performed as described earlier, using 100-300 ng cDNA amplicon. Size-selection aliquots were validated on a High Sensitivity D5000 Screentape. Aliquots containing full-length SSU rRNA amplicons were pooled and purified using 0.6× AMPure XP bead and eluted in 15 µL nuclease-free water. The quality and concentration of the tagged amplicon product were analyzed on a High Sensitivity D5000 screentape and with the Qubit dsDNA HS Assay Kit.
DNA based tagged amplicon preparation. Adaptor annealing. Tagged bacteria 16S amplicons were generated using genomic DNA as the template and a method inspired by Burke et al. 10 . Adaptors containing unique molecular tags and defined primer binding sites were added to each end of the bacterial 16S rRNA genes by PCR targeting the bacteria 16S (27f/1492r primers). The reaction contained 10 µL 10× PCR Buffer (Qiagen), 2 µL 10 mM dNTP (Qiagen), 5 µL 10 µM B16S_rDNA_27F, 5 µL 10 µM B16S_rDNA_1492R, 4 µL 25 mM MgCl 2 , 72.5 µL nuclease-free water, 1 µL of template DNA (20-200 ng/µL), and 0.5 µL 5 U/µL Taq polymerase (Qiagen). The reaction was incubated with an initial denaturation at 94 °C for 3 min followed by two cycles of denaturation at 94 °C for 30 s, annealing at 56 °C for 30 s, and extension at 72 °C for 3 min and then a final extension at 72 °C for 5 min. The sample was purified using 0.6× AMPure XP bead and eluted in 11 µL nuclease-free water.
Library amplification. The tagged 16S rDNA amplicons were amplified using PCR to obtain enough product for validation and size selection. The reaction contained 10 µL of adaptor annealed sample, 63.5 µL nuclease-free water, 10 µL 10× QIAGEN PCR buffer, 2 µL 10 mM dNTP, 5 µL 10 µM SSU_ rRNA_pcr_fw, 5 µL 10 µM SSU_rRNA_pcr_rv, 4 µL 25 mM MgCl, and 0.5 µL 5 U/µL Taq polymerase. The reaction was incubated with an initial denaturation at 94 °C for 3 min followed by 20-25 cycles of denaturation at 94 °C for 30 s, annealing at 62 °C for 30 s, and extension at 72 °C for 2 min and then a final extension at 72 °C for 5 min. The PCR product was purified using 0.6× AMPure XP bead and eluted in 20 µL nuclease-free water. The amplicon was validated on a D5000 screentape.
Library size selection. The presence of unspecific products and template DNA was frequently observed after the PCR. To increase the final yield of full-length SSU rRNA amplicons, a size selection on a precast E-Gel CloneWell gel was performed as described earlier using up to 400 ng PCR product. Size-selection aliquots were validated on a High Sensitivity D5000 Screentape. Aliquots containing full-length 16S rDNA amplicons were pooled and purified using 0.6× AMPure XP bead with elution into 15 µL nucleasefree water. The quality and concentration of the tagged amplicon product were analyzed on a High Sensitivity D5000 screentape and with the Qubit dsDNA HS Assay Kit.

Read-tag and linked-tag library preparation. Read-tag library preparation.
A Nextera library preparation kit (Illumina) was used to prepare a paired-end read-tag sequencing library from the clonal library using a customized protocol. A tagmentation reaction was prepared with 100 ng of the clonal library in 22.5 µL nuclease-free water, 25 µL tagment DNA buffer (Illumina) and 2.5 µL tagment DNA enzyme (Illumina). The reaction was incubated at 55 °C for 5 min. The product was immediately diluted to 100 µL and purified using 0.6× AMPure XP bead with elution into 42 µL nuclease-free water.
The tagmentation products were PCR-amplified using two separate PCRs A and B. PCR A selectively amplified fragments containing the 5′ termini of the cDNA amplicons and PCR B selectively amplified fragments containing the 3′ termini. The reactions contained 5 µL N504 Nextera adaptor (Illumina), 5 µL 10 µM SSU_rRNA_readtag_fw (PCR A) or SSU_rRNA_readtag_rv (PCR B) adaptor, 15 µL Nextera PCR master mix (Illumina), 5 µL PCR primer cocktail (Illumina), and 20 µL purified tagmentation product. The following PCR program was used. Initial elongation at 72 °C for 3 min, initial denaturation at 98 °C for 30 s, and 10 cycles of denaturation at 98 °C for 10 s, annealing at 60 °C for 30 s and elongation at 72 °C for 3 min and finishing with final extension at 72 °C for 5 min. The raw read-tag libraries were purified using 1.0× AMPure XP bead with elution into 20 µL nuclease-free water.
To ensure even sequencing coverage across the length of the SSU amplicons, the read-tag libraries were normalized. The libraries were size fractionated on an E-Gel CloneWell gel (Thermo Fisher Scientific). A total of 500 ng GeneRuler 1 kb DNA ladder (Thermo Fisher Scientific) was used as a reference. The gel was run until the 500 bp marker was 1 mm from the elution well, after which 20 µL elution aliquots were sampled and replaced by nuclease-free water every 15 s, up to a total of 32 aliquots. Every two aliquots were pooled, yielding 16 pooled aliquots per sample. These were then analyzed on an Agilent 2200 Tapestation using the High Sensitivity D1000 Screentape. Fractions with a mean fragment length of 500-1,250 bp were used for the pooling. The effective sequencing concentration for fractions from 500-950 bp was determined based on the tapestation data and the empirical formula C seq = Peak molarity [pmol/l] * (−0.0124*(peak size [bp] − 215 bp) + 10.332). These fractions were pooled in equimolar concentrations (C seq ). For fractions between 950-1,250 bp the entire aliquot was used for pooling (40-50 µL). The pooled aliquots were then purified using 1.0× AMPure XP bead with elution into 12 µL nuclease-free water. The quality and concentration of the coverage normalized read-tag libraries were analyzed on D1000 screentapes and with the Qubit dsDNA HS Assay Kit, respectively.
Linked-tag library preparation. Clonal libraries were end-repaired in a reaction containing 1 µL of 10 ng/µL clonal library, 20.25 µL nuclease-free water, 2.5 µL 10X NEBNext End repair Reaction Buffer (New England Biolabs), 1.25 µL NEBNext End Repair Enzyme Mix (New England Biolabs). The reaction was incubated at 20 °C for 30 min. The end-repair reaction was purified using 1.0× AMPure XP bead and eluted into 10 µL nuclease-free water.
The junction sequence, which contains both unique tags, were amplified by PCR in a reaction containing 8 µL of circularized clonal library, 5 µL 10x QIAGEN PCR buffer (Qiagen), 1 µL 10 mM dNTP mix, 2.5 µL with 10 µM of each SSU_rRNA_linktag_fw and SSU_rRNA_linktag_rv, 3 µL 25 mM MgCl 2 , 30.25 µL nuclease-free water, and 0.25 µL 5U/µL Taq polymerase. The PCR reaction was initiated by denaturation at 94 °C for 3 min, followed by 20 cycles of denaturation at 94 °C for 20 s, annealing at 56 °C for 20 s, and extension at 72 °C for 20 s and finishing with a final extension at 72 °C for 3 min. The PCR product was purified using 1.0× AMPure XP beads and elution into 12 µL nuclease-free water. The quality and concentration of the linked-tag libraries were analyzed on D1000 screentapes and with the Qubit dsDNA HS Assay Kit, respectively. manual # 15050107 v02; manual # 15061846 v01) with the following changes. 10 µL library pool was denatured by adding 10 µL 0.1 N NaOH solution, mixing well by pipetting and incubating for 5 min at 25 °C. The denatured library pool was diluted by adding 980 µL cold Hybridization Buffer (Illumina). 400 µL denatured and diluted library pool was mixed with 20 µL denatured and diluted 10 pM PhiX control v3 library (Illumina), and stored on ice until loading. Custom read2 primer mix was prepared by mixing 25 µL 100µM SSU_rRNA_read2_fw and 25 µL 100 µM SSU_rRNA_read2_rv in a conical tube (15 mL) and diluted with 4,950 µL Hybridization Buffer (final concentration is 0.5 µM). When the paired-end reagent rack was loaded on the HiSeq, the Illumina primer mix in position nr. 16 was replaced with the custom read2 primer mix prepared as described above. When setting up the HiSeq run in the control software, the standard procedure was followed except for the following steps. For the "Recipe Screen", the following options were chosen: index type options = No Index. Read 1 cycles = 240. Read 2 cycles = 25. After sequencing, bcl2fastq v2.17.1.14 (Illumina) was used to generate fastq files from bcl files using standard settings (manual # 15038058 RevB).
Nanopore library preparation and sequencing. Library preparation. Nanopore sequencing was carried out following the "Lab protocol for Amplicon sequencing for the MinION device for SQK-NSK007" (Oxford Nanopore Technologies). The clonal library was end-repaired in a reaction containing 340 ng of clonal library in 45 µL nuclease-free water, 7 µL Ultra II END-Prep buffer (New England Biolabs), 3 µL End-Prep enzyme mix (New England Biolabs) and 5 µL DNA CS standard (dsDNA from the Lambda phage). The end-repair was incubated at 20 °C for 5 min, followed by 65 °C for 5 min. The end-repaired DNA was purified using 1.0× AMPure XP Bead and eluted into 31 µL nuclease-free water. Adapters were ligated to the endprepared DNA by addition of 8 µL nuclease-free water, 10 µL Adaptor Mix, 2 µL hairpin adaptor (HPA), 50 µL NEB Blunt/TA Master Mix (New England Biolabs), and incubated at 22 °C for 10 min. Then 1 µL hairpin tether (HPT) was added and incubated for 10 min. 50 µL MyOne C1 beads (Thermo Fisher Scientific) were pelleted on a magnet and washed twice with 100 µL bead binding buffer (BBB), and the washed beads were resuspended in 100 µL BBB. Ligation cleanup was performed by adding 100 µL washed MyOne C1 beads to the ligation reaction, and incubated at RT for 5 min. Then the beads were pelleted on a magnet and washed twice with 150 µL BBB. Residual BBB was removed by spinning down any liquid and pelleting on a magnet. The adapted library was resuspended in 25 µL elution buffer (ELB) and incubated at 37 °C for 10 min. The eluted DNA is termed Pre-sequencing mix. The library size distribution was checked using a D5000 screentape.
MinION sequencing. An R9 flow cell (Oxford Nanopore Technologies) was mounted in the MinION Mk1B (Oxford Nanopore Technologies), and platform QC was carried out in MinKNOW, revealing 1,354 active pores. Flow cell priming buffer was prepared by mixing 500 µL running buffer (RBF1) and 500 µL nuclease-free water. The flow cell was primed twice with 500 µL priming buffer with 10 min in between loadings. The loading mix was prepared by mixing 75 µL RBF1 with 63 µL nuclease-free water and 12 µL pre-sequencing mix. The sequencing was started in MinKNOW (v. 1.0.2) and basecalling was started using Metrichor "2D Basecalling RNN for SQK-NSK007" (v. 1.107).

RNA-seq.
RNA-seq (metatranscriptome) libraries were prepared using TruSeq Stranded mRNA Library Prep Kit (Illumina) according to the standard protocol (TruSeq Stranded mRNA Sample Preparation Guide, Part # 15031047 Rev. E) with the following changes. A total of 1-100 ng total RNA was used. The ribo-depletion part of the protocol was skipped, and instead, the protocol was started from the section "Make RFP" step 12 on page 20 of the manual. Instead of elution from beads, 9.5 µL total RNA sample was simply mixed with 9.5 µL of Fragment, Prime, Finish Mix. The Fragment, Prime, Finish reaction was incubated at 94 °C for 12 min. Subsequent steps followed the standard protocol. The quality and concentration of the resulting RNA-seq libraries were analyzed using D1000 screentapes and the Qubit dsDNA HS Assay Kit, respectively. The RNA-seq libraries were diluted to 1.4 nM, pooled and single-end (250 bp) sequenced on a HiSeq2500 instrument (Illumina) using onboard clustering and rapid run mode with a HiSeq SE Rapid Cluster Kit v2 (Illumina) and HiSeq Rapid SBS Kit v2, 200 cycles (Illumina) using standard procedure with the following exception. The SBS reagents were supplemented the SSU rRNA data from a soil (sd01), a sediment (sd04) and a human gut (hg01) sample were compared to data from shotgun RNA sequencing (RNAseq), which had been prepared from the same RNA extract. To measure bias, the fraction of missed putative SSU rRNA sequences from the RNA-seq data was estimated by individual mappings of the RNA-seq data to the RNA-based full-length SSU rRNA data set, the DNA-based full-length SSU rRNA data set and the SILVA 128 SSU Ref NR 99 database. RNA-seq sequences matching either data set were categorized as the total number of putative SSU rRNA sequences in the RNA-seq data. The putative SSU rRNA sequences were counted for each data set and divided by the sum of all putative SSU rRNA sequences to estimate percentage missed data. The main comparison is between the RNA-and DNAbased data sets, where the method missing least putative SSU rRNA sequences is the method with the least overall bias. The SILVA database was included in the analysis to recruit as many putative SSU rRNA sequences as possible.
The RNA-based data set and the DNA-based data set were prepared by taking assembled SSU rRNA sequences from this study, trimming the sequences to remove bases beyond the 1007-43230 alignment interval, filtering them to discard sequences <1,200 bp and dividing the sequences between an RNAbased data set and a DNA-based data set (see "Trimming, SSU filtering and OTU clustering"). The RNA-and DNA-based data sets were then divided into environment-specific data sets and normalized to 34,000 sequences and classified at domain level with barrnap (https://github.com/tseemann/barrnap) using barrnap-reject 0. 16S rRNA gene copies were detected in different relative abundances, and for B. subtilis three gene copies (rrnG, rrnH and rrnI) were not detected at all. To test if the observation reflected genuine differential expression or protocol bias, we compared the RNA-seq data from the mock community with the mock community reference sequences. The validation was focused on the B. subtilis 16S rRNA genes. Reference 16S rRNA gene sequences of Escherichia coli MG 1655 (DSM 18039, NC_000913), Bacillus subtilis subsp. subtilis str. 168 (DSM 402, NC_000964) and P. aeruginosa PAO1 (NC_002516) were imported into CLC Genomics workbench v9.5.1 and analyzed. The reference sequences were aligned using the "Create Alignment" function with default settings. The alignment was manually inspected to detect single-nucleotide polymorphisms (SNPs) unique to the different 16S rRNA gene copies from B. subtilis. One SNP was chosen as a unique identifier for each gene copy (Supplementary Table 1). No unique identifier SNPs were found for rrnA and rrnO. Mock community RNA-seq data were mapped to the mock community 16S rRNA references using the "Map Reads to Reference" function using default settings with the following exceptions: Length fraction = 1.0, Similarity fraction = 1.0, non-specific match handling = Ignore. The read mappings were manually inspected to determine read coverage of the unique identifier SNPs in the different B. subtilis gene copies. The read coverage was converted to relative abundance by normalizing with the summed read coverage of all the B. subtilis 16S rRNA genes, estimated by mapping the RNA-seq data to the B. subtilis reference sequences using the "Map Reads to Reference" function using default settings with the following exceptions: Length fraction = 1.0, Similarity fraction = 1.0, non-specific match handling = Map randomly.
Phylogenetic analysis. The SINA alignment of the full-length SSU rRNA OTUs was imported into the Silva SSR NR v. 128 database within the ARB software 55 . For phylogenetic analysis, selected sequences were exported using custom positional conservation filters calculated in ARB using the 'filter by base frequency' function. Maximum-likelihood phylogenetic trees were calculated using either FastTree v.  Table 13) were designed within the ARB software based on the full-length SSU rRNA sequences. The primers were then used to screen for the novel archaea in all sediment samples included in this study. The PCR reaction contained 10 µL 10X QIAGEN Coral PCR buffer (Qiagen), 2 µL 10 mM dNTP, 5 µL 10 µM forward primer, 5 µL 10 µM reverse primer, 4 µL 25 mM MgCl, 63.5 µL nuclease-free water, 0.5 µL 5 U/µL Taq polymerase (Qiagen) and 10 µL diluted template DNA (10 ng/µL). The reaction was initiated by denaturation at 94 °C for 3 min, followed by 30 cycles of denaturation at 94 °C for 30 s, annealing at 56 °C for 30 s and extension at 72 °C for 1 min and finishing with final extension at 72 °C for 5 min. 10 µL of each PCR amplicon were directly analyzed on a 2% agarose gel stained with GelRed nucleic acid stain (Biotium). Selected PCR amplicons were pooled and purified using 0.6× AMPure XP beads and eluted in 65 µL nuclease-free water. The purified PCR amplicons were validated using Tapestation 2200 and D5000 screentapes. Sequencing libraries were prepared from the purified amplicons using TruSeq DNA PCR-Free Sample Preparation Guide starting from the end-repair step of the protocol. Final library quality and concentration were measured using Tapestation 2200 with D5000 screentapes and Qubit dsDNA HS Assay Kit. The libraries were prepared and sequenced on a MiSeq using standard procedure (manual # 15027617 v01; manual # 15039740).
Life Sciences Reporting Summary. Further information on experimental design is available in the Life Sciences Reporting Summary.
Data availability. Raw and assembled sequencing data are available at the European Nucleotide Archive (https://www.ebi.ac.uk/ena) under the project number PRJEB22259; a complete data overview can be found in Supplementary Tables 10-12. All data processing scripts are available on github (https://github.com/MadsAlbertsen/fSSU) (Supplementary Code) and a data processing overview can be found in Supplementary Figure 9.