Abstract
Metabarcoding of microbial eukaryotes (collectively known as protists) has developed tremendously in the last decade, almost uniquely relying on the 18S rRNA gene. As microbial eukaryotes are extremely diverse, many primers and primer pairs have been developed. To cover a relevant and representative fraction of the protist community in a given study system, a wise primer choice is needed as no primer pair can target all protists equally well. As such, a smart primer choice is very difficult even for experts and there are very few on-line resources available to list existing primers. We built a database listing 179 primers and 76 primer pairs that have been used for eukaryotic 18S rRNA metabarcoding. In silico performance of primer pairs was tested against two sequence databases: PR2 for eukaryotes and a subset of Silva for prokaryotes. This allowed to determine the taxonomic specificity of primer pairs, the location of mismatches as well as amplicon size. We developed a R-based web application that allows to browse the database, visualize the taxonomic distribution of the amplified sequences with the number of mismatches, and to test any user-defined primer set (https://app.pr2-primers.org). This tool will provide the basis for guided primer choices that will help a wide range of ecologists to implement protists as part of their investigations.
Introduction
Microbes are key players in all Earth ecosystems. Among them are protists that encompass all unicellular or unicellular-colonial eukaryotes excluding some fungi. Protists perform a range of functions from photosynthesis to organic matter degradation. Although some eukaryotic groups such as unicellular algae (phytoplankton) have a long tradition of being studied as key players in marine primary production, the importance of protists in other processes and other environments has only been recently recognized, for example their role in nutrient cycling in soils or the complexity of symbiotic and parasitic relationships in marine waters (Geisen et al. 2018a; Worden et al. 2015). This absence of recognition stems in part from the inherent difficulties to identify them because of the lack of morphological features as well as of the difficulty to grow them in culture. In recent years the development of metabarcoding has provided new tools to study protist roles.
Metabarcoding is defined as the use of a specific marker gene to analyse the composition of natural communities in a specific environment (water, soil, animal gut, faeces, etc…). After DNA extraction, the gene is amplified using a pair of primers targeting one specific region, samples are labelled with molecular tags and the resulting DNA is sequenced using a high throughput technology, mostly Illumina currently. This approach was initially developed for prokaryotes (Sogin et al. 2006) and expanded steadily in recent years for protists. For prokaryotes, the gene most commonly used is the gene coding for the small sub-unit ribosomal RNA (SSU rRNA or 16S). SSU rRNA genes are composed of conserved and variable regions. Conserved regions can be used to design primers while variable region (V) can be used to assign taxonomy. In prokaryotes, the region targeted is very often V3/V4, although other regions have been suggested as providing better resolution (e.g. Bukin et al. 2019). For eukaryotes, earlier metabarcoding work done on microbial communities used the 18S rRNA gene (Amaral-Zettler et al. 2009; Stoeck et al. 2009), following up on what had been done using earlier cloning and Sanger sequencing approaches (López-García et al. 2001; Moon-van der Staay et al. 2001). Other genes, and in particular the mitochondrial cytochrome oxidase 1 gene (COI or cox1), have been used for Metazoa but their use is debated in particular because of the lack of universal primers (Andújar et al. 2018; Deagle et al. 2014) and the absence of this gene in lineages that have lost the mitochondrial genome (e.g. Yahalomi et al. 2020). For protists, the 18S rRNA gene appears to be most appropriate as a general marker (Pawlowski et al. 2012), although other genes such as rbcL (large unit of the RUBISCO) have been used for specific purposes such as targeting photosynthetic organisms (e.g. Pujari et al. 2019). Two variable regions of the 18S rRNA gene have been mostly targeted, V4 and V9: V4 is located in the second quarter of the 18S rRNA gene and V9 at the end of the 18S rRNA gene, near the ITS (internally transcribed spacer) region. Initially, the V4 region was favoured for 454 sequencing technology and V9 for Illumina which was then restricted to 2×75 bp. However with the development of the Illumina MiSeq (up to 2×300 bp), the V4 region is now preferred, in particular because it is longer, more variable, and better covered in reference databases (Pawlowski et al. 2012).
Primer selection is critical to obtain an accurate image of protist communities. Each primer (forward and reverse) must amplify the target community with minimal biases and the region amplified must be long enough to be taxonomically resolutive but preferably short enough to be fully sequenced by the chosen technology, although longer amplicons can be also be partially sequenced. With Illumina sequencing being now the most used technology, amplicon size must be about 50 bp smaller that the sum of the forward and reverse sequences (called R1 and R2) to allow enough overlap to reconstruct the complete amplicon: for example, the Illumina MiSeq 2×300 bp chemistry can sequence amplicons of up to 550 bp. A large diversity of primer and primer sets targeting the 18S rRNA gene have been developed over the years, although a few dominate in protist metabarcoding studies. Few resources are available that list eukaryotic 18S primers and primer pairs, provide information on their taxonomic specificity and allow to test new primer pairs. Most existing primer databases do not focus on protists. For example the primer database linked the Barcode of Life Data System web site Bold Systems (https://boldsystems.org/index.php/Public_Primer_PrimerSearch) is focusing on metazoans and Probebase (http://probebase.csb.univie.ac.at/node/8, Greuter et al. 2016) focuses on bacteria. A few programming tools have been developed to test primer set specificity, for example EcoPCR (Ficetola et al. 2010), a Python program, or R libraries such as PrimerMiner (Elbrecht and Leese 2017). Unfortunately, these tools need to be installed in a specific computing environment and require some background programming skills. Many existing online tools such as Probematch (https://rdp.cme.msu.edu/probematch/search.jsp) only allow testing primer sets against bacteria, archaea and fungi. Silva TestPrime (https://www.arb-silva.de/search/testprime) is the only tool that covers protists. It provides very detailed feedback on the taxonomy of amplified sequences, and the location of mismatches. Such detailed information comes at the expense of speed with a typical test needing a few minutes to run. Moreover the taxonomic annotation of the Silva database for protists is not optimal.
To fill this gap and to provide protist researchers with a usable tool, we constructed a database of primer and primer sets used for eukaryotic 18S rRNA metabarcoding. These primer sets were tested in silico against the PR2 database (Guillou et al. 2013) that contains more than 180,000 18S rRNA sequences with expert taxonomical annotation and a subset of the prokaryotic Silva database. We developed a R-based web application that allows to explore the database,d to visualize pre-computed in silico amplification results according to taxonomy (% of amplification, size of amplicons and location of mismatches), and to test any user-defined primer set.
Material and Methods
18S rRNA gene primers (Table S1) and primer sets (Table S2) used in metabarcoding studies were collected from the literature. Primer sequences and primer sets (knowing that several primer sets may share at least one primer) were stored in a MySQL database. Primer sets were tested by performing in silico amplification of eukaryotic sequences stored in the PR2 reference database (Guillou et al. 2013) version 4.12.0 (https://github.com/pr2database/pr2database/releases/tag/v4.12.0). We also used a small subset of the Silva database version 132 provided by the mothur web site (https://mothur.org/wiki/silva_reference_files) to test whether prokaryotes were amplified. Sequences with ambiguities were discarded (any nucleotide that is not A, C, G or T). Sequences with length shorter than 1350 bp were not considered except for the V4 region for which this threshold was lowered to 1200 bp, since most sequences from PR2 contain the V4 region. In contrast, this limit was extended to 1650 for the V9 region and since many 18S rRNA do not cover the full V9 region, we only kept sequences that contained the canonical sequence GGATC[AT] which is located at the end of the V9 region, just before the start of the internally transcribed spacer 1 (ITS1). A R (R Development Core Team 2013) script using the Biostrings package (Pagès et al. 2020) was used to compute the number of mismatches to the forward and reverse primers allowing for a maximum of 2 mismatches for each primer using the function matchPattern with the following parameters: max.mismatch=2, min.mismatch=0, with.indels=FALSE, fixed=FALSE, algorithm=“auto”. We computed the position of mismatches using the mismatch function with parameter fixed=FALSE. A faster version of the script is also available that does not compute mismatch position using the vectorized form of the matchPattern function (vmatchPattern). The latter function is used in the Shiny application (see below) allowing users to test their own prime sets. The data were tabulated using the dplyr package and plotted using the ggplot2 package (Wickham 2016). A R shiny application to interact with the database was developed using the following R packages: shiny, shinyFeedback and shinycssloaders (Sali and Attali 2020).
All scripts including those for the Shiny application are available at https://github.com/pr2database/pr2-primers.
Results and Discussion
Database of primers and primer sets
We have been able to recover from the literature a total of 102 general eukaryotic primers and 77 specific to some taxonomic groups (Tables 1 and S1, https://app.pr2-primers.org). Some of these primers were designed early on when researchers began to amplify and sequence the 18S rRNA gene (e.g. primers EukA and EukB, Medlin et al. 1988). More recently, researchers have been designing primer specific of some taxonomic groups mostly targeting the division-level (e.g. S19F and S15rF for Foraminifera, Morard et al. 2011) or class-level (e.g. primer PRYM03+3 for Prymnesiophyceae, Egge et al. 2013). Some primers are also designed to block specific taxa (e.g. 18SV1V2Block against the coral Pocillopora damicornis, Clerissi et al. 2018) to be used in combination with more general primers (18SV1V2F in this case) or to avoid amplification of some groups (e.g. EUK581-F and EUK1134-R which do not amplify Metazoa, Carnegie et al. 2003). These primers are used when looking at the eukaryotic microbiome of specific organisms (corals, oysters) to avoid amplification of host’s genes (Bass and del Campo 2020).
We identified a total of 76 primer sets that have been used in metabarcoding studies, mostly targeting protists (Table S2). Of these, most are general, i.e. not targeting specific groups. The distribution of these primer sets over the 18S rRNA gene is very heterogeneous, but the vast majority target the V4 region (Table 2 and Fig. 1). In contrast the number of primer sets targeting the other favoured metabarcoding region V9 is much lower. Most of the primer sets targeting a specific taxonomic group are located in the V4 region, and none are in the V9 region (Table 2). In terms of usage, the V4 region is much more popular (about 80% of published studies in marine systems, Lopes dos Santos et al. 2021), the three most commonly used primer sets being #8 (TAReuk454FWD1 and TAReukREV3, Stoeck et al. 2010), # 17 (E572F and E1009R, Comeau et al. 2011) and # 16 (TAReuk454FWD1 and V4 18S Next.Rev, Piredda et al. 2017), while for the V9 region the most popular sets are # 27 (1391F and EukB, Stoeck et al. 2010) and #28 (1380F and 1510R, Amaral-Zettler et al. 2009).
Testing primer sets by in silico matching
General primer sets
We used the PR2 database (Guillou et al. 2013) which currently contains about 180,000 18S rRNA sequences with a detailed taxonomic annotations to test all primer sets from the pr2-primers database. We also verified on a small set of sequences representative of the different prokaryotic groups whether these primers amplified bacteria or archaea. We only used long sequences (see Material and Methods) and allowed for a maximum of 2 mismatches on both forward and reverse primers, i.e. a maximum of 4 mismatches. For general primers, amplification success varied from 32 to more than 97% (Table 3, Fig. 2 and S1). In general, the reverse primer has a tendency to have more mismatches than the forward primer (Table 3). Primer sets targeting regions other than V4 or V9 do not perform as well in general (Fig. S1), although the best overall performance is for # 76 targeting the V7 region (F-1183 and R-1443, 97.1% of sequences amplified, Lundgreen et al. 2019). If we focus on the V4 and V9 regions (Fig. 2), the best performing primer sets overall are #6 (616*f and 1132r, 96.5%, Hugerth et al. 2014) and #29 (1389F and 1510R, 79.8%, Amaral-Zettler et al. 2009). The lower percentage observed for the V9 primers have to be interpreted with caution: many 18S reference sequences do not extend to the end of the V9 region and therefore will miss the signature of the reverse primer. To minimize this problem we retained for the analysis of V9 primer sets only sequences that contain the canonical signature GGATC[AT] located at the end of the V9 region. Despite performing well when allowing for 4 mismatches, some of these primer sets have at least one mismatch to PR2 sequences: for example primer set # 108 (545F and 1119R, Kataoka et al. 2017) amplifies only 7.9% of the sequences with zero mismatch. Another important consideration is the size of the amplicon. Since most metabarcoding studies are currently using Illumina sequencing technology, the maximum possible size to allow some overlap between the two R1 and R2 reads is about 550 bp (assuming that one uses the 2×300 bp sequencing kits), although smaller amplicons are preferable to allow more overlap. A sizeable fraction of the primer sets produce amplicon close to or larger than 600 bp (Fig. 2). The post sequencing analysis strategy in this case would be to only use one of the reads (R1 is in general less noisy) without trying to assemble R1 and R2 (as done in Lambert et al. 2019).
Another important consideration is whether amplification is similar across the whole eukaryotic taxonomic range. Taking as example the most used primer set targeting V4 (# 8, Fig. 3A) and looking at the amplification efficiency at the supergroup level, a significant fraction of Excavata and to a smaller extent of Rhizaria present at least 5 mismatches to this primer set (Fig. 3A top-left). Amplification is even more unlikely for sequences presenting mismatches with the forward primer because the mismatches are locate at the 3’ end of the primer (Fig. 3A top-right) which is the most unfavorable situation (mismatches at the 5’ end are better tolerated). The average size of the amplicon is also varying depending on the taxonomic group (Fig. 3B bottom). For example, Excavata have on average longer amplicons because of the presence of introns: amplicon size is then beyond the current range of Illumina sequencing and this may induce as well negative bias during PCR amplification (Geisen et al. 2015). For other groups such as Opisthokonta, although the average size is compatible with Illumina sequencing, there is a large number of outlier sequences with long amplicons. This will mean that taxa corresponding to these sequences (mostly Arthropoda) will be missed from surveys conducted with this primer set, although of course this is less critical when protists are targeted. The situation with V9 primer #27 (Fig. 3) is somewhat similar although there is less dissimilarity between the different supergroups. However for some supergroups, in particular Opisthokontha (Ascomycota) and Archae-plastida (Bangiophyceae), there is a number of outliers that will be missed by Illumina sequencing. Again these groups are less relevant when focusing on protists. When looking at all the general primer sets (Figure S2), some sets such as # 2, 25, and 110 appear to have more taxonomic biases than others. Overall Excavata constitute the supergroup that is most often discriminated against.
Most primer sets will not amplify prokaryotes except primers such as set # 33 (515F and Univ 926R Needham and Fuhrman 2016) that were designed to amplify both bacteria and eukaryotes (Figs. S3 and S4). However some primers specific of eukaryotes such as # 4 (563f and 1132r, Hugerth et al. 2014) amplifies quite well prokaryotes. Interestingly, set # 12 (3NDf and 1132rmod, Geisen et al. 2018b) amplify only archaea but not bacteria. In most cases, it is the reverse primer which was discriminating against prokaryotes.
Specific primer sets
In order to access a deeper diversity within a given taxonomic group primers, primer sets have been developed with specific targets (Tables S1 and S2). Target levels are most often at the division (e.g. Haptophyta) and class levels (e.g. Chrysophyceae), although some sets are targeting supergroups (e.g. SAR #84). Some primer sets are extremely specific of their targets. One example is primer # 65 targeting Cercozoa (S616F Cerco and S947R Cerco, Fiore-Donno et al. 2018) that presents at least 5 mismatches to all other divisions (Fig. S5) and amplifies all Cercozoa groups. Primer # 38 targeting Chlorophyta (ChloroF and ChloroR, Moro et al. 2009) presents at least 5 mismatches to all other divisions (Fig. S5). However, it is does not amplifies all Chlorophyta as it misses picoplanktonic green algae such as Mamiellophyceae or Chloropicophyceae (Fig. S6). In contrast several primer sets claimed to be specific of a given group are in fact quite general. For example set # 87 which targets oxymonads (Oxy 18S-F and Oxy 18S-R Michaud et al. 2020) amplifies many other groups (Figs. S1 and S5). In this case, this is not critical since oxymonads only occur in termite guts and such primers will be used in this specific context. Primer set # 21 (D512for and D978rev, Zimmermann et al. 2011) which was designed to target diatoms would amplify actually most of the Ochrophyta (brown algae) classes but also some green algae (Fig. S6).
R Shiny application
We have developed a web site based on a R Shiny application (https://app.pr2-primers.org) that allows users to visualize and download the pr2-primers database, explore at different taxonomy levels the results of in silico amplification against the PR2 and Silva seed database for the primer sets from the database and test their own primer sets. The application is composed of 6 panels. The first panel (Fig. 4A) provides information on the database as well as link to report issues or new primers. The second and third panels (Fig. 4B) provide an interface to the two tables for primers and primers sets with the option of downloading it (item 1) and of revealing/hiding specific columns. The fourth and fifth panels allow to explore the in silico amplification of the primer sets from the database. The fourth panel present a synthesis of the results (similar to Fig. 2) while the fifth panel allows for a given primer set (item 2) to look at amplification properties within a taxonomic level from the kingdom to the class (item 3). The right panel shows general amplification characteristics (item 4), the location of the mismatches, the number of mismatches for each group and the distribution of the amplicon sizes (item 5). Finally the sixth panel allows users to provide a primer set and a maximum number of mismatches (item 6) to run an in silico amplification against PR2 and Silva seed databases. For the sake of speed, only the number of mismatches is provided but not the position of the mismatches as for the primer sets from the database. Global statistics on the amplification are provided which can be explored at different taxonomic levels (item 7). The R shiny application has been incorporated into a Docker container available at https://hub.docker.com/repository/docker/vaulot/pr2-primers.
Conclusion
The combination of the pr2-primers database with the PR2 sequence database provides a very useful resource for protist metabarcoding. It will help researchers to select the most suitable primer pairs for both broadly-targeted surveys and studies focusing on target taxonomic groups, and to test and validate in silico novel primers. We emphasize that primer pairs must also be tested on reference culture material and natural samples as actual amplification may differ from in silico results. Hopefully this database will grow with time as novel primer pairs are developed and tested on samples from a range of environments. This will contribute to better design and comparability of microbiome analyses, inventories of protist diversity across environments, and increase our understanding of this functionally diverse and important group of organisms.
Author contributions statement
DV and SG conceived the study. DV, DB and FM scanned the literature for existing primers and primer sets. DV developed the database, the analysis scripts and the R shiny application. DV wrote the first draft of the paper and all co-authors edited and approved the final version.
Additional information
Competing interests
The authors declare no competing financial interests.
Supplementary Material
Acknowledgements
We thank the ABIMS platform of the FR2424 (CNRS, Sorbonne Université) for bioinformatics resources.