Abstract
To monitor the effect of nature restoration projects in North Sea ecosystems, accurate and intensive biodiversity assessments are vital. DNA based techniques and especially environmental DNA (eDNA) metabarcoding from seawater is becoming a powerful monitoring tool. However, current approaches are based on genetic target regions of <500 nucleotides, which offer limited taxonomic resolution. This study aims to develop and validate a long read nanopore sequencing method for eDNA that enables improved identification of fish species.
We designed a universal primer pair targeting a 2kb region covering the 12S and 16S rRNA genes of fish mitochondria. eDNA was amplified and sequenced using the Oxford Nanopore MiniON. Sequence data was processed using the new pipeline Decona, and accurate consensus identities of above 99.9% were retrieved. The primer set efficiency was tested with eDNA from a 3.000.000 L zoo aquarium with 31 species of bony fish and elasmobranchs. Over 55% of the species present were identified on species level and over 75% on genus level. Next, our long read eDNA metabarcoding approach was applied to North Sea eDNA field samples collected at ship wreck sites, the Gemini Offshore Wind Farm, the Borkum Reef Grounds and a bare sand bottom. Here, location specific fish and vertebrate communities were obtained. Incomplete reference databases still form a major bottleneck in further developing high resolution long read metabarcoding. Yet, the method has great potential for rapid and accurate fish species monitoring in marine field studies.
Introduction
North Sea fish populations are sensitive to disturbances such as fisheries, nutrient run offs and increasing sea water temperatures (Andersen et al., 2020; Capuzzo et al., 2018; Hofstede, Hiddink, & Rijnsdorp, 2010; O’Brien, Dafforn, Chariton, Johnston, & Mayer-Pinto, 2019). Combined management strategies such as reduced fishing (Couce, Schratzberger, & Engelhard, 2020), designation of marine protected areas (MPA), and placing artificial hard substrates such as off-shore wind parks are suggested to facilitate rehabilitation of the North Sea ecosystem (Claudet, 2018; Degraer et al., 2020; Didderen, Lengkeek, Bergsma, & Dongen, 2019; Kamermans, van Duren, & Kleissen, 2018). To understand how North Sea fish population dynamics are affected by these strategies, development and validation of methods that map fish population diversity and density is crucial. Conventional marine fish biomonitoring practices largely rely on destructive methods that involve netting and trapping (Daan, Gislason, Pope, & Rice, 2005; Reiss et al., 2010). These methods are costly, time-consuming and require expert taxonomic visual identification skills (Mateos-Rivera et al., 2020; Teletchea, 2009). In addition, conventional methods have limited sampling efficiencies and may be disruptive to the environment (Eggleton, Depestele, Kenny, Bolam, & Garcia, 2018). Thus, it is crucial to develop precise and non-invasive biomonitoring solutions that are time and cost efficient (Goodwin et al., 2017).
Environmental DNA (eDNA) based fish species identification has gained substantial attention in the last decade, as it can detect the presence of fish species based on a small amount of DNA present in e.g. 1 liter of seawater. It has been shown to be highly sensitive for non-indigenous species detection (Ficetola, Miaud, Pompanon, & Taberlet, 2008) and identification of spawning and migration patterns (Thalinger, Wolf, Traugott, & Wanzenböck, 2019). Short read eDNA metabarcoding has become an increasingly popular tool to perform fish community assessment for identification of ecological relevant fish species from an array of ecosystems (Deiner et al., 2017; Miya et al., 2015; Ruppert, Kline, & Rahman, 2019; Taberlet, Coissac, Pompanon, Brochmann, & Willerslev, 2012; Thomsen et al., 2012). Also in the North Sea, eDNA metabarcoding results from a sampling effort close to fykes showed to be comparable to the fyke catches themselves. (Bleijswijk et al., 2020).
The standardization of eDNA metabarcoding as monitoring strategy is still under development. Species-related differences occur in e.g. degree of skin cell shedding, ambient dependent (seasonal) degradation rates and unknown dilution factors depending on currents make quantification of the results challenging (Beng & Corlett, 2020; Lacoursière-Roussel, Côté, Leclerc, & Bernatchez, 2016; Sassoubre, Yamahara, Gardner, Block, & Boehm, 2016; Seymour et al., 2018). The sample preparation and metabarcoding technique and workflow will determine the quality of the results and thus the species detection quality and possible biases (Beng & Corlett, 2020; van der Loos & Nijland, 2020). Important steps in the protocol include decisions about methods of sampling and DNA extraction (Bessey et al., 2020; Hunter, Ferrante, Meigs-Friend, & Ulmer, 2019), primer and PCR settings (Doi et al., 2019; Sard et al., 2019; Zhang, Zhao, & Yao, 2020), sequencing technology (Egeter et al., 2020; Singer, Fahner, Barnes, McCarthy, & Hajibabaei, 2019; Truelove, Andruszkiewicz, & Block, 2019), post sequencing data handling (Santos, van Aerle, Barrientos, & Martinez-Urtaza, 2020) and reference databases used (Hestetun et al., 2020; McGee, Robinson, & Hajibabaei, 2019).
Especially choice of primer and targeted DNA region are crucial for successful fish detection with eDNA (Beng & Corlett, 2020). Several universal fish primers are available that mostly target regions of the mitochondrial genome as there is a high copy number of this genome per cell (Schon, 2000). The most used primers target different short regions from 100 to 500 nucleotides of the 12S rRNA (Miya et al., 2015; Riaz et al., 2011; Taberlet, Bonin, Zinger, & Coissac, 2018) 16S rRNA (DiBattista et al., 2017; Evans et al., 2016), cytochrome B (Thomsen et al., 2012) and COI (Balasingham, Walter, Mandrak, & Heath, 2018) gene. Although primers targeting short 12S regions are the most commonly used and considered as a standard (Shu, Ludwig, & Peng, 2020), longer target amplicon size yield higher taxonomic resolution (Zhang et al., 2020). Also the use of multiple primer sets are suggested to increase taxonomic resolution (Evans et al., 2016; Miya et al., 2015; Zhang et al., 2020) and has been shown to increase species level detectability in lakes (Sard et al., 2019). A method to relatively increase the presence of less abundant taxa is applying oligo blocking primers to reduce the amount of unwanted DNA in samples such as human DNA or DNA from very abundant taxa (Liles et al. 2003; Vestheim & Jarman, 2008). In addition, using longer reads and multiple marker sets can enhance the necessary species resolution and specificity although it remains unclear how the use of a blocking primer may affect marine vertebrate species detection.
Long read sequence analysis with commonly used Illumina platforms is not possible due to its abilities to sequence with high accuracy but with a read length maximum of 500bp (Tan, Opitz, Schlapbach, & Rehrauer, 2019). Fortunately, third generation sequencing as available from Oxford Nanopore Technologies (ONT) and Pacific Biosciences enables the generation of ultra-long sequences (Bleidorn, 2016). This can be used for eDNA studies that are based on primers targeting longer regions, covering several mitochondrial marker genes. Such approach showed successful in microbial metabarcoding studies, which considerably increased taxonomic resolution up to species level (Johnson et al., 2019; Shin et al., 2016). Historically, the main limitation of nanopore sequencing was the relatively large error rate of 5 to 10% (Jain et al., 2015). This error rate can be overcome with bioinformatics tools to generate reliable consensus sequences and thus increase sequence accuracy (Baloğlu et al., 2020; Carradec et al., 2020; Egeter et al., 2020). To our knowledge, a bioinformatics pipeline is not yet available to generate such consensus sequences from raw sequence data in multiplexed metabarcoding experiments. Such a pipeline would greatly facilitate the development of long read metabarcoding in marine molecular biomonitoring.
This study aims to investigate the possibilities of long read metabarcoding of eDNA with a newly designed fish specific primer pair targeting a 2kb amplicon containing both the 12S and 16S mitochondrial rRNA genes, using Oxford nanopore MinION sequencing. To increase species identification accuracy, we developed a sequence data processing pipeline. The identification resolution of the primers were tested in silico on several genetically similar species from the genus Ammodytes (sandeels). eDNA sampled from a large marine zoo aquarium was used to analyse the accuracy of species identification together with technical evaluations such as the use of replicates to better understand the species detection possibilities of the amplicon. Further, we assessed the effectiveness of a human-specific oligo blocking primer to reduce the influence of human DNA in the sample on target species detection. Finally, we applied our newly developed long read metabarcoding approach to field samples collected at different locations and timepoints in the North Sea.
Materials and Methods
Sample collection
Samples were collected on three separate sampling sites with varying sampling approaches; the Ocean aquarium of Royal Burgers’ Zoo, Arnhem, the Netherlands; around North Sea shipwrecks; and in the North Sea on a transect from Gemini Wind Park to Borkum Reef Grounds (figure 1).
For eDNA samples from the Ocean aquarium in Burgers’ Zoo, 2L water samples were collected just under the water surface using a 1L plastic container pre-sterilized with bleach. After filtering, filters were cut in quarters to serve as filter replicates (figure 1a). Demineralized tap water from Burgers’ Zoo was also filtered to serve as negative control. The Ocean aquarium has a volume of 3000 m3 artificial seawater and represents the edge of a tropical coral reef with only fish, in total 31 species varying in size from ∼ 5 cm to 2.5 m. The water temperature is kept at 24.5-26.0°C, the salinity at 33.0 ‰ and the pH at 8.00-8.10. In the North Sea, samples were collected around three different ship wrecks while SCUBA diving: wreck 1 (55.1821 N, 03.4446 E using the WGS84 reference system), wreck 2 (55.2609, 03.5117) and wreck 3 (55.0774, 02.5087). At each sample location, three replicates of approximately 2L of seawater were taken by pumping (figure 1b, dx.doi.org/10.17504/protocols.io.6yfhftn). Tap water was used as negative control. From Borkum Reef Grounds/Gemini Wind Park, samples were collected from inside Gemini Wind Park (54.0109, 6.0781), halfway Gemini Wind Park and the Borkum Reef Grounds on sandy substrate (53.8645, 6.2145) (Sandy bottom) and at Borkum Reef Grounds (53.7016, 6.3467). All samples were taken at slack tide during neap tides. Three 1L replicates were collected at each location by sampling seawater using 2.5 L Niskin bottles at 0.5-1m above the seafloor on 2, 17 and 31 July 2020 (figure 1c). Demi water was used as negative control.
All samples were immediately filtered using Thermo Scientific Nalgene Rapid-Flow sterile disposable Filter Units CN (Cellulose nitrate) with a pore size of 0.8µm. Filters were then individually placed in 2mL screwcap Eppendorf tubes. The tubes were prefilled with 400µL tissue lysis buffer (ATL, Qiagen, USA) when sampled in Burgers’ Zoo and 400µL Zymo DNA/RNA shield (Zymo, USA) preservative when sampled at the North Sea. Samples were immediately stored at -20°C for a maximum of one month before further processing.
Primer design
Primer design is based on the adjacent ribosomal genes 12S and 16S of the mitochondrial genome of bony and cartilaginous fish present in either the North Sea or the Ocean aquarium of Royal Burgers’ Zoo, Arnhem, the Netherlands. Primers were designed in silico in Geneious prime 2019.0.4 (Kearse et al., 2012) and based on the in NCBI available mitochondrial genomes of the target species. Blocking primers were designed for Homo sapiens. The complete 12S and 16S region was extracted from all available mitochondrial fish genomes. A consensus sequence for each species was constructed when multiple genomes were available from the same species using default settings of the MAFFT alignment tool (v7.450, Katoh & Standley, 2013) incorporated in Geneious. Consensus sequences of all species were aligned and primers were designed manually by locating regions with low genetic variation between target species. This resulted in a long read universal fish primer pair (table 1) targeting a 2kb fragment from 450bp downstream the 12S rRNA gene in forward direction and 300bp upstream the 16S rRNA gene in reverse direction (figure 2a). The 5’ ends of the primers were extended with an ONT tag to allow for direct PCR based sample barcoding in downstream library preparation.
The additional Homo sapiens specific oligo blocking primer is designed based on the alignment with the universal forward primer and extended with a human specific sequence (table 1). The 3’ end of the blocking primer was extended with a phosphoramidite C3 spacer to chemically prevent polymerase activity. Primers were synthesized by Integrated DNA Technologies and purified by standard desalting only (Integrated DNA technologies Inc., USA).
DNA extraction
To pre-process the samples, Proteinase K was added to the samples, together with silica (Biospec products, USA) to facilitate crushing of the filters with a pestle to promote DNA extraction. Further DNA extraction was performed using a DNeasy Blood and Tissue extraction kit (Qiagen, USA) in low bind Eppendorf tubes. DNA concentrations were measured using a Qubit 2.0 Fluorometer (Invitrogen, USA). DNA from filters from the two North Sea datasets were extracted using the Quick-DNA miniprep kit (Zymo, USA) according to the manufacturer’s instructions. Details of both protocols are also given at protocols.io (dx.doi.org/10.17504/protocols.io.6yfhftn).
Amplification
For PCR amplification of the target sequences of the samples from the Ocean aquarium, 10µL 2x Phire Tissue Direct PCR Master Mix (ThermoFisher Scientific, USA) was used. To the master mix, 0.4µL of each primer (10mM), 0.5µL template and nuclease free water was added to a total volume of 20µL. To reduce the effect of stochastic heterogeneity in PCR amplification, samples were amplified using 4 separate PCR replicates, accept for filter 4, from which only 1 replicate could be obtained after amplification with PCR. When applied, blocking primer was added in a 5:1 ratio to the universal forward primer. A thermocycler was used with the following settings: initial denaturation at 98°C for 180s, 35 cycles of 98°C for 8s, 59.6°C for 8s, 72°C for 30s followed by a final extension at 72°C for 180s. Prior to sequencing, quality of the amplicons was checked on gel. A sample was also amplified at the Ocean aquarium field site using miniPCR 8 thermal cycler (miniPCR bio, UK) and 4 replicates were pooled to be able to skip downstream barcoding PCR. Samples were purified using a pre-prepared solution of Sera-Mag™ Magnetic SpeedBeads™ (Sigma-Aldrich, USA). Beads were prepared according to the original protocol (Deangelis, Wang, & Hawkins, 1995). Amplicon replicates were not pooled and consequently sequenced separately. For the North Sea samples, three PCR replicates were used in combination with a hot start and an annealing temperature of 59°C for 8s. In addition, field samples from Borkum Reef Grounds transect were amplified with 35-45 cycles, depending on sample (S2) to increase amplification yield. PCR replicates were pooled prior to purification with beads.
Nanopore sequencing
Amplicon sequence library preparation was performed using the SQK-LSK109 kit and PCR barcoding kit 96 (PCB-096) (Oxford Nanopore Technologies Ltd., UK), according to the manufacturer’s instructions, with the following adaptations: barcoding PCR was performed in a total volume of 15µL containing 0.3µL PCR barcode primer and 10-50ng amplicon. The applied barcode PCR program was as follows: initial denaturation at 95°C for 180s, 15 cycles of 95°C for 15s, 62°C for 15s, 65°C for 90s, followed by a final extension at 65°C for 180s.
After the barcoding PCR, sample concentration was estimated using the Qubit HS kit on the non-purified barcoded PCR products, and samples were pooled in equimolar ratios. The pooled amplicon sequence library was cleaned using magnetic beads, washed once with 70% ethanol and once with a mixture of Long Fragment Buffer (LFB) and Short Fragment Buffer (SFB) (2:1) to enrich for the 2kb target size fragments. During final clean-up, the library was again washed in a mixture of LFB and SFB in a ratio of 2:1. A maximum of 100ng DNA was loaded on a primed flow cell to prevent overloading of the flow cell. If necessary, the flow cell was refuelled using a mixture of Sequencing Buffer (SQB) and nuclease free water (1:1). Sequencing was performed until an estimated average sequencing depth of 30k reads per barcode was achieved. Base-calling was performed using Guppy (Version 4.2.2, Oxford Nanopore Technologies Ltd., UK) in high accuracy (HAC) mode. For the North Sea field samples, the estimated sequencing depth differed, as 25-50k reads were obtained per barcode for From Borkum Reef Grounds/Gemini Wind Park samples and 70-100K per barcode for the shipwreck samples.
Bioinformatics
The bioinformatic analysis was performed with our new pipeline called Decona (v0.1.2) (https://github.com/Saskia-Oosterbroek/decona). Decona was used to demultiplex, filter read length (1800-2200 bases) and quality (q 10), cluster (at 80%) and build Medaka consensus sequences from each cluster larger than 100 sequences (decona -d -l 1800 -m 2200 -q 10 -c 0.8 -n 100 -M). Decona demultiplexes different barcodes with Qcat v1.1.0 (2018 Oxford Nanopore Technologies Ltd., UK). Furthermore, Decona uses Nanofilt v2.3.0 to filter raw fast reads on quality and read length (De Coster, D’Hert, Schultz, Cruts, & Van Broeckhoven, 2018). It then uses CD-hit v4.8.1, a program clustering reads based on short words rather than sequence alignment, to cluster the reads based on a set percentage of similarity (W. Li, Jaroszewski, & Godzik, 2002). The clustered reads are subsequently aligned using Minimap2 v2.17 (H. Li, 2018). Based on these alignments, Racon v1.4.13 is used to build the initial draft consensus sequence of each cluster (Vaser, Sović, Nagarajan, & Šikić, 2017) which is then polished by Medaka v1.1.2 (2018 Oxford Nanopore Technologies Ltd., UK). For read identification Decona is integrated with NCBI’s BLAST+ command line tool.
For taxonomic identification, an in-house compiled reference database was used based on sequences available in the NCBI database. Separate databases were built to analyse the datasets of the Ocean aquarium (last search April 2018) and for the North Sea (last search April 2021) datasets. When the whole mitochondrial genome was not available, available sequences of the 12S and/or the 16S from the species were added to the databases. To validate correct species identification, closely related and tropical fish species that do not occur in the Ocean aquarium were also added to the respective in-house databases, as were frequently occurring contaminants.
Species identification of the 2kb reads were obtained with the taxonomic identifier Centrifuge v1.0.4 (Kim, Song, Breitwieser, & Salzberg, 2016) with a minimal alignment length of 200 nucleotides and using only 1 primary assignment for each consensus sequences. If a consensus sequence was aligned with the same quality/score to two or more species, sequences were assigned to genus level. Consensus reads that could not be identified on species level, were labelled NO_ID. In addition, consensus sequences labelled with NO_ID were later verified using BLASTn with the online database of NCBI. No hits with >99% identity were missed but if there was a hit with lower sequence identity (>94%), the sequence was assigned at genus level. Initially, the Decona build-in function BLASTn was used for taxonomic identification and to assess the accuracy of the amplicons. However, due to the use of the longer 2kb amplicon, Centrifuge considerably outperformed BLAST in terms of overall correct species identification (S4) and was therefore used for taxonomic identification.
Data analysis
The Centrifuge output including the number of clusters and sequences per identification was processed in R studio v1.1.463. Sequence data were loaded as data frame in R using the packages taxize v0.9.96 (Chamberlain & Szöcs, 2013) and phyloseq v1.30.0 (McMurdie & Holmes, 2013). Prior to analysis, sequences from Homo, Ovis, Gallus and Bos genera were removed from the dataset when applicable. In the dataset of the Borkum Reef Grounds, samples that yielded one (or less) successful PCR amplification replicate, were removed from the dataset. This resulted in the removal of several samples from 31st of July upon which was decided that all datapoints of that day were removed. Non-parametrical Wilcoxon rank test was performed to analyse the statistical effect of blocking primer. Non-metric multidimensional scaling (‘jaccard’) was performed in combination with PERMANOVA to analyse the effect of location in field samples.
Results & Discussion
Designed 2kb primers cover sufficient length to distinguish between closely related Ammodytes species
Prior to utilizing, the designed primers were tested in silico using the available full mitogenomes of North Sea and tropical marine fish species (data not shown) which indicated that both primers could anneal to both the North Sea and tropical fish and further vertebrate mitogenomic sequences. In silico comparison of the region covered by the 2kb primer of this study (12S-16S_ONT) and other commonly used 12S and 16S primers (Zhang et al., 2020), shows that the regions obtained from most of the previously used primer pairs are also obtained by the primer pair from this study (figure 2a). Exceptions are the commonly used Mifish and teleo2 primer pairs that both anneal to a region upstream of the forward primer binding site of this study. Alignment of 5 Ammodytes (sandeel) species with low genetic variability, A. marinus, A. tobianus, A. hexapterus, A. personatus, A. dubius and Hyperoplus lanceolatus shows a species specific pattern of single nucleotide polymorphisms (SNPs). Alignment of the previously used primer pairs shows that a limited amount of SNPs can be detected ranging from 0 to 5 between all species (figure 2b). In contrast, when our newly designed primer pair is used, the SNPs from all primer pairs together can be detected while also detecting SNPs upstream of the 16S rRNA gene (figure 2b). In the region of the 16S rRNA gene with a relatively high amount of variability and high primer coverage, most of the Ammodytes species only have one or two unique SNPs, which is not sufficient for identification on species level (figure 2c). This in silico comparison therefore shows that enhanced identification resolution on species level can be obtained when using long read metabarcoding.
Accuracy and detection sensitivity testing using the Ocean aquarium
Enhanced sequence accuracy up to 100% using Decona
Samples collected in the Ocean aquarium yielded 599299 reads that were assigned to 247 different consensus sequences (Doorenspleet, K., Jansen, L., Oosterbroek S., Nijland, 2021). Barcode distribution was between 5-20k reads except for barcode 35 (sample processed at the aquarium site) which contained 286108 reads (S1. When the consensus sequences were identified with BLAStn, based on complete query coverage, more than 35% of these sequences could be correctly identified based on a percentage identity of 99.9-100%. A total of 71.6% of the consensus sequences could be correctly identified based on a percentage identity of 99.5% or higher. This demonstrates the potential of the Decona pipeline to successfully increase identification accuracy of nanopore sequence data with up to 10% points compared to the raw read accuracy. This accuracy is comparable to what is currently expected from highly accurate Illumina reads (Caporaso et al., 2011). To our knowledge a combination of bioinformatics tools in one line of code have not yet been described for nanopore based sequence read processing of metabarcoding data. The script and associated tools are available as the pipeline Decona (https://github.com/saskia-oosterbroek/decona). Decona is written in such way that only one line of code suffices to correctly run the pipeline. As such, data processing also becomes possible for scientists with limited experience in the command line. The bioinformatics tools integrated in Decona are well established programs widely used in genomics and transcriptomic studies (Huang, Niu, Gao, Fu, & Li, 2010; H. Li, 2018). Currently, there is limited automated bioinformatics processing reported in nanopore based studies, especially for metabarcoding (Santos et al., 2020). For example, tools as CD-Hit (Huang et al., 2010) have previously been used in nanopore studies for clustering (Voorhuijzen-Harink et al., 2019) and consensus building of fish amplicons. Reference based polishing was successfully applied when identifying benthic organisms on autonomous reef monitoring structures (Jin et al., 2020) using minibarcoder.py (Srivathsan et al., 2018). The combination of both clustering and de novo alignment based polishing with racon (Vaser, Sović, et al., 2017) and medaka (https://nanoporetech.github.io/medaka/benchmarks.html#evaluation-across-samples-and-depths) has been used for the correction of metagenomes in the NANOclust pipeline (Rodríguez-Pérez, Ciuffreda, & Flores, 2020). Decona on the other hand combines similarity based clustering based on short word tables instead of an alignment approach in combination with alignment based polishing with racon and medaka which further increase the identification accuracies.
Blocking primer reduces Homo sapiens reads but also reduces species richness
The use of a blocking primer significantly (p= 0.0013) reduced the relative read abundance of human DNA (figure 3a) from the eDNA sample. Overall, 15 species could be identified both with and without the use of blocking primer (figure 3c). However, the species Chaetodon ulititensis (1.5%), Chelmon rostratus (0.1%) and Sufflamen chryopterum (1.2%) were only detectable without the use of blocking primer (figure 3b). Initially, oligo blocking primers have been developed to block the amplification of otherwise dominating bacterial (Liles, Manske, Bintrim, Handelsman, & Goodman, 2003) taxa or host DNA in Antarctic krill diet studies to increase the detection of rarer reads/taxa (Vestheim & Jarman, 2008). The presence of an abundant proportion of human reads as observed in this study has previously been reported (Miya et al., 2015). Using an oligo blocking primer was effective for reduction of the amount of human sequences in this study. However, a decrease in species count was found, especially in taxa with low relative read abundance. A recent eDNA study that assessed the usage of blocking primers in combination with several fish specific short read primers, have also reported decreases in species richness (Zhang et al., 2020). Hence, the findings in our study agree with the suggestion to increase sequencing depth instead of using an oligo blocking primer to obtain the highest fish and vertebrate species richness (Zhang et al., 2020). Nevertheless, sequencing depth and time can be an important consideration, especially when sequencing budget is limited. Careful usage of an oligo blocking primer can then still be relevant for limited studies where a more general species pattern is more relevant than the complete diversity.
Replication increases species detection sensitivity
In the Ocean aquarium dataset, four pseudo replicates are used by cutting the filters in four. After metabarcoding without using a blocking primer, the species composition and richness differed (Figure 4). Noteworthy, the species uniquely found on one of the filters; C. rostratus (0.4%), Epinephelus flavocaeruleus (0.6%), Labroides dimidiatus (0.3%), Odonus niger (0.4%), Scomber scombrus (1.5%) and S. chrysopterum (1.7%) all represent species with low total biomass in the aquarium. Of these, findings of mackerel (S. scombrus) likely represents the detection of the feed that is used for the sharks. Several different numbers of replicates have been used between studies ranging from no replication (Gold, Sprague, Kushner, Zerecero Marin, & Barber, 2021; Stoeckle et al., 2021), three replicates (Andruszkiewicz et al., 2017; Singer et al., 2019) to five replicates (Jeunen et al., 2019). In our study there was a considerable increase in species richness and detection of unique rare reads when using replicates. Thus, filter replication increases the detection sensitivity. This is in agreement with similar observations where the use of replication increased the detection sensitivity (Beentjes, Speksnijder, Schilthuizen, Hoogeveen, & Van Der Hoorn, 2019; Evans et al., 2017). Recently, an increase of 25% of species count was found between using one and 18 biological 1L replicates, but using only limited number of replicates already detects the optimal number of species (Macher et al., 2021). The filters extracted and amplified at the Burgers’ Zoo site, showed the highest species richness. However for this sample, the sequencing depth was 10 to 40 times higher than the separate barcodes of the lab processed filter replicates. This makes direct comparison between the processing locations delicate as it cannot be determined to which extend the sequencing depth could have influenced detection of field sample unique rare species. On the other hand, sampling approach, processing and sequencing depth considerably differed from samples processed in the lab. Therefore these data also shows the robustness of the method, independent of the processing location. Nevertheless, in this study, several species remained undetected, which could be improved by using more replicates, a greater sequencing depth or a greater volume of water where possible. Accordingly, careful considerations on the sampling design both in terms of replication, sampling volume and sequencing depth are also important when conducting long read eDNA metabarcoding.
Majority of species are detected by long reads
Of the total number of consensus sequences, 92.2% of the raw reads resembling a consensus sequence could be taxonomically identified at species or genus level. The remaining 7.8% could not be taxonomically identified with the used reference database and identification tools. However, based on individual BLASTn results of the unassigned consensus sequences 2 additional genera, Plectorinchus and Glaucostegus were identified. Of all taxonomically identified sequences, 17 genera could be identified from 23 different genera present in the Ocean aquarium (figure 5a). These 23 genera represented 31 species of which 18 species could be identified (figure 5b). Specifically, the reads obtained with this long read metabarcoding method identified more than 58% of the species and 74% of genera present in the ocean aquarium. This percentage is lower than previous single marker validation studies where 80-100% of species present in a tank could be identified (Evans et al., 2017; Kelly, Port, Yamahara, & Crowder, 2014). It must be noted however that in these studies often a low (5-10) number of fish species per tank is used. A validation experiment of a complex aquarium with 100+ species using multiple markers, demonstrated a maximum of 50% identification on species level and 80% on genus level when using 12S, 16S and COI together (Morey, Bartley, & Hanner, 2020). Results presented here are better comparable in terms of species complexity of the aquarium but especially the technical approach of using multiple markers. This, because in silico validation of the primer pair of this study shows that this fragment covers both the target regions of most of the commonly used 12S and 16S primers for fish eDNA studies and large regions of the 12S and 16S rRNA gene that are usually not targeted. In silico validation also shows that the 2kb fragment is better in distinguishing between closely related species of the Ammodytes genus than using a single 12S or 16S marker. We therefore suggest that long read metabarcoding targeting this 2kb fragment can be as effective if not more effective as using a multi marker approach to detect complex fish communities. Moreover, using multiple markers is more effective than single marker approaches but considerably increases the processing costs. Thus, an additional advantage of using a long read approach with a single primer pair to cover multiple genetic regions is increased species detection sensitivity without increasing laboratory costs.
Incomplete reference database still hampers correct identification of long reads
Of the 5 genera that were not detected in this experiment, 3 genera (Bodianus, Chrysiptera and Pomacentrus) were only represented in the reference database with the 16S reference on genus level, and from 2 genera (Ctenochaetus and Siganus) the full mitogenome was available (Table 2). Although the full mitochondrial references of these genera were present in the database, these species were not detected. The species corresponding to these genera represent very small fish species, indicating that the availability of shed DNA from these relatively small fish could have been too limited for identification. This is in line with inconsistent detection of rare taxa between filters in previous reports (Evans et al., 2017; Kelly et al., 2014; Morey et al., 2020). Species names could be assigned based on both complete reference mitogenome or partial 12S or 16S references (Table 2). Of the undetected species Acanthurus tennentii, Myripristis jacobus, Plectorhinchus obscurus and Chrysiptera parasema no reference was available, which makes definitive identification of these species almost impossible. Instead of the unidentified species Abudefduf sexfasciatus and Myripristis murdjan, the species Abudefduf vaigiensis and Myripristis berndti were identified. For these two identified species the full mitogenome was present whereas for the undetected species, only the 16S reference was available. For primers targeting long DNA regions, a limited reference database can be the limiting factor as several parts of the target sequence might not (yet) be stored in the reference database. This study shows that even with only a single marker available in the reference database, identification is still possible. However, a complete reference of the full length amplified region remains preferred as this will give more certainty on correct identification, especially of closely related species as this reduces the possibility of false positives.
Biodiversity assessment of North Sea fish from different locations
Long read eDNA metabarcoding detects ship wreck specific spatial variation in species compositions
From 9 different filtered water samples from 3 different shipwrecks in the Dutch part of the North Sea, a total of 262732 reads were analyzed (Doorenspleet, K., Jansen, L., Oosterbroek S., Nijland, 2021), with a barcode distribution between 5-20k reads (S1). A total of 79 consensus sequences were identified, which after correction indicated 21 species. A significant difference in beta-diversity (jaccard) between the 3 locations was found (p= 0.002) (figure 6a), which resulted in clustering of water sample replicates per ship wreck (figure 6b). Despite the similarity between replicates, some species only occurred in 1 of the 3 filter replicates on all locations. For example, the species scaldfish Arnoglossus laterna, solenette Buglossidium luteum, harbour porpoise Phocoena phocoena and plaice Pleuronectes platessa from ship wreck 1 only occurred in replicate 2, while the species sprat Sprattus sprattus and S. scombrus only occurred in replicate 3. As observed with the experiment with the ocean aquarium and earlier findings (Beentjes et al., 2019; Evans et al., 2017; Macher et al., 2021) these findings confirm that using multiple sample and filter replicates to increase the detected species richness of a location. From ship wreck 1 the composition according to the 3 samples mainly exists of grey gurnard Eutrigla gurnardus, dab Limanda limanda and bull trout Myoxocephalus scorpius. Unique for this shipwreck is the occurrence of sardine Sardina pilchardus. Around wreck 2 E. gurnardus is commonly found, and also sand goby Pomatoschistus minutus and plaice Pleuronectes platessa are common there. Turbot Scophthalmus maximus, P. minutus and common dragonet Callionymus lyra only were found at this ship wreck. The identifications from ship wreck 3 are mainly dominated by lesser sandeel Ammodytes marinus and to a lower extend in terms of read count per sample, L. limanda and M. scorpius. Clearly unique species composition profiles could be indicated for the 3 different shipwrecks, and the large number of detected fish species is a good basis for defining the ecological profile. Hence, the designed primer pair is universal enough to both target tropical and North Sea fish and elasmobranch species. Interestingly, in the samples from shipwreck 1, a moderate amount of P. phocoena, sequences was found. Although the 2kb primer set was designed on and for bony and cartilaginous fish, the detection of harbour porpoise in our field samples suggest the applicability of the primer pair for vertebrate species. Ecologically, detection of P. phocoena suggest the relatively high productivity and abundance of fish in this area, on which harbor porpoise feed. Although it is unlikely that this method has captured the entire fish and marine vertebrate biodiversity, it is still sensitive enough to find relevant spatial differences between sampling sites. Especially using location specific sampling methods (e.g. ship wreck sampling) gives a general overview of the vertebrate richness on a local scale and give insight into visiting species such as the harbor porpoise which is likely to be left undetected with other methods.
Long read eDNA metabarcoding detects variation in species composition in natural/artificial reef
Overall, eDNA metabarcoding of 18 filtered water samples from 3 locations sampled at two time points at the transect from Borkum Reef Grounds to Gemini Wind Park resulted in 913119 sequences (Doorenspleet, K., Jansen, L., Oosterbroek S., Nijland, 2021), with a barcode distribution between 20-100k reads (S1). Based on 105 consensus sequences, 20 species could be identified. The species composition was significantly different between sampling locations and sampling dates (p = 0.01). The species composition of Gemini Wind Park differed significantly from that on the sandy bottom (p=0.002). The species on the Borkum Reef Grounds, on the other hand, were not significantly different from the other locations. The most abundant species at Gemini were A. marinus, E. gurnardus and S. scombrus (figure 7). For location Sandy bottom the most abundant species are P. minutus and sprat Sprattus sprattus, while for Borkum Reef Grounds this was E. gurnardus, P. minutus and C. lyra. In Gemini Wind Park, the common species composition hardly differed between the sampling dates and sequences of the thornback ray Raja clavata and P. phocoena were found. S. pilchardus, solenette Buglossidium luteum, L. limanda, Lozano’s goby Pomatoschistus lozanoi, P. phocoena and horse mackerel Trachurus trachurus were not found in the July 17th 2020 samples. At the Sandy bottom, more differences occurred between the 2 sampling dates. For example, the relative read abundance of E. gurnardus was higher on the 17th of July, while S. pilchardus, eel Anguilla anguilla, B. luteum and L. limanda were undetectable in the 17th of July samples. Also at the Borkum Reef Grounds, variation between sampling dates was found: S. scombrus, S. pilchardus and long-spined sea scorpion Taurulus bubalis were only detected the 2nd of July, while species E. gurnardus, Arnoglossus laterna, striped red mullet Mullus surmuletus and whiting Merlangius merlangus only on the 17th of July. Despite the location-specific species found in these samples, there is considerable difference when samples are taken at different timepoints. In this study we saw clear changes in presence of pelagic fish as S. scombrus and S. pilchardus at the Borkum Reef Grounds. On the other hand in the sandy bottom, bottom dwelling fish as P. minutus and B. luteum were found on the first sampling day but not on the second day. It is enticing to suggest that these results show short term dynamics like pelagic fish movement in Borkum Reef Grounds and at the sandy bottom halfway between Gemini Wind Park and Borkum Reef Grounds. However, on such short temporal scale it remains delicate to draw any meaningful ecological conclusions as eDNA metabarcoding is highly sensitive (Beng & Corlett, 2020), even when sampling in a relatively controlled environment. On the other end of the argument, temporal replication at the shorter time scale is important to consider to obtain a more complete image of the diversity in a certain area.
Conclusions
This study demonstrates and validates an eDNA metabarcoding approach using nanopore long read technology, enabling increased resolution on species level. Highlighted are increased species resolution due to the longer DNA fragments analyzed enhanced by our nanopore raw read processing pipeline Decona. We further show that limitations as insensitivity issues from incomplete reference databases remain in long read metabarcoding. Detection limitations also occur when using an oligo blocking primer. Further research should focus on direct comparisons of long read nanopore approaches with other sequencing platforms. Another interesting possibility is the use of long read metabarcoding to study spatial-temporal shifts based on the detection of eDNA fragments of different lengths. Moreover, it is essential that additions of longer reference sequences, preferably full mitogenomes, to reference databases become a high priority in marine molecular ecology, as only then long read based DNA metabarcoding and metagenomics can develop to its full potential to serve as monitoring tool.
Authors contributions
LJ and RN designed the experiment; LJ, PK, OB, MJ and EW were involved in sample collection and processing; LJ and RN did the laboratory work; SO, KD and RN designed the bioinformatics pipeline Decona; KD and RN conducted the data analysis; KD, LJ, SO, RN and TM interpreted the data; all authors wrote and revised the manuscript.
Data availability
All raw sequence data that support these findings are available in ENA at reference XXXXX.
Acknowledgements
We are grateful for the members of ‘Duik the Noordzee Schoon’ foundation for providing the trip to the ship wreck and assisting during the dives and sampling. We would like to thank Linda Tonk of Wageningen Marine Research, Miriam Schutter of Bureau Waardenburg and crew members of MS Vrijheid III, for providing the cruise to the Borkum Reef Grounds and Gemini Wind Park and assistance with sampling. We acknowledge GEANS, an Interreg project supported by the North Sea Programme of the European Regional Development Fund of the European Union and JIP ECO-FRIEND (RVO reference number TEWZ118017) for funding parts of this research. Especially valuable contributions are made by Aline Joustra by providing the illustrations of the experimental design (http://www.alinesci.com/).