ABSTRACT
Background Recently, there has been a push towards the extended barcode concept of utilising chloroplast genomes (cpGenome) and nuclear ribosomal DNA (nrDNA) sequences for molecular identification of plants instead of the standard barcode regions. These extended barcodes has a wide range of applications, including biodiversity monitoring and assessment, primer design, and evolutionary studies. However, these extended barcodes are not well represented in global reference databases. To fill this gap, we generated cpGenomes and nrDNA reference data from genome skims of 184 plant species collected in Denmark. We further explored the application of our generated reference data for molecular identifications of plants in an environmental DNA metagenomics study.
Results We assembled partial cpGenomes for 82.1% of sequenced species and full or partial nrDNA sequences for 83.7% of species. We added all assemblies to GenBank, of which chloroplast reference data from 101 species and nuclear reference data from 6 species were not previously represented. On average, we recovered 45 genes per species. The rate of recovery of standard barcodes was higher for nuclear barcodes (>89%) than chloroplast barcodes (< 60%). Extracted DNA yield did not affect assembly outcome, whereas high GC content did so negatively. For the in silico simulation of metagenomic reads, taxonomic assignments using the reference data generated had better species resolution (94.9%) as compared to GenBank (18.1%) without any identification errors.
Conclusions Genome skimming generates reference data of both standard barcodes and other loci, contributing to the global DNA reference database for plants.
Competing Interest Statement
The authors have declared no competing interest.
Abbreviations
- atpB
- adenosine triphosphate synthase subunit beta
- BEMT
- Blunt End Multi Tube
- BOLD
- Barcode of Life Data Systems
- BP
- base pairs
- BSA
- Bovine Serum Albumin
- BWA
- Burrows-Wheeler Alignment
- CDS
- Coding sequences
- COI
- cytochrome-c oxidase subunit 1
- cpGenome
- Chloroplast genome
- DNAmark
- Danish national DNA reference database
- eDNA
- environmental DNA
- GATK
- GenomeAnalysisTK
- GBIF
- Global Biodiversity Information Facility
- HTS
- high-throughput sequencing
- ITS
- internal transcribed spacers
- LCA
- lowest common ancestor
- LSU rRNA
- large subunit ribosomal ribonucleic acid
- matK
- maturase K
- nrDNA
- nuclear ribosomal sequences
- ORG.Annot
- Organelle Annotator
- ORG.asm
- ORGanelle ASseMbler
- PE
- paired-end
- PVP
- Polyvinylpyrrolidone
- qPCR
- quantitative PCR
- rbcL
- ribulose-bisphosphate carboxylase large chain
- SD
- standard deviation
- sedaDNA
- Ancient sedimentary DNA