Abstract
The increasing popularity of cytochrome c oxidase subunit 1 (COI) DNA metabarcoding warrants a careful look at the underlying reference databases used to make high-throughput taxonomic assignments. The objectives of this study are to document trends and assess the future usability of COI records for metabarcode identification. Over 2.5 million COI sequences were found in GenBank, half of which were fully identified to the species rank. From 2003 to 2017, the number of COI Eukaryote records deposited has grown by two orders of magnitude representing a nearly 42-fold increase in unique species. For fully identified records, 92% are at least 500 bp in length, 74% have a country annotation, and 51% have latitude-longitude annotations. To ensure the future usability of COI records in GenBank we suggest: 1) Improving the geographic representation of COI records 2) Improving the cross-referencing of COI records in the Barcode of Life Data System and GenBank to facilitate consolidation and incorporation into existing bioinformatic pipelines, 3) Adherence to the minimum information about a marker gene sequence guidelines, and 4) Integrating metabarcodes from eDNA and mixed community studies with existing sequences. COI metabarcoders are normally considered consumers of taxonomic data. Here we discuss the potential for taxonomists to reverse this pattern and instead mine metabarcode data to guide species discovery. The growth of COI reference records over the past 15 years has been substantial and is likely to be a resource across many fields for years to come.