rCRUX: A Rapid and Versatile Tool for Generating Metabarcoding Reference libraries in R

Key to making accurate taxonomic assignments are curated, comprehensive reference barcode databases. However, the generation and curation of such databases has remained challenging given the large and continuously growing volumes of DNA sequence data and novel reference barcode targets. Monitoring and research applications require a greater diversity of specialized gene regions and targeted taxa to meet taxonomic classification goals then are currently curated by professional staff. Thus, there is a growing need for an easy to implement tool that can generate comprehensive metabarcoding reference libraries for any bespoke locus. We address this need by reimagining CRUX from the Anacapa Toolkit and present the rCRUX package in R. The typical workflow involves searching for plausible seed amplicons (get_seeds_local() or get_seeds_remote()) by simulating in silico PCR to acquire seed sequences containing a user-defined primer set. Next these seeds are used to iteratively blast search seed sequences against a local NCBI formatted database using a taxonomic rank based stratified random sampling approach (blast_seeds()) that results in a comprehensive set of sequence matches. This database is dereplicated and cleaned (derep_and_clean_db()) by identifying identical reference sequences and collapsing the taxonomic path to the lowest taxonomic agreement across all matching reads. This results in a curated, comprehensive database of primer specific reference barcode sequences from NCBI. We demonstrate that rCRUX provides more comprehensive reference databases for the MiFish Universal Teleost 12S, Taberlet trnl, and fungal ITS locus than CRABS, METACURATOR, RESCRIPt, and ECOPCR reference databases. We then further demonstrate the utility of rCRUX by generating 16 reference databases for metabarcoding loci that lack dedicated reference database curation efforts. The rCRUX package provides a simple to use tool for the generation of curated, comprehensive reference databases for user-defined loci, facilitating accurate and effective taxonomic classification of metabarcoding and DNA sequence efforts broadly.


52
The fields of freshwater, estuarine and marine ecology are rapidly embracing high throughput 53 sequencing to detect, monitor, or assess change in biological communities (Deiner et al. 2017, 54 Takahashi et al. 2023). Fundamental to the efficacy of these molecular biomonitoring efforts, 55 particularly metabarcoding (amplicon sequencing), is the taxonomic assignment of the 56 sequences generated (Edgar 2018, Bik 2018. Taxonomic assignment is a complicated 57 bioinformatic process that involves many challenges including the uncertainty around the 58 generated sequencing data, the comparison between those data and a reference database of 59 sequences of known origin, and the bioinformatic decisions that land on a taxonomic 60 and is also made available for use under a CC0 license. was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 The copyright holder for this preprint (which this version posted June 3, 2023. Beng and Cortlett 2020). Generating a high quality reference database from these enormous 78 sequence repositories requires a full accounting of all orthologous sequences, the detection and 79 removal of mislabelled sequences, and the identification of identical sequences across taxa 80 (Curd et al. 2019, Jeunen et al. 2023, Richardson et al. 2020) . Parsing and refining these large 81 sequence repositories into curated databases that are comprehensive for specific marker sets 82 remains a significant challenge (Jeunen et al. 2023). 83 Efforts to address this challenge either rely on the dedicated maintenance and curation 84 of reference databases for specific loci of interest or are limited in efficacy because they rely on 85 keyword searches, are too computationally demanding, or are difficult to stand up and install, 86 relying on a suite of software dependencies. By far the most successful and widely used 87 reference databases (e.g. Silva, PR2, UNITE, MitoFish) rely on dedicated staff and resources to 88 maintain and update such repositories (Quast et  sequence databases in a three step process ( Figure 1): 1) identification of seed sequences that 138 match the primers of interest, 2) finding homologous and orthologous sequences to those via 139 BLAST, and 3) dereplication of the resulting database to reduce redundancy and detect wrongly 140 annotated sequences. This can be followed by 4) database comparison tools provided in 141 rCRUX. 142 and is also made available for use under a CC0 license.
was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 The copyright holder for this preprint (which this version posted June 3, 2023. Specifically, get_seeds_local() passes the forward and reverse primer sequence for a 163 given PCR product to run_primer_blastn(). In the case of a non-degenerate primer set only two 164 primers will be passed to run_primer_blast(). In the case of a degenerate primer set, 165 get_seeds_local() will obtain all possible versions of the degenerate primer(s) (using 166 primerTree's enumerate_primers() function), randomly sample a user-defined number of forward 167 and reverse primers, and generate a fasta file of selected primers (Cannon et al. 2016 and is also made available for use under a CC0 license. was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 The copyright holder for this preprint (which this version posted June 3, 2023. The final step of rCRUX, derep_and_clean_db(), takes the output from blast_seeds() and 217 conducts quality control and curation de-replicates the dataset to identify representative 218 sequences. First, all sequences with NA taxonomy for phylum, class, order, family, and genus 219 are removed from the dataset because they typically represent environmental samples with low 220 value for taxonomic classification and are stored separately. Next, all sequences with the same 221 length and composition are collapsed to a single database entry, where the accessions and 222 taxids (if there are more than one) are concatenated. The sequences with a clean taxonomic 223 and is also made available for use under a CC0 license.
was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 The copyright holder for this preprint (which this version posted June 3, 2023. ; https://doi.org/10.1101/2023.05.31.543005 doi: bioRxiv preprint path (e.g. no ranks with multiple entries) are saved. In contrast, sequences with multiple entries 224 for a given taxonomic rank are processed further by removing NAs from rank instances with 225 more than one entry (e.g. "Chordata, NA" will mutate to "Chordata"). Any remaining instances of 226 taxonomic ranks with more than one taxid are reduced to NA (e.g. species rank "Badis 227 assamensis, Badis badis" will mutate to "NA", but genus rank will remain "Badis"  Table 1 and Supplemental Table 3). 267 and is also made available for use under a CC0 license.
was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 The copyright holder for this preprint (which this version posted June 3, 2023.

281
The get_seeds_local() in silico PCR consistently captured a greater number of species than 282 ECOPCR across the MiFish 12S Universal Teleost, trnl, and FITS loci (Figure 1a,b,  across comparable benchmark reference databases. 296 and is also made available for use under a CC0 license. was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 The copyright holder for this preprint (which this version posted June 3, 2023.  (Figure 3). 312 and is also made available for use under a CC0 license.
was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 The copyright holder for this preprint (which this version posted June 3, 2023.  Figure 3). 321 Limiting the seeds and database generation output comparisons to only Eukaryotic 322 reads had minimal effect on the results (Supplemental Figures S9-12). We also note that the 323 rCRUX databases were generated after the other databases, however they include the majority 324 of species captured by compared methods. Together, these results benchmark rCRUX 325 favorably against CRABS, METACURATOR, ECOPCR, RESCRIPt, and CRUX across a 326 diversity of metabarcoding loci. 327 and is also made available for use under a CC0 license.
was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 The copyright holder for this preprint (which this version posted June 3, 2023. We successfully generated a total of 16 reference databases ( distinct generating strategies and combine results to obtain the most comprehensive reference 379 database possible. 380 and is also made available for use under a CC0 license. was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 The copyright holder for this preprint (which this version posted June 3, 2023. the ubiquity of R users in the molecular biology and ecology fields, rCRUX provides a powerful 399 tool that is straightforward and relatively easy to implement on any computing environment. By 400 providing researchers with an accessible reference database generating tool, we hope to 401 alleviate the difficulties of building and updating reference databases. Thus the ability to 402 generate user-specific barcode reference databases will enhance metabarcoding, eDNA, 403 microbiome, and DNA classification research efforts broadly. 404 One of the motivations for making simple and easy to install, update, and maintain 405 reference database generating tools was to increase access to these resources across the 406 molecular biology and ecology fields. However, limitations in the utility of reference database 407 generating software still remain, particularly the scale of computational resources needed. 408 Although the iterative blast implementation of rCRUX reduces computational needs compared 409 to the previous iterations of CRUX, the rCRUX databases presented here still relied on high 410 performance computing (each run was given a maximum allotment of 250GB of RAM, 40 cores, 411 and one week of run time on the University of Vermont -Vermont Advanced Computing 412 Cluster computing and high performance computing resources continue to become increasingly cost 422 effective, we hope rCRUX and similar reference database generating tools will become more 423 accessible (Thompson & Thielen 2023). We note that rCRUX can be successfully implemented 424 and is also made available for use under a CC0 license.
was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 The copyright holder for this preprint (which this version posted June 3, 2023. ; https://doi.org/10.1101/2023.05.31.543005 doi: bioRxiv preprint on a personal laptop with a 1 TB hard drive, 16 GB of RAM, and 8 cores, given parameters and 425 markers that require fewer computational resources. Importantly, we designed rCRUX to be 426 highly scalable and easy to install through R in any compute environment, allowing for adoption 427 in future cloud computing efforts in which rCRUX could be served to a wide audience like NCBI 428 primerTools or BLAST. 429 However, to specifically help address issues of access to comprehensive reference 430 databases in the short term, we provided 16 reference databases for commonly used or 431 emerging metabarcoding loci. These databases will be updated and curated at least annually 432 with a unique DOI, providing important genetic resources to the broader DNA sequencing 433 community including those that lack access to such computational infrastructure. Future efforts 434 will be made to grow the list of available databases as future loci become available and widely 435 adopted. 436 Lastly, we demonstrate the reproducibility of rCRUX, allowing for users to make identical 437 databases from the same starting parameters and sequence repositories (Supplemental Table  438 2). Providing a reproducible and stable tool for the generation of barcode reference databases 439 ensures high quality genetic resources that adhere to FAIR principles. 440 441 Broader Applications of rCRUX 442 The most immediate application of rCRUX is the generation of reference databases to support 443 taxonomic assignment of metabarcoding from high throughput sequencing.  Health under grant number P20GM103449. Its contents are solely the responsibility of the 503 authors and do not necessarily represent the official views of NIGMS or NIH. Support for the 504 development of rCRUX was also provided by the CalCOFI program. This study is a PMEL 505 contribution 5512. 506 and is also made available for use under a CC0 license.
was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 The copyright holder for this preprint (which this version posted June 3, 2023. ; https://doi.org/10.1101/2023.05.31.543005 doi: bioRxiv preprint 507 Acknowledgements 508 509 This work benefited from the amazing input of many including Lenore Pipes, Sarah Stinson, 510 Gaurav Kandlikar, Maura Palacios Mejia, Ryan Kelly, and Kim Parsons. We want to especially 511 acknowledge the late, great Jesse Gomer, coding extraordinaire, rCRUX co-conspirator, and 512 dear friend who tragically passed away before rCRUX was completed. None of this would be 513 possible without Jesse's endless inspiration, creativity, ingenuity, and generosity. 514

515
The rCRUX package and 16 generated reference databases are available at 516 https://github.com/CalCOFI/rCRUX. Data and code for analysis and figures will be uploaded 517 upon acceptance of the manuscript. 518 Replacing "parachute science" with "global science" in ecology and conservation biology. 531 Conservation Science and Practice, 4(5), e517. and is also made available for use under a CC0 license.
was not certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 The copyright holder for this preprint (which this version posted June 3, 2023. ; https://doi.org/10.1101/2023.05.31.543005 doi: bioRxiv preprint