Introducing an rbc L and trn L reference library to aid in the metabarcoding analysis of 2 foraged plants from semi-arid eastern South African savannas

Abstract


Introduction
. CC-BY 4.0 International license made available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprint this version posted October 7, 2022. ; https://doi.org/10.1101/2022.10.06.511093 doi: bioRxiv preprint 3 39 Dietary analysis is a fundamental part of constructing habitat selection and utilization 40 models as well as assessing the influence of land use type on the plant community 41 and how this, in turn, can influence foraging strategies [1,2]. Determining and 42 analysing food items in ecosystems will also aid in identifying key environmental 43 resources for the design of reliable conservation and management strategies [3,4]. 44 To determine the composition of animal diets and how this composition reflects the  Except for the choice of the barcoding marker to be used, the efficiency and usefulness 60 of metabarcoding as a tool are underpinned by the extent of taxonomic coverage and 61 the quality of records available in the DNA barcode reference database to which the 62 query sequences will be compared and identified [17,18]. However, there is a lack of . CC-BY 4.0 International license made available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprint this version posted October 7, 2022. identification accuracy of reference sequences via primary distance-based criteria.

89
The fulfilment of these aims will result in a DNA reference database that can be used 90 confidently to delineate between species found in the faeces of herbivores foraging in 91 a semi-arid eastern South African savanna as well as to use the successful taxonomic  (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is  "quality" analysis.

147
The aligned candidate barcode reference sequences were used to reconstruct        To improve the accuracy of the DNA reference database, optimal identification 251 thresholds were determined with the threshOpt function in the R package SPIDER.

252
The default threshold of species optimisation used by identification algorithms such as 253 BLAST and BOLD is 1%, which may not always be appropriate for all reference 254 datasets. Accordingly, the determined optimal threshold for the rbcL dataset was 0.4% . CC-BY 4.0 International license made available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprint this version posted October 7, 2022. ; https://doi.org/10.1101/2022.10.06.511093 doi: bioRxiv preprint 12 255 and that of the trnL dataset was 0.6%, which clearly indicates that the identification 256 accuracy of the datasets would not have been accurately portrayed by the default 257 threshold value of 1%. These optimal threshold values will be used during the 258 simulated taxonomic identification via the three distance-based analyses which will aid 259 in predicting the accuracy of the reference datasets.

260
. . CC-BY 4.0 International license made available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprint this version posted October 7, 2022. ; https://doi.org/10.1101/2022.10.06.511093 doi: bioRxiv preprint Table 1: Discriminatory power of the rbcL and trnL reference datasets predicted by the application of three distance-based measures: 262 nearest neighbour, Meier's best close match, as well as BOLD identification criterion, for both default and optimized thresholds for 263 the inclusion and exclusion of singletons for the respective datasets. All three analyses were performed to identify queries within the 264 respective identification thresholds to genus level.

268
False -the nearest species index is not the same as the tested individual; True -the nearest species index is the same as the tested individual. Note that the 269 optimal threshold could not be changed from its default in the nearestNeighbour() function.

278
The nearest-neighbour (k-NN) criterion performed the best among the distance-based 279 measures during the identification simulations (    Table 1). The best success rates across all three 351 methods were seen with the exclusion of singletons, in this case, genera . CC-BY 4.0 International license made available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprint this version posted October 7, 2022.   (Table 1) was not considered in this part of the study, as this would have led to a . CC-BY 4.0 International license made available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprint this version posted October 7, 2022.  During this study, we developed a DNA barcode reference library that is robust to 400 identify taxons on the list of species that we curated for some foraged plants from the 401 semi-arid eastern South African savanna. All the different tests used to validate the . CC-BY 4.0 International license made available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is The copyright holder for this preprint this version posted October 7, 2022. ; https://doi.org/10.1101/2022.10.06.511093 doi: bioRxiv preprint 20 402 use and accuracy of this library indicate that it can be used with confidence to assign 403 taxonomies for plants found in the eastern savannas of South Africa. We envisage 404 that it will add to other similar research done not only on the local flora but also to the 405 work done on savannas elsewhere on the African continent. The datasets for rbcL 406 and trnL are not presented as a complete DNA reference library, but rather as two 407 databases that should be used in unison to identify species foraged on in the semi-