naturaList: a package to classify occurrence records in levels of confidence in species identification

There is a big volume of occurrence records available in biodiversity databases, but researchers should guarantee its quality before use it in scientific studies. A problem that might compromise the quality of occurrence data is species misidentification. We address this issue by presenting naturaList, a R package designed to classify species occurrence data according to identification reliability. naturaList allows to classify species occurrences up to six levels of confidence in species identification, and to filter occurrence data accordingly. The highest level of confidence is assigned to records identified by a specialist, whose name must be provided by the user. The other five levels of confidence are derived from the occurrence data. We demonstrate naturaList functions using occurrences of Alsophila setosa, a tree fern species from Atlantic Forest, as example. We classified and filtered data in grid cells in order to maintain only the highest-level records in each cell. Then we selected only those records classified in the two highest levels of confidence. From 323 occurrences of Alsophila setosa displaying geographic coordinates, 69 (21%) were identified by a specialist. After filtering the highest-level records inside grid cells, 102 records remained. From these grid cell filtered data, 38 occurrences (37%) were classified into the highest confidence level. Three records were removed using an interactive map module, due to falling in sea sites or outside the native range size of the species. Since we selected only records classified in the two highest levels of confidence, the final dataset contained 94 occurrence records. naturaList guarantees the reproducibility of occurrence data processing and cleaning. Macroecologists, biogeographers and taxonomists might benefit from using naturaList package to evaluate the quality of species identification in occurrence data and by identify sites that need evaluation of taxonomic classification of species.

. Names and general description of the functions of the naturaList package. create_spec_df Creates a specialist data frame from character vector.
get_det_names Facilitates the search for non-taxonomist strings in the 'determined.by' column of occurrence records dataset.
grid_filter Filters the occurrence with most confidence in species identification inside grid cells.
map_module Checks out the occurrence records in an interactive map module.

3
The classification in confidence levels of species identification is done with 9 4 classify_occ function. This function demands a data frame with the occurrence data and 9 5 a data frame with specialist names. Each occurrence record is compared with the criteria 9 6 from lower to higher levels of confidence and it is flagged with the highest criterion it 9 7 met ( Fig. 1). By default, classify_occ function uses six confidence levels (see Table 2), 9 8 which, except for level 1, can be reordered according to the adequacy of the study. The output is a data frame containing the occurrence records with all the original 1 0 0 information presented in the input dataset plus a column named 'naturaList_levels', 1 0 1 6 indicating the confidence level of the occurrence record (codes are presented in Table   1  0  2 2).   The occurrence dataset demanded by classify_occ must have at least the 1 1 7 columns with the information showed in Table 3. By default, these column names function, which is ready to use in classify_occ.  To classify an occurrence record as identified by a specialist, the algorithm in classify_occ searches first if the surname of one or more specialist is present in the all strings that need verification by the user to assure it corresponds to a specialist name. In the manual verification, the user is asked by the function if the string printed in the R 1 3 6 console corresponds (y), or do not (n) to a specialist' name. The classification process depends both on the quality of specialist dataset The grid_filter function selects the occurrence with highest level of confidence 1 4 6 inside a spatial extent, given by a cell size defined by the user or from a raster layer.

4 7
Thus, this function returns a data frame only with the best classified occurrence record function gives priority for occurrence that was 1) more recently determined; 2) more occurrences is randomly chosen. The map_module function is an interactive application that allows the user to To demonstrate the use of naturaList package, we conducted a classification of the (GBIF, 2019). We assumed that occurrence records had no issues regarding 1 6 7 geographical coordinates and did not conduct any procedure to remove records before  occurrence points were inside the native range using map_module function. We species occur in South America, to define the native distribution range of A. setosa. The 1 8 0 code used to produce this example is provided in the supporting information. Additionally, an introduction to the package can be found with the code: The dataset of Alsophila setosa has 508 occurrence records. After using classify_occ the 1 8 6 number of records decreased to 323 due to records without coordinates in the raw GBIF The dataset with 323 classified occurrence records was then filtered in grid cells The grid cell filtering enhanced the representativeness of occurrences classified in the 1 9 7 level 1 (from 21.1 to 37.3%), by removing occurrences classified in lower levels that are 1 9 8 placed in the same grid cell. Finally, we used this dataset with 102 occurrences in the map_module to visually check for potential errors and to select only occurrence records in levels 1 and 2 2 0 1 (Fig. 2). First, we deleted three occurrence records, by setting on the button 'delete 2 0 2 1 2 points with click' (Fig. 2b-II) and clicking on them; two of them were deleted because 2 0 3 were placed in the sea and one because was out of the native range recognized by Flora 2 0 4 do Brasil (the 'x' in Fig. 2b). Then, we selected the levels 1 and 2 to be maintained in 2 0 5 the output dataset ( Fig. 2b-I). The current selected records may be visualized above the 2 0 6 'Done!' button ( Fig. 2b-III). Note that selection made in the Fig. 2b-IV only serves to 2 0 7 display the points in the map. Polygons should also serve as spatial selection that the 2 0 8 user can draw with tools in the left side of the map (Fig 2b-V). Finally, we clicked in 2 0 9 'Done!' button to assign the selected points to an R object. At the end of these 2 1 0 procedure 94 occurrence records were maintained in the occurrence dataset. Our example demonstrated that naturaList enables researchers to classify and quantify, of confidence is suitable to their study objective. Also, our example showed that the 2 2 6 proportion of records with high quality can be improved after the use of grid_filter 2 2 7 function, therefore it might enhance data quality to be used in SDMs. This increase, from which the function will use the cells to conduct the filtering.

3 3
Although SDM is the most popular application aiming to produce accurate 2 3 4 estimates of species distribution, occupancy models have been also used for this regarding confidence in the identification might be useful for occupancy models that  The tools described here could be also useful for taxonomists, that might occurrence of the species -by visiting the areas or a scientific institution with preserved 2 4 2 material -and thereby improving the quality of identification.

4 3
On one hand, the tools in naturaList enable researchers to account for the quality 2 4 4 of species identification in big datasets as well as to report the steps used in the cleaning 2 4 5 of data, which enhance the reproducibility of such studies. On the other hand, those 2 4 6 tools might enhance the quality of species identification by guiding specialists in 2 4 7 taxonomy to revise specimens with low determination confidence.