nzffdr: an R package to import, clean and update data from the New Zealand Freshwater Fish Database

The New Zealand Freshwater Fish Database (NZFFD) is a repository of more than 155,000 records of freshwater fish observations from around New Zealand, maintained by the National Institute of Water and Atmospheric Research (NIWA). Records from the NZFFD can be downloaded using a web interface. The statistical computing language R is now widely used for data wrangling, analysis, and visualisation. Here, we present nzffdr, an open source R software package that: i) allows users to query and download data from the New Zealand Freshwater Fish Database directly in R, ii) provides functions to clean imported data, iii) facilitates the addition of information such as species names and Department of Conservation threat classification status, and iv) a workflow for visualising information from the NZFFD. The nzffdr package aims to standardise, simplify, and speed up a workflow likely already used in an ad hoc manner by scientists across New Zealand and abroad.


Introduction 44
The New Zealand Freshwater Fish Database (NZFFD) contains over 155,000 45 observations of freshwater fish (plus freshwater shrimp and kōura) from across New 46 Zealand dating back to 1901 (Crow, 2017). The observations typically include 47 information on sampling location, date, time, fishing method, and the organisation that 48 conducted the survey; less frequently information on the number and size of individuals 49 caught is included. The database is a remarkable asset and is widely used to inform 50 academic and governmental research and decision making (Goodman et al., 2014;Joy 51 & Death, 2004). A limitation of the NZFFD is that it lacks some basic variables that 52 individuals need to add each time they analyse NZFFD data; for example, species' 53 common and scientific names are not included (a 6 letter species code is included), nor 54 is any other taxonomic information (e.g. family), threat classification status, and 55 whether the species is native or introduced. Adding this information each time data are 56 downloaded is not trivial and can be time consuming if many records are downloaded. 57 The statistical computing language R (R Core Team, 2020) has increased in 58 popularity over the last decade, and is now one of the most common programming 59 languages used by ecologists (Lai et al., 2019). R is typically used for data wrangling, 60 analysis, and visualisation and is a popular tool for interrogating NZFFD data (Jellyman 61 & Harding, 2012;Jones & Closs, 2015;Leathwick et al., 2006). 62 Here we present the nzffdr R package. We describe the features of each of the 63 core functions in the nzffdr package and then illustrate how the functions can be used 64 via an analysis of NZFFD data. 65

Methodology 66
The nzffdr package has four core functions and four core datasets. The four core 67 functions: i) import NZFFD data from in R. ii) clean up a variety of spelling inconsistencies and add a new variable "form" which describes the sampling habitat e.g. 69 (river, stream, wetland etc.), iii) add missing information such as, family, genus and 70 species names, common names, Department of Conservation threat classification status 71 (Dunn et al., 2017) and whether the species is native or introduced and, iv) import and 72 attaches associated REC data. The four built-in datasets are: i) a subset of 200 rows 73 from the NZFFD that can be accessed without an internet connection and used for 74 exploratory analysis, ii) the different fishing methods included in the NZFFD; it is 75 possible to search the database using these terms so they are provided for reference, iii) 76 scientific and common names of all species included in the NZFFD; the database can be 77 searched by species name (using scientific or common names) so these are provided for 78 reference, iv) a simplified version of the 1:150k NZ map outline available from Land 79 Information New Zealand (https://data.linz.govt.nz/layer/50258-nz-coastlines-topo-80 150k/) to facilitate easy mapping of species' distributions. 81

Importing data: nzffdr_import() 82
The nzffdr_import() function is used to search the NZFFD and takes input 83 arguments that align with the search options of the NZFFD web user-interface. There 84 are seven search arguments: 85 (1) catchment: this refers to the Catchment number, a 6-digit number unique to the 86 reach of interest. Search using the individual reach number (e.g. catchment = 87 "702.500"), or for all rivers in a catchment you can use the wildcard search term (e.g. 88 catchment = "702%"). 89 (2) river: search for a river by name; for example, to get all records for the Clutha 90 (river = "Clutha"). 91 (3) Location: search for river by sampling locality for example, to get all records 92 from Awakino (location = "Awakino"). 93 (4) fish_method: search by fishing method used, for example to get all records 94 where fish were caught using a seine net (fish_method = "Other net -Seine"). There Not specifying the arguments will return all possible records. The 107 nzffdr_import() function requires an internet connection to query NIWA's 108 database. 109

Cleaning imported data: nzffd_clean() 110
While the data imported from NZFFD is generally does not have many errors there are 111 some small inconsistencies (e.g. spelling of river and place names); the 112 nzffd_clean() function aims to fix these errors. The first letter of all words in the 113 columns "catchname" and "locality" are capitalised, and any non-alphanumeric 114 characters are removed. Observations in the "time" column are converted to a standardised 24-hour format and nonsensical values (e.g. "0.677") converted to "NA". 116 The organisation column ("org") is converted to all lowercase and has non-117 alphanumeric characters removed. The NZMS260 map code ("map") is converted to 118 lower case and has any non-three-digit codes converted to "NA". Observations in the 119 catchment name column ("catchname") are standardised, e.g. "Clutha River", "Clutha 120 r" and "Clutha river" all become "Clutha R". Finally, a new variable "form" is added, 121 which defines each observation as one of the following: creek, river, tributary, stream, 122 lake, lagoon, pond, burn, race, dam, estuary, swamp, drain, canal, tarn, wetland, 123 reservoir, brook, spring, gully or NA. The "form" variable is created by matching the 124 above "forms" with the "locality" column; therefore, it reflects the description given by 125 the "locality" variable. 126

Adding River Environment Classification data: nzffd_add() 139
Finally, network topology and environmental information from the River Environment 140 Classification (REC) database (Snelder & Biggs, 2004) can be added to the NZFFD 141 data using nzffd_add(). This function takes the NZFFD "nzreach" variable and 142 matches it against the corresponding "NZREACH" variable in the REC database, and 143 imports all the associated REC data, adding 24 new columns to the NZFFD dataset. 144 This function requires an internet connection to query the REC database. 145

Illustration of nzffdr functionality 146
To demonstrate the utility of the nzffdr package we imported the entire NZFFD into R, 147 cleaned up the imported data, filled in missing data, and added the REC database. We 148 then highlight the usefulness of some of the new variables that nzffdr has added to the 149 NZFFD dataset. Specifically, we map the distribution of native and introduced species, 150 plot the relative proportion of records across habitat forms for each of the Galaxias 151 species, highlighting their respective conservation status, and finally use the REC data 152 to show the distance inland that each of the Galaxias species has been found. 153 All analysis was carried out using R v 4.1.0 (R Core Team, 2020). We plotted the distribution of introduced and native species records from the NZFFD 161 ( Fig. 1), where the introduced/native variable and the map of New Zealand are provided by the nzffdr R package. We then graphed the relative number of records occurring 163 across 10 habitat forms for each of the Galaxias species, including information about 164 each species' threat classification status (Fig. 2). Habitat form, threat classification, and 165 species common names have all been added to NZFFD data via the nzffdr package. 166 Finally, distance to sea (km) at each of the locations of Galaxias species in the NZFFD 167 have been observed at was plotted (Fig. 3). The distance to sea variable is added to the 168 NZFFD data from the River Environment Classification database via the nzffdr 169 package. This analysis illustrates some of the functionality offered by the nzffdr