A review of the International Seabed Authority database DeepData: challenges and opportunities in the UN Ocean Decade

There is an urgent need for quality biodiversity data in the context of rapid environmental change. Nowhere is this more urgent than in the deep ocean, with the possibility of seabed mining moving from exploration to exploitation, but where vast knowledge gaps persist. Regions of the seabed beyond national jurisdiction, managed by the International Seabed Authority (ISA) are undergoing intensive mining exploration, including the Clarion-Clipperton Zone. In 2019 the ISA launched its database ‘DeepData’, publishing environmental (including biological) data; and since June 2021, DeepData records have been harvested by OBIS (Ocean Biodiversity Information System) via the ISA node. Here we explore how DeepData could support biological research and environmental policy development in the CCZ (and wider ocean regions); and whether data are Findable, Accessible, Interoperable and Reusable (FAIR). Given the direct connection of DeepData with the regulator of a rapidly developing potential industry, this review is particularly timely. We found evidence of extensive duplication of datasets; an absence of unique record identifiers and significant taxonomic data quality issues, compromising FAIRness of the data. The publication of DeepData records on the OBIS ISA node has led to large-scale improvements in data quality and availability. However, limitations in usage of identifiers and issues with taxonomic information were also evident in datasets published on the node, stemming from mis-mapping of data from the ISA environmental data template to the data standard Darwin Core prior to data harvesting by OBIS. While notable data quality issues remain, these changes signal a rapid evolution for the database and significant movement towards integrating with global systems through usage of data standards and publication on global aggregators. This is exactly what has been needed for biological datasets held by the ISA. We provide recommendations for future development of the database to support this evolution towards FAIR.


Abstract
There is an urgent need for quality biodiversity data in the context of rapid environmental 26 change. Nowhere is this more urgent than in the deep ocean, with the possibility of seabed 27 mining moving from exploration to exploitation, but where vast knowledge gaps persist. 28 Regions of the seabed beyond national jurisdiction, managed by the International Seabed 29 Authority (ISA) are undergoing intensive mining exploration, including the Clarion-Clipperton 30 Zone. In 2019 the ISA launched its database 'DeepData', publishing environmental 31 (including biological) data; and since June 2021, DeepData records have been harvested by 32 OBIS (Ocean Biodiversity Information System) via the ISA node. Here we explore how 33 DeepData could support biological research and environmental policy development in the 34 CCZ (and wider ocean regions); and whether data are Findable, Accessible, Interoperable 35 and Reusable (FAIR). Given the direct connection of DeepData with the regulator of a 36 rapidly developing potential industry, this review is particularly timely. We found evidence of 37 extensive duplication of datasets; an absence of unique record identifiers and significant 38 taxonomic data quality issues, compromising FAIRness of the data. The publication of 39 DeepData records on the OBIS ISA node has led to large-scale improvements in data quality 40 and availability. However, limitations in usage of identifiers and issues with taxonomic 41 information were also evident in datasets published on the node, stemming from mis-42 mapping of data from the ISA environmental data template to the data standard Darwin Core 43 ( Figure 1). Further, given the direct connection of the regulator and database, it also 114 illustrates how these data could be synthesised and applied to environmental management, 115 for example in developing tools such as the Regional Environmental Management Plan 116 (REMP) or the design of Areas of Particular Environmental Interest (APEIs; see Figure 1 and 117 Smith et al., (35)). 118

119
In this study, we provide the first review of DeepData, focussed on the biological data 120 available for the most active area of seabed mining (CCZ) and include recommendations for 121 the development of this database into the future. This work is particularly timely given 122 DeepData has now been operational for four years; associated records are being actively 123 pushed onto global data aggregators such as OBIS, GBIF and INSDC (International 124 Nucleotide Sequence Collaboration) and OBIS is also now publishing DeepData records via 125 the OBIS ISA node (3). A more critical point however is the context of the rapid recent 126 development of deep seabed mining regulations and the urgent need to address deep-sea 127 biodiversity data gaps both for the CCZ and other regions (36,37). Here we conduct an 128 assessment of the database and wider related ISA biological/environmental data 129 management as part of a broader study where we synthesise the biodiversity and 130 biogeographic data available from DeepData and associated databases for the CCZ 131 (Rabone et al., in prep). The primary purpose of this review therefore is to assess the 132 FAIRness of published biological data in DeepData, and the potential utility of the database 133 to support both research and decision-making for environmental policy. 134

137
Overview of DeepData and description of the online data portal 138 The ISA DeepData website or online data portal provides biological, geochemical, and 139 physical data collated from expeditions arranged by contractors for the CCZ and other 140 exploration regions. The map-based interface includes boundary data (e.g., shapefiles) 141 depicting APEIs, mining exploration contract areas, reserved mining exploration areas, and  Figure 1). Here 'Points' equate to deployments (sampling events) collected 159 from a particular point in space and time e.g. a box core; and 'Trawl lines' being those 160 collected from sampling between two points, e.g. and ROV or via towed gear such as a 161 Brenke Epibenthic Sledge trawl sample.  The structure of the DeepData output had observations distributed both over rows and 279 columns, or in both 'wide' and 'long' format (38), resulting in a similar outcome of additional 280 data processing steps. Wide format is one record or observation per row; and 'long' format' 281 where one record or observation is split across multiple rows. All data were wide format, until 282 the fields 'Analysis' and 'Result', where these data fields were 'paired', i.e. 'Result' data 283 values pertain to the adjacent field 'Analysis', and these data were therefore structured in 284 long format. The field 'Analysis' is a list of column headings, e.g. ' Taxonomist both across rows and columns has produced significant redundancy in the data, only 5 293 columns are shown (S Table 1), but there were 48 columns in total for 'Point' data (and 49 294 for 'Trawl Line' data), the majority containing this redundant repeated data-39,066,594 cells 295 in total. This redundancy will therefore multiply as more datasets added to the database. 296 This is likely to significantly impact processing speeds. Another export option was available, 297 'export pivot query', this option has all data in wide format, but was not used in analysis as 298 during initial exploratory investigations it appeared to differ in visual formatting only and 299 export query was appeared to be the default format.  Data on size class ('nominalSizeCategory'), was often missing also, despite being a required 346 field in the data template. In some cases, omission of information has produced 347 inaccuracies. For example, the field 'Identification Method' for recording how taxa were 348 identified, text entries were present as 'Morphological' or 'DNA', but not as a combined entry, 349 i.e. 'Morphological and DNA'. This data recording is an artefact of an earlier iteration of the 350 data template, where only one method could be recorded in the field and can give the 351 impression that an identification was made with only one method even when this was not the 352 case. As a wider point, data from the majority of cruises are yet to be published on the 353 database, as 103 cruises have been carried out in the CCZ (ISA Secretariat, pers. comm.), 354 but records from 24 cruises, and ten contractors in total have been published to date (Table   355 1; Rabone & Glover, in review). It is unclear is this is entirely due to a data backlog or if there 356 are cases of active contractors who have not submitted data. While substantial data 357 processing (and in some cases, interpretation) was required for taxonomy and to a lesser 358 extent, sampling information, site data in contrast required minimal processing. Some The duplicates are primarily stemming from issues with identifiers. The database export 387 lacks a record identifier (or primary key) and uses the specimen identifier field 'SampleID' to 388 reconcile records (Sheldon Carter, pers. comm.). In theory, any records submitted year on 389 year with the same ID should therefore be matched and associated data updated if changed.  401 Several fields were included in the database output that are not required. For example 402 backend database names were present, 'AreaKey'; 'ClusterID'; and 'BlockID', and for the 403 latter two, no data entries were present in any case. While the search was for polymetallic 404 nodule data only, the output included fields for vents and sulphide deposits: including 405 'HydrothermalActivity' and 'HydrothermalVentAge'; and 'ExtensionPMSSite'. Additional fields 406 were present for taxonomic information, e.g. 'Subfamily', the only sub-or super-taxonomic 407 classification field included. Both the reason for its inclusion and the rules around its usage 408 are unclear, as it has been used not for subfamily names, but rather as a field to capture 409 morphospecies, even though there are two separate fields for recording this in the output: Other key fields were still absent, however, such as a record identifier field that is 430 persistent and unique (equivalent to occurrenceID in DwC; see List of Terms) and as distinct 431 from a specimen identifier, i.e. SampleID (equivalent to catalogNumber in DwC). Another 432 key field missing from the template is an equivalent for the DwC field 'basisOfRecord' for 433 designating record type, for example 'machineObservation' for an ROV-derived record, or 434 'preservedSpecimen' for a specimen-based one. As in the database output, superfluous 435 fields were present. 'OrgNum' for example is a required field ('TaxaID' in the previous 436 template) but is an arbitrary number to provide a composite key for ISA data processing. It is 437 therefore a backend column name and as such a redundant field that doesn't capture any 438 existing data in contractor datasets. It also necessitates an additional processing step by 439 contractors and has the potential to cause confusion. Subfamily is included, but as indicated 440 earlier, this field is not necessary. For the other tabs within the template, superfluous data 441 fields were also present, e.g. 'Target latitude'/'Target longitude' in point/towed gear sample 442 tabs. DwC term 'occurrenceID' is a key, and required field for a persistent, unique record identifier. 506 Here occurrenceID has been generated as a composite key, from combining 507 'StationID'/'TrawlID' and SampleID'. There were duplicates present in this composite key, 508 however. These duplicates were identified by the OBIS secretariat and at the start of the 509 OBIS processing pipeline records were allocated a separate unique identifier. Because of 510 these duplicates in occurrenceID in the DeepData records, a proportion of records cannot be 511 definitively matched between the two databases. Also, the occurrenceID as a non-unique 512 composite key is not present in the DeepData output, only in the JSON files mapped to DwC 513 (and therefore in the OBIS ISA node records), and the composite key would therefore need 514 to be generated with the same formatting to allow any cross-referencing between the 515 records from DeepData or OBIS, i.e. there is not a common record identifier. Even adding 516 the composite key and comparing the records, they do not match of course because the 517 identifier is not unique (and there is different data processing for the DeepData output versus 518 the records on the OBIS ISA node). Overall the number of records for benthic metazoans 519 were different, 40,518 on DeepData and 48,554 in OBIS, which appears to be due in part to 520 slightly more datasets published on OBIS than DeepData at the time of download, but this 521 could not be clearly ascertained because of the underlying identifier issue. In conclusion, 522 standardisation of data to DwC terms to prepare the DeepData records so they can be 523 harvested by OBIS has been a significant step forward, but incorrect data mapping in the 524 process has also compromised data quality. 525

527
The ISA has met a significant challenge to reconcile and publish often variable datasets from 528 contractor annual environmental data submissions. It is a notable achievement that 529 significant biological data holdings (>50,000 records) are now published and available on the 530 database. The 2022 template is also an improvement on the previous version. Through 531 publishing of DeepData records on OBIS, and in the process, mapping data to DwC, some 532 key issues have been addressed and the biological data can now, in part, be classified as 533 FAIR (although reusability is compromised). Despite the issues detailed here, DeepData is a 534 major step forward in developing a centralised repository of biodiversity data in ABNJ, and, 535 given that there has only been four years of development since public release, it is already of 536 great potential value in developing local and regional environmental management plans for 537 this region, and others of our planet that are undergoing rapid industrial exploration. In a separate study, we have made the first attempt to survey all metazoan biodiversity data 540 from the CCZ using DeepData and published species records (Rabone et al., in prep). 541 These kinds of regional syntheses would not be possible without the significant efforts from 542 the ISA DeepData team. DeepData provides a crucial source of 'raw' occurrence data that 543 are rarely available in publications, even as supplementary files, as revealed in the parallel 544 study. A broader point is that the timing of this work has coincided with a phase of rapid 545 evolution of the database, and that the Secretariat is aware of the limitations discussed here 546 are actively working to address them (ISA Secretariat, pers. comm). There are significant 547 improvements to be made, however, that can address the key data quality issues, with the 548 result of greater utility of the data. It is important to note here that the scope of our study is 549 limited to biological data in the CCZ. Many other data types such as geochemistry data are 550 collected by contractors and held by the database. The FAIRness of these data should also 551 be assessed in depth, especially given these data are only available through DeepData 552 itself, and not also as Darwin Core published on OBIS. Geological data being confidential 553 may be a more complex case, but the potential for greater transparency could be explored 554 as this would have significant scope for improved understanding of ecosystems in the 555 region. Here we provide key recommendations with the aim of improving data quality for 556 both research and environmental policy. These recommendations are also depicted as a 557 potential workflow in Figure 3 and summarised in Table 2. We make a key recommendation that the ISA update the current environmental contractor 563 data submission template with a DwC compliant version, with all fields (column headings) in 564 DwC format. Darwin Core is a global, community-led, well-established data standard in wide 565 usage and the DwC terms are clearly understood, with a readily available, easy-to-read 566 reference guide (https://dwc.tdwg.org/terms/). To accompany this, we recommend that rules 567 are also incorporated into the template to ensure required fields (e.g. occurrenceID) are 568 populated. Contractors or other stakeholders should also be able to submit data as a DwC 569 archive (DwC-A). The ISA could consider that at a later stage the environmental data 570 template is entirely phased out for a requirement of data submission as DwC-A, i.e. as is the 571 case for OBIS and GBIF. We acknowledge the environmental data template is much broader 572 than the biological data covered here, but data standards including within DwC are available 573 to cover the relevant fields, for example the OBIS-ENV-DATA environmental DwC extension 574 (40). In time, usage of data standards could also be applied to geological data. Full utilisation 575 of the global standard DwC would benefit both the contractors and the ISA data team, as 576 well as other stakeholders and the user community, and would address the key issues we 577 have identified with the database, as outlined here: 578 579 -All the fields included in the biological data template can be mapped to DwC terms 580 with less ambiguity and more precision. As a result, data will adhere to a common 581 global data standard, allowing data to meet criteria of being FAIR.

638
Darwin Core and usage of identifiers 639 We also recommend some key adjustments to data management in the following section to To accompany this process we recommend comprehensive field (re)mapping to DwC for the 690 template and existing data holdings, both data submissions in template-form, and legacy -691 or pre-template data. The existing DwC data mapping is incomplete and incorrect in some 692 cases (e.g. morphospecies names mapped to taxonRemarks rather than taxonConceptID). It 693 is important to note that because of the current mis-handling of taxonomic data, unsupported 694 scientific conclusions could be drawn without full cleaning and interrogation of the data. More 695 comprehensive mapping will also result in better data capture. For example, some 696 contractors have included non-specimen records, such as image only records in the 697 datasets, which could be described using the basisOfRecord field. While a key to mapping 698 template column headings to DwC is provided, this is somewhat buried in the guidance. This 699 documentation could be revised once the mapping is revised. Data mapping to DwC would 700 also allow for publishing of legacy datasets. This is particularly important given the lack of 701 legacy data available, with very few published works available prior to 2000, as ascertained 702 in the parallel study (Rabone et al., in prep). Although data quality can be highly variable in 703 legacy data, here DeepData could draw on lessons from natural history collections, 704 publishing data with data quality/data completeness flags as done in GBIF for example. The duplicates, which is a useful tool, but it was not comprehensive in its assessment. Therefore, 719 it is important to make changes to data management, both in usage of identifiers as above, 720 but also at the dataset level. As above, the DwC term datasetName would ideally be 721 included in the template as a required field. Improved versioning and documentation of 722 datasets will assist in both preventing and identifying duplication. Communication and 723 involvement of the contractors will also facilitate this process. Contractors could also be 724 required to do iterative data reporting rather than one-off submissions where applicable, i.e. 725 every year the entire dataset, along with any additional new records are submitted, and no 726 'one-off' data submissions are made. This would ensure that year on year changes to 727 records are captured e.g., updates to taxonomic identifications, and potential for harvesting 728 of duplicate datasets is minimised. We recommend that changes are also made to the ISA 729 data publishing strategy, so that rather than publishing contractor data received from 2015 730 up to the present, the reverse is applied, i.e., the latest data submissions-post QA/QC are 731 published. Any additional data that are identified from previous years submissions not 732 included in the current submissions, e.g. contracts that are no longer active, are then added. 733 This will again reduce potential duplication. Further, once record identifiers are incorporated 734 into the template itself, i.e. occurrenceID (Figure 4), any duplicates at the record level could 735 be automatically flagged for example through cross-referencing of these identifiers during 736 the submission process. To support the DwC submission process, training and workshops for contractors, also 742 involving the scientific community and other stakeholders could be considered by the ISA.

771
As DeepData reaches a more mature state, further developments of DeepData would be 772 worthwhile. Our review has focussed in main part on data quality of the biological 773 database output, here we turn to web functionality. It should be noted, however, that as 774 web functionality is inherent to general usability and user experience, it is a key element 775 of general database functionality. Also some of the recommendations listed below, in 776 particular provision of bathymetric data will be critical to characterising deep-sea 777 environments, and therefore should not necessarily be regarded as 'optional extras' but 778 rather as core development. Extensive testing of the web interface is recommended. 779 With data systems, usability and user testing is more critical than theories as to how the    -Provide information on database and data updates, e.g. when the database has been 802 updated and a list of datasets published. This will support FAIRness of data and 803 general transparency (46,47,34). This is currently listed on the website as an 804 upcoming feature 5 (i.e. publication of a file catalogue) and should be straightforward 805 to implement. It could also include a list of submitted datasets that are yet to be 806 published, therefore clarifying which contractors are actively collecting data (Table 1). 807 808 -Provide a dynamically updated cruise inventory on the database for all cruises that 809 have taken place up to current cruises and potentially those in planning. This could 810 be very simple with research vessel and contractor name/s, with cruise dates (e.g. 811 Table 1), but would be very helpful information for all stakeholders. This could even 812 provide a model for the cruise notification system proposed in the BBNJ draft treaty 813 text (20).  integrating with global databases, and global common data standards, allowing for data 888 exchange and integration, for data to be FAIR. However, notable, non-trivial issues with data 889 quality remain, particularly regarding identifiers, duplication, and treatment of taxonomic 890 information. Our review of the database has illustrated the integral importance of global 891 community-led data standards and persistent identifiers for biodiversity data. While the 892 challenges of DeepData reflect those in the wider biodiversity data ecosystem, given the 893 direct connection of the database with the regulator, and its potential to be directly utilised in 894 development of environmental policy, it is even more urgent that these issues are 895 addressed. It would be of great value to be able to directly interrogate the database for 896 species distribution or diversity for example, or on a regional scale, DeepData could 897 ultimately become critical in helping to develop the REMP for the CCZ and other seabed 898 regions managed by the ISA. There is the potential for DeepData to provide an invaluable 899 resource both for research and environmental management. The database is at a nascent 900 phase of its development, here engagement and involvement of the science community, 901 policymakers and contractors to further the development of DeepData is obviously critical. 902 While feedback from user communities of databases via feature requests or bug tracking for 903 example is common practice, more formal and comprehensive assessments of databases 904 like the current study are rare, and we hope in the process to have provided the ISA with 905 useful and implementable recommendations. There is a collective responsibility amongst all 906 stakeholders to support open data efforts such as DeepData and community data curation. 907 However, the ISA is well placed to lead and coordinate activities and encourage efforts in 908 best practice and eventually may even provide an exemplar for high quality deep-sea 909 biological datasets. Such information could be utilised for biodiversity assessments and 910 observing programmes, including contributions to indicators and variables such as EOVs 911 and EBVs (3,4). These could be applied at regional scales with DeepData contributing  for the data template, database export, and within the database itself. SampleID is currently the key 1000 identifier in DeepData and used as a proxy record identifier (although it is neither unique or persistent) 1001 and currently mapped to occurrenceID (as in the ISA DwC guidance), but catalogNumber is in fact the 1002 equivalent DwC term. VoucherCode instead is currently mapped to catalogNumber, but would be 1003 correctly mapped to recordNumber (or otherCatalogNumber