FAIR enough? A perspective on the status of nucleotide sequence data and metadata on public archives

Knowledge derived from nucleotide sequence data is increasing in importance in the life sciences, as well as decision making (mainly in biodiversity policy). Metadata standards have been established to facilitate sustainable sequence data management according to the FAIR principles (Findability, Accessibility, Interoperability, Reusability). Here, we review the status of metadata available for raw read Illumina amplicon and whole genome shotgun sequencing data derived from ecological metagenomic material that are accessible at the European Nucleotide Archive (ENA), as well as the compliance of the primary sequence data (fastq files) with data submission requirements. While overall basic metadata, such as geographic coordinates, were retrievable in 98% of the cases for this type of sequence data, interoperability was not always ensured and other (mainly conditionally) mandatory parameters were often not provided at all. Metadata standards, such as the ‘Minimum Information about any(x) Sequence (MIxS)’, were only infrequently used despite a demonstrated positive impact on metadata quality. Furthermore, the sequence data itself did not meet the prescribed requirements in 31 out of 39 studies that were manually inspected. To tackle the most immediate needs to improve FAIR sequence data management, we provide a list of minimal suggestions to researchers, research institutions, funding agencies, reviewers, publishers, and databases, that we believe might have a potentially large positive impact on sequence data and metadata FAIRness, which is crucial for further research and its derived applications.

as well as decision making (mainly in biodiversity policy). Metadata standards have been 23 established to facilitate sustainable sequence data management according to the FAIR principles 24 (Findability, Accessibility, Interoperability, Reusability). Here, we review the status of metadata 25 available for raw read Illumina amplicon and whole genome shotgun sequencing data derived 26 from ecological metagenomic material that are accessible at the European Nucleotide Archive 27 (ENA), as well as the compliance of the primary sequence data (fastq files) with data submission 28 requirements. While overall basic metadata, such as geographic coordinates, were retrievable in 29 98% of the cases for this type of sequence data, interoperability was not always ensured and 30 other (mainly conditionally) mandatory parameters were often not provided at all. Metadata 31 standards, such as the 'Minimum Information about any(x) Sequence (MIxS)', were only 32 infrequently used despite a demonstrated positive impact on metadata quality. Furthermore, the 33 sequence data itself did not meet the prescribed requirements in 31 out of 39 studies that were 34 manually inspected. To tackle the most immediate needs to improve FAIR sequence data 35 Next generation sequencing has gained increasing popularity and is now firmly established as a 46 routine tool in multiple fields of the life sciences, such as ecology (foremost microbial ecology), biodiversity research, and conservation biology. Furthermore, knowledge derived from nucleotide 48 sequence data, also referred to as digital sequence information (DSI), is becoming increasingly 49 relevant for decision-making in natural resource management (e.g. Sustainable Development 50 Goals) and as part of international agreements (e.g. Convention on Biological Diversity). The 51 amount of nucleotide sequence data has been and still is growing exponentially (Harrison et al., 52 2021). However, a string of nucleotides (ACTG) on its own does not contain much information -53 metadata and contextual parameters (Box 1) are required to describe sample origin and sequence 54 generation for the data to be meaningful within and beyond the scope of the study, for which the 55 sequence was obtained. Capturing and communicating not only the primary (sequence) data, but 56 also its metadata and contextual data, is a crucial part of good data management. To promote 57 sustainable data management and usage, the FAIR principles have been introduced. They offer 58 guidance on how to make data Findable, Accessible, Interoperable, and Reusable (Box 1; 59 Fillinger, de la Garza, Peltzer, Kohlbacher, & Nahnsen, 2019; Wilkinson et al., 2016), to prepare 60 for a future of more automated analyses with the aim that the value of the data will not be restricted 61 to a single study, but will extend to the reuse and integration across multiple studies over time. 62 Recently, the trend of an increasing data volume, which is being more and more sustainably 63 managed, has resulted in nucleotide sequence data being used more frequently in the emerging 64 field of data science, answering new scientific questions with existing data, as such constituting 65 a public good for the scientific community (Box 1). One prime example is the TARA Oceans data 66 set, which has so far resulted in hundreds of publications making secondary use of the dataa 67 number that is constantly increasing 1 . 68

69
One key aspect of FAIR data is the implementation of standards for metadata ( Experiments are collected into studies, which usually use a common methodological approach. 105 Several studies can be summarized by an umbrella project that may correspond to the larger 106 scientific project, for which the data was generated. Samples, referring to the biological material, 107 can be associated with multiple studies through different experiments. This flexible metadata 108 model allows representing complex experimental set-ups correctly, but can be hard for 109 inexperienced submitters to navigate properly and provide the necessary information for checklist 110 compliance. 111 112 When accessing sequence data as data consumer (Fig. 1), all provided metadata for each level 113 (run, experiment, study, sample) can be retrieved as XML. To simplify metadata access, the ENA 114 advanced search offers a collection of indexed parameters that use standardized names and are 115 searchable (i.e. usable to restrict the search) and returnable (i.e. downloadable in a user-friendly 116 TSV format; ENA Portal API; Fig. 1). On the run level, these indexed parameters are also inherited 117 from sample and experiment metadata. At the moment, the implementation of indexed 118 parameters is limited to metadata parameters that are mandatory for most checklists and/or most 119 frequently provided. As such, many conditionally mandatory, environment-specific, and optional 120 MIxS parameters, mainly due to a lack of consistent and widespread use, are not indexed and 121 only accessible in XML format, where no standardized nomenclature, controlled vocabulary or 122 specific data value format are enforced. Therefore, some of the value of MIxS is intrinsically lost, 123 making non-indexed parameters not interoperable. 124

125
In addition to metadata requirements, ENA also standardizes the format of the submitted 126 sequence data depending on the sequencing approach. For instance, paired-end Illumina raw 127 reads have to be submitted as demultiplexed R1 and R2 files (fastq) without artificial sequences 128 (e.g. adapters, linkers, barcodes/tags, primers) and prior to any quality trimming 4 . To provide 129 sequencing data in such a format, initial sequence processing steps are necessary, starting from 130 the multiplexed sequencer output. As bioinformatic sequence analysis pipelines vary, it is 131 important to not deviate from the sequence format requirements by adjusting analysis workflows 132 accordingly. 133

134
To support sustainable data management, data brokerage services, such as the German 135 Federation for Biological Data (GFBio), have been established (Diepenbroek et al., 2014). 136 Brokerage services offer a central entry point for data submissions, providing personal guidance 137 (helpdesk) on FAIR data, supporting and often simplifying the data submission process, and 138 ensuring data deposition on the most appropriate archive. As an additional checkpoint, brokerage 139 services therefore constitute a valuable resource for each individual researcher to improve the 140 FAIRness of their data, which is now becoming a strict requirement from many funding agencies.

150
To facilitate data reuse in data mining endeavors, access to and retrieval of the raw read 151 sequencing data and metadata on run level, together with the inherited metadata parameters 152 describing the sample and sequencing experiment, are most crucial. However, despite the 153 available framework for FAIR data archiving, data interoperability and reusability are still limited 154 and often complicated by insufficient metadata (Eckert et al., 2020;Hoopen et al., 2016;Jurburg, 155 Konzack, Eisenhauer, & Heintz-Buschart, 2020; personal observation). Therefore, we decided to 156 conduct a review of the status of nucleotide sequence data and associated metadata accessible 157 through ENA to (i) identify deficits in metadata quality and (ii) provide suggestions for improving 158 FAIR data management (Fig. 1). 159 160 We restricted our analysis to a very popular example for biodiversity assessment in ecology: 161 paired-end amplicon (metabarcoding) raw read data generated from ecological metagenomes 162 (NCBI taxid: 410657) as source material on the Illumina platform. We focus on metadata 163 parameters, which are mandatory and/or crucial for the reuse of this kind of data, evaluating the 164 impact that the use of MIxS checklists ( For an additional 18% of the cases, latitude and longitude values were retrievable from the sample 211 XML as part of non-indexed parameters. There, this data was stored under 23 different parameter 212 names, and was therefore not easily accessible or interoperable. Considering only cases 213 submitted according to a MIxS checklist and specifying a MIxS environmental package, latitude 214 and longitude were always provided in some form and the proportion of cases with latitude and 215 longitude available in the TSV output was slightly higher across all years (86%), although it has 216 been declining from more than 99% to 61% since 2017. 217 218 Target gene (Fig. 2 middle): Missing information about which DNA region was targeted by the 219 amplicon sequencing approach is one of the main obstacles for data interpretation and reuse. 220 The ENA search output in TSV format includes the indexed parameter target_gene, which can be 221 used to specify the amplified gene or locus name using free text. In MIxS, target_gene is further 222 specified as a mandatory parameter for amplicon sequencing studies (MIMARKS survey), 223 although its use is not enforced by ENA for data submissions. Additionally, the non-indexed 224 parameters target_subfragment and pcr_primers are listed as conditionally mandatory 225 parameters to supply additional metadata about the amplified gene region, and as such, should 226 be supplied for all amplicon sequencing experiments. Among all cases investigated here, only 2% 227 provided the target gene, with an additional 5% where some information about the amplified 228 region could be retrieved from non-indexed parameters in the sample XML (stored under 58 229 different parameter names) and from the library construction protocol included in the experiment 230 XML. However, such entries were extremely inconsistent, ranging from gene and gene region 231 names (or a combination of both) to primer names, primer sequences, and references for the 232 applied PCR protocol. To reuse this data, each entry would have to be inspected manually, 233 making available target gene information not only difficult to access, but also not interoperable. 234 The proportion of cases with target gene information available among those submitted according 235 to a MIxS checklist and environmental package was considerably higher with 35%, although still 236 far from optimal bearing in mind that this is a mandatory parameter. The correct identification of 237 the amplified region without any respective metadata is cumbersome and computationally 238 expensive. Complete and correct metadata entries, using a standardized format or even 239 controlled vocabulary preferably in accordance with existing ontologies, would drastically reduce 240 computational and man-power requirements for post-deposition data curation and data reuse. IDs, which are mandatory to be included according to the MIxS documentation, were only found 277 for 8% of the cases, with an additional 2% with character string matches to ENVO term names. 278 For environment_material and environment_feature, these proportions were considerably higher 279 with 73% and 77% matches to ENVO term IDs and an additional 13% and 11% matches to ENVO 280 term names, respectively. This demonstrates that the use of MIxS checklists drastically improved 281 the availability of an environment description via the parameters environment_biome, 282 environment_material, and environment_feature, but also that, if provided, the interoperability of 283 this metadata is severely impaired by non-ENVO entries. trimmed merged fasta files to be used for oligotyping (Eren et al., 2014). This analysis required 343 that the sequences were generated from the exact same gene region to be comparable. 344 for compliance with ENA submission requirements, i.e. that paired-end Illumina reads were 347 archived as demultiplexed, unmerged forward and reverse reads, without artificial sequences and 348 prior to any quality trimming. Of the 39 inspected studies, only eight were submitted as required. 349 The majority (28 studies) did not remove the primer sequences, eight studies contained already 350 merged sequences, and one study even provided only the sequencer output prior to 351 demultiplexing (sample barcode information had to be obtained from the author). This data mining 352 experience showed that even if metadata that enables the findability and accessibility of the data 353 is provided, the raw read data itself may often not be submitted as required, making manual 354 checks mandatory and limiting the interoperability and reusability of the data. and biodiversity research (e.g. environmental DNA studies). Therefore, it is even more worrisome 404 that the use of standards in data submissions has not been following this same upward trend. We 405 further noticed that even when data was submitted according to MIxS, this metadata standard 406 was often not used as intended or to its full potential. In part, this situation may have arisen from 407 inconsistencies in the documentation about metadata requirements provided by separate 408 resources. For instance, the MIxS checklists are not implemented in their entirety by INSDC due 409 to a lack of demand to archive such parameters. Especially conditionally mandatory parameters, 410 such as target_gene, target_subfragment, and pcr_primers in the case of amplicon data (i.e. 411 MIMARKS standard), are listed as optional for the MIxS checklists on ENA, the latter two also not 412 being indexed as searchable or returnable parameters. Furthermore, the description of the 413 parameters environment_biome, environment_material, environment_feature in the ENA 414 documentation specifies free text, whereas MIxS specifies the use of a controlled vocabulary and 415 data syntax. In such cases, the more stringent standard should be communicated and adhered 416 to in compliance with the original standard description. 417

418
Other issues, which we did not explore in more detail here, included duplicated data submissions, 419 contradictory primer references, and conflicting metadata entries for the same run. The latter is 420 especially difficult to track and often only detectable after a manual check of the data, metadata, 421 and publication. In some such instances, we discovered contradictory entries for the sequencing 422 method, instrument model, library selection, sample and library names, NCBI taxonomy ID 423 Terabytes to Petabytes of data may not be readily interoperable and reusable, severely limiting 430 their added value, long-term impact, and future relevance. 431 432 While ongoing development of standards and their integration across disciplines 5 is an essential 433 endeavor to increase the added-value by standard-compliant (meta)data, we think that it is crucial 434 to avoid further delays in improving (meta)data quality and FAIRness by making better use of 435 existing standards. This task is up to each individual researcher, to voluntarily use more stringent 436 checklists and provide optional parameters. Brokerage services, such as GFBio, fundamentally 437 improved metadata quality and therefore data reusability, but are too personnel-intensive to solve 438 all challenges described here. Luckily, the number of tools, platforms, and tutorials to inform and 439 facilitate sustainable data management has increased rapidly over the last two years (Olsson &  target_subfragment, and pcr_primers should be supplied for all sequencing read data that 456 was generated with library_selection="PCR" AND library_strategy="AMPLICON". 457 • Enter data diligently and according to the specified format to facilitate interoperability. It is 458 not only important that (meta)data is archived, but also how.