BioSharing: Harnessing Metadata Standards for the Data Commons

The use of community-driven metadata standards, such as minimal information guidelines, terminologies, formats/models, is essential to ensure that data and other digital research outputs are Findable, Accessible, Interoperable, and Reusable, according to the FAIR principles. As with other types of digital assets, metadata standards also need be FAIR. Their discoverability and accessibility is ensured by BioSharing, the most comprehensive resource of metadata standards, interlinked to data repositories and policies, available in the life, environmental and biomedical sciences. With its growing content, endorsements, and collaborative network, BioSharing is part of a larger ecosystem of interoperable resources. Here we describe some of the activities under the USA National Institutes of Health (NIH)’s Big Data to Knowledge (BD2K) Initiative, illustrating how we track the evolution and use of metadata standards and work to connect them to indexes and annotation tools.


EXAMPLES OF USE AND ACTIVITIES
We describe three examples of how BioSharing contributes to the NIH BD2K community and some projects, as part of ongoing research and development activities.These exemplars are quite diverse and allow us to show some of the existing BioSharing features and future directions, serving both researchers and developers and striving to embed metadata standards into the data cycle in an 'invisible' manner.

Tracking the evolution of metadata standards
BioSharing content can be searched using simple or advanced searches, refined via our filtering options, or grouped via the 'Collection' feature, according to field of interest or focus.For example, journals and publishers are collating the metadata standards and data repositories they recommend in their data policies.Similarly, communities, projects and organizations are creating Collections by selecting and filtering standards (and data repositories) relevant to their work, and/or those they are actively developing themselves.An example of the latter case, is provided by the NIH Library of Network-Based Cellular Signatures (LINCS) Program [16], which is creating a network-based understanding of biology by cataloging changes in gene expression and other cellular processes that occur when cells are exposed to a variety of perturbing agents.
As part of this multi-site program, the LINCS Data Working Group (DWG) works to develop metadata standards to describe LINCS reagents, assays and experiments [17] to ensure that key elements of experimental metadata are reported in a common manner, facilitating the metaanalysis between all LINCS Centers and the release of FAIR data to the community via the LINCS Data Portal.Due to the dynamic nature of the experiments and the phased development of the standards specifications, the LINCS DWG uses BioSharing to display and track the evolution of their metadata standards.
BioSharing has enabled the LINCS DWG to create, edit and maintain their own records for their standards, and group them under a LINCS Collection [18].Each record can have one or more maintainer, who has a user profile that can be linked to their resources, publications and ORCID identifier [19]; this provides not only visibility for the individuals but also a much needed contact point for prospective users.Each metadata standard (and data repository) record in BioSharing is manually curated to ensure its description and status is up-to-date, and the validity of the .CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made available under The copyright holder for this preprint (which was not this version posted May 31, 2017.; https://doi.org/10.1101/144147doi: bioRxiv preprint information therein is checked with the maintainers and/or the community behind each effort.

Linking data repositories to relevant standards
Another example of a Collection is the one created with and for the the NIH BD2K bioCADDIE project.The bioCADDIE Collection [20] groups and displays the existing metadata standards that have been used to develop the DatA Tag Suite (DATS), the metadata model [21] underpinning DataMed [22].This data index and search engine prototype, is based on metadata extracted from various data sets in a range of data repositories, and does for data what PubMed[23] has done for the literature.
One of the search use cases elicited from researchers during the DataMed development phase, is to allow the searching and filtering of datasets (from data repositories) that are compliant with a given community metadata standard.For this reason the DATS model is designed around the Dataset metadata element that is linked to other digital objects, such as Publication, Software, DataRepository and DataStandard, which are the focus of other indexes, such as PubMed, BD2K Aztec [24] and BioSharing, respectively.The latter link is especially important, because knowing if a data repository uses open community standards to harmonize the reporting of its different datasets will provide researchers with some confidence that these datasets are (in principle) more comprehensible and reusable.Figure 2 shows an example of how BioSharing builds this interlinkage between metadata standards and the data repositories that implement them, as well as showing their indicators of readiness.BioSharing is therefore well placed to provide DataMed with the knowledge of the relation(s) between metadata standards and the data repositories that implement them.To realize this query, work is in progress to deliver a BioSharing 'look up service' functionality that will allow systems like DataMed to access and use the information in the context of their searches.

Driving annotation and validation against metadata standards
Despite the growing set of reporting guidelines, models/formats and terminologies for describing the experiments, the barriers to authoring the experimental metadata necessary for sharing and .CC-BY 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made available under The copyright holder for this preprint (which was not this version posted May 31, 2017.; https://doi.org/10.1101/144147doi: bioRxiv preprint interpreting datasets are still tremendously high.The reasons are twofold.First, bound by a particular discipline or domain, metadata standards are fragmented, with gaps and duplications, thereby limiting their combined usage.For example, producers of datasets in which source material has been subjected to several kinds of analysis (e.g., genomic sequencing and clinical measurement) find it particularly challenging to describe the datasets as coherent units of research due to the diversity of metadata standards with which the parts must be formally represented.
Second, understanding how to comply with these metadata standards takes time and effort, and researchers often see them as burdensome and/or over-prescriptive, as something that may benefit other scientists, but not themselves.In addition, these guidelines are usually narrative in form and prone to be ambiguous, further making their compliance difficult.
The need for tools and services that facilitate the 'invisible use' of metadata standards is widely recognized; this is something of which we have first-hand experience via the ISA framework [25,26].Further research is ongoing in BioSharing to explore how to automatically use metadata requirements, from two or more domain specific standards, for composition in annotation templates and for validation purposes.To this end, BioSharing contributes to the NIH BD2K CEDAR project [27,28], which works to develop tools and practices to make the authoring of complete datasets smarter and faster.
Figure 3 shows how BioSharing works to define the method and process to create modularized metadata elements, tracking provenance (e.g., information about the standard(s) the metadata elements are derived from, as well as the process of derivation), conditions and dependencies (of each standard(s)-derived metadata element) and validation rules (to ensure a template meets the requirement of one or more checklists).Ultimately, the machine-readable versions of the standards-derived metadata elements will be served to inform the creation of descriptive templates in the CEDAR, and/or validation of datasets in others tools like ISA.

BIOSHARING: AN ELEMENT OF THE COMMONS
To implement the FAIR principles it is necessary to: (i) have a comprehensive description of standards; and (ii) help researchers, developers, curators, funders, journals and librarians to best navigate and select the various standards, or to find the repositories that implement them and draft a data management plan; or simply to find enough information to make an informed decision on Via its informative and educational functionalities and indicators, BioSharing provides: (i) developers of standards and data repositories with a mean to increase the discoverability of their resources outside their own direct community, and (ii) prospective consumers with ways to visualize and understand the status of these resources, enabling them to make an informed decision as to which standard (database or policy) to (re)use or endorse, thus maximizing the potential of adoption and reducing the potential for unnecessary reinvention.
A recently conducted survey[29], supported by ELIXIR and NIH BD2K, provides an insight of users' needs from BioSharing, showing the road ahead and driving our future activities.As highlighted by the Wellcome Trust-commissioned review [4], it is essential to recognize interoperability standards as digital objects in their own right, with their associated research, development and educational activities.

Figure 1
Figure 1 illustrates an example of how the BioSharing indicators (of readiness for implementation or use) are used to show the evolution of one of the LINCS standards.
review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made available under