A protocol for adding knowledge to Wikidata, a case report

Pandemics, even more than other medical problems, require swift integration of knowledge. When caused by a new virus, understanding the underlying biology may help finding solutions. In a setting where there are a large number of loosely related projects and initiatives, we need common ground, also known as a “commons”. Wikidata, a public knowledge graph aligned with Wikipedia, is such a commons and uses unique identifiers to link knowledge in other knowledge bases However, Wikidata may not always have the right schema for the urgent questions. In this paper, we address this problem by showing how a data schema required for the integration can be modelled with entity schemas represented by Shape Expressions. As a telling example, we describe the process of aligning resources on the genomes and proteomes of the SARS-CoV-2 virus and related viruses as well as how Shape Expressions can be defined for Wikidata to model the knowledge, helping others studying the SARS-CoV-2 pandemic. How this model can be used to make data between various resources interoperable, is demonstrated by integrating data from NCBI Taxonomy, NCBI Genes, UniProt, and WikiPathways. Based on that model, a set of automated applications or bots were written for regular updates of these sources in Wikidata and added to a platform for automatically running these updates. Although this workflow is developed and applied in the context of the COVID-19 pandemic, to demonstrate its broader applicability it was also applied to other human coronaviruses (MERS, SARS, Human Coronavirus NL63, Human coronavirus 229E, Human coronavirus HKU1, Human coronavirus OC4).


Introduction
The COVID-19 pandemic, caused by the SARS-CoV-2 virus, is leading to a burst of swiftly released scientific publications on the matter (1). In response to the pandemic, many research groups have started projects to understand the SARS-CoV-2 virus life cycle and to find solutions. Examples of the numerous projects include outbreak.info (2), VODAN around FAIR data (3), CORD-19-on-FHIR (4) and the COVID-19 Disease Map (5). Many research papers and preprints get published every week and many call for more Open Science (6). The Dutch universities went a step further and want to make any previously published research openly available, in whatever way related to COVID-19 research (7).
However, this swift release of research findings comes with an increased number of incorrect interpretations (8) which can be problematic when new research articles are picked up by main-stream media (9). Rapid evaluation of these new research findings and integration with existing resources requires frictionless access to the underlying research data upon which the findings are based. This requires interoperable data and sophisticated integration of these resources. Part of this integration is reconciliation, which is the process where matching concepts in Wikidata are sought. Is a particular gene or protein already described in Wikidata?
Using a shared interoperability layer, like Wikidata, different resources can be more easily linked.
The Gene Wiki project has been linking the different research silos on genetics, biological processes, related diseases and associated drugs (10), creating a brokerage system between the research silos. The project recognises Wikidata as a sustainable infrastructure for scientific knowledge in the life sciences.
In contrast to legacy databases, where data models follow a relational data schema of connected tables, Wikidata ( https://wikidata.org/ ) uses statements to store facts (see Figure 1) (10)(11)(12)(13). This model of statements aligns well with the RDF triple model of the semantic web and the content of Wikidata is also serialized as Resource Description Framework (RDF) triples (14,15), acting as a stepping stone for data resources to the semantic web. Through its SPARQL endpoint ( https://query.wikidata.org ), knowledge captured in Wikidata can be integrated with other nodes in the semantic web, using mappings between these resources or through federated SPARQL queries (16). Automated editing of Wikidata simplifies the process, however, the quality control must be monitored carefully. This requires a clear data schema that allows the various resources to be linked together with their provenance. This schema describes the key concepts required for the integrations of the resources we are interested in the NCBI Taxonomy (17) , NCBI Gene (18) , UniProt (19) , the Protein Data Bank (PDB) (20) , WikiPathways (21) , and PubMed (22) . Therefore, the key elements for which we need a model include viruses, virus strains, virus genes, and virus proteins. The first two provide the link to taxonomies, the models for genes and proteins link to UniProt, PDB, and WikiPathways. These key concepts are also required to annotate research output such as journal articles and datasets related to these topics. Wikidata calls such keywords 'main subjects'. The introduction of this model and the actual SARS-CoV-2 genes and proteins in Wikidata enables the integration of these resources.
This paper is a case report of a workflow/protocol for data integration and publication. The first step in this approach is to develop the data schema. Within Wikidata, Shape Expressions (ShEx) are used as the structural schema language to describe and capture schemas of concepts (23,24). With ShEx we describe the RDF structure by which Wikidata content is made available. These Shapes have the advantage that they are easily exchanged and describe linked data models as a single knowledge graph. Since the Shapes describe the model, they enable discussion, revealing inconsistencies between resources and allow for consistency checks of the content added by automated procedures. The Semantic Web was proposed as a vision of the Web, in which information is given well-defined meaning and better enabling computers and people to work in cooperation (26). In order to achieve that goal, several technologies have appeared, like RDF for describing resources (15), SPARQL to query RDF data (27) and the Web Ontology Language (OWL) to represent ontologies (28).
Linked data was later proposed as a set of best practices to share and reuse data on the web (29). The linked data principles can be summarized in four rules that promote the use of uniform resource identifiers (URIs) to name things, which can be looked up to retrieve useful information for humans and for machines using RDF, as well as having links to related resources. These principles have been adopted by several projects, enabling a web of reusable data, known as the linked data cloud ( https://lod-cloud.net/ ), which has also been applied to life science (30).
One prominent project is Wikidata, which has become one of the largest collections of open data on the web (16). Wikidata follows the linked data principles offering both HTML and RDF views of every item with their corresponding links to related items, and a SPARQL endpoint called the Wikidata Query Service.
Wikidata's RDF model offers a reification mechanism which enables the representation of information about statements like qualifiers and references (see also https://www.wikidata.org/wiki/Help:Statements ). For each statement in Wikidata, there is a direct property in the wdt namespace that indicates the direct value. In addition, the Wikidata data model adds other statements for reification purposes that allow enrichment of the declarations with references and qualifiers (for a topical treatise, see Ref. (31) ). As an example, item Q14875321 , which represents ACE2 (protein-coding gene in the Homo sapiens species) has a statement specifying that it has a chromosome ( P1057 ) with value chromosome X ( Q29867336 ). In RDF Turtle, this can be declared as: That statement can be reified to add qualifiers and references. For example, a qualifier can state that the genomic assembly ( P659 ) is GRCh38 ( Q20966585 ) with a reference declaring that it was stated ( P248 ) in Ensembl Release 99 ( Q83867711 ).

Specifying data models with ShEx
Although the RDF data model is flexible, specifying an agreed structure for the data allows domain experts to identify the properties and structure of their data facilitating the integration between heterogeneous data sources. Shape Expressions were used to provide a suitable level of abstraction. YaShE, http://www.weso.es/YASHE/ , a ShEx editor implemented in JavaScript, was applied to author these Shapes (33).
This application provides the means to associate labels in the natural language of Wikidata to the corresponding identifiers. The initial entity schema was defined with YaShE as a proof of concept for virus genes and proteins. In parallel, statements already available in Wikidata were used to automatically generate an initial shape for virus strains with sheXer (34). The statements for virus strains were retrieved with SPARQL from the Wikidata Query Service (WDQS). The generated Shape was then further improved through manual curation. The syntax of the Shape Expressions was continuously validated through YaShE and the Wikidata Entity Schema namespace was used to share and collaboratively update the schema with new properties. Figure 3 gives a visual outline of these steps. Genomic information from seven human coronaviruses (HCoVs) was collected, including the NCBI Taxonomy identifiers. For six virus strains, a reference genome was available and was used to populate Wikidata. For SARS-CoV-1, the NCBI Taxonomy identifier referred to various strains, though no reference strain was available. To overcome this issue, the species taxon for SARS-related coronaviruses (SARSr-CoV) was used instead, following the practices of NCBI Genes and UniProt.

NCBI Eutils
The Entrez Programming Utilities (EUtils) is the application programming interface (API) to the Entrez query and database system at the National Center for Biotechnology Information (NCBI). From this set of services the scientific name of the virus under scrutiny was extracted (e.g. "Severe acute respiratory syndrome coronavirus 2").

Mygene.info
Mygene.info is a web service which provides a REST API that can be used to obtain up-to-data gene annotations. The first step in the process is to get a list of applicable genes for a given virus by providing the NCBI taxon id. The following step is to obtain gene annotations for the individual genes from mygene.info through http://mygene.info/v3/gene/43740571 . This results in the name and a set of applicable identifiers (Figure 4).

UniProt
The annotations retrieved from mygene.info also contain protein identifiers such as UniProt, RefSeq and PDB, however, their respective names are lacking. To obtain names and mappings to other protein identifiers, RefSeq and UniProt were consulted. Refseq annotations were acquired using the earlier mentioned NCBI EUtils. UniProt identifiers are acquired using the SPARQL endpoint of UniProt, which is a rich resource for protein annotations provided by the Swiss Bioinformatics Institute. Figure 5 shows the SPARQL query that was applied to acquire the protein annotations.

Reconciliation with Wikidata
Before the aggregated information on viruses, genes and proteins can be added to Wikidata, reconciliation with Wikidata is necessary. If Wikidata items exist they are updated, otherwise, new items are created. Reconciliation is driven by mapping existing identifiers in both the primary resources and Wikidata. It is possible to reconcile based on strings, but this is dangerous due to the ambiguity of the labels used (37). When items on concepts are added to Wikidata that lack identifiers overlapping with the primary resource, reconciliation is challenging. Based on the Shape Expressions, the following properties are identified for use in reconciliation. The COVID-19 related pathways from WikiPathways COVID-19 Portal are added to Wikidata using the approach previously described (10). For this, a dedicated repository has been set up to hold the GPML files, the internal WikiPathways file format, The GPML is converted into RDF files with the WikiPathways RDF generator (39), while the files with author information are manually edited. For getting the most recent GPML files, a custom Bash script was developed ( getPathways.sh in the SARS-CoV-2-WikiPathways repository ). The conversion of the GPML to RDF uses the previously published tools for WikiPathways RDF (39). Here, we adapted the code with a unit test that takes the pathways identifier as parameter . This test is available in the SARS-CoV-2-WikiPathways branch of GPML2RDF along with a helper script ( createTurtle.sh ). Based on this earlier generated pathway RDF and using the Wikidataintegrator library, the WikiPathways bot was used to populate Wikidata with additional statements and items. The pathway bot was extended with the capability to link virus proteins to the corresponding pathways, which was essential to support the Wikidata resource. These changes can be found in the sars-cov-2-wikipathways-2 branch.

Scholia
The second use case is to demonstrate how we can link virus gene and protein information to literature. Here, we used Scholia ( https://scholia.toolforge.org/ ) as a central tool (13). It provides a graphical interface around data in Wikidata, for example, literature about a specific coronavirus protein (e.g. Q87917585 for the SARS-CoV-2 spike protein). Scholia uses SPARQL queries to provide information about topics. We annotated literature around the HCoVs with the specific virus strains, the virus genes, and the virus proteins as 'main topic'.

Semantic data landscape
To align the different sources in Wikidata, a common data schema is needed. We have created a collection of schemas that represent the structure of the items added to Wikidata. Input to the workflow is the NCBI taxon identifier, which is input to mygene.info (see Figure 3). Taxon information is obtained and added to Wikidata During this effort, which took three weeks, the bot created a number of duplicates.
These have been manually corrected. It should also be noted that for SARS-CoV-2 many proteins and protein fragments do not have RefSeq or UniProt identifiers, mostly for the virus protein fragments.   ( https://w.wiki/Lsq ) and the articles that discuss them. Scholia takes advantage of the 'main subject' annotation, allowing the creation of "topic" pages for each protein.
For example, Figure 8 shows the topic page of the SARS-CoV-2 spike protein. Wikidata provides a solution. It is part of the semantic web, taking advantage of its reification of the Wikidata items as RDF. Data in Wikidata itself is frequently, often almost instantaneously, synchronized with the RDF resource and available through its SPARQL endpoint ( http://query.wikidata.org ). The modelling process turns out to be an important aspect of this protocol. Wikidata contains numerous entity classes as entities and more than 7000 properties which are ready for (re-)use. However, that also means that this is a confusing landscape to navigate. The ShEx Schema has helped us develop a clear model. This is a social contract between the authors of this paper, as well as documentation for future users.
Using these schemas, it was simpler to validate the correctness of the updated bots to enter data in Wikidata. The bots have been transferred to the Gene Wiki Jenkins platform. This allows the bots to be kept running regularly, pending on the ongoing efforts of the coronavirus and COVID-19 research communities. While the work of the bots will continue to need human oversight, potentially to correct errors, it provides a level of scalability and generally alleviates the authors from a lot of repetitive work.
One of the risks of using bots, is the possible generation of duplicate items. Though this is also a risk in manual addition of items, humans can apply a wider range of academic knowledge to resolve these issues. Indeed, in running the bots, duplicate Wikidata items were created, for which an example is shown in Figure 9. The Wikidataintegrator library does have functionality to prevent the creation of duplicates by comparing properties, based on used database identifiers. However, if two items have been created using different identifiers, these cannot be easily identified.
Close inspection of examples, such as the one in Figure 9, showed that the duplicates were created because there was a lack of overlap between the data to be added and the existing item. The UniProt identifier did not yet resolve, because it was manually extracted from information in the March 27 pre-release (but now part of the regular releases). In this example, the Pfam protein families database (42) identifier was the only identifier upon which reconciliation could happen. However, that identifier pointed to a webpage that did not contain mappings to other identifiers.
In addition, the lack of references to the primary source hampers the curator's ability to merge duplicate items and expert knowledge was essential to identify the