Wikidata: A platform for data integration and dissemination for the life sciences and beyond

Wikidata is an open, Semantic Web-compatible database that anyone can edit. This ‘data commons’ provides structured data for Wikipedia articles and other applications. Every article on Wikipedia has a hyperlink to an editable item in this database. This unique connection to the world’s largest community of volunteer knowledge editors could help make Wikidata a key hub within the greater Semantic Web. The life sciences, as ever, faces crucial challenges in disseminating and integrating knowledge. Our group is addressing these issues by populating Wikidata with the seeds of a foundational semantic network linking genes, drugs and diseases. Using this content, we are enhancing Wikipedia articles to both increase their quality and recruit human editors to expand and improve the underlying data. We encourage the community to join us as we collaboratively create what can become the most used and most central semantic data resource for the life sciences and beyond.


Stone Data Soup
In the Stone Soup folktale [1], a group of hungry travelers arrive in a village with its inhabitants unwilling to share their food. With a kettle of water and a stone the travelers manage to touch the curiosity of the villagers. The curiosity finally spawns a collaborative effort to make a great soup. This story is nowadays used to express the power of crowdsourcing and collaborative projects [2], such as Wikipedia, where many individuals each make small contributions but collectively produce something larger than the sum of its parts. Wikidata extends this collaborative model to the Web of data [3]. In this article we will describe Wikidata and the ways that this open public platform can take a central role in data sharing and management for the life science community. Wikidata items, these are flagged for manual review. The next phase of the project will stitch these concepts into a richly interconnected semantic network.

Taking a sip of the data soup -Wikidata and the Semantic Web
The first application to use Wikidata extensively is Wikipedia but this could be the tip of interactions are known for the drug methadone (CHEMBL651)" [9]. Importantly, the data used to answer this query came from two groups working completely independently. Our 'drug_bot' bot added the CHEMBL identifiers (as well as many other identifiers) while another bot developed by a team at the Medical University of Vienna added the drug-drug interactions [10]. This happened without any direct coordination between our groups.

Many Cooks...
The fact that Wikidata is one centralized, community resource immediately surfaces the challenges incurred in any collaborative ontology development process. In Wikidata, the 'ontology' corresponds to its collection of linking properties used to describe items. A new property in Wikidata has to be proposed for community discussion and is only created after a consensus regarding the value of the property and its relation to existing properties has been established. For those used to controlling their own data and data models, this process can feel tedious. But this same fundamental process must be undertaken in any attempt at data integration. The fact that it happens up front, when data is first being loaded, should help to keep the data consistent and reduce the downstream identifier and ontological mapping problems that continue to plague bioinformatics.
Imagine the power of combining the structured data in Wikidata, the high accessibility and dedicated community of Wikipedia and the knowledge of the scientific community. Contemplate further that all of this data is freely available and accessible through a stable query interface and robust, read/write API. This makes important, high-quality information easily accessible by anyone and opens up scientific knowledge for public scrutiny. Further, the built-in provenance tracking can provide detailed chains of evidence to support or refute each claim and all of this can be discussed using the many social tools, such as 'talk pages' for every data item, baked into the MediaWiki infrastructure.
Aside from creating useful ways to disseminate data, this sociotechnical structure provides a framework for the broad community to broadcast feedback back to the original data owners. Even at this early stage of this project, this process has already led to improvements in source data. For example, in the Disease Ontology the term 'Ollier disease' had the synonym 'Maffucci syndrome'. Upon importing the Disease Ontology into Wikidata, members of the Wikidata community pointed out that the two terms, though putative synonyms, linked to two different extant Wikidata items. Upon closer review it was determined that these two terms represent two different, albeit closely related, diseases, leading to the creation of a new term in the Disease Ontology. As Wikidata expands it is to be expected that additional differences in representation between it and other knowledge resources will surface. These will first be triaged by the WikiData community to check for errors and, if consensus is achieved that there is an error in the original source, this will be relayed for consideration. In this way, the WikiData community can become the 'many eyes' that make all ontology bugs shallow.

...Can Make a Delicious Soup
We can create a powerful commons of biomedical knowledge by building on established resources and the dedicated community to connect genes, proteins, drugs, diseases, phenotypes and symptoms. Wikipedia will be the first application to use the content in Wikidata, but certainly not the last. The fire is ready and the pot is starting to heat up. Some villagers are already peeking out of their windows ready to join us around the pot, but it will take the effort of the whole community to make a delicious biomedical data soup. We invite you to join us in this effort.