Reference ontology and database annotation of the COVID-19 Open Research Dataset (CORD-19)

The COVID-19 Open Research Dataset (CORD-19) was released in March 2020 to allow the machine learning and wider research community to develop techniques to answer scientific questions on COVID-19. The dataset consists of a large collection of scientific literature, including over 100,000 full text papers. Annotating training data to normalise variability in biological entities can improve the performance of downstream analysis and interpretation. To facilitate and enhance the use of the CORD-19 data in these applications, in late March 2020 we performed a comprehensive annotation process using named entity recognition tool, TERMite, along with a number of large reference ontologies and vocabularies including domains of genes, proteins, drugs and virus strains. The additional annotation has identified and tagged over 45 million entities within the corpus made up of 62,746 unique biomedical entities. The latest updated version of the annotated data, as well as older versions, is made openly available under GPL-2.0 License for the community to use at: https://github.com/SciBiteLabs/CORD19


Introduction
In March 2020 in response to the COVID-19 pandemic, a consortium including The White House, AI2, CZI, MSR, Georgetown and NIH released an open research dataset along with a call for action [1] to the science and technology community. They called on artificial intelligence experts to develop new text and data mining techniques that can help the science community answer high-priority scientific questions related to COVID-19. The original dataset can be accessed at Kaggle: https://www.kaggle.com/allen-institute-forai/CORD-19-research-challenge Data normalisation is an important step in the data preparation phase of many machine learning approaches [2]. The use of ontologies to normalise this data has shown improvements in various machine learning applications [3,4]. Using ontologies to standardise descriptions of data also offers benefits in post-analysis. Using the same ontology identifiers in annotations can enable the integration of results across multiple disparate but similarly annotated datasets. The value of sharing well-annotated data has never been made clearer with the emergence of COVID-19 [5, 6] and it is this challenge we are addressing here. To contribute to this call for action, in March 2020, and for the months following, we developed updated COVID-19 focused vocabularies replete with rich synonyms to identify relevant virus strain names, proteins and genes often studied with these viruses, and related entities such as gene functions and processes. These vocabularies use public identifiers where available from the likes of NCBI Taxon, Uniprot and the Gene Ontology. Using these vocabularies along with others describing indications, phenotypes, species, countries, genes, proteins and drugs, we applied our named entity recognition tool, TERMite, to annotate the entire CORD-19 dataset. This has produced a richly annotated dataset with over 45 million annotations, using 62,746 unique biomedically relevant entities. In order to facilitate the discovery of connections between entities in the corpus, we have produced a set of entity cooccurrences. This includes sentences which mention at least two entities, for instance a drug and gene. Because the vocabularies for identifying COVID-19 related terms are available, this can be used to filter for COVID-19 specific co-occurrences.

Method
The CORD-19 dataset is described as "the most extensive machine-readable coronavirus literature collection available for data mining" [7]. As of September 2020, the dataset contains over 200,000 scientific articles, including over 100,000 with full text, focusing on COVID-19, SARS-CoV-2, and other relevant or related coronaviruses. The dataset is made available as a set of JSON documents, each document includes the title, authors, abstract, body text and detailed references (where appropriate). To identify relevant concepts, a focused set of vocabularies was required with enrichments specific to concepts pertinent to COVID-19 research. The new COVID-19 specific vocabularies, and a set of existing vocabularies, mostly aligned to public ontologies (Table 1). To create the annotations for each document, a commercial named entity recognition engine (NER), TERMite, was employed using these vocabularies. For each section of text in the CORD19 JSON (titles, abstract parts and body parts), we injected a new block containing entities identified. The JSON block for each entity consisted of: • a preferred name • a unique identifier which in most cases was a URL for the relevant entity in its respective ontology • a count of the number of times the entity was found in the section of text • a list of the sentences in which the entity was found in the section of text • a list of exact string locations for the sentences in which the entity was found in the section of text By injecting these results back into the original JSON as an additional set of objects, we aimed to prevent any compatibility issues that would have arisen for individuals and groups who had already begun work with the initial release of this data.
To enable a more focused analysis, we also tabulated and transposed the annotated data (i.e. only sentences with annotations). In this case, we pivoted the data around the sentence, making the sentence the primary key, with attendant lists of entities identified in said sentence.

Results
TERMite identified over 45 million entities within the initial CORD19 release, consisting of 62,746 unique identifiers. It should be noted that some terms are represented in multiple ontologies. This is particularly common with, for example, phenotypes from the Human Phenotype Ontology and indications from MeSH. (Table 6) summarises the annotations as a result of the NER process for each of the different vocabularies used. (Figure 1) gives an example of a sentence in which three annotations were found from three vocabularies. The number of unique hits for each vocabulary type was many orders of magnitude less than the number of total hits found. This is very common in large, focused areas of research -in this case anything COVID-19 relevant -as the target virus will be mentioned many times in a single article. Summaries of the most frequently found entities are shown in (Tables 3, 4 and 5). As can be observed in Table 3, many of the drugs found both in March and more recently have been heavily investigated by researchers around the world for their use as either a treatment (e.g. Hydroxychloroquine [8], Ribavirin [9]) or considering if they may play a role in poorer clinical outcome (e.g. Angiotensin [10]). The ontology annotated data has now seen use in, for example, generating entity graphs using sentence cooccurrences as edges (Knowledge Graph Hub, 2020) or for combining evidence from multiple publication sources (Pub Annotation, 2020).

Conclusions
The value of well-annotated and accessible data on a global scale has never been more acutely apparent. The CORD-19 dataset and call to action channeled this need for rapid and innovative action from the community to help research and development into tackling this global challenge. Within weeks of this call in March 2020, we have released to the community annotated sentences which mention entities of biomedical relevance to COVID-19 and coronaviruses broadly, along with co-occurrences within sentences. The use of this data has already helped in developing knowledge graphs and data integration portals and it is our hope it will be put to further use in the data driven approaches used in 2020 and beyond.