Optimising biomedical relationship extraction with BioBERT

Text mining is widely used within the life sciences as an evidence stream for inferring relationships between biological entities. In most cases, conventional string matching is used to identify cooccurrences of given entities within sentences. This limits the utility of text mining results, as they tend to contain significant noise due to weak inclusion criteria. We show that, in the indicative case of protein-protein interactions (PPIs), the majority of sentences containing cooccurrences (∽75%) do not describe any causal relationship. We further demonstrate the feasibility of fine tuning a strong domain-specific language model, BioBERT, to analyse sentences containing cooccurrences and accurately (F1 score: 88.95%) identify functional links between proteins. These strong results come in spite of the deep complexity of the language involved, which limits the accuracy even of expert curators. We establish guidelines for best practices in data creation to this end, including an examination of inter-annotator agreement, of semisupervision, and of rules based alternatives to manual curation, and explore the potential for downstream use of the model to accelerate curation of interactions in the SIGNOR database of causal protein interactions and the IntAct database of experimental evidence for physical protein interactions.

Introduction 1 through the initial training, including grammatical principles that it would be near 15 impossible to learn from the relatively miniscule datasets used in the fine tuning stage, 16 can be leveraged. As a result, it is now possible to achieve strong results with relatively 17 little curation, driving down the previously prohibitive cost of developing state of the art 18 models within specific domains. Furthermore, models with domain specific pretraining 19 have also been developed, such as BioBERT [5], whereby the initial training is followed 20 by further training on life science corpora such as MEDLINE and PubMedCentral, 21 resulting in further gains on life science specific tasks. BioBERT was released with three 22 fine-tuned variants of the base model for performing named entity recognition, question 23 answering and relationship extraction. Research has shown a moderate improvement in 24 the extraction of biomedical relationships using BioBERT as opposed to base BERT [6]. 25 The sheer volume of publications in the life sciences presents a significant challenge 26 to researchers attempting to monitor developments and compile the consolidated 27 resources upon which modern research is largely dependent. The potential to streamline 28 these operations by fine tuning state of the art deep learning models to extract causal 29 relationships from text therefore warrants research [7]. 30 Protein-protein interactions (PPIs) are critical to a great number of disciplines 31 including the established, like pharmacy, and the up and coming, like synthetic 32 biology [8]. They are often described using deeply complex and technical language and, 33 as the majority of sentences containing multiple proteins do not describe any 34 interaction, they are resistant to extraction with simple, traditional text mining 35 methods. The extraction of PPIs from text is therefore a pivotal problem to be solved, 36 and empowering researchers to more effectively collect these data may have significant 37 real world benefits in downstream applications. We therefore chose to focus on this 38 particular task, but the methods described are largely transferable to the extraction of 39 other relationships from text. As the state of the art in natural language processing has 40 progressed over the past years, these methods have been applied to PPI extraction from 41 text [9][10][11]. However, different use cases define interactions differently, which can be 42 seen by the number of papers [11][12][13] which achieve disparate results on the major 43 datasets of BioInfer [14] and AIMed [15]. For example, either causal or physical 44 interactions may be sought, or both. There may also be other criteria for extracted 45 sentences, such as requiring that they describe novel discoveries. As models continue to 46 become more capable of abstracting patterns from fewer samples of data, the feasibility 47 of creating highly specific models for niche concerns increases, and the onus of 48 generating data shifts away from consortia and towards individuals and organisations. 49 It is therefore key to have clear guidelines for best practices in this data creation.

50
In this paper, we demonstrate the feasibility of using traditional text mining as a 51 starting point to narrow down a curation space to facilitate more rapid generation of 52 datasets, and also to narrow down the execution space once the model is trained. We 53 then assess whether there is a need for inter-annotator agreement in the curation of 54 data, and whether semisupervision can reduce curation time, and suggest best practices 55 based on the outcomes. We proceed to compare the efficacy of rules-based methods, 56 deep learning methods bootstrapped with data collected via rules-based methods, and 57 deep learning methods trained on curated data. Finally, we outline some initial testing 58 undertaken to assess the value added to current PPI database curation methods both 59 with and without further task-specific fine tuning. space for our tasks, we defined high level minimum criteria. It is rare that a text 66 describes an interaction between two different proteins without containing the explicit 67 mention of both proteins. We used named entity recognition (NER) software, TERMite, 68 to tag life science entities within papers drawn from MEDLINE. This enabled us to 69 extract only sentences containing relevant, high level patterns. In the case of PPIs, we 70 looked for two or more gene hits, aligned to HGNC [16], within a given sentence. In 71 auxiliary tests for semisupervision we also looked at sentences containing pairs of drug 72 and adverse event hits, aligned to ChEMBL [17] and MeDDRA [18] respectively, and 73 pairs of drug and gene hits. Filtering sentences in this way not only allows for more 74 rapid curation but also results in fewer sentences being required for training, as the 75 model only needs to comprehend the nuances of a niche subset of sentences. This same 76 step can be used at inference such that only sentences of interest are passed to the 77 model, resulting in significant savings of compute and time. To assess the impact of inter-annotator agreement, three curators independently curated 81 an initial set of 925 sentences, taken from MEDLINE, within which our NER system had 82 identified two or more genes/proteins. Experienced curators attempted to identify PPIs 83 according to criteria provided (see Table 1, or S1 Appendix for full detail). Our criteria 84 were aimed at developing a recall-oriented model which could potentially then be 85 further fine tuned to more specific use-case criteria as required. Concordance between 86 all three curators was observed in 451 of 925 sentences (48.8%), and concordance 87 between at least two of three curators was observed in 889 of 925 sentences (96.1%). A 88 subset of 170 sentences enjoyed agreement from two curators while the third curator 89 had indicated they were unsure as to the correct category. We curated these sentences 90 with a fourth curator and found agreement with the 2:1 majority in 155 cases (88.2%). 91 The high level of disagreement amongst curators (with at least one curator dissenting 92 in 51.2% of cases), illustrates the complexity of the problem even as approached by 93 human experts. We also observed that the number of sentences deemed to contain 94 coincidental mentions of genes significantly outnumbered the number of sentences 95 deemed to describe interactions, illustrating the need for models to differentiate between 96 these classes in order to reliably automate the identification of PPIs within literature.

97
Although two curators could process more sentences than three in a given number of 98 person-hours, the resulting number of sentences with agreement between two annotators 99 per unit time was similar with either three curators or two curators. In our case, we 100 found inter-annotator agreement between at least two of three curators in 96.1% of 101 sentences, as opposed to a mean average of 64% between the possible pairings of two 102 curators. If three curators curate n sentences in t person-hours, we would expect two 103 curators to curate 1.5n sentences in t, and 0.64 · 1.5n ≈ 0.961n (or 0.96 ≈ 0.961). It 104 should be noted that agreement between two out of two curators represents a 105 concordance rate of 1, whereas agreement between two out of three curators represents 106 concordance of 0.67. We determined to proceed with three annotators to allow us to 107 assess the efficacy of models trained with varying degrees of concordance and to get the 108 strongest possible gold standard set. interactions in a further 136 sentences, bringing the combined total to 254 (23.4%). In 121 the randomly selected set of sentences, these figures were 15.2% and 26.1% respectively. 122 More broadly, we observed similar rates of agreement to the initial set of sentences with 123 all three annotators agreeing in 51.3% of cases (compared to 48.8% in the initial set) 124 and two out of three annotators agreeing in 94.9% of cases (compared to 96.1%). The 125 one clearly observable difference was a marked increase in the identification of sentences 126 containing coincidental mentions (increasing from 47.8%/25.3% to 58.7%/35.2% with 127 agreement between two and three curators respectively). 128 We repeated this with even stricter StringDB combined scores of ≥995, and once 129 again found no improvement in the rate of identification of positive interactions. This 130 may be a result of well established interactions being assumed knowledge and therefore 131 rarely being explicitly stated. We therefore continued using the initial randomly 132 selected set of sentences for data preparation.

133
To assess whether semisupervision might be more applicable in the case of a different 134 relationship, we attempted to apply a similar methodology to drug/adverse reaction 135 pairs. Adverse reactions are difficult to identify using traditional string matching 136 methods as they are lexically identical to non-adverse reaction indications. As such, a 137 sentence which appears to contain a drug and an adverse reaction may in fact describe 138 the opposite, with the drug being described as a treatment for the indication in 139 question. This indicates there is potential for the use of semisupervision in the 140 development of models to extract this relationship. 141 We used TERMite to identify indications mentioned on FDA drug labels within the 142 warnings section, and considered these indications to be adverse reactions caused by the 143 drug to which the label belonged. We proceeded to curate 100 sentences containing this 144 subset of drug/indication pairs and a further 100 sentences containing randomly the semisupervised and the randomly selected set, 23 sentences were deemed to likely 147 describe a drug causing an adverse event. It is very common for a drug to list the 148 indication it treats as a side effect (e.g. headaches being a side effect of aspirin), so we 149 postulated that one possible way to improve on this result would be to exclude any 150 indication mentioned on the drug label which is also known to be a condition treatable 151 with said drug. Repeating the above methodology but excluding approved treatments, 152 as listed in ChEMBL, resulted in a minor improvement of 28 likely positive sentences 153 being identified in the 100 curated. 154 We undertook one final round using ChEMBL to identify drug-gene pairs wherein 155 the gene was a known target of the drug. In this case we found that 58.1% of randomly 156 selected sentences containing a drug and a gene likely described a targeting relationship. 157 In the semisupervised set, this rose to 89.9%. In conclusion, semisupervision may 158 provide a valuable means to increase the ratio of sentences containing the desired 159 relationship to those containing coincidental mentions, but this value is case dependent. 160 Final curation 161 We continued with curation using randomly selected sentences containing two 162 genes/proteins, with no further filtration, ultimately collecting 1408 sentences deemed 163 by at least two curators to contain an interaction, of which 308 were randomly selected 164 and allocated to a test set. These sentences were combined with a correspondent 165 number of sentences deemed by at least two curators to contain coincidental mentions 166 to create our primary training and testing sets. Rules-based approaches 169 We attempted to use some simple rules-based approaches to provide baselines for deep 170 learning comparisons. The most reliable rule we could identify was to look for two 171 genes/proteins alongside a bioverb indicative of an interaction. We compiled a list of 41 172 such bioverbs (S2 Appendix). In an attempt to improve precision, we also used a second 173 method utilising the natural language processing library spaCy [20] to generate 174 dependency trees of our sentences in order to confirm that the genes identified were 175 grammatical children of the bioverb identified. If the bioverb did not appear to link the 176 two genes, these sentences were considered to be coincidental as opposed to positive.  To define deep learning baselines, we first attempted to train a model using data 186 collected with the rules-based methods defined above. The risk inherent in using 187 rules-based approaches for data gathering is the introduction of bias, as one is 188 effectively training a model to recognise one's simplified ruleset, as opposed to exposing 189 it to a truly representative sample of the more nuanced relationship you hope to identify. 190 To assess whether BioBERT was abstracting patterns from our rules-generated training 191 sets, we removed sentences containing a subset of five bioverbs from our training set and 192 used these as our test set. Training was then performed on sentences containing the remaining set of bioverbs, after replacing any gene hits identified by TERMite with a 194 normalised token. We observed F1 accuracy of 0.584 on unseen bioverbs using the basic 195 method, and F1 accuracy of 0.801 on unseen bioverbs using the spaCy method. This 196 indicates that, while both methods led to some degree of abstraction, this was much 197 more pronounced in the method using grammatical dependency parsing than the 198 method using simple term identification. This insight may be useful if curated data is 199 not available.  In-the-wild testing 204 Following encouraging results from a preliminary examination of the model's output, 205 curators at SIGNOR [21] and IntAct [22] examined two sets of data extracted from the 206 CORD19 dataset [23]. A small, custom vocabulary of PPI measurement techniques was 207 used to filter the documents (S3 Appendix), and pairs already represented in the 208 SIGNOR database were excluded. One set was created by identifying sentences with 209 two proteins using TERMite. The second set used the model prediction as an additional 210 filter. Another round of curation was then undertaken to assess coverage of particular 211 genes of interest in which TERMite was used to identify proteins listed by 212 SIGNOR/IntAct as being high priority.

214
From our results in Table 2, it is clear that the direct application of our simple,  We attempted to address this by using dependency parsing to increase our confidence 224 that the two proteins mentioned in a sentence were in fact linked by the molecular 225 bioverb we had identified. When directly applied, this second ruleset did result in  In order to get an indication as to whether different tasks would require similar 249 amounts of training data, we repeated this with the Genetic Association Database 250 (GAD) [24] dataset for gene/disease associations. We observed similar accuracy curves 251 when plotting results from different tasks (Fig 1). 95% of the peak accuracy was During curation, it can be challenging to ascertain whether an interaction is being 259 described from reading a single sentence. Rather than forcing difficult sentences into an 260 existing category, we decided to collect these sentences separately. We hypothesised that 261 having a bin for sentences which did not clearly belong in either the coincidental bin or 262 the positive bin may enable us to train models with an emphasis on either precision or 263 recall, with each model suited to different use cases. To verify this, we replaced 20% of 264 positive sentences with sentences labelled as unclear/unknown by at least two curators. 265   Table 4 The 293 extraction of sentences was deemed more efficient than searching and reading entire 294 papers, and using the model as an additional filter resulted in a much higher rate of 295 useful sentences.

Example Model prediction SIGNOR/IntAct score Comment
We also demonstrate that the antiaging effect of Sip2 acetylation is independent of nutrient availability and TORC1 activity.
Positive 0 Sentences denying an interaction were extremely rare in the initial training data.
All three curators agreed on only one negative example. This comprises TRF1 and TRF2 which directly bind the duplex structure and POT1 which interacts with the single-stranded overhang tail.
Positive 0 Binding and interacting, but with a non-protein molecule.
We constructed and delivered the shRNA-resistant myc-tagged DDX1 expression plasmids into the DBT cells, within which the endogenous DDX1 had already been knocked down by using shDdx1-1 ( Figure 6A, lanes 4-6) .
Positive 0 An edge case -DDX1 is tagged with myc, but via recombinant DNA technology as opposed to any interaction between independent proteins.
Christopher Stroh (Muenster, Germany) presented an intriguing strategy for enhancing apoptosis sensitivity of tumor cells by transfection of the NF-kB inhibitor IkBa fused with the viral protein VP22...

Positive 0
Another edge case -fusion proteins do not interact as independent entities.
MICAL1 colocalizes with Rab8a. Positive 0 Colocalisations are not considered positive according to either curation protocol. Many of the models mistakes are understandable and may be addressed with targeted improvements to the training data.
It was deemed that the potential existed for the model to improve curation time at 297 SIGNOR/IntAct, so a further test was carried out using a specific set of proteins of 298 Table 4. Differences in curation criteria.

Example
Model prediction SIGNOR/IntAct score Comment (e) Compound 4E2RCat, which is described in literature to be an inhibitor of the eIF4E/eIF4G interaction.
Positive 0 Despite the drug being the subject and the interaction clearly being established elsewhere in the literature, a protein interaction is mentioned in passing. Activation of NK cells in vitro with IL-2 induced equivalent amounts of GzmB in either genotype (data not shown).
The coexpression of pIF-LukTer with a plasmid expressing MDA5 (pEF-BOS MDA5) stimulated luciferase activity but this activation was not significantly modified in EPZ treated-cells ( Figure S1 ), suggesting that MDA5 does not play a pivotal role in the Dot1L-mediated regulation of the IFN pathway.
Positive 0 In this case a causal interaction between one or more of the overexpressed proteins and luciferase is implied, but there is no link between two specific proteins.
Differences in criteria for curation between our initial recall-oriented effort and the more specific aims of SIGNOR/IntAct.
interest. TERMite was used to identify sentences containing two proteins, at least one of 299 which was listed by SIGNOR/IntAct as being of interest. These sentences were ordered 300 according to the proteins present such that all sentences supporting an interaction 301 between a pair of proteins could be considered collectively. 144 of 210 (68.6%) sentences 302 were deemed to contain an interaction. The accuracy was likely impacted by the 303 ordering, with certain pairs posing particular difficulty for the model. One example of 304 this was the pair IL17 and IL17R, often captured together as 'IL17/IL17R' after the were interpretable, such as interactions between one protein and one gene (see Table 5). 325 The number of sentences scoring either 3 or 4 increased from 29.27% to 34.31% 326 (+5.04%). Examples can be seen in Table 6. These results were positive, especially 327 considering the low number of training samples available. One caveat to note is that 328 this more stringent model was notably less likely to make positive predictions, so recall 329 August 21, 2020 9/14 was likely reduced. The importance of recall depends largely on the ambition of the 330 curators. For example, a database with low coverage and an aim to increase this 331 indiscriminately and quickly will benefit from a model with a focus on precision.

332
Inversely, a database with significant coverage or with an aim to target specific entities 333 will benefit from recall to ensure those few positives not already represented within the 334 database are not missed. This once again illustrates the importance of models being 335 amenable to fine tuning, as even within one database models with different emphases 336 may be required at different stages of its life cycle. In addition, no HDAC6-derived phosphopeptide was detected in our analysis suggesting that GRK2 does not exert its proviral role through HDAC6.

Positive 0
Negative examples still pose a problem for the model, as they remain poorly represented in the training data.
ChIP experiments confirmed that there was a strong association of STAT3 with the GFAP promoter, suggesting the existence of mechanism that facilitates access of the STAT3 complex to the GFAP promoter.
Positive 0 Interaction between protein and gene.
To evaluate whether TRIM25 could counteract the reduction of the antiviral response mediated by Dot1L inhibition, TRIM25 overexpression experiments were carried out.
Positive 0 Hypothesis positing a causal interaction.
Depletion of STT3A, but not STT3B, causes a modest induction of the unfolded protein response (UPR) pathway.
Positive 0 Pathway induction implies an interaction may exist but none is explicitly described.
(50) found that WNT5a can increase fibroblast proliferation through a "noncanonical" or b-catenin/TCF-independent signaling mechanism, indicating that both canonical and noncanonical WNTs may contribute to tumorigenesis.
Positive 0 Effectively another negative example, asserts an unspecified signalling mechanism independent of protein mentioned.
Many of the models mistakes are understandable and may be addressed with targeted improvements to the training data.  In the first phase of the assay a known amount of FVIII is inactivated by activated protein C (APC) in the presence of a Protein-S (PS) containing plasma sample, phospholipids and calcium ions.

Positive 3 Detailed result
These results suggest a critical role for Sin3A in regulating GFAP expression during astrocytic differentiation.

Positive 3
Novel finding with implication of further curatable information in the full text • Named entity recognition provides a useful starting point for curation, as well as 341 for identifying sentences to pass to the trained model at inference. It also allows 342 for the automatic replacement of specific entities with generic tokens, preventing 343 the model from simply remembering protein pairs, which again applies to both 344 training and inference.

345
• Inter-annotator agreement is essential for high quality data. Life science text is 346 deeply complex and individuals regularly disagree when asked to classify content. 347 In the case of PPIs, a voting mechanism with three independent curators dissolved 348 most of these conflicts.

349
• Semisupervision may be a valuable preprocessing step where significant class 350 imbalances are present in the data available for curation. Some entity 351 relationships seem to be more amenable to this approach than others.

352
• Rules-based approaches offer interpretable results, but require careful manual 353 tuning to balance bias and variance. While rules may be indefinitely tuned, in 354 practice it is challenging to achieve comparable efficacy to deep learning results.

355
• Deep learning methods appear to be capable of some degree of abstraction when 356 bootstrapped with data collected from rules-based methods. This seems to be 357 particularly true where the rules in question account for grammatical dependency 358 as opposed to pure named entity recognition.

359
• BioBERT is capable of achieving strong results with relatively few training 360 samples. 500 sentences per class seems to be a reasonable target for an initial 361 round of curation, at which point the accuracy curve should be assessed.

362
• Capturing examples that are unclear in a separate class may allow for training 363 two models with emphasis either on recall or precision.

364
• If a model is intended for use in multiple different settings, it is possible to fine 365 tune a recall-oriented model to more specific criteria with relatively few training 366 samples.

367
The SIGNOR/IntAct curation validated the use of relationship extraction models to 368 streamline the identification of protein pairs for curation. However, it also illustrated 369 the importance of further fine tuning to target specific subcategories within the scope of 370 all sentences containing PPIs, and provided ideas for future work. In particular we feel 371 that the feasibility of developing a model for detecting novel findings warrants 372 investigation. This could then be used in combination with the PPI model, or any other 373 relationship extraction model, to filter results such that only sentences describing the 374 initial identification of a relationship were targeted. These would be more likely to be 375 accompanied by the evidence and quantitative metrics required for comprehensive 376 database curation. Moderate gains were made by further fine tuning of the model using 377 a small training set curated according to the different criteria of SIGNOR/IntAct, 378 illustrating the potential for a model with broad criteria to be fine tuned to more 379 specific use cases.