GraphPrompt: Biomedical Entity Normalization Using Graph-based Prompt Templates

Biomedical entity normalization unifies the language across biomedical experiments and studies, and further enables us to obtain a holistic view of life sciences. Current approaches mainly study the normalization of more standardized entities such as diseases and drugs, while disregarding the more ambiguous but crucial entities such as pathways, functions and cell types, hindering their real-world applications. To achieve biomedical entity normalization on these under-explored entities, we first introduce an expert-curated dataset OBO-syn encompassing 70 different types of entities and 2 million curated entity-synonym pairs. To utilize the unique graph structure in this dataset, we propose GraphPrompt, a promptbased learning approach that creates prompt templates according to the graphs. Graph-Prompt obtained 41.0% and 29.9% improvement on zero-shot and few-shot settings respectively, indicating the effectiveness of these graph-based prompt templates. We envision that our method GraphPrompt and OBO-syn dataset can be broadly applied to graph-based NLP tasks, and serve as the basis for analyzing diverse and accumulating biomedical data.


Introduction
Mining biomedical text data, such as scientific literature and clinical notes, to generate hypotheses and validate discovery has led to many impactful clinical applications Lever et al., 2019). One fundamental problem in biomedical text mining is entity normalization, which aims to map a phrase to a concept in the controlled vocabulary (Sung et al., 2020). Accurate entity normalization enables us to summarize and compare biomedical insights across studies and obtain a holistic view of biomedical knowledge. Current approaches (Wright, 2019;Ji et al., 2020;Sung et  Figure 1: Illustration of GraphPrompt. GraphPrompt classifies a test synonym (CD115 (human)) to an entity in the graph by converting the graph into prompt templates based on the zeroth-order neighbor (T 0 ), first-order neighbors (T 1 ), and second-order neighbors (T 2 ). 2020) to biomedical entity normalization often focus on normalizing more standardized entities such as diseases (Dogan et al., 2014;Li et al., 2016), drugs (Kuhn et al., 2007;Pradhan et al., 2013), genes (Szklarczyk et al., 2016) and adverse drug reactions (Roberts et al., 2017). Despite their encouraging performance, these approaches have not yet been applied to the more ambiguous entities, such as processes, pathways, cellular components, and functions (Smith et al., 2007), which lie at the center of life sciences. As scientists rely on these entities to describe disease and drug mechanisms (Yu et al., 2016), the inconsistent terminology used across different labs inevitably hampers the scientific communication and collaboration, necessitating the normalization of these entities.
The first immediate bottleneck to achieve the normalization of these under-explored entities is the lack of a high-quality and large-scale dataset, which is the prerequisite for existing entity normalization approaches (Wright, 2019;Ji et al., 2020;Sung et al., 2020). To tackle this problem, we collected 70 types of biomedical entities from OBO Foundry (Smith et al., 2007), spanning a wide variety of biomedical areas and containing more than 2 million entity-synonym pairs. These pairs are all curated by domain experts and together form a high-quality and comprehensive controlled vocabulary for biomedical sciences, greatly augmenting existing biomedical entity normalization datasets (Dogan et al., 2014;Li et al., 2016;Roberts et al., 2017). The tedious and onerous curation of this high-quality dataset further confirms the necessity of developing data-driven approaches to automating this process and motivates us to introduce this dataset to the NLP community.
In addition to being the first large-scale dataset encompassing many under-explored entity types, this OBO-syn dataset presents a novel setting of graph-based entity normalization. Specifically, entities of the same type form a relational directed acyclic graph (DAG), where each edge represents a relationship (e.g., is_a) between two entities. Intuitively, this DAG could assist the entity normalization since nearby entities are biologically related, and thus more likely to be semantically and morphologically similar. Existing entity normalization and synonym prediction methods are incapable of considering the topological similarity from this rich graph structure (Wright, 2019;Ji et al., 2020;Sung et al., 2020), limiting their performance, especially in the few-shot and zeroshot settings. Recently, prompt-based learning has demonstrated many successful NLP applications (Radford et al., 2019;Schick and Schütze, 2020;Jiang et al., 2020). The key idea of using prompt is to circumvent the requirement of a large number of labeled data by creating masked templates and then converting supervised learning tasks to a masked-language model task . However, it remains unknown how to convert a large graph into text templates for prompt-based learning. Representing graphs as prompt templates might effectively integrate the topological similarity and textural similarity by alleviating the oversmoothing caused by propagating textual features on the graph.
In this paper, we propose GraphPrompt, a prompt-based learning method for entity normalization with the consideration of graph structures.
The key idea of our method is to convert the graph structural information into prompt templates and solve a masked-language model task, rather than incorporating textual features into a graph-based framework. Our graph-based templates explicitly model the high-order neighbors (e.g., neighbors of neighbors) in the graph, which enables us to correctly classify synonyms that have relatively lower morphological similarity with the ground-truth entity ( Figure 1). Experiments on the novel OBOsyn dataset demonstrate the superior performance of our method against existing entity normalization approaches, indicating the advantage of considering the graph structure. Case studies and the comparison to the conventional graph approach further reassure the effectiveness of our prompt templates, implicating opportunities on other graphbased NLP applications. Collectively, we introduce a novel biomedical entity normalization task, a large-scale and high-quality dataset, and a novel prompt-based solution to advance biomedical entity normalization.

Related Works
Biomedical entity normalization. Biomedical entity normalization has been studied for decades because of its significance in a variety of biomedical applications. Conventional approaches mainly relied on rule-based methods (D'Souza and Ng, 2015;Sullivan et al., 2011) or probabilistic graphical models (Leaman et al., 2013; to model the morphological similarity, which are incapable of normalizing functional entities that are semantically similar but morphologically different. Deep learning-based approaches (Li et al., 2017;Wright, 2019;Pujary et al., 2020;Deng et al., 2019;Luo et al., 2018) and pre-trained language models (PLMs) (Ji et al., 2020;Sung et al., 2020;Miftahutdinov et al., 2021) have obtained encouraging results in capturing the semantics of entities through leveraging human annotations or large collections of corpus. However, these approaches focus on datasets comprising of less ambiguous entity types, such as drugs and diseases and are not able to incorporate graph structures into their framework. In contrast, we aim to utilize rich graph information to assist the normalization of more ambiguous entities such as functions, pathways and processes.
Incorporating graph structure into text modeling. Graph-based approaches, such as network embedding (Tang et al., 2015) and graph neural network (Kipf and Welling, 2016), have been used to model the structural information in the text data, such as citation networks , social networks (Masood and Abbasi, 2021;Aljohani et al., 2020) and word dependency graph (Fu et al.,  2019). Among them, Kotitsas et al. (2019) considered the most similar DAG structure to our task and proposed a two-stage approach to integrate graph structural with textual information. The key difference between our method and existing approaches is that we transform the graph structures into prompt templates and then solve a maskedlanguage model task, whereas existing works represent textual information as fixed node features and then optimize a graph-based model. Prompt-based learning. Prompt-based learning have recently shown promising results in many applications , such as text generation (Radford et al., 2019;Brown et al., 2020), text classification (Schick and Schütze, 2020;Gao et al., 2020) and question answering (Khashabi et al., 2020;Jiang et al., 2020). Prompt-based learning has not yet been applied to integrate the graph information. The most related prompt-based works to our task is prompt-based relation extraction Han et al., 2021) and prompt-based knowledge base completion (Davison et al., 2019). These approaches only consider immediate neighbors in the graph and are not able to model more distant nodes, thus being incapable of capturing the topology of the entire graph. To the best of our knowledge, we are the first work that considers higher-order graph neighbors in the prompt-based learning framework.

Dataset Description and Analysis
We collected 70 relational graphs from Open Biological and Biomedical Ontology Foundry (OBO) (Smith et al., 2007). Nodes in the same relational graph represent biomedical entities belonging to the same type, such as protein functions, cell types, and disease pathways. Each edge represents a relational type, such as is_a, part_of, capable_of, and regulates. We leveraged these edge types to build templates in our prompt-based learning framework. The number of nodes in each graph ranges from 113 to 2,334,910 with a median value of 3,077. The number of synonyms for each entity ranges from 1 to 284 with a median value of 2 (ignoring the entities without synonyms). On average, each graph has 34,418 entity-synonym pairs and 72.9% of graphs have more than 1,000 entity-synonym pairs ( Figure 2a). The graph structure and entity synonym associations are all curated by domain experts, presenting a large-scale and high-quality collection.
In comparison to other biomedical entity normalization datasets (Dogan et al., 2014;Li et al., 2016;Roberts et al., 2017), OBO-syn presents a unique graph structure among entities. Intuitively, nearby entities as well as their synonyms should be semantically similar, as their biological concepts are relevant. To validate this intuition, we investigated the consistency between graph-based entity similarity and text-based entity similarity. In particular, we used the shortest distance on the graph to calculate graph-based similarity and Sentence-BERT (Reimers et al., 2019) to calculate text-based similarity. We observed a strong correlation between these two similarity scores (Figure 2b), suggesting the possibility to transfer synonym annotations from nearby entities to improve the entity normalization.
We next compared this OBO-syn dataset with the existing biomedical entity normalization dataset. We first observed very small overlaps of 5.26%, 14.59%, 3.29% between our dataset and three widely-used biomedical entity normalization datasets NCBI-disease (Dogan et al., 2014), BC5CDR-disease (Li et al., 2016), and BC5CDRchemical (Li et al., 2016), respectively. The small overlaps with existing datasets indicate the uniqueness of our dataset, and further make us question the performance of the state-of-the-art entity normalization methods on this novel dataset. More importantly, we noticed a substantially large number of out-of-vocabulary phrases in our dataset compared to existing datasets ( Figure 2c). We calculate the number of mentions of each phrase in 29 million PubMed abstracts, which are used as the pre-training corpus for biomedical pre-trained models Gu et al., 2020). The 95 percentile of the number of mentions in our dataset is only 51, substantially lower than 487 in NCBI and 582 in BC5CDR, suggesting a worse generalization ability using pre-trained language models and motivating us to exploit the graph structures for this dataset.

Problem Statement
The goal of entity normalization is to map a given synonym phrase s to the corresponding entity v based on their semantic similarity. One unique feature of our problem setting is that entities belonging to the same type form a relational graph. Formally, we denote this relational graph as G = (V, R, E), where V is the set of entities, R is the set of relation types and E ⊂ V × R × V is the set of edges. Let C be the vocabulary of the corpus. Each node v i ∈ V is represented as an entity In addition to the graph, we also have a set of mapped synonyms S that will be used as the training data. Each s j s 1 j , s 2 j , . . . , s |s j | j ∈ S is mapped to one entity v i in the graph S, and s k j ∈ C. Our goal is to classify a test synonym s to an entity v in the graph. Since the majority of entities only have very few synonyms (e.g., 96.9% of entities have less than 5 synonyms), we consider a few-shot and a zero-shot setting. Specifically, in the few-shot setting, the test set entities are included in the training set entities. On the contrary, the entities in the training set and the test set present no overlap in the zero-shot setting, and therefore the entities of training datasets are unobservable for test procedure. The small number of training synonyms for each entity could exacerbate over-fitting. To mitigate the over-fitting problem, we propose graph-based prompt templates, where we consider the synonyms of nearby entities in the training data.

Base model
We first introduce a base model that only considers the textual information of synonyms and entities while disregarding the graph structure. Following the previous work (Sung et al., 2020), the base model uses two encoders to calculate the similarity between the queried synonym s and the candidate entity v. The first encoder Enc s encodes the queried synonym into the dense representation x s = Enc s (t s ). The second encoder Enc v encodes the candidate entity into the dense representation . Then the predicted probability of choosing entity v is calculated as: where

We select BioBERT with [CLS] readout function as
Enc v and Enc s , and share the parameters between both encoders. Following Sung et al. (2020), the input t v and t s are designed as "[CLS] v [SEP]" and "[CLS] s [SEP]" respectively. In practice, we find that the initial [CLS] output vectors are fairly close. This can result in large positive x T v x s , which leads to slow convergence and potential numerical issues, yet it is not addressed by BioSyn (Sung et al., 2020). To alleviate this issue, we use a trainable 1-d BatchNorm layer and redefine our similarity function Q as: (2) When the candidate entity set is large, backpropagating through x v results in high memory complexity due to the construction of |V| computation graphs to get x v . To tackle this problem, we apply the stop gradient trick to x v , following Sung et al. (2020). Besides, we utilize the hard negative strategy following Sung et al. (2020) by sampling difficult negative candidates U ⊂ V. The loss function is defined as:  (Sung et al., 2020) and the base model use BioBERT with [CLS] readout function as the encoders, which take the synonym or entity as the input and use the hidden state of [CLS] as the output. However, using the synonym or entity as the input text might not fully capture its semantic since PLMs are often pre-trained with sentences instead of phrases. To tackle this problem, we construct two simple prompt templates T 0 for a training entity-synonym pair (v, s) as: and [SEP] are omitted). Then we optimize the model by solving an masked language modeling task, where we use the output of BioBERT at [MASK] token as the dense representation x v (x s ) for v (s), respectively: Since the graph is not used here, we refer to x v as the zeroth-order representation of entity v. The loss function of prompt model is similar to base model's, where we select the whole entity set V as candidates instead of its subset: . (6) 5 GraphPrompt model

Intuition
The observation that nearby entities are more semantically similar (Figure 2b) motivates us to integrate textual similarity with graph topological similarity to boost the entity normalization. Conventional approaches often integrate text and graph information by adapting a graph-based framework and incorporating text features as node features (Kotitsas et al., 2019). However, such approaches might not fully utilize the strong generalization ability of pre-trained models, which have been crucial for a variety of NLP tasks (Devlin et al., 2018;Petroni et al., 2019). In contrast to conventional approaches, we propose to utilize a prompt-based learning framework to integrate text and graph information through representing the graph information as prompt templates. To the best of our knowledge, our method is the first attempt to represent the graph structure as prompt templates.

First-order GraphPrompt
GraphPrompt uses Equation 1 for inference, but utilizes the graph information during training. GraphPrompt considers first-order neighborhood (i.e., immediate neighbors) and second-order neighborhood (i.e., neighbors of the neighbors) to construct prompt templates for a given entity.
To model first-order neighbors, GraphPrompt defines the template T 1 r (v i , v j ) = "v i r v j " for an edge between entity v i and its immediate neighbor entity v j with relation type r. r is created from r with minor morphological change, as listed in Table 3. For a given triple (v i , r, v j ) in the graph, we create a masked-language model task by randomly masking v i or v j . We also include the template that replaces the unmasked v with its training synonym s. For example, when v i is masked and v j is replaced with s k , we obtain the following template: We then use BioBERT to obtain the first-order representation y v i based on this template: , v j )). (7) We then calculate the loss term by comparing the first-order representation of v i with the zeroth-order presentation x v i :

Second-order GraphPrompt
To consider second-order neighbors, Graph-Prompt first finds all 2-hop relational paths (v i , r, v j , is_a, v k ) in the graph. Since is_a relation contributes to the majority of the relation type, we fix the second relation to be is_a for simplicity. The prompt template is then defined as T 2 Different from T 0 and T 1 , there are three tokens that can be masked in T 2 . We chose to mask two tokens in each template, resulting in two kinds of second-order templates: We don't consider the template of T 2 r ([MASK], [MASK], v k ) because of the DAG structure in our dataset. The numbers of child nodes and grandchild nodes grow exponentially in DAG and will introduce too many paths using T 2 r ([MASK], [MASK], v k ) template, slowing down the optimization.
To calculate the loss term based on T 2 r ([MASK], v j , [MASK]), we compare the second-order dense representation z v i , z v k to the zeroth-order dense representation  Dataset   mp  cl  hp  fbbt  doid  #synonyms  26119  27242  20070  23870  21401  #entities  13752  10939  16544  17475  13313  data split zero-shot few-shot zero-shot few-shot zero-shot few-shot zero-shot few-shot zero-shot few-shot Acc@1 Acc@10 Acc@1 Acc@10 Acc@1 Acc@10 Acc@1 Acc@10 Acc@1 Acc@10 Acc@1 Acc@10 Acc@1 Acc@10 Acc@1 Acc@10 Acc@1 Acc@10 Acc@1 Acc@10  Table 1: The performance of our method and comparison approaches on 5 datasets using zero-shot and few-shot settings. The best model in each column is colored in blue and the second best is colored in light blue. See Appendix A.2 for detailed implementations of comparison approaches.
Although we can further define higher-order templates accordingly, we observed limited improvement by including third-order or even higher-order templates in our experiments. This observation is consistent with conventional graph embedding approaches where only first-order and second-order neighborhood are explicitly modeled (Tang et al., 2015). For the 2-hop relational path, we didn't consider sibling-based templates such as "Both [MASK] and [MASK] are a kind of v" due to the large number of sibling pairs in the DAG. Nevertheless, such templates might be worth exploring on other graphs. In practice, different entities may have similar x v , making them indistinguishable at the test stage. This issue could be exacerbated when the graph structure is incorporated. For example, for two edges (v i , is_a, v j ) and (v i , is_a, v j ), the model tends to increase the similarity between the embeddings of siblings v i and v i . To alleviate this problem, we consider another contrastive loss term L c that encourages the model to distinguish different entities: The final loss of our model combines of L p , L c , L 1 and L 2 , with weights λ p , λ c , λ 1 and λ 2 chosen on the validation set: 6 Experimental Results

Experimental settings
We selected five graphs (mp, cl, hp, fbbt, doid) with the number of entities between 10,000 and 20,000 from OBO-syn. We investigated a few-shot setting and a zero-shot setting. In the few-shot setting, we split the synonyms into six folds, and then used four folds as training set, one fold as validation set and one fold as test set. In the zeroshot setting, we split all entities into three folds, and then used two folds as training set and one fold as test set. All synonyms of training (test) entities are observable (unobservable) during training. Our method and all comparison approaches used the same data split.
We compared our method to the state-of-theart entity normalization approaches: Sieve-Based (D'Souza and Ng, 2015) , BNE (Phan et al., 2019), NormCo (Wright, 2019), TripletNet (Mondal et al., 2020) and BioSyn (Sung et al., 2020), and a graph convolutional network (GCN) (Kipf and Welling, 2016). We also compared our method with a base model (L base ), a prompt model (L p + L c ) and a first-order GraphPrompt (w / o T 2 ) (L 1 + L p + L c ). See more details for the implement of baselines and our methods in appendix A.2 .

Improved performance in few-shot setting
We first sought to evaluate the performance of our method in the few-shot setting ( Table 1). We found that our method outperformed all other approaches in all metrics on all the datasets. When comparing to the best-performed entity normalization approach BioSyn, our method obtains an average 27.7% improvement on Acc@10 and 35.5% improvement on Acc@1, indicating the prominence of using the graph structure to leverage annotations from nearby entities. We found that using graph structure leads to large improvement on datasets with fewer training samples (39.6% improvement on doid comparing to 24.9% on mp), suggesting GraphPrompt's ability to learn from limited samples. We next compared our method to a graph-based approach GCN and observed a superior performance of GraphPrompt, confirming the effectiveness of modeling graph structures using prompt templates. The base model, which does not exploit the graph structure, also performed better than GCN, partially due to the over-smoothing issue in GCN. Despite showing a less superior performance comparing to our method, GCN still outperformed most of the entity normalization approaches that do not consider graph structure, reassuring the advantage of using graph structure in this dataset.
To further verify that the improvement of our method comes from using graph structure, we compared the performance of GraphPrompt with the base prompt model and the first-order prompt model. Overall, GraphPrompt is better than both approaches by utilizing the second-order neighborhood, while the first-order prompt is better than the base prompt model. Collectively, our results clearly assure the importance of considering the graph structure and the effectiveness of modeling it using prompt templates.

Improved performance in zero-shot setting
After verifying the superior performance of our method in few-shot learning, we next investigate the more challenging zero-shot setting, where ground-truth entities in the test set have no synonyms in the training set (Table 1). Likewise, our method outperformed all comparison approaches in all metrics on all datasets. We found that Graph-Prompt obtained larger improvement over BioSyn in the zero-shot setting compared to the few-shot setting. Since ground-truth entities do not have any observed synonyms in the zero-shot setting, graph information becomes more crucial to aggregate synonym annotations from nearby entities.
The consistent improvement of GraphPrompt over GCN in both zero-shot and few-shot settings further confirms the effectiveness of using prompt templates to capture the graph structure. Graph-Prompt also shows consistent improvement over the base prompt model and the first-order prompt model, indicating the importance of considering second-order neighbors in the graph.

Improvement analysis
We sought to investigate the superior performance of GraphPrompt. We first calculated the textual similarity between the test synonyms and their ground truth entities using Sentence-BERT (Reimers et al., 2019). We found that the improvement of GraphPrompt over the base model increases with the decreasing of this textual similarity ( Figure 3a). Entity-synonym pairs that have smaller textual similarity are more difficult to be predicted correctly with only the textual information, thus obtaining larger improvement from the graph structure. Moreover, the low overlaps with pre-training corpus limit the knowledge from PLMs, necessitating the consideration of graph information.
We then sought to study the improvement of GraphPrompt over GCN. Interestingly, we found that GCN tends to have better performance on Acc@10 rather than Acc@1, whereas our method shows consistent improvement on these two met-
To further verify this, we examined the improvement of our method against GCN at different depths in the graph (Figure 3b). We found that the improvement of our method over GCN becomes larger when the depth of the entity is smaller. Because of the DAG structure in our graph, entities that have smaller depth are closer to the center of the graph, and could be more disturbed by the oversmoothing issue. In contrast, our method explicitly converts the graph structure into prompt templates, successfully alleviating the over-smoothing issue caused by propagating on the entire graph.
Next, we examined the effect of the L c norm in our method ( Figure 3c). As expected, adding L c greatly improved the performance on all the datasets in the few-shot setting. The improvement is much larger on datasets that have worse overall performance (e.g., fbbt, doid), indicating the importance of separating the embeddings of different entities. We also noticed that the accuracy of the state-of-the-art entity normalization approaches, such as BioSyn and NormCo, is much worse on our OBO-syn dataset than on the mainstream datasets, such as BC5CDR and NCBI (see results in Sung et al. (2020) ), further confirming the difficulty of our task and dataset.
Finally, we presented two case studies of how GraphPrompt utilized the graph structure to correctly identify the entity (Figure 1 and Table 2). We found that GraphPrompt performed a 'recombination' of two nearby phrases using the graphbased prompt templates during the prediction. For example, GraphPrompt correctly classified the test synonym 'adult Leucokinin ABLK neuron of the abdominal ganglion' to the entity 'adult abdominal ganglion Leucokinin neuron' by combining it with the second-order neighbor 'larval Leucokinin ABLK neuron of the abdominal ganglion', whereas comparison approaches classified to incorrect but semantically similar entities (e.g., 'adult anterior LK Leucokinin neuron') ( Table 2). Likewise, GraphPrompt correctly classified 'CD115 (human)' to 'macrophage ... receptor (human)' by recombining it with CD115 according to the first-order prompt template. These recombinations of nearby entities reassure the effectiveness of graph-based prompts in biomedical entity normalization.

Conclusion and Future Work
We have presented a novel biomedical entity normalization dataset OBO-syn that encompasses 70 biomedical entity types and 2 million entitysynonym pairs. OBO-syn has demonstrated small overlaps with existing datasets and more challenging entity-synonym predictions. To leverage the unique graph structures in OBO-syn, we have proposed GraphPrompt, which converts graph structures into prompt templates and then solves a masked-language model task. GraphPrompt has obtained superior performance to the state-of-the-art entity normalization approaches on both few-shot and zero-shot settings.
Since GraphPrompt can in principle be applied to integrate other types of graphs and text information, we are interested in exploiting GraphPrompt in other graph-based NLP tasks, such as citation network analysis and graph-based text generation. The novel OBO-syn dataset can also advance tasks beyond entity normalization, such as link prediction, graph representation learning, and be integrated with other scientific literature datasets to investigate entity linking, key phrase mining, and named entity recognition. We envision that our method GraphPrompt and OBO-syn will pave the path for comprehensively analyzing diverse and accumulating biomedical data.

A Appendix
A.1 Relations and phrases Table 3 shows the relations among entities and their corresponding synonyms. The relation identical links a entity and a synonym to claim that the synonym refers to the entity. During training, the relation identical links [MASK] and a synonym or entity to extract the textual feature. Among other relations, is_a is the most common relation, which describes the subsumption relation between a child entity and a parent entity. We transform these relations into phrases to put them in templates used by our Prompt-based model.

A.2 Implementation details
Details about prompt-based methods For prompt-based methods (Prompt, GraphPrompt (w/o T 2 ), and GraphPrompt), we trained the model with L c for 400 iterations to warm-up entity embedding x v . For zero-shot setting, we followed the bi-encoder architecture that uses two encoders for entities and synonyms. Every time we updated the embeddings of entities x v , v ∈ V, we had to run the encoder for every entity. For few-shot learning, we found that the entity embedding can be directly trained with an embedding layer. We used the entity side of the bi-encoder to generate entity embedding x 0 v , and used this embedding to initialize the embedding layer. Then we used embeddings from this trainable embedding layer to replace the sg(x v ) and sg(x v ) term in the loss.
Details about second-order GraphPrompt The second-order GraphPrompt (GraphPrompt in Table 1) actually didn't include zeroth-order and first-order templates, since we considered that they are sub-templates of second-order templates. We achieved this by padding a To get x v and y v from this template, you only need to ignore the output of the second mask.
Details about the base model The base model is a BioSyn-like model with some important modifications. We trained the model for 30 epochs with initial learning rate 1e-5, and decayed it to about develops from has_sensory_dendrite_in has sensory dendrite in sends_synaptic_output_to sends synaptic output to synapsed_to is synapsed to synapsed_by is synapsed by continuous_with is continuous with synapsed_via_type_Ib_bouton_to is synapsed via type Ib bouton to receives_synaptic_input_in receives synaptic input in overlaps overlaps 1e-6 when the model converged. We used sparse features (Sung et al., 2020) during candidate generation. During encoding, we didn't add sparse features, since we found that sparse features had no significant impact on the results, and even caused a slight decrease of accuracy. Besides, we found that BioSyn (Sung et al., 2020) sometimes failed to retrieve positive candidates due to limited candidate size and the inaccuracy of the model. Therefore, we manually added positive candidate in order to make full use of training data. The procedure for inference is the same as BioSyn (Sung et al., 2020). Details about baselines NormCo (Wright, 2019) was initially introduced to perform bio-entity linking with inputs being text corpora. Central to its proposed method is the modeling of coherence leveraging concept co-mentions in each text corpus. However, as NormCo is not designed to learn the semantics of concepts, it is not capable of zeroshot learning in our dataset. Therefore, we did not report its results under zero-shot setting.
In addition, to construct a coherence sequence analogous to the co-appearances of mentions in the original setting, we took the mentions (entities and synonyms) of neighbor concepts of each training concept (excluding validation and test mentions when training), where the mentions are arranged in order based on their distance from the central concept we build this sequence for.
As Sieve-Based (D'Souza and Ng, 2015) is a rule-based entity normalization method which does not need the training data, we treated the model as a zero-shot model. Besides, Sieve-Based does not include a scoring mechanism, so we could only report the results of Acc@1.