MACE2K: A Text-Mining Tool to Extract Literature-based Evidence for Variant Interpretation using Machine Learning

Interpretation of a given variant’s pathogenicity is one of the most profound challenges to realizing the promise of genomic medicine. A large amount of information about associations between variants and diseases used by curators and researchers for interpreting variant pathogenicity is buried in biomedical literature. The development of text-mining tools that can extract relevant information from the literature will speed up and assist the variant interpretation curation process. In this work, we present a text-mining tool, MACE2k that extracts evidence sentences containing associations between variants and diseases from full-length PMC Open Access articles. We use different machine learning models (classical and deep learning) to identify evidence sentences with variant-disease associations. Evaluation shows promising results with the best F1-score of 82.9% and AUC-ROC of 73.9%. Classical ML models had a better recall (96.6% for Random Forest) compared to deep learning models. The deep learning model, Convolutional Neural Network had the best precision (75.6%), which is essential for any curation task.


Introduction
The interpretation of any given variant's pathogenicity is one of the most profound challenges to realizing the promise of genomic medicine. It is imperative to understand how gene variants impact certain diseases and associated phenotypes. Consortium projects like ClinGen [1], gnomAD [2], and GA4GH [3] are curating knowledge bases for understanding the clinical relevance of human genetic variation, based on novel methods for assessing the clinical actionability of genes and the pathogenicity of genetic variants. Literature review to identify published assertions of associations between variants and diseases is a crucial step variant interpretation curation process. There has been a rise in the volume of scientific literature describing variant-disease assertions due to sequencing techniques [4]. Manually curated databases through literature review containing variants and associated diseases such as COSMIC [5], BioMuta [6], OMIM [7], HGMD [8], dnSNP [9], ClinVar [10], CIViC [11] have been developed. It is becoming increasingly difficult for biocurators, clinical researchers, and clinicians to keep up with the rapidly growing volume and breadth of variant-related information from published literature. The value of extracting and understanding genetic variations and their relationship to disease from literature has been recognized [12] and there is a pressing need to develop text-mining tools to extract evidence of variant-disease associations from literature.
Extracting such relevant information from literature will speed up and assist the manual variant interpretation curation process. To address such variant interpretation curation needs, we have developed a text-mining tool named MACE2K to extract evidence sentences indicating variantdisease associations from full-length PMC articles. We will treat this extraction task as a classification problem i.e. given a sentence with the variant and disease annotated, we will train ML models to classify the sentence as positive or negative indicating the presence or absence of the variant-disease association. We train and test different classical ML models: Logistic Regression, Support Vector Machines, Random Forest, and deep learning models: Convolutional Neural Networks (CNN) and Long short-term memory (LSTM) for evidence sentence classification. The different ML classifiers were evaluated using 5-fold cross-validation on an inhouse curated dataset and achieved average precision, recall, and F1-score of up to 75.6%, 96.6%, and 82.9%, respectively.

Methods
As indicated earlier, MACE2k is a text-mining tool that extracts evidence sentences indicating variant-disease associations from full-length PMC articles. An example positive sentence indicating a variant-disease association is provided in Example 1 below. In this instance, an association between the variant "His239Arg", the associated gene "HRAD9", and the disease "lung adenocarcinoma" is stated. We treat this extraction task as a classification problem i.e. given a sentence with the variant and disease annotated, we will train ML models to classify the sentence as positive or negative indicating the presence or absence of the variant-disease association. We will use PubTator [13] for entity typing i.e. to detect and normalize the gene, variant, and disease mentioned in a sentence to be classified. In a previous work, we have developed a pattern-based relation extraction system called eGARD [14] to extract such relationships between variants, disease, and drug responses from abstracts. As with any rule/pattern-based approach, eGARD suffered from low recall. In this work, we employ Machine Learning (ML) to address the issue of low recall and additionally extend the extraction to full-length PMC articles. We present an overall workflow of the study design and methodologies in

Creation of the curated dataset
For the creation of this dataset, we formulated an annotation protocol, which was provided to the curators. The object of this annotation experiment was to highlight the evidence sentences in full-length PMC articles that indicate an assertion by the author for a relationship between any pair of the three entities: (1) disease, (2) gene, and/or (3) variant. Additionally, the entities (disease, gene, variant) with normalized identifiers (HGNC [15], MONDO [16], ClinVar [10]) in the evidence sentences were also marked by the annotator. If no associated ClinVar identifier existed for the annotated variant, the curator used a ClinGen Allele Registry (CA) [17] identifier to normalize the variant. An annotation tool called Hypothes.is [18] was used to assist in the creation of the curated gold set. Based on the annotation protocol, 1000 evidence annotations (gene-variant, gene-disease, or variant-disease associations) from 87 PMC Open Access articles were annotated. The different statistics of the curated dataset are depicted in Table 1. A total of 557 evidence sentences were annotated out of which 181 sentences contained a variant-disease association. As our aim is to extract evidence sentences containing a variant-disease association relevant for variant pathogenicity interpretation, we used these 181 evidence sentences as positive instances for our ML models.

Feature Engineering
In order to train ML models, the evidence sentences for positive and negative annotations need to be converted to structured features. The first set of features, which we used as input for our classical ML models is term frequency-inverse document frequency (tf-idf) weighting [21]. The second set of features, which we used as input to our deep learning models are distributed word embeddings, which have been shown to achieve better performance in NLP tasks by learning similar vectors for similar words [22,23]. For our deep learning models, we used the publicly available pre-trained word embeddings from NCBI: BioWordVec [24,25] Words that were not present in the set of pre-trained words are set as a zero vector. The biLSTM architecture consisted of a 64-cell bidirectional LSTM layer followed by two pooling layers (maximum and average). The maximum and average pooling layers were concatenated fed to a fully connected layer with 64 units (with ReLU activation). A dense layer of size 1 with a sigmoid activation function was applied over the fully connected layer to obtain the biLSTM classifier. For the deep learning models, we used binary cross-entropy as the objective loss function and the Adam algorithm [28] to optimize the loss function. To train the parameters for CNN and bi-LSTM, we used mini-batch training with a batch size of 32. In between each layer of our deep learning architectures, we added a dropout layer with a dropping probability of 0.5 to avoid overfitting during training. We set the number of epochs to 100 during training.

Results and Discussion
We compared the performance of the different ML models to classify evidence sentences using 5-fold cross-validation. Average precision, recall, F1-score, and area under the receiver operating characteristic Curve (AUC-ROC) were computed across the folds and reported in

Table 2. Evaluation results
An initial analysis of the errors made by the deep learning models was conducted. We observed that some of the negative instances in our dataset that were generated automatically were incorrect. As indicated earlier, we assigned a negative label to any sentence with variant and disease mention that was not annotated by a curator. Our approach will incorrectly label a positive sentence (with a variant-disease association) that was missed during the annotation process as a negative instance. Subsequent verification of the automatically labeled negative instances by a curator will resolve this issue.
A potential threat to the validity of our results is the small set of evidence annotation sentences (181 positive and 98 negative) with a variant and disease mention. We plan to add more curations to our dataset to validate and additionally improve the results. Although deep learning models can achieve good performance without complex human-engineered features, it requires large amounts of data to effectively train the numerous parameters in the model. As a future step, we aim to investigate various state-of-the-art ML techniques to learn from a small gold set and a large "noisy" automatically labeled data set with approaches such as transfer learning [29,30], distant supervision [31][32][33], and adversarial networks [34,35]. We will automatically generate large amounts of distantly labeled data using existing knowledge bases with known variant-disease entity pairs such as CIViC [11], ClinGen [1], and ClinVar [10].
Noise-reduction heuristics will be used to remove noise in the distantly-labeled data. We will first train our models on the large distantly labeled set and then re-train the model on the "small" amount of human-labeled data to increase accuracy. Network (CNN) has the best precision (75.6%), which is essential for any downstream curation task. We believe that MACE2k will assist and speed up the variant interpretation curation process by extracting relevant information from the literature.