Text mining-based word representations for biomedical data analysis and machine learning tasks

Biomedical and life science literature is an essential way to publish experimental results. With the rapid growth of the number of new publications, the amount of scientific knowledge represented in free text is increasing remarkably. There has been much interest in developing techniques that can extract this knowledge and make it accessible to aid scientists in discovering new relationships between biological entities and answering biological questions. Making use of the word2vec approach, we generated word vector representations based on a corpus consisting of over 16 million PubMed abstracts. We developed a text mining pipeline to produce word2vec embeddings with different properties and performed validation experiments to assess their utility for biomedical analysis. An important pre-processing step consisted in the substitution of synonymous terms by their preferred terms in biomedical databases. Furthermore, we extracted gene-gene networks from two embedding versions and used them as prior knowledge to train Graph-Convolutional Neural Networks (CNNs) on breast cancer gene expression data to predict the occurrence of metastatic events. Performances of resulting models were compared to Graph-CNNs trained with protein-protein interaction (PPI) networks or with networks derived using other word embedding algorithms. We also assessed the effect of corpus size on the variability of word representations. Finally, we created a web service with a graphical and a RESTful interface to extract and explore relations between biomedical terms using annotated embeddings. Comparisons to biological databases showed that relations between entities such as known PPIs, signaling pathways and cellular functions, or narrower disease ontology groups correlated with higher cosine similarity. Graph-CNNs trained with word2vec-embedding-derived networks performed best for the metastatic event prediction task compared to other networks. Word representations as produced by text mining algorithms like word2vec, therefore capture biologically meaningful relations between entities.


Corpus pre-processing and training of word embeddings
Natural language text data usually require an amount of preparation before they can be fed into the 136 model training. For the purpose of pre-processing of the text corpus we implemented a pipeline 137 that conducted the steps depicted in Fig 1. The first phase applied classical processing steps such 138 as lowercasing, lemmatization, and removal of punctuation and numerical forms. We assumed 139 replacing synonyms of biomedical terms with their main terms can affect the similarity between 140 words in a way to better capture functional relationships between biomedical entities. Thus, we 141 introduced an optional step before starting training, in which synonymous terms were substituted 142 by externally defined main terms. This corpus was then used to train the word2vec model. For 143 training we used the word2vec implementation in Gensim [28] with context window of size 5, 144 minimum count is equal to 5 and 300 neurons which is also the number of generated vector 145 dimensions. 146 For this study, we generated the following two word2vec embeddings: 'Embedding_v1' in which 147 synonymous terms of genes, diseases, and drugs were substituted by their preferred terms from 148 HUGO [29], MeSH [30], and DrugBank [31], respectively; and 'Embedding_v2' for which the 149 same preprocessing strategies and the same training process were applied but without replacing 150 synonyms. Moreover, we assigned type labels using the same biomedical databases mentioned 151 above in order to filter similarities. This enabled us to compare similarities between entities in 152 obtained embeddings to existing knowledge in biomedical databases.    [36] with more than 5 and less 172 than 1000 member diseases. We calculated medians, lower and upper quartiles of cosine 173 similarities for gene pairs within pathways and biological processes with at least 10 and not more 174 than 3000 genes as well as for disease pairs within the 139 disease groups. In addition, we  information. We extracted genes associated with each drug reported in DrugBank with type target.

185
By considering 5234 drugs and their target genes, we created drug pairs based on the common 186 genes that two drugs share in each pair. Drug-gene associations were obtained from DrugBank 187 release 4.5.0 and cosine similarities of 50000 drug pairs with at least one shared target gene were 188 compared to 50000 drug pairs without common target genes. Moreover, to examine the variability 189 of the similarity distribution of drug pairs based on the number of genes they share, we sampled 190 three drug pair groups (group1: no genes, group2: <=5 genes, group3: <=9 genes) (S6 File).    291 We tested how the text corpus size influences the variability of word representations and 292 compared the similarities between given concepts obtained from four resulting embeddings.

293
The embeddings were produced by applying word2vec to four text corpora of di fferent sizes.

294
The four text corpora were of size 4M, 8M, 12M, and ~>16M. We selected 10 terms of different 295 entity types, the genes brca1, psen1 and egf, the medical terms breast neoplasms, eczema, sleep 296 and schizophrenia, and the molecular compounds ranitidine, lactose, and cocaine. For each 297 entity term, we calculated its first 10 nearest neighbors and selected the ones that are commonly 298 present in the four resulting embeddings (S7 File).

302
To demonstrate the utility of our word2vec embeddings in data analytical applications, we 303 examined the agreement of cosine similarities between words according to their vector 304 representations with information extracted from biomedical knowledgebases (see Materials and 305 methods). As a result, pairs of genes with known interactions in the Reactome database showed 306 on average higher cosine similarities than gene pairs without known interaction in the same 307 database (Fig 2). Similarly, cosine similarities of drugs with overlapping target gene sets were, on 308 average, higher than similarities between drugs without common target genes. Furthermore, cosine 309 similarities within Reactome and TRANSPATH® pathways, as well as within GO biological 310 processes, were increased compared to median cosine similarities of randomly sampled gene pairs 311 (Fig 2). Regression curves estimated for the medians revealed a correlation between the number 312 of pathways or GO category members and the median similarity, with higher values for smaller 313 gene sets. We think that gene pairs in smaller pathway networks or biological processes were more 314 likely to correspond to direct molecular interactors that share a close functional context than in 315 pathway or functional categories with a higher number of members. The embedding, in many 316 cases, indeed captured these relations. While disease-disease cosine similarities within Human 317 Disease Ontology (HDO) groups also revealed such a trend for groups with less than 25 members, 318 median similarities within groups were often smaller than for randomly chosen disease pairs (Fig   319   2). Therefore, disease-disease relations captured by broader HDO groups did not correspond well 320 with vector presentations of the embedding. Better correspondence was observed for narrower 321 disease groups but did not exceed similarities of random disease pairs. Full plots are provided as 322 supplementary figures Figs S1-S6.  The networks were compared based on the similarity threshold and number of vertices included. In In this study, we leveraged a state-of-the-art text-mining tool to learn the semantics of biomedical