MEAHNE: MiRNA-disease association prediction based on semantic information in heterogeneous networks

Prior studies have suggested close associations between miRNAs and diseases. Correct prediction of potential miRNA-disease pairs by computational methods is able to greatly accelerate the experimental process in biomedical research. However, many methods cannot effectively learn the complex information in the multi-source data, and limits the performance of the prediction model. A heterogeneous network prediction model MEAHNE is proposed to make full use of the complex information in multi-source data. We first constructed a heterogeneous network using miRNA-disease associations, miRNA-gene associations, disease-gene associations, and gene-gene associations. Because the rich semantic information in the heterogeneous network contains a lot of relational information of the network. To mine the relational information in heterogeneous network, we use neural networks to extract semantic information in metapath instances. We encode the obtained semantic information into weights using the attention mechanism, and use the weights to aggregate nodes in the network. At the same time, we also aggregate the semantic information in the metapath instances into the nodes associated with the instances, which can make the node embedding have excellent ability to represent the network. MEAHNE optimizes parameters through end-to-end training. MEAHNE is compared with other state-of-the-art heterogeneous graph neural network methods. The values of area under precision-recall curve and receiver operating characteristic curve show the superiority of MEAHNE. Additionally, MEAHNE predicted 50 miRNAs for lung cancer and esophageal cancer each and verified 49 miRNAs associated with lung cancer and 44 miRNAs associated with esophageal cancer by consulting relevant databases. MEAHNE has good performance and interpretability by experimental verification.

same time, the graph neural network end-to-end training method can also be used to optimize all the parameters in the model. Therefore, the learning ability of graph neural networks is very powerful. Li et al. [21] established a miRNA functional similarity matrix and disease semantic similarity matrix into a graph, and used GCN [17] to learn the structure information of the graph; they then used the structure information as the input for a multi-layer neural network to obtain a low-dimensional representation of miRNA and disease. To effectively integrate heterogenous miRNA and disease information, Li et al. [22] designed a graph encoder, which contains an aggregator function and a multi-layer perceptron that aggregates node neighborhood information to generate a low-dimensional embedding of miRNAs and diseases.
Many methods learn on homogeneous data, and isomorphic graph neural networks cannot adapt well to the complex associations of heterogeneous networks obtained when using multi-source heterogeneous data. To learn the semantic information generated by the complex associations in the network, the heterogeneous neural network performs multi-modal information mining on the heterogeneous network by setting the metapath. Each metapath represents a semantic type. Multiple subgraphs are sampled from the heterogeneous network according to the set of multiple metapaths, and then the graph neural network method is used to learn a low-dimensional representation of the nodes on the subgraphs. The concept of metapath was first proposed by Metapath2vec Dong Y et al. [23].
Metapath2vec samples multiple sequences composed of nodes from heterogeneous networks through the metapath setting, and a word representation learning model processes the sequences into lowdimensional vector representations. HAN [24], a representative heterogeneous graph neural network, processes the heterogeneous network into multiple sub-netwoks through metapath, and processes each subgraph into a graph composed of corresponding nodes of the same type. GAT [18] is then used to learn low-dimensional representations of isomorphic subgraphs, and semantic level attention is used to integrate the representations under multiple metapaths. HAN learns the semantic information in the network, and it can better represent the nodes in the network than the isomorphic neural network.
However, this method processes the sub-graphs under the metapath into isomorphic graphs, ignoring all intermediate nodes, resulting in a large amount of information being ignored. This problem is also called the early-summarization problem [25]. MAGNN [26] is a heterogeneous neural network model based on HAN. To solve the problem of missing information in the intermediate nodes on the metapath subgraph, MAGNN rotates the intermediate nodes of each metapath instance of the subgraph. The low-dimensional embedding obtained by the rotation is regarded as the semantic information of the instance, and the semantic information is aggregated into the target nodes. In the MAGNN method, the information of all types of nodes is fused together, which leads to the loss of discrimination between the representations of different types of nodes.
The traditional heterogeneous graph neural network aggregates nodes in the network indiscriminately, which wastes the semantic information in the network. In fact, semantic information in the network can help the network to aggregate nodes more efficiently. To overcome this problem, we propose a semantic-based attention aggregation heterogeneous graph neural network to predict miRNAdisease potential association. Our main contributions are as follows: • To fully utilize the semantic information in heterogeneous graph neural networks, we propose a semantic-based attention mechanism that utilizes the extracted semantic information to efficiently aggregate nodes in heterogeneous networks.
• In addition to aggregating neighbor node information, the semantic information extracted from metapath instances is also aggregated into nodes associated with instances. This enables nodes to have rich semantic information and adequately express relationships in heterogeneous networks. • We design a semantic-based heterogeneous graph neural network model. By utilizing the semantic information of multiple metapaths and the semantic information of metapath instances, the relationships in the mirna-disease-gene network are fully mined. Our model can be used for the mining of large-scale multi-source biological data.

EXPERIMENTS
In this section, we introduce several representative models of heterogeneity graph representation, compare and analyze them with our model in detail. We compare our method with other heterogeneous network embedding methods under two metrics, area under the receiver operating characteristic curve (AUC), and area under the precision-recall curve (AP), under fair conditions. And draw the Receiver Operating Characteristic(ROC) and Precision Recall(P-R) curves. Then the advantages of our model analyzed for miRNA-disease link prediction tasks in large-scale heterogeneous networks by observing and comparing experimental performances. The models we used for comparison are as follows: Metapath2vec [23]: A structural learning method for heterogeneous networks. The network is sampled according to the set metapath to obtain a sentence composed of nodes in the network, and the sentence is used as an input for the skipgram model to obtain the final node embedding. We experimented with multiple metapaths and obtained the best performance under the metapath(mirna-disease-genemirna).
GAT [18]: A type of isomorphic graph neural network. This model uses the attention mechanism to assign weights to the neighbors of nodes in the spatial domain. According to the calculated weights, the neighbors in the spatial domain are aggregated. GAT uses multi-head attention, which is used to comprehensively learn the network and generate the final node representation.
HAN [24]: A heterogeneous graph neural network model that uses multiple metapaths to mine the network, separates the corresponding subgraphs, and processes the subgraphs into isomorphic graphs; GAT is then used to learn the processed graph to obtain the node representation under a single metapath, and then the attention mechanism is used to fuse the node representations under multiple metapaths.
MAGNN [26]: A heterogeneous graph neural network model. This model first uses multiple metapaths to sample the network to obtain multiple subgraphs under different source paths. To preserve the instances of each subgraph, the semantic information of each instance is rotated, and different types of rotations are rotated into the same space as the semantic information of each instance. The attention mechanism is used to aggregate the semantic information of the instances into the nodes. The problem of premature integration is alleviated. Finally, semantic level attention is used to fuse the node representations under multiple source paths.
HeCo [32]: A self-supervised heterogeneous graph neural network. The node representation of the heterogeneous network is learned from two perspectives, namely the network architecture perspective and the metapath perspective, which fully capture the information in the heterogeneous network. By using collaborative contrastive learning for node embedding from the two perspectives, network perspective and meta-path perspective are collaboratively supervised as two views. As training progresses, these two views will guide each other and co-optimize.GAEMDA [33]: An autoencoder model used on bipartite graphs. The model first projects the two types of nodes in the bipartite graph into the same space through the node transformation matrix, and aggregates the features of other types of nodes into the original embedding of the node through the encoder of the graph neural network. Finally, the prediction of potential links between nodes is done using a bilinear decoder.
The parameter settings used for the models were as follows: The window size of the Metapath2vec model was set to 5 and the walk length to 100; each node performed 10 walks, and the number of negative samples was 5. In the GAT model, the hidden layer dimension was set to 64, the multi-head attention to 2, and the learning rate to 0.0001. HAN, MAGNN, and our method MEAHEN are heterogeneous neural networks, which are methods for metapath segmentation of the original heterogeneous network and learning of segmentation subgraphs, so we used the same parameters for all three models. Since both MAGNN and our model MEAHNE set a limit on the number of nodes in the node sampling stage, for fairness, we set the same limit for the HAN model: a maximum of 100 neighbors for each node. The node dimensions of the three models were all set to 64, with a learning rate of 0.005 and an L2 penalty weight of 0.001. In the GAEMDA model, the number of node aggregation layers was set to 2, the learning rate to 0.001, and the weight decay to 0.001. In the HeCo model, the learning rate was set to 0.001, the number of neighbor samples in the network architecture perspective to 10, and the dimension of the hidden layer to 64. Table 1   We use semantic information twice to make nodes fully aggregated. The first is to encode semantic information into weights to aggregate neighbor nodes, and the second is to fuse semantic information into connected nodes. In this way, our model achieves good performance.
By comparing the experimental results, we can find that since the Metapath2vec model generation node embedding process and the downstream prediction task were performed separately, the downstream prediction task does not affect the generation of upstream nodes. At the same time, the upstream node embedding generation task can only learn the structural information of the node, making the node representation incomplete, which is also the reason why the performance of Metapath2vec was not as good as that of GNN models. The GAT model treats all nodes as being of the same type, which makes GAT unable to learn rich semantic information. It also aggregates all neighbors in the spatial domain, and the noise from the neighbors will also affect the final result. The HAN model only aggregates homogenous nodes connected to the target node through the metapath, which is equivalent to HAN giving semantic information for the meta-path instances and only paying attention to the semantic information at the metapath level. Lack of semantic information leads to the poor performance of HAN.
In the GAEMDA model, nodes only aggregate information for connected nodes of different types. Since the auto-encoder continuously updates the node representation on the graph,the node can learn information of nodes that are multiple hops away from it , which helps GAEMDA achieve better results than HAN.
The MAGNN model learns semantic information on the metapath instances and aggregates this information, which enables this model to better learn the complex information in the model; therefore, it yielded good results. The HeCo model learns node representations from two perspectives. Contrastive learning method makes the two perspectives constrain and complement each other in the learning process.
The node representation obtained in this way is very complete. HeCo yields good results.
Analysis of model parameters: We changed two parameters in our model, dimension of the node vector and the number of semantic information extraction layers, to evaluate their influence on performance of the model. In this section, we describe experiments used to evaluate the influence of these two parameters on the model.
For the dimension of the node vector, we found that when the node vector was 64-dimensional, the model performance was better, but when the dimensions were 128 and 256, the model performance became worse (Fig 2). Thirty-two dimensions could not fully express the node information, resulting in loss of information, while 128 and 256 dimensions were too many and contained a lot of noise; 256 dimensions contained the most noise and led to the worst performance.
Our model uses non-linear fully connected layer to build when extracting semantic information on the instance of the metapath, and different number of connection layers affects the quality of information extraction. It can be seen in Fig 3 that the performance was best when the number of node fusion layer was one; the performance of two layers and three layers were similar, and four layers was the worst.
Multiple non-linear fully connected layers cause over-fitting, resulting in insufficient semantic information learned. The poor performance of the four layers verifies this point of view. To test the accuracy of our model, we performed miRNA predictions for lung cancer and nasopharyngeal carcinoma. The prediction method was as follows: all of the miRNAs were paired with these two diseases to obtain miRNA-disease pairs, and the trained model was used to score the pairs. We selected the top 50 miRNAs in the miRNA disease combination for evaluation. Among the potential miRNA prediction results for lung cancer, after dbDEMC [34] verification, there were a total of 49 associations with lung cancer. The association of hsa-mir-1-1 was not verified using dbDEMC. Among the potential miRNAs associated with throat cancer, 44 miRNAs in the top 50 miRNAs were verified using the dbDEMC database. The miRNAs that were not verified were hsa-mir-210, hsa-mir-92-1, hsamir-1-1, hsa-mir-9-3, hsa-mir-9-2, and hsa-mir-9-1. The prediction results are shown in Tables 2 and 3.

Conclusion
Due to the relatively small number of verified relationships between miRNA and disease, we selected the third type of node gene and build a heterogeneous network to alleviate this problem.
Meanwhile, we propose a semantic-based heterogeneous graph neural network model for link prediction, which aggregates nodes using semantic-based attention aggregation method. The model utilizes semantic information to aggregate nodes in heterogeneous networks twice, and nodes can fully express the relationships in the network. Traditional heterogeneous graph neural networks often ignore intermediate nodes [25] and cause information loss. We avoid this problem by extracting the information on metapath instances into semantic information. Compared with the heterogeneous graph neural network methods of the past few years, our method achieves the best performance in both AUC and AP.
But it is worth mentioning that our model still has room for improvement. First we randomly select a fixed number of metapath instances for each node. Other selection strategies may yield better performance. Second, semantic information has an important place in our model. Our model uses neural networks to extract semantic information. Whether other extraction methods can make the model perform better is worthy of our further experiments.

Materials
This section introduces the data we used, which consist of three types of nodes, namely miRNA, disease, and gene, and types kinds of associations between the three types of nodes. The four types of associations are miRNA-disease association, miRNA-gene association, disease-gene association, and protein-protein interaction association.
We collected related links between miRNAs and diseases from the HMDD3.2 [28] database. HMDD is a reliable database that specifically collects miRNA-disease associations. We collected 17,972 links between 1206 miRNA and 893 diseases and integrated miRNAs and diseases as nodes, and miRNAdisease associations as instances into the heterogeneous network. We collected related links between miRNAs and target genes from the Circ2disease [39] database. We selected 4676 links between 202 miRNAs and 1713 genes and integrated miRNAs and target gene as nodes and the associations between them as instances into the heterogeneous network. We collected the related links between diseases and genes from DisGeNET [30]. We selected 84,038 links between 11,181 diseases and 9703 genes and integrated diseases and genes as nodes and the associations between then as instances into the heterogenous network.
The protein-protein interaction network was obtained from the STRING [31] database, which is a reliable database that specifically collects protein interactions. We select genes associated with our chosen miRNAs and diseases and integrate these genes into our heterogeneous network The 105,171 associations between these genes were integrated into the heterogeneous network as instances. Finally, we established a heterogeneous network with 1296 miRNAs, 11,783 diseases, 10,116 genes, and 211,857 instances (Tables 4 and 5).  , in which a i represents the i-th type of nodes in the heterogeneous network and represents the collection of all node types in the heterogeneous network. r i ∈ ℛ, r i represents the i-th type of relationship between nodes and ℛ represents the collection of all relationship types in the heterogeneous network.
Definition of metapath instance [26]. Under each metapath type i , there are a large number of paths following i in the heterogeneous network. We call these paths metapath instances. For example, v a 1 1 → v a 2 5 → v a 3 3 → v a 5 2 is a metapath instance under 1 , in which v a i i represents the i-th node of type Definition of metapath neighbors. The two endpoints of a metapath instance are metapath neighbors to each other, and they are connected by the metapath instance. For example, v a 1 1 → v a 2 5 → v a 3 3 → v a 5 2 is a metapath instance in which v a 1 1 and v a 5 2 are metapath neighbor to each other.
This section introduces the main methods, ideas, and specific implementation details of the MEAHNE model. The MEAHNE model is mainly divided into five parts: node conversion, subgraph extraction, metapath instance semantic information extraction, node aggregation method based on metapath instance semantic attention, multi-semantic information fusion, and link prediction. Fig 4 shows the overall framework of MEAHNE.

A. Node space conversion
If we want to learn representations of heterogeneous networks, we need to perform interactive calculations on the nodes of the graph. However, heterogeneous network have multiple types of nodes, and different types of nodes are located in different spaces. If the nodes are not processed, the interactive calculation between nodes becomes too difficult, so we first converted all types of nodes into the same space to facilitate calculations between nodes as follows.
A trainable linear transformation matrix was set for each type of node, and original nodes of different types were projected into the same space, as shown in formula (1): Where x a i represents the original feature vector of the node type a i , and M a i ∈ ℝ d′×d a i , in which d′ represents the feature space dimension after space conversion and d a i represents the original feature dimension of a i type node.

B. Extract the metapath subgraph and the metapath instances
To mine heterogeneous network in multiple metapaths, the first step is to separate the corresponding sub-networks based on specific metapaths.
We separated the sub-network i according to the metapath i , and i represents the subnetwork mined in the i mode. In sub-network i , the metapth instances corresponding to the i was sampled and denoted as P(v, u), which connects the target node v and its metapath neighbor u.
C. Extract the semantic information contained in the metapath instances When mining the information from the corresponding subgraph i under a single metapath, i , different types of nodes are transformed into the same space through space, which allows different types of nodes to represent each other. The metapath instance is composed of different types of nodes connected to each other and contains rich semantic information. Therefore, to learn the semantic information on the metapath instance when learning the subgraph, we first integrated the information on the metapath instance. Each metapath instance was represented as a vector that represents the semantic information on the instance. All the nodes on the metapath instance were concatenated according to the order of the metapath, as shown in formula (2): ℎ ( , ) =∥ ( ( , )) =∥ ∀ ∈{ ( , ) } (ℎ ) (2) in which ( , ) represents the set of nodes on the metapath instance and ( , ), ℎ ( , ) represents the vector obtained by concatenating the vectors of the nodes on the metapath instance ( , ).

D. Semantic-based attention aggregation method
After obtaining the semantic information from the metapath instances, we can aggregate the semantic information into the target nodes connected to these metapath instances, but the semantic information is obtained by the fusion of different types of nodes. If the target node only aggregates semantic information, each type of node contains information about other types of nodes, causing different types of nodes to lose their distinction. To maintain the discrimination between nodes of different types, we first aggregate only neighbor nodes of the same type. For aggregating nodes of the same type, we designed a method to encode semantic information into attention weights and used the obtained attention coefficient to aggregate metapath neighbors. Then, we fused the information obtained by the aggregation of nodes of the same type and semantic information from metapath instances as the final node representation.
We encoded the semantic information on the metapath instance using the attention mechanism as a weight value-the correlation strength coefficient between the target node and the metapath neighbor, as shown in Fig 5 and Equations (5) and (6).

Fig 5. encoding semantic information on the metapth instances into attention weights
Where e vu represents the value encoded by the attention mechanism, Leaky_relu ( •)is a nonlinear activation function, a represents the attention weight matrix under metapath , N v represents the set of metapath neighbors connected to the target node v on the subgraph under metapath , and a vu represents the weight value obtained by normalizing e vu .
Next, the metapath neighbors were aggregated according to the weight and the semantic information was also integrated to ensure the integrity of the node embedding.
To reasonably integrate semantic information during the node aggregation stage, we performed secondary learning on semantic information. We designed a trainable matrix to optimize semantic information and added nonlinear activation operations to the optimization results as shown in formula (7).
Where represents a learnable weight matrix under metapath , and the content of semantic information is continuously adjusted through end-to-end learning.
Next, the node information was aggregated. We used the learned metapath semantic weight to aggregate the metapath neighbors and added the semantic information learned twice, as shown in formula Where h v i represents the embedding obtained by aggregating the target node v under metapath i andℎ represents the result of concatenating the representation of the target node v under all metapaths.
Then the embedding ℎ was input into the nonlinear neural network to learn a low-dimensional embedding that fuses the target node representation under multiple metapaths as shown in formula (10): After learning using a nonlinear neural network, represents a low-dimensional embedding that fuses multiple metapath representation results as the final representation of the target node.

F. Link prediction and optimization goals
The vector inner product is used as the score of the link strength of the two nodes. If the two vectors are highly correlated, then the score of the node inner product will be higher. We used this as the basis for link prediction as shown in formula (11): score md = σ(< ℋ m , ℋ d >) (11) Our link prediction was between miRNA and disease. The higher the prediction score, the stronger the correlation, and the lower the prediction score, the weaker the correlation. In theory this is a twoclassification problem, so we used two-class cross-entropy as the optimization target. Our optimization goal is shown in formula (12) Where Φ represents the set of miRNA and disease pairs that have been verified to be associated, and Φ − represents the set of all miRNA-disease pairs that have not been experimentally verified. The goal of the optimization is to make the score between verified node pair higher and the unverified node pair lower. Because our model is an end-to-end training model, the parameters in the model are continuously optimized during the training process, and the continuously optimized parameters enable us to achieve the optimization goal.