## Abstract

MicroRNAs (miRNAs) have been playing a crucial role in many important biological processes. Currently, the validated associations between miRNAs and diseases are insufficient comparing to all underlying associations. To identify these hidden associations based on biological experiment is expensive, laborious and time consuming. Therefore, computationally inferring the potential associations from biological data for further biological experiment has attracted increasing interests from different communities ranging from biological to computational science. In this work, we propose an effective and flexible method to predict the associations between miRNAs and diseases, namely linear optimization (LOMDA). The proposed method is capable of predicting the associations in three manners e.g., extra information such as miRNA functional similarity, gene functional similarity and known miRNA-disease associations are available; only some associations are known; and new miRNAs or diseases that do not have any known associations at all. The average AUC obtained from LOMDA over 15 diseases in a 5-fold-cross validation is 0.997, while the AUC of 5-fold cross validation on all diseases is 0.957. Moreover, the average AUC on leave-one-out cross validation is 0.866. We compare LOMDA with the state-of-the-art methods and the results show that LOMDA outperforms the others in both cases, e.g., extra information is combined and only known associations are used.

## 1. Introduction

MicroRNAs (miRNAs) are the short non-coding RNAs about 22 nucleotides that regulate the gene expression of the target post-transcriptional level [1, 2, 3, 4]. In the last few decades, accumulative evidences show that miRNA has strong relationships with many critical life processes including early cell growth, proliferation, apoptosis, differentiation and metabolism [5, 6, 7, 8, 9]. Moreover, miRNA dysregulation has also been shown to have close relations with many human complex diseases [10, 11, 12, 13, 14, 15] including lung cancer [16], breast cancer [17, 18] and cardiovascular diseases [19] and so on. Therefore, studying the associations of miRNA and disease from biological data has become a significant problem in biomedical research which not only helps in the investigation of the pathogenesis [20], but also assists the diagnosis, treatment and preventions. Obviously, a lot of cancers can be much easier treated at the initial stage, therefore, identifying all the novel associations of miRNA and cancer can play an important role for early investigation and treatment [21, 22, 23, 24]. Biological experiments to verify the new associations one by one would require huge amount of time and labor, hence, an effective and flexible tool for selecting a small portion which are the most likely associations among a large pool of associations is needed for scientists to further experiment.

Because of the importance of the miRNA-disease associations and to help the researchers study the associations between miRNAs and diseases, different comprehensive databases of miRNA-disease associations have been constructed, e.g., human miRNA-association disease database (HMDD) [25] collecting human miRNA and disease associations which are experiment-supported, dbDEMC [26] containing different expressions of miRNAs in human cancers detected by high-throughput methods and miR2Disease [27] containing comprehensive resource of miRNA deregulations in various human diseases. These databases have facilitated the researchers and scientists in understanding the disease pathogenesis, furthermore, they are the main resources for the association identification research. Although there are rich collections of miRNA-disease association databases, these known associations are still limited comparing to all potential miRNA and disease associations. Moreover, it is believed that one miRNA can be associated with multiple diseases and vice versus. Laboratory experiments on searching for such underlying associations are very costly and time-consuming. Motivated by and based on the collected biological data, the machine learning and network-based methods are the appropriate and effective tools to predict the most likely potential associations for further laboratory experiments.

In the last few years, there is an emerge of research [28] in predicting the association between miRNA and disease by utilizing machine learning methods [29, 30, 31, 32, 33] and network-based methods [34, 35, 36, 37, 38, 39, 40]. In particular, Jiang *et al.* [41] proposed a model by integrating miRNA functional similarity network, disease phenotype similarity, known disease-miRNA association network and a discrete probability distribution named hypergeometric to predict the potential associations. Xuan *et al.* [42] combined the miRNA-disease associations with miRNA functional similarity, disease semantic similarity and disease phenotype similarity in a model called HDMP to obtain the association scores between miRNA and disease by summing up the sub-scores for the miRNA’s *k* neighbors. Then sub-score of a neighbor was computed by multiplying the weight of the neighbor with the functional similarity between the neighbor and the miRNA. Chen *et al.* [35] developed a method named RWRMDA to predict the novel human miNRA-disease associations by using random walk with restart based on network global similarity. In their proposed method, two data are utilized including miRNA-disease network and miRNA functional network similarity matrix. RWRMDA cannot predict the associations between miRNAs and diseases that do not have any known associations. Chen *et al.* [36] proposed three inference methods namely miRNA-based similarity inference, phenotype-based similarity inference and network-consistency-based inference. The three methods also utilize the global network similarity to predict new miRNA-disease associations. Chen *et al.* [43] developed a method by using semi-supervised learning namely the regularized least squares for miRNA-disease association (RLSMDA) by integrating miRNA-disease associations, disease-disease similarity and miRNA-miRNA associations. Chen *et al.* [33] proposed a method based on restricted Boltzmann Machine named RBMMMDA where a two-layered undirected miRNA-disease graph was built. RBMMMDA is unable to handle new diseases or miRNAs that do not have known associations. Pasquier *et al.* [29] proposed a vector space model to predict the miRNA-disease associations by utilizing miRNA-target associations, miRNA-word associations and miRNA-family associations to form a large matrix. The large matrix is later on decomposed by using singular value decomposition (SVD) as the dimensionality reduction. Finally, the decomposed matrices are utilized to reconstruct a new matrix which is embedded in a much lower dimensional space. Gu *et al.* [44] developed a method named network consistency projection for miRNA-disease association by integrating the miRNA family similarity network, the disease semantic similarity network, the known miRNA-disease associations and the miRNA family information to predict the potential association. [45] proposed a model namely extreme gradient boosting machine (EGBMMDA) by taking the feature vector obtaining from feature extraction on the miRNA functional similarity, disease semantic similarity and known miRNA-disease associations as input to the model. Xuan *et al.* [46] proposed a method namely MIDP based on random walk on the miRNA and disease networks. Chen *et al.* [47] developed a method based on a recommendation system on the network integrating of known miRNA-disease associations, disease semantic similarity, miRNA functional similarity and Gaussian interaction profile kernel similarity. Ding *et al.* [48] proposed a method called DMHM by using graph-based regularization on the manifold heterogenous networks integrating target information. You *et al.* presented a method called path-based miRNA-disease association (PBMDA) prediction by integrating known human miRNA-disease associations, miRNA functional similarity, disease semantic similarity, and Gaussian interaction profile kernel similarity for miRNAs and diseases. Chen *et al.* [49] proposed a method called MDHGI by integrating the predicted association probability obtained from matrix decomposition through sparse learning, miRNA functional similarity, disease similarity and Gaussian interaction profile kernel similarity. Zeng [50] *et al.* propose a method using structural perturbation method (SPM) on the integration of miRNA-disease association network, miRNA similarity network and disease similarity network. Zeng [51] *et al.* proposed a method namely neural network model for miRNA-disease association prediction (NNMDA) on also the heterogenous network by integrating neighborhood information in the neural network which also consider the imbalance of datasets. This model predicts miRNA-disease associations by integrating multiple biological data resources.

Some algorithms use only the known miRNA-disease associations. For instance, Sun *et al.* [37] proposed a method, namely network topological similarity (NTSMDA), which used only the known miRNA-disease associations as bipartite network. Firstly, they constructed the adjacency matrix representing the association between miRNA and disease. Two matrices, e.g., miRNA network topological similarity and disease network topological similarity matrix, were constructed by using Gaussian interaction profile kernel on the constructed adjacency matrix. The two topological similarity matrices then were integrated with the original adjacency matrix as linear combinations. Finally, the resource allocation was utilized on the two matrices to improve the network-inference by incorporating them together. Li *et al.* [30] developed a matrix completion technique, namely MCMDA, to predict the associations between miRNA and disease on the miRNA-disease adjacency matrix. MCMDA cannot predict new disease or miRNA that do not have any known association at all. Methods that exploit different sources of information normally perform better than the methods that use only known associations. However, it is noteworthy that the extra information about the characteristics of the disease and miRNA is not always available. Therefore, an effective method utilizing only the known associations which are already experimentally validated is still a very necessary bioinformatics tool. Intuitively, the effective computational methods should not be only capable of producing accurate predicting associations, but also be able to predict potential associations of diseases or miRNAs that do not have any known association at all, as well as predicting the associations of miRNAs and diseases that have only known associations. We illustrate some of miRNA-disease association prediction methods and their characteristics in table 1.

Although many methods have been proposed to predict the associations between miRNAs and diseases, their performances are still not satisfying. Hence there is still a room to grow. In this study, we proposed an effective and flexible method namely linear optimization for miRNA-disease association (LOMDA). On one hand, by using merely known associations LOMDA performs much better than the other methods. On the other hand, extra information of miRNA functional similarity and gene functional similarity boost up the performance of LOMDA. The proposed model can also predict the miRNAs or diseases that do not have any known associations, but have some characteristics information. Moreover, the study cases demonstrate the effectiveness of LOMDA on predicting the novel associations between miRNAs and diseases. In a word, LOMDA would be a promising bioinformatics tool for biomedical researches. The Matlab source code of the proposed method and data implemented in this work can be obtained at https://github.com/rathapech/LOMDA.

## 2. Method

It is believed that similar diseases are associated with similar miRNAs and vice versus [46, 50]. Embedding this information, e.g., disease similarity, gene functional similarity, target information and so on, into the model will boost the prediction performance. In this work, we propose a method based on linear optimization namely LOMDA to solve miRNA-disease association prediction problem. Whenever extra information about the characteristics of disease and gene functions are available, we can embed them into the model. In particular, we integrate miRNA functional similarity and gene functional similarity and miRNA-disease association to obtain a heterogenous matrix as
where *MS* ∈ ℝ^{m×m} is the miRNA functional similarity matrix in which *m* is the number of miRNA, *DS* ∈ ℝ^{d×d} is the disease semantic similarity matrix in which *d* is the number of disease and *MD* ∈ ℝ^{m×d} is the known miRNA-disease association matrix. After integrating the three types of information into a heterogenous matrix, we obtain a square matrix **A** ∈ ℝ^{(m+d)×(m+d)}. On the other hand, when only the associations between miRNAs and diseases are available, we can set **A** = **MD**.

Disease similarity is obtained from HumanNet database [52] which contains the log likelihood score (*LLS*) of each interaction between genes. Zeng *et al.* computed the gene functional similarity [50] as
where *S*(*d _{i}*) and

*S*(

*d*) are the the gene sets that related to disease

_{j}*d*and

_{i}*d*, respectively. |

_{j}*S*| is the cardinality of set

*S*.

*LLS*(

*x, S*(

*d*)) is the

_{i}*LLS*between gene

*x*and gene set

*S*(

*d*), where

_{i}*x*∈

*d*. The miRNA functional similarity [50] is obtained from four source of information including verified miRNA-target associations (

_{j}*RST*), miRNA family information (

*RSF*), cluster information (

*RSC*) and verified miRNA-disease associations (

*RSD*). where

*η*,

*β*,

*γ*and

*θ*are the parameter to adjust the four weights and were set as

*α*= 0.2,

*β*= 0.1,

*γ*= 0.2 and

*θ*= 0.5.

### 2.1 LOMDA

Denoting the integration matrix by **A** as shown in Eq. (1), we assume that the likelihood of the associations between miRNA and disease can be written as a linear combination of **A** and weighting matrix **Z** as

Since **S** and **Z** are unknown, the problem of Eq. (4) has infinite solutions. However, in order to obtain the likelihood **S** containing the existing and predicted associations, **S**should be intuitively and reasonably close to **A**. Then we can write
where *ϵ* is the threshold parameter. Moreover, to avoid the model to be overfitted and simultaneously to constrain the magnitudes of **Z**, we can relax the Eq. (5) as
where *α* is the positive free parameter greater than 0 and ‖.‖ is the matrix norm. Without losing the generality [53], we use Frobenius norm and raise the two terms with power 2. We can have
where Frobenius norm is denoted as , e.g., *σ*_{i} is the singular value, *p* is the number of row, and *q* is the number of column of **Z**. The expansion of Eq. (7) reads
with its partial derivative being

Setting *∂***E***/∂***Z** = 0, we can obtain the optimal solution of **Z** as
where **I** is the identity matrix. The likelihood matrix **S** can be obtained as

Finally, **S** is utilized as scores of each pair of the miRNA and disease association after disregarding the scores of the known associations.

## 3. Results

### 3.1 Performance evaluation

To evaluate our proposed method against others, we adopt cross validation techniques including leave-one-out cross validation and 5-fold cross validation which are the widely used evaluation methods. In the LOOCV, we randomly remove one association of each disease as testing samples and use the remaining associations as the training samples. For the 5-fold cross validation, we randomly divide all the known associations of diseases and miRNAs into 5 subsets. We utilize the four subsets of the five subsets as training samples and leave the remaining subset as testing samples. We repeatedly and independently do this for five times until all the five subsets are utilized as testing samples exactly once. The database contains 336 diseases, 577 miRNAs and 6441 associations.

For the LOOCV and 5-fold cross validation, we use the area under the receiver operating characteristic (ROC) curve (AUC) to evaluate the predicting accuracy. The curve displays true positive rate (sensitivity) versus false positive rate (1-specificity) at different values of thresholds. On one hand, sensitivity is the percentage of the test samples in which rank higher than a given threshold. On the other hand, specificity is the percentage the test samples that fall below the threshold. AUC = 1 indicates the perfect prediction, while AUC = 0.5 indicates random performance.

### 3.2. Comparison with other methods

A great number of the methods that are developed to predict the potential associations between miRNAs and diseases utilized different sources of information including gene expressions, miRNA functional similarity, disease similarity and miRNA family information. Moreover, it is believed that the methods that combine different sources of information perform better. In this study, we test the proposed method on two manners e.g., the information of miRNA and disease are embedded in the model (called LOMDA-**A**) and only known associations are utilized (called LOMDA-**MD**). We compare LOMDA with many methods including RWRMDA, HDMP, RLSMDA, MIDP, SPM and NNMDA [51].

Table 2 illustrates the predicting performances of the proposed method and others on different diseases. The highest values generating from any methods are shown in boldface. The average AUC values among the 15 disease, as shown in bottom row of the table 2, of RWRMDA, HDMP, RLSMDA, MIDP, SPM, NN-MDA, LOMDA-**MD** and LOMDA-**A** are 0.799, 0.816, 0.826, 0.862, 0.914, 0.937, 0.986 and 0.997, respectively. In other words, LOMDA–**A** and LOMDA–**MD** perform higher than the others. LOMDA–**A** specifically outperforms RWR-MDA, HDMP, RLSMDA, MIDP, SPM and NNMDA in all the 15 diseases by 24.7%, 22.1%, 20.7%, 15.6%, 9.1% and 6.4%, respectively. The highest AUC values generating from LOMDA-**A** on three diseases e.g., hepatocellular carcinoma, melanoma and stomach neoplasms are up to 0.999. Moreover, LOOCV obtained from LOMDA-**A**, NNMDA, SPM, HDMP and RLSMDA are 0.866, 0.843, 0.811, 0.770 and 0.695, respectively.

### 3.3. Case studies

In order to verify the effectiveness of LOMDA, case studies of three critical diseases including breast neoplasms, colon neoplasms and kidney neoplasms have been investigated to predict the potential associations. Breast neoplasms is the most common malignant tumor in women accounting for 25% [54] followed by prostate and colon cancer [55, 56]. Moreover, colon neoplasm is one of the common cancers which has high death rate [57].

In these case studies, all the known associations are utilized as training samples and the unknown are the testing samples. First of all, we compute the scores of all the unknown associations then sorted in descending order corresponding to each disease. After that, we select the top 30 association scores of the interested disease and manually verify the existences of the associations in other two miRNA-disease databases, e.g., dbDEMC [26] and miR2Disease [27]. The predicting results of the three diseases are shown in table 3, 4 and 5, respectively.

In addition, to verify the prediction of LOMDA on the disease without any known associations, we remove all the associations of hepatocellular carcinoma and lung neoplasms with all the miRNAs. Then we compute the likelihood scores of these diseases with all the miRNAs by using LOMDA-**A**. Finally, we select the top 30 candidates and manually check these candidates in the three databases including HMDD, dbDEMC and miR2Disease. All these predicted candidates belonging to kidney neoplasms can be confirmed in at least one of the three databases, while 29 among 30 predicted associations of lung neoplasms are also confirmed. These predicted results are shown in table 6 and 7, respectively.

### 3.4. Parameter α

In the proposed method, there contains a parameter *α* playing a role to control the residual and to avoid overfitting of the model. The optimal value of *α* for the integration matrix **A** is 0.01, while the optimal *α* for the only known interaction matrix, e.g., **A** = **MD**, is 0.001. As shown in Figure 1, *α* is not sensitive to the performance, especially when extra information are embedded into the model. That means more information provided, the more stable the model. We train the model to obtain the optimal value of this parameter by using cross-validation technique and tune *α* from 0.001 to 0.05 with 0.001 step. The optimal *α* is the one that produces the highest accuracy.

## 4. Conclusion

Predicting the novel associations between miRNA and disease helps scientists firstly focus on the most likely associations rather than blindly check on all the possible associations which is extremely costly and laborious. Moreover, it can help researchers enhance their understanding toward the molecular mechanisms of disease at the miRNA level. This prediction also plays an important role in understanding the pathogenesis of human disease at the early stage, therefore, it can help in diagnose, treatment and prevention. Motivated by the necessity of the identifying the novel associations between miRNAs and diseases, in this work we proposed a method, namely linear optimization for miRNA-disease association (LOMDA), to predict the potential associations between miRNAs and diseases. The proposed method utilizes the heterogenous matrix by integrating the miRNA functional similarity, disease gene similarity and known miRNA-disease associations. In case only known associations are available, the method can also be applied. Moreover, the method can also predict the associations of new miRNAs (or diseases) by using miRNA functional similarity (or disease semantic similarity). According to the cross validation evaluated by AUC, the proposed method has been shown to perform very satisfied. Thus, LOMDA is an effective and flexible tool for predicting miRNA-disease associations. Firstly, the researchers can apply this method to predict the potential associations by computing the association scores and finally choose the most promising associations for further biological experiment.

LOMDA contains a parameter *α*. In order to find the optimal value of this *α*, one might need to apply cross validation by dividing the known association into training samples and testing samples, then the AUC is computed based on different values of parameters. The optimal *α* is the one that produces highest AUC. When the data is large enough, the optimal parameter of this division is approximately the same as that of the whole dataset.

## Footnotes

## References

- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵