GNMFLMI: Graph Regularized Nonnegative Matrix Factorization for Predicting LncRNA-MiRNA Interactions

Long non-coding RNAs (lncRNAs) and microRNAs (miRNAs) have been involved in various biological processes. Emerging evidence suggests that the interactions between lncRNAs and miRNAs play an important role in the regulation of genes and the development of many diseases. Due to the limited scale of known lncRNA-miRNA interactions, and expensive time and labor costs for identifying them by biological experiments, more accurate and efficient lncRNA-miRNA interaction computational prediction approach urgently need to be developed. In this work, we proposed a novel computational model, GNMFLMI, to predict lncRNA-miRNA interactions using graph regularized nonnegative matrix factorization. More specifically, the similarities both lncRNA and miRNA are calculated based on known interaction information and their sequence information. Then, the affinity graphs for lncRNAs and miRNAs are constructed using the $p$ -nearest neighbors, respectively. Finally, a graph regularized nonnegative matrix factorization model is developed to accurately infer potential interactions between lncRNAs and miRNAs. To assess the performance of GNMFLMI, five-fold cross-validation experiments are carried out. The AUC values achieved by GNMFLMI on two datasets are 0.9769 and 0.8894, respectively, which outperform the compared methods. In the case studies for lncRNA nonhsat159254.1 and miRNA hsa-mir-544a, 20 and 16 of the top-20 associations predicted by GNMFLMI are confirmed, respectively. Rigorous experimental results demonstrate that GNMFLMI can effectively predict novel lncRNA-miRNA interactions, which can provide guidance for relevant biomedical research. The source code of GNMFLMI is freely available at https://github.com/haichengyi/GNMFLMI.

lncRNA-protein interactions [11]. Therefore, it is necessary to construct a network of biomolecular interactions mediated by lncRNAs, which is very useful for revealing the underlying mechanisms and biological functions of lncRNAs [12]. As a bait for miRNA, lncRNAs can inhibit the binding of miRNA to target gene mRNA, and can also act as an endogenous miRNA sponge to inhibit miRNA expression [13,14]. With the accumulation of knowledge on miRNA function, the lncRNA-miRNA interaction network can help us better understand the complex functions of lncRNA [7]. MiRNA is a class of non-coding short sequence RNAs of 18-25nt in length that are widely found in eukaryotes and are highly conservative, spatiotemporal-specific and tissue-specific [15]. A miRNA molecule can regulate the expression of up to 200 target genes, and about one-third of human genes are regulated by miRNAs [15]. Up to now, miRNAs are considered to be the most important gene regulators in cell differentiation, development, growth, and tumorigenesis, progression, metastasis, and drug resistance [16]. In the occurrence and development of human tumors, some miRNAs can act as both an oncogenes and a cancer suppressor genes [17,18]. For example, certain miRNAs are associated with the development of ductal carcinoma in situ (DCIS) to invasive carcinoma, especially miR-210, miR221 and let-7d, which are down-regulated in situ carcinoma but up-regulated in invasive carcinoma [17].
In recent years, More and more studies have shown that both lncRNA and miRNA play critical roles in various biological processes and human complex diseases [19,20]. It has been systematically studied that the lncRNA-miRNA interactions exert regulation role in some human complex diseases [21][22][23]. In many diseased cells, lncRNA is discovered to have a certain quantitative relationship with some miRNAs, this quantitative relationship is closely associated with the occurrence and development of diseases [24].
For example, in the renal cell carcinoma (RCC), after Malta silencing, miR-205 expression is up-regulated, but cell proliferation, migration and invasion are inhibited. Conversely, the expression of Malta is significantly reduced after overexpression of miR-205，experiments showed that there is a mutual regulation relationship between Malta and miR-205 [25].
The detailed understanding of interactions between lncRNAs and miRNAs in disease is very helpful for new biomarker discovery and treatment methods exploration [26].
However，identifying the interactions between lncRNAs and miRNAs is expensive and time-consuming by biological experiment.
To accelerate the process of identifying interactions between biomolecules, many computational methods have been proposed and effectively used for predicting relationships (e.g. miRNA-disease associations, protein-protein interactions and lncRNAprotein interactions), including manifold learning, manifold embedding and semisupervised learning, etc. [27][28][29]. The computational methods for predicting miRNA-target interactions usually have the following common rules, including site accessibility, seed matching, free energy and protection [10,30]. However, many miRNA-target identification methods were proposed originally for mRNA targets that may not be able to identify the interactions between lncRNAs and miRNAs, or even contradictory [31,32].
Huang and Chan proposed the EPLMI calculation model based on the assumption that lncRNAs tend to interact collaboratively with miRNAs of similar expression profiles, and constructs bipartite graph via known lncRNA-miRNA interactions for prediction [33]. Huang et al. proposed the GBCF computational model based on known interaction network to obtain a top-k probability ordering list of individual lncRNA or miRNA for prediction [34]. Although the above two methods have better predictive effects in the known lncRNA-miRNA interaction network, they cannot be applied to new lncRNA or miRNA. The predictive effect of interactions between molecules can be improved by integrating biological information from different sources [35][36][37]. In fact, prediction of lncRNA-miRNA interactions can be considered as a recommender system problem [38,39]. Accumulated studies have shown that matrix factorization is an effective method which has been successfully used in recommender system for data representation，and already widely applied in the field of bioinformatics [40][41][42].
In this paper, we propose a new calculation model, GNMFLMI, to predict lncRNA-miRNA interactions using graph regularized nonnegative matrix factorization (NMF) [43].
The model is based on the assumption that functionally similar lncRNAs (or miRNAs) are more possibility to interact with a same miRNA (or lncRNA) [44] . GNMFLMI fully exploits miRNA/lncRNA sequence information and known lncRNA-miRNA interaction network to calculate miRNA/lncRNA similarity. Subsequently, the graph spaces of miRNA and lncRNA were constructed based on the local invariance hypothesis of intrinsic geometric space [45][46][47], which promote similar lncRNAs/miRNAs to be close enough to each other in the lncRNA/miRNA space. We evaluated the performance of our method by five-fold cross validation and compared the performance with standard NMF [43], RNMF [48]. The experiment results show that GNMFLMI is better than other compared methods, and can effectively predict the novel lncRNA-miRNA interactions.

Benchmark Dataset
The lncRNA-miRNA interactions dataset used in this work were obtained from the lncRNASNP2 database in January, 2019. This dataset is collected and collated by Ya-Ru Miao et al. [49] and is accessible to academic users at

The standard nonnegative matrix factorization (NMF)
Nonnegative matrix factorization (NMF) is an effective algorithm which has been successfully used in recommender system for data representation [41]. This algorithm divides a large original matrix into two low-dimensional nonnegative matrices: a basis matrix and a coefficient matrix , and make their product as close as possible to the original matrix. Recent years, NMF has been successfully utilized to predict potential associations of lncRNA-protein [50], miRNA-disease [51], Microbe-disease [52]，drugtarget [53]，CircRNA-disease [54]，etc. In this work, the lncRNA-miRNA interaction adjacency matrix ∈ 468×262 is divided into ∈ ×468 and ∈ ×262 , is the sub-space dimensionality ( < /( + )). So that: The objective function for predicting lncRNA-miRNA interactions can be formulated as following: Where, ‖•‖ denotes the Frobenius norm. ， ≥ 0 means that all elements of and are nonnegative. According to matrix properties ‖ ‖ 2 = ( ), ( ) = ( ), and ( ) = ( ), we can obtain: Where (•) is the trace of a matrix.
Lee and Seung [43] propose the nonnegative matrix factorization algorithm and this algorithm is based on the multiplicative update rules of and , the update rules are as follows:

Constrained nonnegative matrix factorization (CNMF)
Due to the standard NMF cannot ensure the and smoothness. We can use the Tikhonov ( 2 ) in the standard NMF to solve this problem [55]. Pauca et al. proposed the following constrained nonnegative matrix factorization (CNMF) formulation [48]: where is the sparseness constraint coefficient which is used adjust the sparsity of and via 2 -norm. The Eq. (6) can be written as: The multiplicative update rules of and are as follows: the details of multiplicative update rules are in section 2.3.5.

methods overview
In this study，we propose a new calculation model, GNMFLMI, to predict lncRNA-miRNA interactions. The GNMFLMI can be summarized into three steps, and its framework is shown in Figure 1. First, the similarity matrices of lncRNA and miRNA are calculated based on lncRNA/miRNA sequence information and known lncRNA-miRNA interaction network. Second, we construct the affinity graphs for lncRNAs and miRNAs using p-nearest neighbors. Finally, graph regularized nonnegative matrix factorization is performed to calculate the lncRNA-miRNA interaction scores.

Similarity measure
In our method, we need to construct the similarity matrices for lncRNAs and miRNAs separately，so the similarity both each pair of lncRNA-lncRNA and each pair of miRNA-miRNA need to be determined. In this study, two different types of lncRNA/miRNA similarity were measured via integrating diverse sources of information. The first type of lncRNA/miRNA similarity is calculated using Gaussian interaction profile (GIP) kernel based on known lncRNA-miRNA interaction network [56]. The second type is calculated using Pearson correlation coefficient (PCC) based on lncRNA/miRNA sequence information [57]. Subsequently, the overall similarity matrices for lncRNAs and miRNAs were constructed based on above two types of similarity.

Similarity based on Gaussian interaction profile (GIP) kernel.
According to the assumption that functionally similar lncRNAs tend to interact with the similar miRNAs, and Gaussian interaction profile (GIP) kernel has been widely used to compute the molecule similarity [56,58] . The Gaussian interaction profile kernel similarity of lncRNA and miRNA can be constructed according to the topologic information of known lncRNA-miRNA interaction network. Thus, we use GIP kernel to calculate the similarity ( , ) between lncRNA and lncRNA as following: is the adjacent matrix of lncRNA-miRNA interaction based on lncRNASNP2 database. and are the number of lncRNAs and miRNAs, respectively. The size of is × , ( ) is the ℎ row of the adjacent matrix , is the kernel width parameter.
Similar to lncRNAs, the Gaussian interaction profile kernel similarity ( , ) of miRNA and miRNA can be calculated as follows: where The size of is × , ( ) is the ℎ column of the adjacent matrix , is the kernel width parameter.

Similarity based on Pearson correlation coefficient (PCC
where, is the number of attributes of the expression profile, ̅ and ̅ denote the average value of and , respectively. Generally, the larger ( , ) represents the more similarly expression of lncRNA and lncRNA .
Similar to lncRNAs, the similarity between miRNA and miRNA can be calculated by Pearson correlation coefficient as follows.

Construct the overall similarity for lncRNAs and miRNAs.
In this study, the Gaussian interaction profile kernel similarity and Pearson correlation coefficient similarity of lncRNA and miRNA are calculated, respectively. After that, the functional similarity between lncRNA and lncRNA is defined according to [61,62], the final similarity matrix of lncRNA is calculated as follows: where is -order square matrix, ( , ) represents the similarity score between lncRNA and lncRNA .
Based on the same method, the final similarity matrix of miRNA is calculated as follows: where is -order square matrix, ( , ) represents the similarity score between miRNA and miRNA .

Sparsification of the similarity matrices
Recent studies on manifold learning theories and spectral graph have shown that the scattered nearest neighbors of data points can effectively model local geometric structure [53,63]. In graph regularized matrix factorization, the nearest neighbor graph can promote close lncRNAs (or miRNAs) to be sufficiently close to each other in the lncRNA space (or miRNA space) [46,64]. That is, it can preserve the local geometries of the original data.
In this study, the affinity graphs ( * and * ) for lncRNAs and miRNAs are constructed using p-nearest neighbor graph，respectively. We set p=5, and the weight matrix is generated based on the lncRNA similarity matrix as follows: where ( ) and ( ) denote the sets of p-nearest neighbors to lncRNA and lncRNA , respectively. Subsequently, the sparse similarity matrix * for lncRNAs is defined as： The same procedure for miRNAs, the sparse similarity matrix * can be obtained by the similarity matrix of miRNA.

The model of GNMFLMI
In the Euclidean space, the standard nonnegative matrix factorization in Eq. (2) fails to discover the intrinsic geometrical and discriminating structure of the data space [65,66].
To avoid overfitting and enhance generalization capability, we use the Tikhonov ( 2 ) regularization in Eq. (2) to guarantee the and smoothness (i.e. Eq. (6)) [55]. At the same time, graph regularization is used to ensure that the relative positions of data points in the lncRNA feature space or miRNA feature space are unchanged [46]. The objective function of graph regularized nonnegative matrix factorization can be defined as follows: Similarly,

Model optimization
Because of the objective functions of GNMFLMI is not convex, finding the global minima is unrealistic by optimization algorithm. However, the local minima can be achieved via algorithm. In this study, the Lagrange Multiplier method was introduced to solve the optimization problem in Eq. (25). Let = { } and = { }, the Lagrange multipliers and are used to restrict the ≥ 0 and ≥ 0, respectively. The Lagrange function ℋ can be constructed as: The partial derivatives of the function ℋ to and are: Finally, the updating rules can be determined as follows: We update the nonnegative matrices and according to Eq. (31) and Eq. (32) until convergence or reaching the iteration upper limit. Ultimately, the new prediction lncRNA-miRNA interaction matrix * can be calculated by * = . In general, the larger value of the element in prediction matrix * , it is more likely that lncRNA/miRNA interacts with the corresponding miRNA/lncRNA. That is, for each miRNA, we can sort the lncRNA in descending order according to the value of the element, the top ranked lnRNAs in each column of * are more likely to be associated with the corresponding miRNA. The same is true for each lncRNA. Table 1 generalizes the procedure of GNMFLMI for predicting lncRNA-miRNA interactions. 9. until convergence or reaching upper limit of the iteration 10. calculate the predicted association matrix * = ; 11. return * .

Experimental settings
In this study, to estimate the performance of GNMFLMI on predicting lncRNA-miRNA interactions, the five-fold cross validation experiments were performed on the lncRNASNP2 dataset and compare our method with the following approaches: NMF and RNMF. In the five-fold cross validation，we randomly divide 8634 known lncRNA-miRNA interaction samples into five equal subsets. For each cross validation experiment, one of the subsets was used as the test set and the other four subsets were used as the training set.
The receiver operating characteristics (ROC) curve and AUC (area under the receiver operating characteristics curve) are widely used to estimate the performance [69,70]. A larger value of AUC represents the better prediction performance of model. The confusion matrix can be obtained by setting different thresholds, the sensitivity (Sen.) and specificity (Spe.) are calculated as:

Cross validation
We compared the performance of GNMFLMI with computational approaches NMF and CNMF on the lncRNASNP2 dataset. Figure 2, Figure 3 and Figure 4 plot ROC curves and calculate the average AUC values of NMF, CNMF and GNMFLMI, respectively.     In addition, the sensitivity, precision, accuracy and F1-Score for these methods were calculated at different specificity. As shown in Table 3, when specificity is 95%, the average sensitivities of GNMFLMI, NMF and CNMF are 89.40%, 80.17% and 82.08%, respectively. The sensitivity of GNMFLMI is 9.23% and 7.32% higher than NMF and CNMF. When specificity is 90%, GNMFLMI achieves the average sensitivity of 94.20%, which is still 10.63% and 8.47% higher than NMF and CNMF, respectively. In figure 5, we can also discover that the ROC curve of GNMFLMI is always above CNMF and NMF.
These results further demonstrate that the performance of GNMFLMI is better than CNMF and NMF.

Case study
Case studies are carried out to further verify the capability of GNMFLMI on predicting novel interactions between lncRNAs and miRNAs. Here, we left out an arbitrary lncRNA-miRNA interaction (i.e. removing its interactions from the lncRNASNP2 dataset) to verify if its interactions would be discovered successfully. According to [71], the prediction performance is greatly affected by nearest neighbor information. If the new lncRNA (miRNA ) and lncRNA (miRNA ) are close to each other, it may be easy to accurately predict the interactions between them, and vice versa. In this work, we select the lncRNA and miRNA which the similarity to the nearest neighbor is low (according to * and * , respectively) to validate the capability of model on predicting novel interactions. However, the standard NMF and CNMF fails to discover the novel interactions in this case.  according to the predicted interaction scores. Table 4 gives the top 20 predicted interactions for nonhsat159254.1, the top 20 predicted interactions were verified by databases. The same procedure is performed for miRNA-hsa-mir-544a, Table 5 lists the top 20 predicted interactions for hsa-mir-544a, 16 out of the top 20 candidate lncRNAs were verified by databases.
It is worth noting that the nonhsat159254.1 and hsa-mir-544a prediction can be considered two difficult cases. Specifically, the similarities both nonhsat159254.1 and hsamir-544a to their nearest neighbors lncRNA and miRNA are as low as 0.1786 (according to * ) and 0.3658 (according to * ), respectively. Such low similarity makes it more difficult to predict the interactions between them. According to the above two cases, it is shown that GNMFLMI can effectively predict novel and challenging interactions between lncRNAs and miRNAs ， which can provide valuable information for biological experiments

Conclusion
The interactions between lncRNAs and miRNAs constitute a complex molecular regulatory network, and studies have confirmed that their interactions are closely related to the occurrence and development of various diseases. Identifying lncRNA-miRNA interactions can help people better understand the complex disease mechanisms. In this paper, we propose a new method, GNMFLMI, for lncRNA-miRNA interaction prediction.
Different from other traditional methods, GNMFLMI guides the matrix factorization via constructing graph Laplacian regularizations of lncRNAs and miRNAs, and uses Lagrange multipliers method to optimize the objective function. This method can also be applied into other similar association prediction (e.g. small molecular-miRNA and mRNA-protein associations). Five-fold cross validation and case studies were conducted to validate the performance of GNMFLMI，the experiment results show that our method outperforms the other compared methods and can identify potential lncRNA-miRNA interactions. We believe that our approach is helpful for clinical research. In future work, more different biological information can be integrated to further improve prediction performance of model.