Identifying Potential miRNA-Disease Associations Based on an Improved Manifold Learning Framework

Adequate evidence has shown that miRNA-disease interactions are strongly involved in the pathological processes of complex human diseases. However, it is commonly time-consuming and labor-intensive to utilize laboratory biological experiments to reveal unknown miRNA-disease pairs. Since the previously proposed calculation model has more or fewer deficiencies, we developed the semi-supervised method called Hessian Regularized Non-negative Matrix Factorization Method for miRNA-disease Association prediction (HRNMFMDA). This model introduced Hessian regularization into the NMF framework to preserve the local manifold information and increased an $l_{2,1}$ -norm penalty term to ensure the feature selection of the coding matrix and an approximate orthogonal constraint to obtain discriminative information. In the model performance evaluation, HRNMFMDA outperformed eight existing models, achieving AUC values of 0.9074, 0.8618, and 0.9044+/−0.0080 in the global leave-one-out cross-validation (LOOCV), local LOOCV and 5-fold cross-validation, respectively. In addition, we applied HRNMFMDA to several high-incidence human carcinomas via three kinds of case studies. Almost all predicted miRNAs were confirmed by external databases derived from experimental literature. Therefore, the conclusion can be drawn that HRNMFMDA is reliable for revealing uncovered miRNA-disease pairs.

The associate editor coordinating the review of this manuscript and approving it for publication was Bin Liu . Moreover, hypercholesterolemia had great connections with the increment of miR-223 levels in atheroprone mice [15]. Moreover, studies on Alzheimer's disease (AD) have revealed that lower expression levels of miR-15 and miR-107 in the hippocampus of AD brains lead to increased phosphorylation of tau protein, progressively inducing the occurrence of AD [16]. Additionally, experiments have demonstrated that miR-21, miR-494, and miR-1973 have encouraging potential as biomarkers for the diagnosis of classical Hodgkin's lymphoma [17]. In addition, the reduction in miR-122 can repress the RNA replication of HCV in chimpanzees [18]. Accordingly, uncovering the associations between miRNAs and diseases is needed to understand the occurrence and development of human diseases. However, traditional experimental methods, such as microarray profiling [19], quantitative RT-PCR array [20] and sequencing technologies [21], are time-consuming and expensive. Fortunately, abundant VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ miRNA-related databases have been continuously established with the accumulated results of innumerable biological experiments. Therefore, it is feasible to build new computational models depending on available biological datasets to predict the underlying disease-related miRNAs.
Over the last few years, investigators have achieved significant progress in clarifying uncovered disease-miRNA associations. Most of the existing predictive models depend on the hypothesis that functionally similar miRNAs are predisposed to diseases with similar phenotypes, and vice versa [22], [23]. Jiang et al. [24] utilized cumulative hypergeometric distribution to construct a scoring system by integrating disease phenotype similarities, miRNA functional similarities, and known disease-miRNA associations. However, the prediction performance was not satisfactory since the information of other nodes, except for the miRNAs' neighbors, was not effectively exploited. Then, Jiang et al. [25] developed a genomic data integration-based naive Bayesian model to predict novel miRNAs, improving predictive outcomes and providing guidance for further experimental investigation. Moreover, Chen et al. [26] proposed Random Walk with Restart (RWR) based on the miRNA functional similarities network (MFSN) to rank candidate miRNAs of a specific disease. However, this method did not work for diseases without any verified miRNAs. The Human Disease-related MiRNA Prediction (HDMP) method was built by Xuan et al. [27] by combining the semantic similarities and phenotype similarities of diseases. Then, miRNA clusters were allocated with higher importance according to the k nearest nodes. Nonetheless, HDMP ignored topology information embedded by local neighbors and could not infer novel miRNAs of the isolated diseases. Later, Xuan et al. [28] further proposed the MIDP method based on Random Walk. In this model, the prior information of topologies was integrated effectively by assigning different weights to known and unknown nodes. Furthermore, the extended transition on the miRNA-disease bilayer network made it possible to implement the prediction of isolated disease-related miRNAs. Chen et al. [29] presented the Heterogeneous Graph Inference (HGIMDA) method to identify candidate miRNA-disease pairs. The model constructed a target equation based on a heterogeneous graph. Then, matrix multiplications were calculated iteratively until convergence to acquire the predictive probability matrix. Nonetheless, the model may bias those miRNAs with more relevant disease records. Chen et al. [30] developed Matrix Decomposition and Heterogeneous Graph Inference (MDHGI) for the identification of miRNA-disease associations. In this method, the heterogeneous network was built using a fine-tuned associations matrix adjusted by low-rank matrix decomposition. Moreover, MDHGI could achieve improved performance because the model combined matrix decomposition with heterogeneous graph inference. Inevitably, the prediction solution had several defects, such as the difficult problem of parameter optimization and the bias caused by unevenly distributed miRNA-disease pairs. As a network-based model, the Path-Based (PBMDA) predictive model was introduced by You et al. [31] by combining multiple biological datasets. The method used the depth-first search algorithm to determine the weights of each path and prioritized miRNA-disease pairs based on their aggregated scores. Chen et al. [32] presented the Bipartite Network Projection (BNPMDA) for identifying miRNA-disease associations. The method adopted hierarchical clustering to generate bias ratings between miRNAs and diseases. Then, a bipartite network recommendation algorithm was implemented in the weight reassignment step between the investigated disease and other miRNAs. Inevitably, BNPMDA also had a crucial flaw in that it was not suitable for the association prediction of isolated diseases, which limited its range of application. The Ensemble Learning and Link Prediction (ELLPMDA) model [33] combined three classic link prediction methods (common neighbors, Jaccard index and Katz index) based on ensemble learning for identifying miRNA-disease associations. The prediction accuracy was effectively improved. Furthermore, Chen et al. [34] proposed a prediction model based on Bipartite Local models and Hubness-Aware Regression (BLHARMDA). The model used Jaccard similarity and KNN methods to mitigate the adverse effects of 'bad' hubs.
Apart from the models mentioned above, many computational methods that belong to machine learning are being gradually created. Xu et al. [35] developed a novel method by utilizing the miRNA target-dysregulated network (MTDN) to discover candidate prostate cancer miRNAs. Adopting the support vector machine (SVM) classifier, Xu's model made predictions by topological characteristics and expression folds of miRNA. Although this method was creative, it could not achieve outstanding accuracy because of the high false-positive rates of miRNA-target pairs. Chen et al. [36] presented a global learning model based on Regularized Least Squares (RLSMDA) for the candidates' prediction. Furthermore, it was mainly constructed as a continuous classifier composed of regularized least squares from the perspective of miRNA and disease. Furthermore, RLSMDA could work for solitary diseases or miRNAs. A novel model based on the restricted Boltzmann machine (RBMMMDA) was developed by Chen et al. [37]. It is noteworthy that RBMM-MDA was the earliest method that could discover not only the unobserved miRNA-disease pairs but also their association types. Of course, the model had some limitations, e.g., RBMMMDA did not have the applicability to reveal the isolated diseases or miRNAs. Moreover, Liang et al. [38] constructed the 1-norm graph model. In this semisupervised model, an effective 1-norm was introduced into the objective function, and then the weights of the miRNA and disease similarity matrix were updated adaptively in the iterative calculation process until convergence. The Extreme Gradient Boosting Machine-based method (EGBMMDA) was introduced by Chen et al. [39]. In this method, informative feature vectors were acquired by combining graph theory features, statistical features, and matrix factorization from multiple biological datasets. Nevertheless, choosing negative samples inappropriately would impair the predictive 33264 VOLUME 8, 2020 performance of the model, which deserves further investigation. Chen et al. [40] built the associations inference based on Inductive Matrix Completion (IMCMDA), whose core idea is to recover the missing entries using an iterative equation from verified miRNA-disease pairs. Chen et al. [41] developed the Ensemble of Decision Tree method (EDT-MDA) for revealing potential candidate pairs. The model adopted principal components analysis (PCA) to reduce feature dimensionality and then utilized the decision trees as base learners to acquire the predictive score matrix. However, the method might cause bias to diseases associated with more miRNAs. Chen et al. [42] developed an improved prediction model based on Laplacian Regularized Sparse Subspace Learning (LRSSLMDA). In this model, statistical and graphtheoretical features of miRNAs and diseases extracted from the integrated similarities were integrated into a common subspace by the L1-norm constraint and Laplacian regularization terms to acquire the projection matrices from the miRNA and disease perspective, respectively. The final prediction scores were obtained by taking the average of predicted association scores of miRNA and disease. Recently, Zhao et al. [43] proposed a novel method based on an Adaptive Boosting algorithm (ABMDA). This model adaptively adjusted the weights of weak classifiers (decision trees) to combine them into a strong classifier. Moreover, ABMDA was suitable for predicting isolated diseases without any known related miR-NAs. Wang et al. [44] developed a new prediction method called LMTRDA. It used the Logical Model Tree (LMT) as the base classifier and then adjusted the weights of the corresponding samples. It is worth mentioning that this model uses natural language processing technology to integrate miRNA sequences into multiple data sources, improving the final prediction performance. The RFMDA model, proposed by Chen et al. [45], performed feature extractions of the integrated multiple data sources according to the miRNAdisease pair and then input them into a random forest for training. The model had the disadvantage that it was often difficult to obtain the negative samples that were required during the training process. Wang et al. [46] constructed the Negative Samples Extraction method (NSEMDA) to predict miRNA-disease association pairs. The model used the Spy and Rocchio techniques to exclude negative samples and calculated similarity for suspected negative samples. Finally, the support vector machine-similarity weight method was used to obtain predictive scores. The advantage of the model was that it could mitigate the effects of noisy data. Zhang et al. [47] created a predictive model based on variational autoencoders (VAEMDA). This model used spliced matrices generated from adjacency and similarity matrices as training data and took the average of the outputs of the two VAE models as the final prediction scores. However, this type of model lacked interpretability, which needs to be further improved.
Some progress has been made in previous research on miRNA-disease association predictive models. However, there is still room to further improve the predictive performance of the models. Therefore, we presented an improved solution called Hessian Regularized Non-negative Matrix Factorization Method for miRNA-disease Association Prediction (HRNMFMDA) to further utilize information from negative samples. This model first introduced Hessian regularization into the NMF framework to exploit the local manifold structure by recovering the inherent local information from the training data. Moreover, the method increased the l 2,1 -norm constraint and approximate orthogonal constraints to ensure the feature selection of the coding matrix, thereby improving the final predictive performance. HRN-MFMDA acquired AUCs of 0.9074, 0.8618, and 0.9044+/-0.0080 after the implementation of global LOOCV, local LOOCV and 5-fold cross-validation, respectively, which were superior to most of the previously proposed models. Moreover, HRNMFMDA was further applied to infer several common human tumors. With the top 50 miRNAs in the predictive results, most of them were confirmed through dbDEMC [48], miR2Disease [49], and HMDD V3.0 [50]. Therefore, the results of the various verification experiments indicate that HRNMFMDA is useful for identifying candidate miRNA-disease pairs.

A. ASSOCIATIONS DATASET
Over the past decades, cumulative biological experiments have discovered a considerable number of miRNA-disease associated pairs. Here, the known human associations used in this model were downloaded from HMDD V2.0 [51]. The dataset contained 495 miRNAs and 383 diseases. An adjacency matrix was adopted to formalize the 5430 experimentally verified human miRNA-disease associated pairs. Specifically, if the disease d(i) had a previously verified connection with miRNA m(j), A ij was assigned a value of 1; apart from this, the corresponding position was set to 0.

B. MIRNA FUNCTIONAL SIMILARITIES
The miRNA functional similarities of this paper were gathered from http://www.cuilab.cn [23], whose calculations were inspired by a fundamental premise that functional analogous miRNAs more likely had a link with analogous phenotypic diseases [52]. Next, an miRNA functional similarity matrix FS was constructed to store data for this association matrix, where FS(i, j) denotes the functional similarities between the i-th and j-th miRNA.

C. DISEASE SEMANTIC SIMILARITIES MEASUREMENT 1
As reported by Wang et al. [23], a directed acyclic graph (DAG) was utilized to outline the semantic correlations among diseases. For instance, disease D could be described using a triple structure such as DAG(D) = (D, T (D), E(D)). In this expression, the node-set T (D) represents the corresponding disease and all its ancestor nodes, and edge-set E(D) was defined as the directed edges from the parents to the child nodes. From the above, the semantic value of disease D VOLUME 8, 2020 can be defined by: and where is the semantic decay factor and was set to 0.5 in the iterative equation according to reference [23]. Commonly, the farther a child node is located from the root node, the smaller its contribution to the semantic value of the corresponding disease. Referencing the hypothesis that two diseases sharing a greater degree of overlap of DAGs should have higher semantic similarity, the semantic similarities between the i-th and j-th disease could be defined by: Considering that the numbers of DAGs involving different disease terms can vary greatly, another method was proposed for quantifying the semantic similarities of the diseases more objectively and accurately [27]. According to the assumption that disease terms appearing in fewer DAGs should be given a more prominent contribution, we have: Similarly, the semantic similarity of disease D and the semantic similarity between the i-th and j-th diseases can be defined as follows: and

E. GAUSSIAN SIMILARITIES FOR DISEASES AND MIRNAS
According to the literature [53], the radial basis function was adopted to measure the Gaussian similarities between disease d(i) and disease d(j). Specifically, we used the binary interaction profile vectors IP(d(i)) and IP(d(j)) to represent the i-th row and the j-th row of the miRNA-disease adjacency matrix, respectively. Then the disease Gaussian similarities matrix KD was calculated by: where γ d is the bandwidth parameter, which is calculated as follows: Analogically, Gaussian similarities of miRNAs were determined as follows: (9) and For brevity, the values of γ d and γ m were both assigned a value of 1, consistent with the previous literature [53], [54].

F. INTEGRATED SIMILARITIES FOR DISEASES AND MIRNAS
Typically, disease semantic similarities were available if the corresponding DAGs existed. If not, the same method would not work. To acquire the disease similarities without missing data, the semantic similarities of the diseases were combined with the Gaussian interaction profile kernel similarities by piecewise function referring to the literature [55]:

d(i) and d(j) has semantic similarity KD(d(i), d(j)), otherwise
where taking the arithmetic mean as the disease semantic similarity if the DAGs existed was adopted; otherwise, KD. Similarly, the integration of miRNA similarity was obtained as follows: NMF is a promising and effective decomposition method utilized in various situations, including document analysis [56], pattern recognition [57], and image processing [58].

FS(m(i), m(j)), m(i) and m(j) has semantic similarity KM
In the NMF, given the miRNA-disease association matrix A ∈ R m×n , the target of the technique is to seek P ∈ R m×r and Q ∈ R r×n by minimizing the following: where ||·|| F represents the Frobenius norm. The minimization of the equation above could be solved by an iterative updating algorithm proposed by Lee and Seung [59].

2) HESSIAN REGULARIZATION
Inspired by the literature [60], the expression of Hessian energy S H (f ) could be defined by the integral of the volume element dV(x) of a smooth manifold M⊂R n , as follows: where T x M represented point x of the tangent space of M , and ∇ a ∇ b f denoted the second covariant derivation of function f , which was a real-valued mapping: M→R m [60]. Using the conventional coordinate system, we further had: where C r and C s were normal coordinates. According to this, the second-order covariant derivative of function f happened to be the Frobenius norm expression of the Hessian regularization in standard coordinates. Thus, we obtained: Let N k (X i ) denote the k minimum distance points of X i , and the Hessian of f (X i ) on N k (X i ) could be estimated as follows: Using the least squares method to solve a second-order polynomial could obtain the H operator [61]. Therefore, the approximate representation of the Frobenius norm of Hessian regularized f on x i could be expressed as: where B rsβ and the final estimated Hessian expressionŜ Hess (f ) could be approximated as follows: where n is the number of samples, and B denotes the Hessian regularization, which is made up of the sum of all matrices B (i) [62], [63]. The specific calculation process of Hessian regularization is shown in Algorithm 1. Motivated by previous literature [64], [65], sparseness constraints [66] and discriminative constraints [67] were joined into Hessian regularized NMF. In general, the 2,1 -norm term of the matrix Q and orthogonal constraint terms on matrix Q were added to the NMF framework as penalty terms. Then, the objective equation of HRNMFMDA to be optimized could be given as follows: In the equation above, A − PQ 2 F is the reconstruction error term of the non-negative matrix factorization, also known as the least squares cost function, where matrix A represents the miRNA-disease interactions and matrices P and Q are the basic matrix and encoding matrix to be solved, respectively.
tr(PB d P T ) and tr(QB m Q T ) are the Hessian regularization terms for preserving the manifold information, where tr(M ) denotes the trace of the matrix M . where q j. represented the j-th row of matrix Q. The purpose of row sparse regularization was to retain pivotal features only by decreasing some of the rows of matrix Q to 0.
Q T Q − I k 2 F was the discriminant constraint term, where I k was a unit matrix with the dimensions k ×k, whose purpose was to make matrix Q approximately orthogonal to obtain discriminant information.
The Tikhonov (L2) regularization term P 2 was set to ensure the smoothness of matrix P to prevent overfitting. VOLUME 8, 2020 Among them, λ, µ and γ were regularization parameters, which could be further optimized using separate validation sets. For the sake of brevity, default parameters were used in the following sections.

4) OPTIMIZATION
Since Eq. (20) was non-convex, the closed-form solution was not available. Next, an iterative algorithm was utilized to obtain the local minimum instead. According to formula A F = tr(AA T ), the optimization function can be rewritten in the following form: The corresponding Lagrange function L of the above equation can be defined as: where ik and ik denote the Lagrange multipliers for the constraints P ik ≥ 0 and Q kj ≥ 0, respectively. The partial derivatives of the equation above concerning the matrix P were: Following the Karush-Kuhn-Tucker (KKT) conditions [68] and ik P ik = 0, so From the above, the updating rules were acquired as follows: Similarly, taking the partial derivatives of L above concerning matrix Q, we could obtain: Here, R was a diagonal matrix whose i-th element was: Then, using KKT conditions, ik Q kj = 0, we can obtain: Finally, the iterative rule for matrix Q was obtained from the above as follows: The main goal of HRNMFMDA was to recover the complete miRNA-disease association matrix X pred ∈ R m×n by implementing NMF with Hessian regularization on the known entries of miRNA-disease associations matrix X ∈ R m×n . In other words, the main problem we needed to solve was to find P ∈ R m×r and Q ∈ R r×n to acquire the formula X pred = PQ, where the parameter r represented the theoretical rank constrained with min(rank(P), rank(Q)). The value of r would influence the convergence speed of the NMF factorization, but it had minimal effect on the final results [40], [69]. Expressly, the value of r was set as 100, the convergence threshold was set as 10 −6 , and the above regularization parameter values take λ 1 = λ 2 = 1, µ = 1, γ 1 = γ 2 = 1 for simplicity. After preliminary exploration, the parameter settings of the Hessian regularization were k = 30 and d = 10 in Algorithm 1. After updating P and Q based on Eq. (25) and Eq. (29) until convergence, we obtained the final prediction score matrix as X pred = PQ. Finally, we ranked the corresponding miRNAs based on X pred . Generally, the miRNAs at the top of the prediction scores had a higher probability of being related to the corresponding disease. The flow diagram of HRNMFMDA is depicted in Figure 1 and the overall implementation of the model is shown in Algorithm 2. The dataset and codes of this model can be downloaded through the following GitHub address: https://github.com/Tomhappy/HRNMFMDA.

A. PERFORMANCE VALIDATIONS
In this section, LOOCV validation and five-fold crossvalidation were performed based on HMDD V2.0 to assess the predictive performance of HRNMFMDA. LOOCV validation comprised two subtypes, including global and local LOOCV. Global LOOCV focused on potential miRNAdisease associations for all diseases. In contrast, local LOOCV only considered miRNAs that target specific diseases. During the implementation, we removed the known miRNA-disease pair one by one as a test sample (total 5430 verification tests) one by one, regarding the other known pairs as training samples and all remaining unverified associations as candidates. Specifically, global LOOCV concentrated on all of the undiscovered associations, while local LOOCV only focused on these miRNAs related to the specific investigated disease. For these two kinds of LOOCV, test samples were ranked with candidates according to HRNMFMDA, and exceeding a predetermined threshold indicated a successful prediction and vice versa. Then, with the changing threshold, the receiver operating characteristic (ROC) curve could Algorithm 2 Overall Implementation of HRNMFMDA Input: DAGs of diseases, denoted as DAG(D) = (D, T (D), E(D)), miRNA functional similarity matrix FS, known miRNA-disease associated pairs matrix A, parameters and default values γ d = γ m = 1, λ 1 = λ 2 = 1, µ = 1, γ 1 = γ 2 = 1, k = 30, d = 10 Procedure: (1) For each disease d(i) in the association matrix A If there are corresponding DAGs Calculate the two semantic similarities SS1 and SS2 according to Eq. 1 to 6.
Else Calculate the Gaussian kernel similarity KD according to Eq. 7 and 8 (about parameter γ d ) (2) For each miRNA m(j) in the association matrix A Calculate the Gaussian kernel similarity KM according to Eq. 9 and 10 (about parameters γ m ); (3) Calculate integrated disease similarities S d according to Eq. 11; (4) Calculate integrated miRNA similarities S m according to Eq. 12; (5) According to Algorithm 1, input similarity matrices S d and S m respectively, and calculate Hessian regularization matrices B d and B m (about parameters k and d) (6) According to Eq. 25 and Eq. 29, iteratively update the matrix to obtain P and Q (about the parameters λ 1 , λ 2 , µ, γ 1 and γ 2 ) until the maximum number of iterations or convergence; (7) Calculate X pred = PQ Output: Predicted score matrix X pred be plotted. And the area under the ROC curve (AUC) was employed to evaluate the prediction performance of the classifier. AUC = 1 indicated the best theoretical performance was achieved, while 0.5 indicated random performance.
The verified pairs, as mentioned above, were partitioned into five equal-sized parts randomly in the 5-fold crossvalidation part. Each subgroup was then treated as the test set in turn, and the remaining subsets were considered training cases. It is worth mentioning that each cross-validation used the updated Gaussian kernel similarities based on the training data. Next, using the score matrix generated by HRNMFMDA, we ranked the miRNAs of the test set compared with the candidates. This calculation process was repeated 100 times to accomplish the stable estimates of the mean and variance of HRNMFMDA. After the implementation of validation, we acquired the average AUCs of GRNMF, HGIMDA, IMCMDA, RLSMDA, GRMDA, WBSMDA and RKNNMDA as 0.8689+/-0.00065, 0.8594+/-0.00021, 0.8335+/-0.00056, 0.8351+/-0.00048, 0.8056+/-0.0022, 0.8009+/-0.0012 and 0.6761+/-0.0020, respectively. Our proposed method, HRNMFMDA achieved an average AUC of 0.9044+/-0.0080. The results further proved the reliability of HRNMFMDA. A local ranking method, RWRMDA was not included in this global validation framework. All these results indicated that HRNMFMDA achieved high prediction accuracy and good robustness.
As described above, the ranking results for 5430 test samples were obtained during the global and local LOOCV. Then, the paired t-test was implemented to assess whether there was a significant difference in the performance of HRNMFMDA between several other methods (GRNMF, HGIMDA, IMCMDA, RLSMDA, GRMDA, WBSMDA, RKNNMDA and RWRMDA) using the p-value. Finally, we concluded that HRNMFMDA differed significantly from other algorithms (see Table 1).

B. STRUCTURE ANALYSIS
To evaluate the performance contribution of the Hessian regularization term to the model in our study, we replaced the Hessian regularization in HRNMFMDA with Laplacian regularization and kept the other parts of the model unchanged.
The AUC values corresponding to the two methods performed in 3 types of cross-validations are shown in Table 2. These results suggested that prediction using either of these two kinds of regularization could produce satisfactory performance, but Hessian regularization yielded better predictive performance than Laplacian regularization. The above results agreed with the existing theoretical research conclusions that VOLUME 8, 2020 the second-order Hessian energy could overcome the problem that the graph Laplacian lacks extrapolating ability [64], [73]. In summary, Hessian regularization had an advantage in discovering unverified miRNA-disease associations compared to Laplacian regularization.
Finally, the permutation test was performed to evaluate how much the two integrated similarities matrices and adjacency matrix contributed to the predictive performance. The core approach of the test randomized one of the three matrices and kept the other two unchanged in the 5-fold cross-validation framework (LOOCV validation framework was not adopted here for the calculations would take too long, even several months). The test for each input matrix was randomized and repeated 100 times, and the average   AUCs were taken as the test result. If a particular matrix made a more significant contribution to the predictive ability of the method, the permutation test results based on that matrix would be closer to 0.5. As described in Table 3, HRNMFMDA randomizing of the adjacency matrix had the average AUCs much closer to 0.5 than that of the others, indicating that the miRNA-disease association was of more paramount importance. In addition, the disease similarities made more significant contributions than the miRNA similarities for improving predictive power.

C. CASE STUDIES
Next, three kinds of case studies were implemented to further verify the predictive capability of HRNMFMDA for uncovering novel miRNA-disease pairs. The first type of case study included breast neoplasms, utilizing the verified pairs of HMDD V2.0 [51] as the training samples. For every predicted disease, the corresponding unverified miRNAs were ranked in accordance with the predicted scores. After that, the top-ranked 50 candidate miRNAs of the predictive list were checked based on three other prominent databases, dbDEMC [48], miR2Disease [49] and HMDD V3.0 [50]. Finally, we acquired the total frequency of the miRNAs confirmed by one of the three databases or more.
Breast neoplasms are a kind of frequently occurring female carcinoma that leads to high mortality [74]. The treatment is currently-based primarily on the clinical and pathological VOLUME 8, 2020 features of breast cancer. Targeted therapy and personalized therapy are the ultimate goals, which are highly dependent on specific molecular typing [75]. For example, the abnormal increase of miR-22 may promote the occurrence and metastasis of breast cancer and lead to a higher degree of malignancy of the tumor [76]. After the implementation of HRNMFMDA, all of the top-ranked 20 and 49 of the topranked 50 discovered miRNAs were confirmed by more than one external database (see Table 4).
Colon neoplasms are one of the most high-frequency malignancies of the digestive tract today [77]. According to statistics, in 2019, approximately 101,420 new cases and 51,020 deaths would occur in the United States [78]. Clinical experimental evidence supported that several miRNAs (miR-19a, 98, 146b, 186, 331-5p, 625) were significantly different in patients with rectal adenoma compared to the control group, which could be regarded as potential biomarkers for screening tests [79]. Overall, implementing HRNMFMDA on colon neoplasms confirmed 19 of the top-ranked 20 and 49 of the top 50 potential candidates (see Supplementary Table S1).
Esophageal cancers are one of the fastest-growing carcinomas in the Western world [80]. It is estimated that in 2019, 17,650 esophageal cancer cases will be newly diagnosed, of which 16,080 are deaths in the United States [78]. Several miRNAs have been discovered as biomarkers of esophageal neoplasms. For instance, miR-29c was downregulated in tumor and serum samples from esophageal neoplasm patients, which could be regarded as a promising non-invasive biomarker [81]. After implementing HRNMFMDA on esophageal neoplasms, all of the topranked 20 and 49 out of the top-ranked 50 predicted results were verified through one of the external databases or more (see Supplementary Table S2).

IV. DISCUSSIONS
In our paper, we proposed the Hessian Regularized NMF Method (HRNMFMDA) for revealing uncovered diseaseassociated miRNAs. In this model, the experimentally discovered miRNA-disease associations, the integrated similarities of miRNAs, and the integrated similarities of disease were merged by the Hessian regularized NMF framework with group LASSO and approximate orthogonal constraints. The proposed model achieved AUC values of 0.9074 and 0.8618 in global and local LOOCV, respectively, which were superior to those of several methods previously reported in the literature. As far as we know, HRNMFMDA is one of the rare models whose AUC of global LOOCV was higher than 0.9. Furthermore, the model achieved 0.9044+/-0.0080 in 5-fold cross-validation. Moreover, a considerable part of the highest-rated potential related miRNAs was supported by external biological datasets in the case studies. In conclusion, the model can integrate multiple biological data through Hessian regularized NMF effectively for identifying uncovered miRNA-disease pairs.
Reasons for the high predictive capability of the proposed model can be summarized as follows: First, as a semisupervised method, HRNMFMDA was not built upon negative samples. Therefore, it may reduce errors caused by generating negative samples randomly. Second, it utilizes the l 2,1 -norm constraint and approximate orthogonal constraint to select the essential features and ensure the group sparsity of the coding matrix [65], so the influence of noise data can be weakened to obtain relatively reliable prediction results. Third, it can capture the intrinsic manifold structure of data better by introducing Hessian regularization in the NMF framework because the second-order Hessian energy can make the function value change linearly with the geodesic distance [60], which is a universal improvement that deserves more attention.
However, there are still some shortcomings to be solved in our method. The first insufficiency is that the data source we used in HRNMFMDA may contain noise information. Unfortunately, the error function used in the model was based on the least squares method, which is commonly considered not to be stable enough when dealing with noisy data. Furthermore, it was troublesome to acquire the optimized parameter set of the model. Due to laboratory limitations, we were unable to perform wet experiments to verify the predicted results of the predictive calculation model. Additionally, the calculation of miRNA and disease similarity proposed in the method section of this paper may not be the ideal method. We hope to construct more methods for integrating similarity information more rationally in the future.