MISSIM: An Incremental Learning-Based Model With Applications to the Prediction of miRNA-Disease Association

In the past few years, the prediction models have shown remarkable performance in most biological correlation prediction tasks. These tasks traditionally use a fixed dataset, and the model, once trained, is deployed as is. These models often encounter training issues such as sensitivity to hyperparameter tuning and “catastrophic forgetting” when adding new data. However, with the development of biomedicine and the accumulation of biological data, new predictive models are required to face the challenge of adapting to change. To this end, we propose a computational approach based on Broad learning system (BLS) to predict potential disease-associated miRNAs that retain the ability to distinguish prior training associations when new data need to be adapted. In particular, we are introducing incremental learning to the field of biological association prediction for the first time and proposed a new method for quantifying sequence similarity. In the performance evaluation, the AUC in the 5-fold cross-validation was 0.9400 +/- 0.0041. To better assess the effectiveness of MISSIM, we compared it with various classifiers and former prediction models. Its performance is superior to the previous method. Besides, the case study on identifying miRNAs associated with breast neoplasms, lung neoplasms and esophageal neoplasms show that 34, 36 and 35 out of the top 40 associations predicted by MISSIM are confirmed by recent biomedical resources. These results provide ample convincing evidence of this approach have potential value and prospect in promoting biomedical research productivity.


INTRODUCTION
M ICRORNAS (miRNA) regulate gene expression in some physiological processes, such as apoptosis and differentiation of cells, through complementary base pairing with messenger RNA (mRNA) [1], [2], [3]. Line-4 and let-7 are miRNAs which are known as characterizations of genes in the past 20 years [4], [5]. Since then, the number of discovered miRNAs accumulated quickly by various biological experimental methods [6]. Furthermore, abundant experimental studies have shown that miRNA is closely related to human diseases. Exploring the influence mechanisms of miRNA in diseases will boost the transformation diagnosis and treatment model. For instance, the combination of miR-211 and TGFbeta R2 accelerated the cancerization of head and neck [7]. By targeting c-Met, migration and invasion of breast cancer cell were inhibited by mir-340 [8]. Gao et al. have found miRNAs are dysfunctional at an early stage by researching the expression changes of miRNAs which is related with disease in prime of HBV-associated hepatocarcinogenesis [9]. The miR-145 was a tumor suppressor candidate miRNA and could give a major push to the development of HCC indicated by their results in the meantime [10]. However, the evolution may be blocked by the high-cost, long cycle experiment and sensitivity of noise. Finding a more credible miRNA-disease association prediction method becomes an important research hotspot.
In the past five years, traditional prediction models have been proposed to solve biological problems [11], [12], [13], [14], [15], [16], [17], [18], [19]. They are based primarily on similarity or on machine learning [20]. A miRNA prioritization approach was built by Xu et al. [21]. The potential associations were distinguished by the target-miRNA interactions and genes of known disease. Liu et al. predicted miRNA-disease associations by a heterogeneous network [22]. Later, a method was proposed by Zeng et al. which gathers social network analysis to forecast the relationship between miRNAs and diseases [23]. Zou et al. predicted disease-specific miRNAs by a supervised machine learning [24]. They used bootstrap aggregating algorithm to train the biased SVM classifier.
In the traditional prediction model, all training data are presented to the classifier [25], [26], [27], [28], [29], [30]. However, under the condition that the miRNA regulation mechanism has not been thoroughly explored, all biological information can hardly be acquired at the same time, but gradually collected by the database. Therefore, incremental learning is of great value. In this work, we propose a prediction model called MISSIM to solve the problem of learning such incremental available data in biological association prediction. In addition, another innovation of the proposed method is to propose an algorithm for quantifying sequence similarity. Specifically, according to the miRNA functional data and disease semantic data, we first obtain the similarity between miRNAs and diseases. Second, the feature information of the miRNA sequence can be abstracted by the Chaos Game Representation (CGR) technology [31]. We compute the relative similarity between any pair of miRNAs by Pearson's correlation to build the miRNA sequence similarity matrix. Third, we construct a feature descriptor which gathered the similarity matrixes of sequence and association. Finally, the processed feature vectors are placed in the broad learning system classifier and potential miRNA-disease associations are obtained. To assess the performance of MISSIM in the HMDD V3.0 data set [32], we computed the AUC of 5-fold cross-validation (0.9400þ/-0.0041). Moreover, we verified MISSIM by three disease including Breast Neoplasms, Lung Neoplasms and Esophageal Neoplasms. As a result, 34, 36 and 35 out of the top 40 predicted miRNAs were respectively verified by other association database. These results provide ample convincing evidence to demonstrate the effectiveness of the method. Fig. 1 shows the workflow of the proposed method.

Evaluation Criteria
Accuracy (Acc:), sensitivity (Sen:), precision (Pre:) and F 1 score are used to assess the performance of MISSIM, which are defined by:

Performance Evaluation
In HMDD v3.0 dataset, 1102 miRNAs and 850 diseases build the dataset with 32281 known miRNA-disease associations from 17412 papers. Some of the associations whose information is unreliable that judged by the public database miRBase and we have removed it [33]. After screening, positive samples were constructed from 32226 miRNA-disease associations, and we randomly selected the same number pairs from unproven miRNA-disease pairs as negative samples.
Prediction of miRNA-Disease Association. Fig. 2 lists the performance of MISSIM and it has gained an average AUC of 0.9400þ/-0.0041. The AUC of the five experiments is 0.9328, 0.9418, 0.9420, 0.9443 and 0.9427 respectively. And, the AUPR of the five experiments is 0.9319, 0.9376, 0.9375, 0.9402 and 0.9397 respectively (Fig. 3). Table 1 shows the average accuracy, sensitivity, accuracy, and f1 scores of 0.8685, 0.8871, 0.8556, and 0.8708, respectively. In accordance with the results of experiment, our approach is feasible, reliable and comes to the result of the expectation. It is a powerful tool for predicting potential miRNA-disease association.
Comparison With Different Classifier Models. The MISSIM model has excellent performance on the HMDD 3.0 database using the BLS classifier. Here, Support Vector Machine (SVM), Decision Tree (DT), and Random Forest (RF) are selected to compare with it [34], [35], [36]. The accuracy of the four experi-  Table 2. It can be directly observed that the MIS-SIM model based on the BLS classifier achieves the highest results in all four evaluation criteria, which indicates that the performance of MISSIM is better than the other three, especially in the AUC which represents the overall performance of the model. The results show that the "mapped feature" adopted by the BLS can effectively extract the deep features of the data and help to improve the performance of the model.

Case Studies
To further evaluate the effectiveness of MISSIM, we applied MISSIM to three human diseases, including breast, lung, and esophageal Neoplasms. Among them, the test sample was established by the miRNA-disease associations about these three diseases and all possible miRNAs. We confirmed the top 40 predictions in dbDEMC v2.0 and miR2Disease [45], [46]. Breast neoplasms which occur in breast tissue takes up about 66 percent of breast disease. Breast cancer is a malignant breast tumor that develops from the uncontrolled growth of freak breast cells. Malignant neoplasms can invade and destroy surrounding tissue and spread to other parts of the body. The reason of most malignant breast tumors is unknown, however, a small of them tend to group in families. So, in the first case study, we took it to assess the performance of MIS-SIM. As shown in Table 4, 34 associations were confirmed.
The main culprit behind lung cancer is the uncontrolled growth of cells in lung tissue. Here, lung tumors were selected as the second case study. After the candidate miRNAs were sorted according to the predicted score, the first 40 were validated. Of these, 36 associations were confirmed to be associated with lung tumors. (See Table 5). As shown in Table 6, 35 of the top 40 Esophageal Neoplasms-associated miRNAs predicted by the proposed model were validated.

Data Set
HMDD [47]. In the proposed method, HMDD v3.0 provides the known experimentally verified human miRNA-disease association. The experimental data can be downloaded from the homepage of the dataset, http://www.cuilab.cn/hmdd. After pretreatment, 32226 miRNA-disease associations were obtained, including 1057 miRNA and 850 diseases.
miRBase [48]. The database provides all-round data on miRNA, including miRNA sequence annotation, prediction of gene targets and other information. In this work, the miRNA sequence information is downloaded from the homepage of miRBase (http://www.mirbase.org).

miRNA Functional Similarity
Wang et al. built a method for computing miRNA functional similarity scores between different miRNAs in the scenario that phenotypically similar diseases tend to relate with functional similarity miRNAs, and uploaded the information at www.cuilab.cn/files/images/cuilab/misim.zip [49], [50], [51], [52], [53]. In this method, we downloaded it and constructed a 495 rows Â 495 columns matrix FS where an entity FS(m(a), m(b)) is degree of comparability between miRNA m(a) and m(b). This data is only used in case studies.

Disease Semantic Similarity
Disease Semantic Similarity Model 1. We downloaded the disease semantic information from MeSH database (https://www. nlm.nih.gov/). In the system, we used the Directed Acyclic Graph (DAG) to describe the association between diseases. Each the direct edge connects to two nodes which represent disease from parent to child nodes. We defined disease D as DAG d ¼ D; T d ; E d where T d is a nodal set consisting of disease D and E d is a set consisting of the corresponding edges [49]. Here, Xuan et al. offered a method to figure disease semantic similarity by MeSH diseases descriptors [54]. Particularly, the degree of semantic contribution is described as follows: D is the semantic contribution coefficient. According to the semantic contribution, the semantic value DV ðDÞ of disease D can be described as follows: If the diseases dðiÞ and dðjÞ share more DAG, then the two diseases are more semantically similar. According to this assumption, semantic comparability is defined as follows: Sim1 is a semantic comparability matrix of disease which has 850 rows and 850 columns. The element Sim1ðdðiÞ; dðjÞÞ is regarded as the semantic similarity of dðiÞ and dðjÞ.
Disease Semantic Similarity Model 2. Hence, the effectiveness of prediction model can be improved by retaining the specificity of disease terms. Because the information content can measure the particularity of disease term effectively, we used it in common ancestor nodes and the closest leaf nodes. First, the information content of all diseases can be figured by the negative log possibility of each term. And we can define disease term t's information content as follow [54]: Next step, the degree of semantic comparability between diseases dðiÞ and dðjÞ can be figured as below: Where DV ðdðiÞÞ and DV ðdðjÞÞ are the semantic score of dðiÞ and ðjÞ, and can be figured in same way as formula (6).

Gaussian Interaction Profile Kernel Similarity
Gaussian Interaction Profile Kernel Similarity for Diseases.
According to previous studies [55], we marked miRNAs which can associate with dðaÞ to describe binary vector IP ðdðaÞÞ that represents the interaction profiles of disease dðaÞ. We described KDðdðaÞ; dðbÞÞ between dðaÞ and dðbÞ as follow: Where parameter g d is a coefficient of the kernel bandwidth and nd is the number of matrix A's row. g d is designed as follows: Gaussian Interaction Profile Kernel Similarity for miRNAs. The column vector of the adjacency matrix A is defined as IP ðmðaÞÞ or IP ðmðbÞÞ and nm is the number of matrix A's column.

Integrated Similarity
Integrated Similarity for Diseases. For getting the utmost out of Sim1ðdðiÞ; dðjÞÞ, Sim2ðdðiÞ; dðjÞÞ and KDðdðaÞ; dðbÞÞ, we built a gathered disease similarity matrix SD combined above similarities [56]. The element SDðdðaÞ; dðbÞÞ is integrated similarity between disease dðaÞ and dðbÞ. It can be described as follows: Integrated Similarity for miRNAs. FSðmðaÞ; mðbÞÞ and KMðmðaÞ; mðbÞÞ were used to build miRNA similarity:

Sequence Similarity for miRNAs
In 1990, Jeffrey built a mapping method for genomic sequences named Chaos Game Representation [57]. CGR is an iterative mapping derived from statistical mechanics, especially chaos theory. And, this method maps gene sequences to twodimensional space uniquely. However, previous studies did not adequately explore the possibility of extracting potential features of a sequence through CGR. We set the four possible nucleotides in the miRNA sequence to the four vertices of a binary square (Fig. 5).
Where g i is the nucleotide coefficient, and when the nucleotides are A, C, G and U, the corresponding nucleotide coefficients are (0, 0), (0, 1), (0, 1) and (1, 0), respectively. According to previous research, parameter u is set to 0.5. In addition, we define i ¼ 1 . . . n G and CGR 0 ¼ ð0:5; 0:5Þ. n G is the length of a miRNA sequence. The positional representation CGR i of each nucleotide can be described as follows: Recently, a number of tools have been proposed to analyze DNA, RNA and protein sequences at the sequence level [58], [59], [60], which has inspired us. However, we found few ways to uniquely map sequence information to the euclidean space. In this work, we were inspired by previous research and quantified the nonlinear sequence information [28], [30], [61], [62], [63], [64], [65], [66], [67]. The miRNA sequence containing a large amount of information is converted into a numerical vector to more fully represent the characteristics of the miRNA. First, we downloaded the precursor sequences of the desired miRNAs from miRBase owing to they contain richer epigenetic information. Second, the sequence of miRNA can be mapped into the CGR space with equally divided areas and the number of occurrences of each area is calculated. We used 2 n c Â 2 n c grid to get the frequency matrix of nucleotide length n c . The nucleotide frequency matrix in Fig. 6 defined as chaos game contents is transformed from CGR drew in Fig. 5. Third, using miRNA chaos game contents shown in Fig. 6 as feature vectors to describe miRNA. Finally, according to the miRNA feature vector, the Pearson correlation coefficient was used to calculate the sequence similarity between miRNAs. We used similarity to build sequence similarity matrix (1057 Â 1057). Therefore, each miRNA sequence could be described by a 1057-dimensional vector:

Broad Learning System
Broad Learning System based on Random Vector Functional Link Neural Network (RVFLNN) effectively eliminates the shortcoming of too long training process, and also ensures excellent generalization ability [37]. The core of BLS is incremental learning algorithm, which will not affect the global model by modifying a part of the parameter space and can avoid the problem of "catastrophic forgetting" [68]. Broad Learning system is a flat network, where the original inputs A are placed as 'mapped feature' and the network is expanded in the 'enhancement nodes'. The ith mapped feature F i can be project as ; i ðAW ei þ b ei Þ. And the connection of all the first i group of mapping feature can be donated as F i ½F 1 ; . . . ; F i . By fine-tuning the initial W ei , the model can get better feature. Meanwhile, the jth group of enhancement nodes, g j ðF i W hj þ b hj Þ can be present as E j , and E i ½E 1 ; . . . ; E j can be donated as the first j set of enhancement nodes. The weight of the feature maps W ei and the weight of the enhancement nodes W hj are random weight with the proper dimension. The bias b ei and b hj are randomly generated.
Assuming that B is the output matrix, and the input data A has N samples with M dimension, a broad learning system with n feature mappings and m groups of enhancement nodes can be present as below.
Donate all n groups of feature nodes as F i ½F 1 ; . . . ; F i , then the j set of enhancement nodes ðj ¼ 1; . . . ; mÞ can be presented as: In this way, the broad learning system can be presented as the equation: where the W m ¼ ½F n jE m þ B and a þ ¼ lim !0 ðIþ aa T Þ À1 a T according to ridge regression learning algorithms. Sometimes, the result of learning cannot live up to our expectation. One solution is to insert additional enhancement node in order to get better accuracy. In this way, the algorithm only needs to compute the additional enhancement nodes. To generate new additional enhancement nodes, we donated X m ½F n jE m and X mþ1 can be presented as: And the pseudoinverse of the new matrix can be deduced as:

Overview
The method according to the hypothesis that functionally similar miRNAs have relation to similar diseases is also used in calculating the association between drugs and target proteins. MISSIM is mainly composed of four parts: 1. selecting positive set and negative set; 2. combining feature vectors of miRNA and disease; 3. lessening the size of combined features; 4. building the better forecast model to calculate potential associations. Here in below, we will go into detail of every process. First, we built the training set. To be specific, we extracted the 32226 corroborative miRNA-disease pairs from HMDD v3.0 as positive samples. Then, we combined them with and negative samples to construct training set. Random selection of negative samples is composed of three steps. To be specific, choosing a disease from the 850 diseases discretionarily; selecting one of the 1057 miRNAs in same way; building a negative sample by combining the miRNA and disease which are not in positive samples.
Second, we described the associations as feature vectors. In detail, SD is integrated as a feature vector to represent each disease as a feature. Disease's feature vector SD is defined as follow: By the same method, the feature vector of the miRNA SM can be defined as follows: SM m a ð Þ ð Þ¼ w 1 ; w 2 ; w 3 ; . . . w 1056 ; w 1057 ð Þ : Based on the above described feature vector of disease and miRNA, the similarity feature vector F sim of each miRNA-disease pair can be defined by the 1907-dimensional vector as follows: After that, we adjust F sim from 1097 to 32 through the automatic encoder. Similarly, the feature matrix F seq is adjusted from 64 to 32 in the same way. We defined the final descriptor for each miRNA-disease pair as a 64-dimensional vector as follow: Finally, the extensive learning system is trained by the final descriptor to obtain a predictive model. If the sample is the positive sample, we define the label as 1. And if it is negative samples set, the label is defined as 0. Then, we put the data of training set into broad learning system and gained a predicting potential miRNA-disease association's model. In our prediction model, if a miRNA and a disease get the higher score, they tend to have a relationship.

CONCLUSION
In this study, we propose a model based on incremental learning to predict miRNA-disease associations, called MISSIM. This method integrated miRNA sequence information, disease semantic information, and similarity information calculated from miRNA and disease associations. In particular, we introduced incremental learning into the field of bio-association prediction for the first time to learn the biological incrementally available data, thus overcoming the problem of "catastrophic forgetting" and modifying the parameter space to affect the global model. In addition, a new method of quantifying sequences was proposed, which provided a new perspective for the characterization of sequence information. In the performance evaluation, the AUC was 0.9400 þ/-0.0041. To better evaluate the effectiveness of MISSIM, it is compared with various classifiers and previous prediction models. Its performance is superior to the previous method. Besides, the case study on identifying miRNAs associated with breast neoplasms, lung neoplasms and esophageal neoplasms show that 34, 36 and 35 out of the top 40 associations predicted by MIS-SIM are confirmed by recent biomedical resources. These results provide sufficient convincing evidence that MISSIM can provide researchers with powerful and useful computational support that providing large-scale disease-related miRNA candidates to promote biomedical research productivity and the development of complex disease treatment. The next task is to explore how to better characterize the biological sequence data in order to obtain better predictive model performance. Yi-Ran Li received the bachelor's degree in electrical engineering and automation from the China University of Mining and Technology, Xuzhou, China, in 2017. She is currently working toward the master's degree in the school of Information and Control Engineering, China University of Mining and Technology, Xuzhou, China. Her current research interests include hyperspectral detection.

ACKNOWLEDGMENTS
Ji-Ren Zhou received the bachelor's degree in civil engineering from the China University of Mining and Technology, Xuzhou, China, in 2019. Currently, he is working toward the master's degree in the Hong Kong University of Science and Technology. His research interests have turned to data mining, machine learning, deep learning, and bioinformatics.
Hai-Tao Zeng received the BS degree from the school of geomatics, Shandong University of Science and Technology, Qingdao, China, in 2017. He is currently working toward the graduate degree in computer science in the School of Mechanical Electronic and Information Engineering, China University of Mining and Tech-nology, Beijing, China. He is also an intern with the Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China. His research interests include computer vision, and image processing.