MDADP: A Webserver Integrating Database and Prediction Tools for Microbe-Disease Associations

More and more evidence has demonstrated that microbiota play important roles in the life processes of the human body. In recent years, various computational methods have been proposed for identifying potentially disease-associated microbes to save costs in traditional biological experiments. However, prediction performances of these methods are generally limited by outdated and incomplete datasets. And moreover, until now, there are limited studies that can provide visual predictive tools for inferring possible microbe-disease associations (MDAs) as well. Hence, in this manuscript, a novel webserver called MDADP will be proposed to identify latent MDAs, in which, a new MDA database together with interactive prediction tools for MDAs studies will be designed simultaneously. Especially, in the newly constructed MDA database, 2019 known MDAs between 58 diseases and 703 microbes have been manually collected first. And then, through adopting the average ranking method and the co-confidence method respectively, eight representative computational models have been integrated together to identify potential disease-related microbes. As a result, MDADP can provide not only interactive features for users to access and capture MDAs entities, but alsoeffective tools for users to identify candidate microbes for different diseases. To our knowledge, MDADP is the first online platform that incorporates a new MDA database with comprehensive MDA prediction tools. Therefore, we believe that it will be a valuable source of information for researches in microbiology and disease-related fields. MDADP can be accessed at http://mdadp.leelab2997.cn.


I. INTRODUCTION
M ICROORGANISMS in human bodies consist mainly of bacteria, archaea, fungi, viruses and protozoa, which are usually parasitized in various human organs such as the gastrointestinal tract, respiratory tract, oral cavity, stomach, skin and genitourinary tract [1]. Since microbiota are ubiquitous in human bodies, they are also known as another vital organ of the human body [2]. In recent years, more and more evidence has proven that microbiota play important roles in certain physiological processes of human bodies, such as improving metabolism, enhancing immunity and maintaining the ecological balance of the body [3], [4]. In addition, they can as well be central or causative agents of many diseases [5]. For instance, studies showed that microorganisms are associated with about 20% of human malignancies [6]. Up to now, with rapid advances in clinical biotechnologies and sequencing technologies, researches on microbiome have experienced exponential growth, which lead to mounting microbe-disease associations (MDAs) being uncovered [7], [8]. Mining potential MDAs can reveal more useful biomedical information in disease-related areas (e.g., disease-causing genes and drugs) and is expected to provide new strategies for disease diagnosis and treatment [9]. For example, in the field of drug repurposing, it has been hypothesized and verified that drugs used to treat type 2 diabetes can also be used to treat colorectal cancer, due to the strong microbe correlation between these two diseases [10], [11]. Thus, understanding microbe-disease associations may be very useful for the diagnosis and treatment of complex diseases such as gastrointestinal inflammation, diabetes, and even cancer.
However, using traditional wet experimental methods to identify MDAs is quite expensive and time-consuming [12], [13].
For the past few years, with rapid developments of complex network technologies, machine learning and artificial intelligence techniques, in order to reduce the time, labor and cost of traditional biological experiments methods have been successively This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ proposed to infer potential MDAs. For instance, Ma et al. constructed the first human microbe-disease association database (HMDAD) in 2017 based on known microbe-disease associations in publications of microbe-related studies, in which, contained 483 associations between 39 diseases and 292 microbes were selected from 61 publications [9]. Based on these known MDAs from HMDAD, Chen et al. proposed a microbe-disease association prediction model called KATZHMDA [14], and since then, lots of computational methods have been developed in succession by adopting diverse strategies. However, by far, few studies can provide visualized MDA prediction tools, which are not conducive to subsequent researches of MDA. In addition, almost all studies can only validate their prediction performances based on the HMDAD database, so that predictive performances of these models may to some degree be unreliable. Furthermore, limited by the number of combinations between microbes and diseases in HMDAD (there are only 39 diseases and 292 microbes in HMDAD), the difference between those potentially associated microbe-disease pairs identified by different models will be very small, even if the prediction performances differ significantly between them, which may severely restrict the application capabilities of those superior computational models. In addition, considering that the data in the HMDAD database are very limited and to some extent obsolete, it is possible that potential MDAs recommended by these predictive models may no longer be time-sensitive. Hence, in order to address the above-mentioned issues, in this study, a novel webserver called MDADP was designed by integrating a new MDA database with effective MDA prediction tools. In MDADP, we first manually collected 2019 known associations between 58 diseases and 703 microbes to construct the new MDA database. And then, through screening and integrating eight representative MDA prediction algorithms, we proposed two kinds of different predictive tools based on the average ranking method (MDADP _ AR) and the co-confidence method (MDADP _ CC) separately to recommend more reliable potential MDAs. To the best of our knowledge, MDADP is the first webserver that integrates a new MDA database and visualized MDA prediction tools. Hence, we believe that it may become a useful tool for future researches in microbiology and disease-related fields. The major contributions of this paper are as follows: r A new MDA database consisting of 2019 known associations between 58 diseases and 703 microbes was constructed based on which, a novel MDA dataset with a scale of 1767 non-redundant known associations for identification of potential MDAs was built.
r By integrating and analyzing eight representative MDA prediction models, two kinds of effective identification models were designed to recommend reliable potential MDAs separately.
r Based on the newly constructed MDA database and identification models, a visualized platform was provided, in which, lots of functions including searching, sorting, filtering, visualization, and downloading of MDAs are implemented simultaneously. To our knowledge, MDADP is the first online platform that can provide visualized tools together with a new database for prediction of potential MDAs, which may be a useful tool for future research in microbiology and disease-related fields.

A. Construction of the New MDA Database
For constructing the new MDA database, a series of keywords to search the Pubmed database for human microbe-related publications, including but not limited to "Human", "Microbiome", "Disease", "Microbe", "Neoplasms", "Cancer", etc., were adopted to search for human microbe-related publications in the Pubmed database. After preliminary screening of publications published before the start of our study (September 2020), in final, more than 500 candidate publications were extracted. Subsequently, 261 publications were refined by reading the abstracts and results of them. All information of these selected studies was recorded into tabular files by meticulous manual management according to following rules: (1) All diseases will be named and classified in a standardized way according to the vocabulary provided by the MeSH database, (2) All microbes will be classified according to the NCBI taxonomy, (3) Regulatory relationships between microbes and diseases will be recorded (positive or negative) in the new database (4) Information on experimental methods and samples used in these publications will be recorded in the new database. Ultimately, 2019 validated MDAs between 58 different diseases and 703 different microbes were collected from 261 publications, among them, there were 1012 positive associations and 1007 negative associations. And besides, all these 58 diseases would be classified into 15  Meanwhile, according to the NCBI taxonomy, microbes would be further classified into phylum, class, order, family, genus, species, and no rank. As a result, the numbers of different disease-associated microbes and different microbe-associated diseases in MDADP are statistically presented in Fig. 1, while the 30 most common diseases and microbes in MDADP are shown in Fig. 2. As can be seen in Fig. 2(a), the most common disease is Type 1 Diabetes Mellitus, with more than 170 microbes associated with it. Additionally, the second and third common diseases are Breast Neoplasms and Lung Neoplasms separately, with 156 and 99 associated microbes. Besides, other kinds of human diseases recorded by MDADP can be found in different human organs, such as colon (Colorectal Neoplasms), oral cavity (Mouth Neoplasms), intestine (Irritable Bowel Syndrome), pancreas (Pancreatic Neoplasms), kidney (Kidney Disease), and so on. As can be seen in Fig. 2(b), it is obvious that the most common microbe is Bacteroides. Besides, all these 30 most common microbes have more than 13 related diseases. Obviously, these experimentally supported microbe-disease associations not only can serve as a source of data for predictive models but also can inspire bioinformatics researchers to mine more useful

B. MDA Dataset
According to above description, it is easy to know that the MDADP database contains 2019 validated MDAs between 58 diseases and 703 microbes. After removing duplicated associations, we finally obtained 1767 non-redundant known MDAs. And for convenience, we refer to the dataset of 58 human diseases as SD, and the i-th disease in SD as d i , and similarly, we refer to the dataset of 703 microbes as SM , and the j-th microbes in SM as m j . Thereafter, a 58 × 703 dimensional association matrix A can be constructed as follows: for any given disease d i and microbe m j , if and only if there is a known association between them, there is A(i, j) = 1, otherwise, there is A(i, j) = 0. Fig. 3 shows the bipartite network graph consisting of microbe nodes, disease nodes and edges (associations) in MDADP respectively.
In order to avoid random results caused by a single dataset, during experiments, HMDAD will be utilized as the alternative data source to verify the reliability of competitive computational  Table I shows the comparison between datasets of MDADP and HMDAD. The scatter plots of the interaction distribution for these two datasets are drawn in Fig. 4. It is obvious that comparing with the MDADP dataset, the HMDAD dataset has less data and is obviously sparser.

C. Construction of Effective Models for Potential MDA Prediction
Recently, researchers have developed a variety of sophisticated MDAs prediction models based on the HMDAD database. Zhao et al. classified current MDAs computational models into four types, namely score function-based models, network algorithm-based models, machine learning-based models and experimental analysis-based models [13]. These models aim to identify potential MDAs by adopting various computational algorithms and further recommend top-k candidate MDAs. Although considerable successes have been achieved, the performances and applications of these models in different databases still deserve further investigation.
In this section, we will first select eight state-of-the-art MDA computational models to compare their performances based on the MDADP. Through comprehensive consideration of predictive performances, algorithm characteristics, code availability and reproducibility of all existing state-of-the-art models, the following eight models including two score function-based  models (BWNMHMDA [17], KATZHMDA [14]), four network algorithm-based models (BiRWHMDA [17], NBLPH-MDA [18], BiWMP [19] and HMDA-Pred [20]) and two machine learning-based models (LRLSHMDA [21], BPNNHMDA [22]) are selected as our final competitive models. And for the sake of fairness, while calculating potential similarities between diseases and microbes, all parameters of these competitive models will be assigned to default values given by their proposers. In addition, for convenience, we adopt the sequence numbers 1 to 8 to represent models of BWNMHMDA, KATZH-MDA, BiRWHMDA, NBLPHMDA, BiWMP, LRLSHMDA, BPNNHMDA and HMDA-Pred separately. Thus, we can define the 58 × 703 dimensional predicted score matrix obtained by the k-th competitive model as S k . And then, it is obvious that S k (i, j) represents the predicted score of potential association between a pair of given disease d i and microbe m j obtained by the k-th model. Thereafter, by ranking all predicted scores in S k in terms of diseases, we can obtain a new ranking matrix R k , where R k (i, j) denotes the ranking of the given microbe m j in all candidate microbes relating to the given disease d i . Specifically, for a given unknown microbe-disease pair, the smaller its ranking value, the higher its potential relevance.
Considering that for any given candidate MDA, the predicted scores obtained by different models may vary greatly, hence, it is problematic to achieve better predictive performance by simply summing together these predicted scores obtained by all competitive models. Hence, in this section, we will adopt the following two kinds of strategies to combine these eight representative models together to further construct two kinds of novel predictive models called MDADP _ AR and MDADP _ CC respectively: (1) Average Ranking strategy based Model (named MDADP _ AR): In MDADP _ AR, the top K competitive models with the best predictive performances will be selected out first according to experimental results. And then, let M denote the set of sequence numbers of these K selected models, an average ranking matrix R av will be obtained according to the following (1): Obviously, the parameter K has key impact on the predictive performance of MDADP _ AR. According to experimental results, MDADP _ AR can achieve the best predictive performance while K is set to 3, and correspondingly, the top 3 competitive models with the best predictive performances are BWNMHMDA, NBLPHMDA and HMDA-Pred separately.
(2) Co-confidence strategy based Model (named MDADP _ CC): In MDADP _ CC, for a candidate MDA between any given disease d i and microbe m j , its co-confidence value will be obtained according to the following (2): Where, Obviously, values of elements in above matrix R cc will vary from 1 to 8, and the parameter δ affects the strictness of the co-confidence strategy. To ensure a stricter synergy confidence, δ will be set to 100 (about %14 of the total number of microbes) in MDADP _ CC. Specifically, for a given unknown microbedisease pair, the larger its co-confidence value, the higher its potential relevance.
Through analyzing above two strategies, it is easy to know that the advantage of MDADP _ AR is that it can achieve better error tolerance by combining results obtained by top K competitive models with the best predictive performances, while the advantage of MDADP _ CC is that it can make models with poor prediction performances work well by obtaining co-confidence values of candidate MDAs.
In MDADP webserver, above eight representative predictive models and two strategies for ranking potential MDAs have been integrated for researchers to query more conveniently.

D. Implementation of the Server
xMDADP webserver with Model-View-Controller architecture is realized by using Python language and Flask front-end framework, and the platform has been deployed on Alibaba Cloud Elastic Computing Service. In addition, data of microbedisease associations is stored in MDADP by MySQL database. Moreover, in the webserver, lots of functions including searching, sorting, filtering, visualization, and downloading of MDAs are implemented simultaneously. MDADP can be accessed at http://mdadp.leelab2997.cn. All MDAs datasets curated in this paper can be downloaded at https://github.com/HaoLeextu/ MDADP.

A. Assessment of Predictive Performance
The leave-one-out cross validation (LOOCV) and 5-fold cross validation (5-Fold CV) are two widely used frameworks for assessing performance of predictive models, in this section, we will adopt these two kinds of frameworks to evaluate performances of above competitive methods based on the MDADP dataset and the HMDAD dataset respectively. In LOOCV, all unknown MDAs are considered as candidate samples. Each known MDA will be excluded in turn as a test sample, and the remaining known associations are used as training samples. After executing the computational model, the predicted values of test samples are ranked against the predicted values of all candidate samples, and the test sample with ranking above a given threshold will be considered as a successful prediction. Obviously, different true positive rates (TPR, sensitivity) and false positive rates (FPR, 1-specificity) can be obtained when different thresholds are set. Here, the TPR refers to the percentage between the number of test samples with rankings above a given threshold and the number of known MDAs. Meanwhile, the FPR indicates the percentage of candidate samples ranked above a given threshold. Using the FPRs and TPRs under different thresholds as the x-axis and y-axis, respectively, the receiver operating characteristic (ROC) can be further plotted. Thereafter, the area under curve (AUC) can be taken to evaluate the prediction performance,  where the closer the AUC value is to 1 indicates the better prediction performance. 5-Fold CV is similar to LOOCV, except that it differs in dividing the samples. In the 5-Fold CV, all samples are divided equally into 5 parts. One part is selected as the testing set in each round, and the other 4 parts are used as the training set. Since the process of dividing samples is random, the 5-Fold CV is executed 100 times, and the stability of the prediction model is determined by calculating the average of all AUCs as the final result and calculating the standard deviation (STD). As another essential evaluation metric, the area under the PR curves(AUPR) can show the balance of recall and accuracy, and is therefore suitable for evaluating the prediction performance of different methods with unbalanced datasets [20]. We plotted the PR curves of eight methods and calculated their AUPR values by LOOCV.
ROC and PR curves for the eight candidate methods based on the MDADP dataset are illustrated in Figs. 5 and 6, respectively. Obviously, the first three models with the best prediction performances are BWNMHMDA, NBLPHMDA, and HMDA-Pred. For comparing the performance of these candidate methods in recovering known MDAs, we statistically counted the number of true MDAs among the top 200, 500, 1000, 2000, and 5000 associations predicted by each model in LOOCV, as illustrated in Fig. 7. It can be seen from Fig. 7 that HMDA-Pred and BWN-MHMDA achieved better results at all thresholds. NBLPHMDA performed poorly at the threshold of 200, but performed well at  remaining thresholds. It is remarkable that BiRWMP achieved relatively low AUCs but performed relatively well in this metric.
In order to further verify the performances of these candidate models, we present ROC curves and AUC values in LOOCV and 5-Fold CV for all candidate methods based on the HMDAD dataset as well. Through observing Figs. 5 and 8, it is easy to find that all candidate methods can achieve higher AUCs in HMDAD. Among them, the AUCs of BPNNHMDA, BWNMHMDA and NBLPHMDA are relatively higher than other four candidate methods. It is noteworthy that KATZHMDA performs poorly in MDADP. The main reason for this disparity is that prediction results of KATZHMDA are biased towards those well-investigated diseases and microbes. For instance, KATZHMDA can achieve excellent prediction performance (AUC value of 0.8382) for Type 1 Diabetes Mellitus in HMDAD, since the associations relating to Type 1 Diabetes Mellitus account for more than onethird of the total records in HMDAD (167 in 483 associations). However, this biased advantage is not manifested in MDADP where the data are relatively evenly distributed.

B. Case Studies of MDADP _ AR and MDADP _ CC
To further confirm the validity of MDADP in predicting potential MDAs, Alzheimer disease would be selected as the case disease. During experiments, we selected and compared the top 5 microbes associated with the case disease predicted by MDADP _ AR and MDADP _ CC for investigation. The Alzheimer disease is one of the most common neurodegenerative diseases, affecting more than 50 million people worldwide. It is extremely troublesome to human health and is the fifth major cause of death [23]. Numerous studies have suggested that the Alzheimer disease may begin in the gut and is closely linked to dysbiosis of intestinal flora [24]. Top 5 microbes relating to the Alzheimer disease predicted by MDADP based on average ranking method and co-confidence method are shown in Table II. Four of the top five microbes identified by MDADP _ AR and MDADP _ CC as being associated with the Alzheimer disease are confirmed by publications. Guo et al. performed 16S ribosomal RNA sequencing on stool samples from patients newly diagnosed with the Alzheimer disease and healthy controls, and showed that Prevotella was significantly increased at the genus level in AD patients [25]. Ivakhniuk et al. explored the relationship between composition of the intestinal microflora and Alzheimer's disease, and they found that patients with Alzheimer's disease expressed significantly decreased Lactobacillus in their gut microflora [26].

IV. DISCUSSION AND FUTURE WORK
What role microorganisms play in the physiological and pathological states of the human body is a current hot research topic [27]. A growing number of studies show a close relationship between microbes and human diseases. In recent years, researchers around the world have been devoting to studying the complex relationship between microbes and diseases to provide new strategies for disease prevention, diagnosis and treatment [24], [28], [29]. The utilization of computational models to uncover MDAs can provide new perspectives to reveal disease mechanisms, by playing a role in areas such as the discovery of disease-causing genes and drug therapies [9]. Recent studies have found high similarities between type 2 diabetes and colorectal cancer, and based on their association with microorganisms, it was hypothesized that both diseases could be treated with the same drug [12]. This hypothesis was successfully tested and validated by biomedical scientists [11], [30]. Hence, the high-performing MDA computational model by leveraging the similarity feature and diverse biomedical data is expected to reduce the time, effort and cost of wet labs' projects by precisely narrowing the potential search space for MDA for researchers. Therefore, the establishment of a systematic MDA database and the provision of reliable MDA prediction tools are of great significance to scientists working in related fields.
In this paper, we present an online platform called MDADP that incorporates a new MDA database with comprehensive MDA prediction tools. In contrast to the recently proposed Disbiome database [31] and HMDAD database [9], our MDADP database uses structured criteria to organize disease terms. When data collection was performed, the complex aliases, extended descriptions (e.g. "new-onset untreated rheumatoid arthritis) and ambiguity between symptoms and disease, allowing ambiguity in disease nomenclature [12]. This is not only detrimental to database expansion, but also the integration of different databases. We organized disease nomenclature using disease terms from the Mesh database and hence allow to retrieve standardized disease terms from different disease repositories in a consistent way. Moreover, microbes in the MDADP database were classified and mapped with NCBI taxonomy. This will facilitate the designers of predictive models to predict potential microbe-disease associations based on classification levels to improve the reliability of predictions.
Certainly, in the current version of MDADP, there are many aspects that need to be improved. For example, although these selected MDA computational models could achieve high AUCs on the HMDAD dataset, it did not perform well enough on the MDADP dataset. We hope that researchers can design more models with better predictive performance based on the MDADP database in the future. In addition, parameters of in these models were set to the optimal values given by their proposers based on the HMDAD database. It is obvious that the parameters could be optimized based on the MDADP database as well. Moreover, these models could introduce more diverse prior information about diseases and microbes, such as disease symptom similarity [32], disease semantic similarity [33], [34], and microbe functional similarity [35]. We observed that some researchers have proposed models for predicting MDAs based on graph convolutional neural networks (NinimHMDA [36]) recently, which have achieved impressive prediction performances. In the future, we will introduce more excellent computational models such as NinimHMDA to MDADP, which will be particularly beneficial for MDADP _ CC to infer potential MDAs with higher co-confidence. Furthermore, we will keep expanding both the diseases and their associated microbes for the MDADP database. With the expansion of MDAs entries and prediction methods, MDADP can rank and select potential microbe-disease pairs on a larger scale for validation experiments in downstream laboratories.
Since MDADP is a systematic online platform integrating a database and prediction tools for MDAs, we believe it will be a valuable source of information for scientists in microbiology and disease-related fields. Furthermore, MDADP promises to be a useful and effective platform in biomedical research, which may inspire bioscientists and computational scientists to form new research hypotheses about microbe-disease interactions, refine existing experimental methods, and validate their conclusions.