Explainable Drug Repurposing Approach From Biased Random Walks

Drug repurposing is a highly active research area, aiming at finding novel uses for drugs that have been previously developed for other therapeutic purposes. Despite the flourishing of methodologies, success is still partial, and different approaches offer, each, peculiar advantages. In this composite landscape, we present a novel methodology focusing on an efficient mathematical procedure based on gene similarity scores and biased random walks which rely on robust drug-gene-disease association data sets. The recommendation mechanism is further unveiled by means of the Markov chain underlying the random walk process, hence providing explainability about how findings are suggested. Performances evaluation and the analysis of a case study on rheumatoid arthritis show that our approach is accurate in providing useful recommendations and is computationally efficient, compared to the state of the art of drug repurposing approaches.


INTRODUCTION
D RUG repurposing (DR) is emerging as an essential and potentially valuable undertaking to rapidly exploit existing and tested drugs for new uses, such as emerging and neglected diseases, as well as an alternative and convenient choice as opposed to the de novo drugs development. However, in the exemplary case of the ongoing COVID-19 pandemic, despite the high number of DR attempts (a nonexhaustive PubMed search performed on January 2022 reported more than 900 results using the keywords "covid" and "drug repurposing"), the effectiveness of DR still appeared to be low: out of over 400 drugs tested, just a few, and precisely four of them, were definitively shown to be effective, i.e., graded "A" (i.e., established effectiveness; endorsement by professional societies) [1], [2], [3]. The reported success rate of $1% is certainly not satisfactory [4] and calls for better use of the increasingly robust data and for the development of more efficient methods for DR algorithms and processes capable of making the most out of previous knowledge.
DR approaches fuse well also with the concept of Synergistic Drug Combinations, where the aim is to find two or more active pharmaceutical ingredients that co-op well to target multiple conditions (e.g., [5] and [6]).
In what follows, we explain some prominent examples of current DR processes representing the starting point of our study, and we highlight some salient related limitations.

Brief Exploration of Previous Relevant Works
The pioneering work of Keiser et al. [7] introduced the possibility to infer by computational means novel drugs for neglected diseases, and suggested general computational DR systems. Since then, various sorts of computational approaches exploiting different databases and different similarity criteria have been designed to tackle the problem. In what follows we briefly report, with no claim of completeness, some of the studies that are most relevant for the work here presented (see e.g., [8] for a more in-depth review).
Luo et al. [9] exploited the concept of similarity of drugs (chemical-based) and diseases (MeSH-based [10]) to build two networks, then they linked them together exploiting data from the databases DrugBank [11] and Online Mendelian Inheritance in Man (OMIM) [12], to generate 1933 drugdisease associations among 593 drugs and 313 diseases. Finally, a random walk approach was implemented simultaneously on the two similarity networks to rank drug-disease associations and propose predictions.
Nam et al. [13] proposed a method involving three steps: (i) drug network reconstruction using drug-target protein associations, (ii) network reinforcement, i.e., a machine learning approach providing information augmentation by using drug-drug interaction knowledge, and (iii) a recommendation step via the generation of a score computed by a graph-based semi-supervised learning procedure. The authors identified and recommended 11 novel drugs for the case study of vascular dementia through their approach.
Ozsoy et al. [14] carried out DR as a recommendation process in three steps: similarity evaluation, neighbour and disease selection, actually implementing a collaborative filtering with the possibility to integrate multiple data sources and multiple features. The method produced recommendations based on the similarity and overlap between symptoms of the diseases and the effectiveness of the drugs, hence showing better performances compared to other methods available in the literature.

Limitations and Issues With Available Data
DR approaches exploit previous knowledge, data cross-linking and associations via database interoperability to infer de novo predictions and testable hypotheses. Such approaches leverage on many different large datasets developed during the last decades, when the global knowledge on drugs and diseases properties has considerably increased [15]. However, a fast growth like this often generates an inconsistency in nomenclatures, resulting in a persistent difficulty in fusing data coming from different sources [16].
Literature reports many attempts in setting up actual standards (e.g., the latest World Health Organization's ICD-11 [17]), however, different usages imply different requirements, not always addressed by all evolving standards. As such, for our purposes, we refer to the common practice in the scientific community operating on DR, that currently widely employs only a few de facto leading open-access reference sources and standards (i.e., the DrugBank knowledge base for drug-target associations [11] and the DisGeNet discovery platform containing one of the largest publicly available collections of genes and variants associated to human diseases [18]). We refer to more pragmatic reviews like [8] for an in-depth analysis of the available datasets.

Overview of the Proposed Approach
Building on previous works and with the aim of enhancing reliability and capabilities of preceding attempts, we here propose a Markov process-based similarity approach that exploits available data in the form of a knowledge graph [19] such as experimental drug-gene interactions, disease-gene associations, and drug-disease pharmacological indications.
Our approach consists of five distinctive features: (i) a careful data selection and nomenclature mapping, (ii) a database bipartite graph design, (iii) a BLAS-based data structure [20], (iv) an Ergodic Markov Process representation of the DR system, and (v) an explainable output.
A careful selection of the terms grants maximum consistency and therefore minimal loss of information, in particular when data sets are joined together. This also considerably reduces the effort spent manually curating the data sets. For this reason, we chose two de facto standards: cas-number (Chemical Abstracts Service Reference Number) for drugs [21] as provided by DrugBank, and the UMLS (Unified Medical Language System) identifier for diseases [22]. We converted all the involved entities from the various data sets through different dictionaries provided by MalaCards [23], OMIM [12], DrugBank [11], and many others and we assigned each drug/disease a unique integer identifier (used for evaluation purposes). This choice enables us to employ two widely used, highly curated, and up-to-date data sets -DrugBank [11] and DisGeNET [18] -granting nomenclature standardisation and interoperability.
The savvy mathematical formulation of the data sets as bipartite graphs allows the natural construction of useful entropy-inspired similarity measures that lays the foundations for our knowledge graph (see e.g., [24] where a similar approach is used to create a collaborative filtering for miRNA-disease associations).
We embed our graph structure in the BLAS environment, where structures are stored as sparse (Boolean) matrices. This solution yields fast and reliable performances based on matrix-matrix operation, like the ability to represent connections between entities as a sequence of multiplications.
We used the resulting knowledge graph with normalised connections to generate a Markov-process-based DR system. The underlying mathematical structure enforces the usage of the notion of ergodicity, therefore providing a mathematical proof of the stability of the recommendations with respect to small changes in the input data, see Section 3.2.
Finally, recommendations provided on the basis of a biased random walk make them self-explainable. Explainability in Artificial Intelligence methods (see [25]) like recommendation systems is particularly useful since it allows, e.g., to interpret the results and provide practical hints in testing, both features strongly demanded in recent trends.
We found that all the above-mentioned characteristics are crucial to providing effective, fast, usable, and reliable results for DR.
The remainder of this paper is organised as follows. In Section 2 we describe the databases and datasets used, their cross-mappings, the related necessary integration, and their pre-processing. Section 3 introduces the approach used to exploit such data (Section 3.1), how to build the recommender system and how it behaves (Section 3.2). In Section 4 we discuss parameters tuning (Section 4.1) and compare the methodology's performances with four existing similar methods (Section 4.2). We provide further results in Section 5, by analysing a specific DR case study, that of rheumatoid arthritis (RA), a chronic autoimmune disease with complex aetiology and no cure to date. Finally, Section 6 provides a summary of the contribution and possible further developments.

MATERIALS
The present DR approach makes use of several datasets and data sources for the reconstruction of a knowledge graph on which the recommendation system is built.
As presented in Section 1.2, we extracted drugs and diseases information from the two de facto standards DrugBank (DB) [11] and DisGeNET (DGN) [18] respectively. This helped us in finding many (five) different datasets to obtain drug-diseases association, namely: MalaCards (MC) [23], RepoDB (RDB) [26], iDrug (ID) [27], as well as data from Li and Lu article (referred to as LL) [28], and data from the Similarity-based LArge-margin learning of Multiple Sources DR framework (SLAMS) [29]. Since the latter four (described in Section 2.2) were employed by other DR published methodologies, we found them suitable to be used as benchmark datasets (see Section 4.2). On the contrary, we used the first three (described in Section 2.1) to build and tune our methodology (see Section 3). Table 1 summarises the statistics of the seven data sets reporting the number of vertices and edges of the underlying graph while in the following we briefly account them. Their usage is summarised in Fig. 1.

Main Data Sets and Data Sources
DB [11] DrugBank is a pharmaceutical knowledge base that enables major advances across the data-driven medicine industry. It is provided as a drug-oriented XML, where each drug is labelled by a DrugBank unique identifier DB (DBxxxxx). We used it to extract druggene relations (target gene polypeptides), drug names (and cas-number), and their market status (approved, illicit, experimental, ...) to provide further filtering on the final recommendations. We were able to extract 7,262 cas-number drugs (out of 13,563) connected to 4,118 Genes through 19,792 connections (from 1 to 305 connections per drug). DGN [18] DisGeNET is a discovery platform containing one of the largest publicly available collections of genes and variants associated with human diseases. It is provided as an SQLite DB built upon many domain, typological and associative tables. We used it to extract the relations between diseases and target genes, along with diseases' names and UMLS identifiers. We consider 30,170 Diseases (out of 30,293) connected to 21,671 (out of 26,137) genes through 1,135,045 connections (from 1 to 10,161 connections per disease). MC [23] MalaCards is an integrated database of human maladies and their annotations, modelled on the architecture and richness of the popular GeneCards database of human genes. It provides the relations between drugs and diseases, already expressed as relations between cas-number for drugs and UMLS identifier for diseases. We also used it as a source of dictionaries for converting the test data sets. We filtered the entries, obtaining a total of 2088 cas-number (out of 12,240) and 7,009 UMLS identifiers (out of 31,642) connected via 495,060 links. The diagram in Fig. 2 shows the links between the records of the data bases. We represented relations as highly sparse Boolean matrices and records as unit sparse vectors of the corresponding size.
As described, each data set is mainly used to extract the relations between two different kinds of entities; hence it can be interpreted as a bipartite graph G ¼ ðV ; E Þ where nodes V represents entities (drugs, diseases, or genes) and edges E enforces connections among them. For ease of notation, we denote each graph with the acronym of the corresponding data set, meaning we build the recommendation system through three of them: MC relations The graphs, stored as sparse Boolean weighted adjacency matrices, are built by multiplying the matrices in Fig. 2.

Databases for Test and Performance Comparison
RDB [26] The RepoDB recommender system provides data set extracted from DrugCentral [30] and Clinical-Trials [31] originally thought as a benchmark database for drug repurposing systems testing. It counts 10,563 connections between 1,519 unique approved drugs (in DB format) and 1,229 unique diseases (in UMLS format). We filtered these data on the drugs and diseases  Table 3 and Fig. 5) and by a manual expert curation in the specific case of Rheumatoid Arthritis, hence obtaining promising recommendations (see also Table 4 and Fig. 6).  [28]. After our filtering, we obtained 1,420 useful links between 242 drugs and 194 diseases, 77% of which are not in MC.

METHODS
In the following, we describe the implementation of the recommendation engine, starting from how data gathered from the sources (described in Section 2) are used and providing details about the working parameters.

Assembling Available Data
We employed G DB and G DGN to extract similarities score s between drugs and diseases respectively, hence obtaining two complete weighted graphs G drg ¼ ðV drg ; E drg Þ and G dis ¼ ðV dis ; E dis Þ with drugs/diseases as nodes and similarity scores s as the weight of the edges. Then, G MC provides a natural way to connect V drg and V dis , hence obtaining a single graph G. A pictorial representation of these steps can be found in Fig. 3 while a pseudo-code is provided in the supplementary materials, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety. org/10.1109/TCBB.2022.3191392. We define pðtÞ as the presence ratio of the gene t in the graph G DB ¼ ðV DB ; E DB Þ, namely Here, pðtÞ is a probability distribution, i.e., P t pðtÞ ¼ 1, therefore, given a gene sets T , the Shannon Entropy is well defined. We define V drg ¼ V DB \ V MC as the set of drugs and we compare two given drugs u; v 2 V drg by means of their target genes sets T u and T v as  In particular, since P v2V drg E drg ½u; v ¼ 1 8u 2 V drg -namely, it is right-stochastic -then E drg can be interpreted as a transition matrix. 1 Following the same pattern on G DGN , we define G dis ¼ ðV dis ; E dis Þ as the disease similarity-weighted graph, hence making diseases and drugs comparable by means of the similarity score. A pseudo-code summarisig and generalising the procedure used to build the similarity graph can be found in the Supplementary materials, available in the online supplemental material.
We later define two sets of edges E drg!dis and E dis!drg to connect V drg to V dis (and vice versa) as for u 2 V drg and v 2 V dis , and for u 2 V dis and v 2 V drg . Such edges generates two right-stochastic complete bipartite graphs on the vertex set 2 A summary of the approach is provided in the Supplementary materials by means of a pseudo-code, available in the online supplemental material. The union of the four set of edges E dis , E drg , E drg!dis , and E dis!drg yields a complete weighted graph in the vertices V , namely G ¼ ðV; EÞ. In order to preserve the transition matrix interpretation, we halve the weights of each of the edge sets and we set the weight E½v; v for each vertex v 2 V as 1 À P w6 ¼v E½v; w, 3 namely The transition matrix of the graph obtained via Equation (7) represents the base of our DR. In the following, we will denote it as R ð1Þ , since it holds all the information we know from the very first step of our DR.

The Core of the Recommendation System
In this section, we exploit the transition matrix property of R ð1Þ to model our system as a recommender for the most probable destination of fixed-length biased random walks.
We define a '-length R ð1Þ -biased random walk over V as a sequence of ' þ 1 nodes sampled according R ð1Þ , that is We define the ' recommendation R ð'Þ as that can be evaluated as zfflfflfflfflfflfflfflfflfflfflfflffl ffl}|fflfflfflfflfflfflfflfflfflfflfflffl ffl{ '-times ½u; v: Given a percentage of drugs 0 < p < 1 we want to recommend, we define the drug recommendation system R p ' RðR ð1Þ ; '; pÞ, as the map where p ¼ bp Á jV drg jc (bxc is the integer part of x) and argmax p ðfðÞÞ are the pre-images of the first p maximum values of fðÞ.
The definition of the recommendation system also makes the methodology self-explainable. In fact, given a recommendation r for a disease d, we can query the system to retrieve which paths contribute most for the novel connection itself, i.e., the heaviest paths d ¼ w 0 w 1 . . .w ' ¼ r.
We can also define an analogous disease recommendation system R p ' : V drg ! V p dis that, given a specific drug, unveils its relationship with other diseases. Due to the asymmetric nature of the argmax operator and of the matrix R ð1Þ , the recommendations generated by the two methodologies may differ i.e., R p ' is not symmetric. In fact, given a disease d 2 V dis , we can have r 2 R p ' ðdÞ for some drug r 2 V drg while d 6 2 R p ' ðrÞ. In other words, it could happen that the DR does not recommend a disease d to be treated via a specific drug r even if r was recommended from the treatment of that specific disease d.

The Parameters
Assuming R ð1Þ as fixed, there are two parameters in R p ' : the length of the walk ' > 1 and the percentage of recommendations 0 < p < 1.
In order to provide an upper bound for ', we notice that the biased random walk X is a memory-less process and therefore we tackle the task to determine the argmax as a Markov process problem G. In particular, since R ð1Þ is stochastic, the graph G itself forms the basis for a '-step Markov chain (see [34] for a detailed description of Markov processes and Markov chains).
Without loss of generality, we may assume that G is irreducible, that is for all couples u; v 2 V, there exists ' > 0 such that R ð'Þ ½u; v > 0. In fact, if it is not irreducible, then there exists a disjoint partition of V and therefore we can 1. Actually, some of the rows could be empty if a drug has no common target genes with the others, hence making necessary the "otherwise" clause; we get rid of this problem, also causing the matrix not to be right-stochastic, in the next steps.
2. See Footnote 1 3. It is a good practice to reduce each positive entry E½u; v by a factor oðminfe 2 E j e > 0gÞ before evaluating E½u; u. This operation gets rid of any numerical issue caused by rounding operations on small floating-point values, hence ensuring no negative value is assigned to E½u; u. split it into independent irreducible Markov processes and work on them individually.
According to (7), for any state u 2 V we have R ð1Þ ½u; u > 0 and therefore G is positive recurrent and aperiodic. An irreducible, aperiodic and positive recurrent Markov Chain is said to be Ergodic. The Ergodic Theorem states holds for every Ergodic Markov chain R and for some stationary distribution P ¼ ðpjpj. . .jpÞ T , and, in general, Then ' is an upper bound for ' since R p ' would give the same recommendation independently from the disease considered, hence being uninformative for repurposing scope.
However, the existence of a stationary distribution p enforces the stability of the method. In fact, slightly modifying E (i.e., partially modifying the initial conditions) we are expected to reach a similar p 0 , i.e., kp À p 0 k is negligible.

PERFORMANCE EVALUATION
In this section, we provide an analysis of our method by means of the commonly used performance indicators based on the confusion matrix obtained from the method's execution. We first analyse the method on its own, hence providing parameters tuning by means of ROC curves. Later, we apply our methodology to cross-validation datasets provided by other literature recommender systems.

Parameters Tuning
As pointed out in Section 3.3, the recommendation system R p ' can be tuned by means of two parameters: The length of the Markov Process 1 < ' ( ' (discrete) The recommendations percentage 0 < p < 1 (continuous) One of the most used tools in parameter tuning is the Receiver Operating Characteristic (ROC) curve (true-positive rate versus false-positive rate) constructed via the confusion matrix: Recomm.
at the varying of the parameters. The objective is to reduce the false-positive rate. As it is common practice in the recommender systems literature, even if little or no a priori information on the ground truth (GT) negatives is available (the absence of a given recommendation might be due to its effectiveness not being assessed, yet), it is assumed that only a small fraction of the negatives are actually to be recommended.
For small values of ', we can then build a ROC curve on the parameter p, varying on 0 p 1. Fig. 4 represents such ROC curves with 1 < ' 5. The corresponding area under the curve (AUC) values approximated via the Newton-Cotes formula are reported in Table 2.
Results are listed with two indicators: in unweighted rates are evaluated with regard to the complete data set (without taking into account the difference between the drugs); weighted rates, on the contrary, are first evaluated regarding the drugs individually and then averaged amongst them.
Fewer known-recommendations are, however, more unreliable in the system as highlighted by the fact that Weighted AUC values are smaller than Unweighted ones. In fact, they have a stronger negative impact on the mean accuracy than those with more recommendations available.
According to ROCs, the best choice is ' ¼ 3. Recommendation ðu; vÞ 2 V drg Â V dis are therefore built employing the following patterns: Fig. 3. The data sets DB and DGN (on the left) are processed according to the s measure, yielding G drg and G dis respectively. These two graphs are then connected by E conn edges, obtained from MC (with weight set to 1). The recommendation shown in cyan is obtained from R ð'Þ that relies on biased random walks like yellow and purple ones.
where u; 0 u 00 2 V drg and v; 0 v 00 2 V dis and u; u 0 and u 00 might be the same state as well as v; v 0 and v 00 .
For ' ¼ 3, we evaluate the best value of p as the parameter generating the furthest point from the reference diagonal argmax params fsensitivity þ specificityg; (15) hence, obtaining p ¼ 12:5%. In our setting, such a value corresponds to 213 recommendations. Analogous results are obtained also with leave-one-out 10-fold cross-validation tests, where the AUC results in a mean value of 0.930174 and 0.862815 for weighted and unweighted approaches respectively, both with a standard deviation of 10 À4 .

Comparison With Other Methods
To provide more insightful information, we compare our method against the four benchmark data sets RDB, ID, LL, and SLAMS introduced in Section 2. To keep recommendations consistent with our method, we restricted such data sets by filtering the diseases and the drugs available in our training and then we collect the result regarding their connections. In situations like these ones, two main approaches are possible: (i) a gentle re-tuning of the method with the new data set in order to prove the model's flexibility to different data or (ii) a direct usage of the data set as a cross-validation set. While the first approach consists in performing parameter analysis again on the novel dataset, hence 'forgetting' the original set of data and therefore taking the best results of the methodology over the benchmarks, the second approach -the one we adopted -tests the benchmarks over the parameter gathered from the original dataset. In this sense, we compared the recommendations our method proposes with the ground truth obtained from the four restricted data sets. However, to keep the environment of the test as close to reality as possible, we did not restrict our method's recommendation set to the drugs available in each benchmark data set, meaning that we made our method infer no information about the validation set it was tested against; if this was the case, in fact, sensitivity and specificity values would grow above ð0:99; 0:9Þ due to the high number of recommendations, hence being uninformative.
In Fig. 5 we report, with respect to the Weighted-mean ROC curve, the true-and false-positive rates achieved on    Table 3) w.r.t. the ROC curve achieved by R p 3 (Fig. 4(b)). The best setting p ¼ 12:5% is also plotted as a star.
the benchmark datasets according to the parameters extracted in Section 4. We also collect the corresponding numeric values in Table 3 along with the Accuracy boolean indicator and the area under the curve of the underlying ROC curve.
As it is expected, better sensitivity and specificity results are obtained on LL and RDB since they share more than 50% of the recommendation with our training set MC. If we consider SLAMS and ID, while having less than 1 = 4 of common data with MC, they achieve anyway a comparable Accuracy and Specificity. Such comparable results provide a further clue of our method's stability.

CASE STUDY: RHEUMATOID ARTHRITIS
In addition to the performances assessed in Section 4 we here explore in more detail the results on an exemplar noncommunicable disease (NCD), i.e., rheumatoid arthritis (RA), a worldwide threat associated with spreading chronic inflammation [35]. NCDs provoke, in fact, more than 44 million deaths per year [36], [37] and it is estimated that RA, amongst the others, affects around 1% of the world population [38]. RA's incidence and representativeness make it an interesting case study to explore the relevance of the results offered by the recommender system also being its aetiology complex and not fully elucidated, since it includes both genetic and environmental factors [39], [40].
Standard therapy (true positives) for RA was extracted from the 2021 update of the American College of Rheumatology (ACR, [41]). Then, the relevance of the findings towards potential clinical translation was manually curated for the top-ranking novel results, using two sources of knowledge to deepen the information retrieved by each entry. The first source is represented by PubMed [42], the most comprehensive medical database. The second one is the well-known repository ClinicalTrials [31] collecting concluded and ongoing clinical trials. In both cases, the search terms were represented by the disease (rheumatoid arthritis) and the drug, as recovered by the recommender system.
The top ten novel (w.r.t. the dataset) recommendation are reported in Table 4. Score, drug name and CAS number are the output of the recommender system, while clinical trials and related publications are the results of our manual curation of the findings. Recommendations are divided according to the associated evidence (publication or clinical study), into three sets: (i) approved: that are the ones already employed in the literature (see [41]) but not in our training set MC, (ii) potential, that are the experimentally promising according to recent publications (reported in the table as well), and (iii) open, that are the ones supported by no hints of recent/ongoing related investigation. Fig. 6. Depiction of the recommendation process of the drug Gemcitabine in the case study of RA (top 25 path contributions shown, coloured from yellow, as the most probable path, to blue, as the least probable one). From left to right, the disease RA ("source") is linked in the first step of the process with similar diseases (in red) according to s scores and approved drugs (in green) through known associations. Then, a second step is performed and a different set of drugs is reached (some of the nodes are returning from the first step, marked with y). Finally, the paths reach Gemcitabine with a third (and last, ' ¼ 3) step, generating the recommendation. Approved drugs offer a measure of the validation of the system itself. Potential results, appear to be mostly identified by drugs whose validation in clinical trials is not (yet) engaged, but whose usage in animal models (mostly collagen-induced arthritis -CIA -rodents) has offered recent, promising results. This is the case for Gemcitabine (95058-81-4) [43] and Lenalidomide (191732-72-6) [45], whose identification by our system is particularly interesting due to their recent developments (i.e., after 2017).
Further inspection can be carried out thanks to the explainability of the output provided by the Markov Process representation of the DR system. As an example, Fig. 6 reports the 25 paths of length ' ¼ 3 that are contributing most to Gemcitabine novel recommendation, hence giving hints on the connections found. The connection with highest value is {RA ! Fludarabine ! Clofarabine ! Gemcitabine}. More in general, all these 25 connections share the same few drugs (14) and diseases (5) with an interesting relevance in Fludarabine (CAS 21679-14-1, present in five paths), Fluorouracil (CAS 51-21-8, present in nine paths), and Methotrexate (CAS 59-05-2, present in nine paths); these paths consists in fact in 21 out of the 25 total (do note that such drugs are present in the same path twice). All of these anti-neoplastic drugs can be found at a distance of one from RA, meaning that they are present in MC dataset.
The category open, having no current support in the literature, can represent both (expected) noise offered by such an automated system building on a meaningful but limited data base or the most innovative opportunities. In this sense, the explainability of the recommendation represents a significant added value, offering pharmacologists and rheumatologists a clear starting point for further consideration as well as the feasibility of (early) translation into clinical testing.

DISCUSSION AND CONCLUSION
Drug recommendation processes encompass experimental and computational approaches, with the latter potentially leveraging on computational resources as well as algorithmic advances, especially in the area of machine learning [46]. The power and flexibility of such tools created high expectations in the field of personalised medicine, a twenty years-long effort in the evolution of the medical paradigm [47], whose latest interpretations rely on machine learning for applications ranging from diagnosis to therapy. In this setting, however, high-level transparency and accountability are mandatory, hence promoting explainable machine learning methods [48].
In this perspective, we here proposed a similarity-based approach that exploits established associative knowledge on experimental drug-gene molecular interactions, diseasedrug and disease-gene associations, and pharmacological (disease-drug) indications to build a predictive DR system.
The way the proposed approach is built grants several advantages. Amongst the others, one is the standardisation, interoperability, quality and up-to-date information obtained thanks to the choice of robust, binary-type data. Moreover, the use of associative binary data (i.e., drug-disease, gene-disease, drug-gene) allows to overcome the problem of the trustworthiness and sensibility to experimental settings of quantitative data (such as e.g., gene expression data, valuable and extremely refined but also sensitive to noise [49]), and to take advantage of prior knowledge characterised by high robustness, reliability and stability. Another key feature is represented by the generation of explainable prediction, which can be obtained, once again, since the choice made here to integrate several binary associative data sets from reliable sources (DrugBank, DisGeNET, MalaCards) allowed to build a robust and well-grounded set of data that the recommender system is able to work on with a transparent approach. The output is hence completely traceable, from the processing of the initial input sources (diseases, drugs) up to the formulation of the drug repurposing proposal, hence promoting it as an explainable artificial intelligence method.
On the other hand, typical issues of associative data generate DR shortcomings, including noisy recommendations (e.g., false positive interactions) or incompleteness in the data sets (false negative interactions). Here, the incorporation of sources containing negative data (e.g., experimentally validated non-interacting pairs such as the Negatome for protein-protein interactions [50]) or important pharmacological side effects that impair drug usage (e.g., data from SIDER [51]) may help in refining the output for practical, clinical use. Multiple, redundant or complementary data sources (e.g., the Drug Repurposing Knowledge Graph [52]) will also help in this direction to cover missing information as well as possible.
In future implementations, we plan to extend the capability of managing information about drug-drug interactions, hence performing a step forward in the problem of synergistic drug combinations while tackling the repurposing proposals issues. To this scope, we plan to include the resources for drug synergy (e.g., DrugCombDB [53]) and adversary effects (e.g., from DrugBank). We also plan to include micro-RNA interactions, a growing field of interest in diagnostic and therapeutic pharmacology (see [54]), to further strengthen our knowledgegraph; in fact, miRNA plays an important role similar to target genes both for what concerns diseases (see [24]) and drugs (see [55]).

ACKNOWLEDGMENTS
This work has been carried out under the auspices of the Gruppo Nazionale per la Fisica Matematica (GNFM) of Istituto Nazionale di Alta Matematica (INdAM).

Author Contributions
EO executed the study, provided the mathematical background, developed the code to perform the analysis and validation of the method and wrote the first draft. FC provided the initial idea and supervised the work. CN suggested the use-case, supervised its analysis and performed its validation. MP supported with the mathematical formalisation and the writing of the code. PT provided support in data acquisition and problem formulation. All authors contributed to the writing of the manuscript, revised it, and read and approved the final version.
Filippo Castiglione received the degree in computer science from the University of Milan, Italy, and the PhD degree in scientific computing from the University of Cologne, Germany. He was a visiting researcher in various centers (IBM T.J. Watson Research Center, the Department of Molecular Biology of Princeton University, the Harvard Medical School, and the Institute for Medical Bio-Mathematics in Tel Aviv) working with the interface between computation and biology. He has coordinated several EU projects related to health informatics. He currently works as a research director with the National Research Council of Italy and also teaches Computational Biology and Machine Learning with the University of RomaTre.
Christine Nardini received the PhD degree in electronics and informatics from the University of Bologna, Italy. She is currently working as a scientist with the National Research Council of Italy (CNR), Institute for Applied Mathematics, Rome, Italy. Her research interests include systems and computational biology. She is a member of the editorial board of PLoS One and BioNanoScience, Springer Publication and associate editor for BMC Bioinformatics.
Elia Onofri received the bachelor's degree in mathematics, and the master's degree in computational sciences from Roma Tre University, Rome, Italy, in 2018 and 2020, respectively. Currently, he is working toward the PhD degree in mathematics with Roma Tre University, under a collaboration with the National Research Council of Italy. His research fields concern applied mathematics in cryptography, machine learning, biology and forecasting simulations.
Marco Pedicini. received the PhD degree in mathematics (Logic and Theoretical Computer Science) from Paris 7 University. He is an associate professor with the Department of Mathematics and Physics of Roma Tre University. Before coming to Roma Tre he was a CNR researcher. His research is on topics with the intersection of theoretical computer science and pure mathematics: logic in computer science, computational number theory, cryptography, and computational methods for systems biology. On behalf of De Cifris, he coordinates the special interest group CifrisCloud.