AI-driven Clinical Decision Support: enhancing Disease Diagnosis exploiting Patients Similarity

Detecting diseases at early stage can help to overcome and treat them accurately and a Clinical Decision Support System (CDS) can greatly help in identifying diseases and methods of treatment. In this paper, we propose a CDS framework able to integrate heterogeneous health data from different sources, such as laboratory test results, basic information of patients, health records and social media data. Using the data so collected, innovative machine learning and deep learning approaches can be employed. A neural network model for predicting patients’ future health information, is proposed. The approach employs word embedding to model the semantic relations of hospital admissions, symptoms and diagnosis, and it introduces a mechanism to measure the relationships of different diagnosis in terms of symptoms similarity to exploit for the prediction task. Several clinical decision support systems, including diagnostic decision support systems for inferring patient diagnosis, have been proposed in the literature. However, these methods typically focus on a single patient and apply manually or automatically constructed decision rules to produce a diagnosis. Even worst, they consider only a single medical condition, whereas it is not uncommon that a patient has more than one medical condition, at the same time, because of the complications of the first disease. The novelty of the proposed approach is the combination of supervised and unsupervised artificial intelligence methods allowing to combine and integrate several and heterogeneous data sources related to a multitude of patients and concerning different medical conditions. Furthermore, with respect to previous approaches, the diagnosis prediction problem is formulated to predict the exact diagnosis in terms of semantic meaning. In fact, instead of referring to the diagnosis in terms of the International Classification of Diseases (ICD-9) codes, as the related literature does, we consider the diagnosis semantics by exploiting Natural Language Processing. Experimental results, performed on a real-world EHR dataset, show that the proposed approach is effective and accurate and provides clinically meaningful interpretations.

systems focus on extracting characteristics of patients and diseases, based on which they classify patients and provide corresponding clinical suggestions to the physicians.
Several clinical decision support systems, including diagnostic decision support systems for inferring patient diagnosis, have been proposed in the literature. However, these methods typically focus on a single patient and apply manually or automatically constructed decision rules to produce a diagnosis. Even worst, they consider only a single medical condition, whereas it is not uncommon that a patient has more than one medical condition, at the same time, because of the complications of the first disease.
To address the above issues and challenges, in [2] we proposed a CDS framework that integrates heterogeneous health data collected from disparate sources, such as laboratory test results, medical images and electronic health records. The framework implements a set of services to support physicians in diagnosing or treating patients' health issues. To this aim novel methods, exploiting deep learning as well as machine learning tools, are embedded within the framework with the main goal of automate disease diagnosis and treatments.
Leveraging on the CDS framework proposed in [2], the key motivation behind this work is to exploit the disparate artificial intelligence methods (like deep learning and machine learning) to enable automatic disease diagnosis also taking advantage of alternative data sources like social media data and data coming from body sensors networks. Specifically, this paper addresses the challenging issue of inferring patient diagnoses, exploiting electronic health records and medical data upon hospitalization. To this purpose, is proposed a novel framework for diagnostic prediction based on patientsimilarity, using basic patient-specific information gathered at hospital admissions, including medical history, blood tests, laboratory results and demographics to identify similar patients and subsequently predicting patient outcomes. Patients similarity is defined as the similarity between patients' diagnoses and symptoms rather than a dichotomous problem stating the absence/presence of just one disease. We learn patients semantic models from the overall collected data by developing an AI method able to generate context-based and rich representation of health related information. In particular, we exploit word embedding models since neural networks shown to have the ability to learn complex feature representations. We apply the word embedding approach to categorize text fragments, at a sentence level, based on the emergent semantics extracted from a corpus of medical text.
In this context a fundamental challenge is how to correctly model such temporal and high dimensional EHR data to significantly improve the performance of prediction. To address this challenge, the paper put emphasis on the implementation within the proposed CDS architecture of an ad-hoc mechanism for searching patient's documents in a distributed health system, based on Natural Language Processing (NLP) concepts.
The diagnosis prediction method proposed has been evaluated over a real-world medical data, the MIMIC III dataset [3]. The results shown that the prediction approach, based on word embedding and exploiting semantic similarity of both symptoms and diagnoses, resulted to be effective and accurate, reaching considerable precision and recall rates. The obtained outcomes are promising for future developments and extensions of the framework that could be a valuable means for automatic inferring disease diagnoses.
The rest of the paper is organized as follows. Section II overviews related work. Section III introduces the clinical decision support architecture together with its main functionalities. Section IV presents as case study the diagnosis prediction method, formulating the problem, the data model and introducing the key aspects of the approach. Section V describes the real-world medical data used for the experimental evaluation. The results of the evaluation are shown in Section VI. Section VII concludes the paper.

II. RELATED WORK
In the healthcare domain, EHRs mining represents one of main research field and topics like disease progression [4]- [7], diagnosis prediction [8]- [11], electronic genotyping and phenotyping [12]- [14], adverse drug event detection [15] were widely investigated. Moreover, designing deep learning models, often, has provided significant improvements to the performance. Unplanned readmissions and risks related to electronic health records management were predicted using convolutional neural networks (CNNs) [16], [17], while multivariate time series health data, fox example, can be profitably modeled through recurrent neural networks (RNNs), also when some values are missing [18], [19].
In [8], an approach to acquire knowledge from health data, named Med2Vec, is proposed with the aim to predict future diagnosis information. The method overrides the long-term dependencies of health codes among diagnosis. A graphbased attention model for representation learning in healthcare is proposed in [20]. In this model, medical ontologies are used to learn robust representations, while patients' visits are modeled exploiting RNN. In order to perform binary prediction tasks, a predictive model implementing a reserve time attention mechanism was proposed in [9]. It employs a location-based approach to predict future possible diagnosis. The attention weights, related to a visit at a given time, are calculated by using latest medical information, and exploited to implement the hidden state of RNN in order to predict the visit at the next time. The relationships between all previous visits and the current one are not taken into account. A diagnosis prediction model, using attention-based bidirectional recurrent neural networks for learning low-dimensional representations of the patient visits, was proposed in [21]. The approach embeds the high dimensional clinical variables into a low dimensional space, and generates the hidden state of a RNN building a code representation through an attentionbased bidirectional RNN.
Medical text classification is a special case of a classification task for health data. In this context, natural Language Processing (NLP) and Machine learning algorithms have been applied obtaining promising results. For example, classification of patient record notes were successfully executed by applying Latent Dirichlet Allocation and Support Vector Machines [22], while [23] addresses the classification task of documents in diabetes diseases, also reporting satisfying results. In general, models proposed in the literature, often, exploit neural network approaches for document classification. Mikolov in [24] introduces Word2Vec, a two layers neural network that elaborates a text corpus in order to identify correlation among words, also capturing the semantic context. A unique real-valued vector is generated to represent each word included in the corpus. The approach is extended in Le et al. in [25], where, starting from Word2Vec, the distributed representations of documents or paragraphs is proposed. Kalchbrenner et al. [26] proposed an approach to semantically model sentences by exploiting a Dynamic Convolutional Neural Networks. The approach manages sentences of different length and provides a feature graph for each sentence that is able of capturing short and long relations.
Srinivasu et al. [27] proposed an approach to classify skin diseases from the image captured from mobile devices, by exploiting a deep learning based method and Long Short Term Memory technique. The model relies on the design of the application through which the image of the affected region of the skin is captured to determine the class of the skin disease. In [28], a cervical cancer prediction model (CCPM) that offers early prediction of cervical cancer using risk factors as inputs, was proposed. In particular, an outlier detection approach and a density-based spatial clustering with noise method, with synthetic minority over sampling technique for data balancing, and random forest for cervical cancer prediction based on risk factors, have been exploited. The novelty is to combine, to improve the prediction performance, data oversampling techniques, an outlier detection methods and random forest classifier for cervical cancer prediction based on risk factors. [29] proposes a personalized healthcare system to monitor diabetic patients by using realtime processing of data from BLE-based sensors. Apache Kafka is utilized to handle incoming sensor information, while MongoDB is exploited for storing unstructured sensor data. The large amount of continuous data (e.g. weight, blood pressure, heart rate, BG,etc.) coming from sensor devices can be managed in real-time through real-time data processing. To classify the diabetes patient, a Multilayer Perceptron classification algorithm is utilized, while the prediction of the BG level relies on Long Short-Term Memory (LSTM) method.

III. THE AI-DRIVEN CLINICAL DECISION SUPPORT SYSTEM
A general and innovative distributed CDS architecture, aiming to support physicians in formulating the diagnosis, was proposed in [2]. The architecture includes a set modules allowing for the collection and management of huge volumes of heterogeneous clinical data originated by distributed sources. Different and non-homogeneous data are involved in the framework including more traditional health data like EHRs, laboratory test data, symptoms of patients, patient health logs, medical images, but also alternative data sources like social media data and data coming from wearable devices. Indeed, with the current revolution of the internet and social media, and the pervasiveness of IoT devices, new sources of medical information are easily accessible. Wearable technologies detect, analyze and transmit information concerning vital signs and/or ambient data, allowing for continuous monitoring of health status. Thanks to advances in artificial intelligence methods, such data can be used in a large set of medical applications, to make a diagnosis or perform a triage.
The proposed Clinical Decision Support framework, whose architecture is shown in Figure 1, relies on a set of cooperative Clinical Data Repository (CDR) hosts, which are geographically distributed healthcare providers (e.g., hospitals or health research centers). Each CDR manages its local information system composed of EHRs, knowledge bases, clinical databases, etc. Essentially a collection of metadata containing information related to the clinical history of each single patient. For the decision making process, the architecture integrates off-line standardized knowledge bases from domain experts and Clinical Practice Guidelines (CPG) with online knowledge that is extracted continuously from EHR databases and social media data. Based on its local knowledge bases and a patient profile, the CDS provides diagnosis suggestions relating to an hospital admission. A patient profile is modeled with consideration of patient medical history and current diagnosis.
The proposed framework relies on the concept of digital patient that is a specialization of the digital twin definition. The concept of digital twin was first used in 2001 at the University of Michigan and described as the virtual and digital equivalent of a physical product. According to [30], a digital twin is a digital replica of a living or non-living physical entity, like processes, people, places, systems and devices that can be used for various purposes. Ideally, it contains all the information of the physical object through a threedimensional representation of its mechanical, geometric and electronic aspects, i.e. embedded software, micro software, product data, data associated with sensors and actuators, increasingly pervasive. The affirmation of this concept is related to the growing diffusion of the IoT, the Cloud, mobile technologies, and AI that have made the advantages associated with digital twins accessible to many sectors, from smart fabrics to smart agriculture, from smart health to smart cities, and many others aiming to ride digital transformation. In the healthcare industry was originally proposed for equipment prognostics [31]. The patient's digital twin is forged through different health data sources like image records, in-person measurements, laboratory results, and genetic. It is meant to assist during diagnosis and to build personalized models for patients, continuously adjustable based on tracked health and lifestyle parameters. Accordingly, a digital patient simulates VOLUME 4, 2016 the health status of the patient, as captured from available clinical data, and infers the missing parameters from statistical models.
In our approach, the digital patient can benefit also from non-clinical data like the ones from social media and wearable sensors to support physicians in clinical decisionmaking. In particular, the continuous patient data collection, allows to detect symptoms at early stages, given doctors the capacity to diagnose the patient before getting ill. Besides, during treatment, it will be able to evaluate if the treatment is being effective. Also, it can support therapy planning with individual quantitative optimization of clinical output or predicting disease propagation.
The architecture of the framework is composed of three layers, as can be observed in Figure 1. The first level is responsible for preparation (i.e. cleaning, handling and integration) of medical data generated from multiple heterogeneous sources. The artificial intelligence (AI) layer is in charge to provide integrated health information and intelligent applications, which can improve the efficiency of the decision-making processes. It includes machine learning and deep learning modules, like Natural language Processing (NLP) module, radiomics module, predictive analytics, and so on. Such modules allow to automatically generate contextbased and rich representation of health-related information and implement a set of high level clinical services, including systems that provide diagnosis based on patient symptoms similarity, personalized medical measurement, treatment pre-diction and monitoring.
A High Performance Computing layer, based on CPU/GPU infrastructure, allows the platform to handle large amounts of data in a limited processing time. The module exploits a vector representation of CDR documents, defining a distance/similarity metric, in order to achieve a logical sorting among them. The distance/similarity between two documents can be computed through the cosine distance/similarity between their vector representations. Given two health documents vectors, v 1 and v 2 , the cosine measure utilized to compute the similarity between them is reported in formula 1.
The aim is to obtain a sorted virtual list among CDR hosts, according to which, each server is logically linked to only other two hosts: the CDR host with the vector value immediately lower and the CDR host with the vector value immediately higher of all network. Each CDR host H CDR with vector v CDR performs the following algorithm: • compute L CDR and H CDR lists containing the currently linked servers with vector value lower and higher than v CDR , respectively; • if L CDR length is higher than 1, (i) identify in the list the hosts with the minimum and the sub-minimum vector values; (ii) create a virtual link between them; • if H CDR length is higher than 1, (i) identify the hosts with the maximum and the sub-maximum vector values; (ii) create a virtual link between them; • notify by message all involved CDR hosts (the host with the minimum/maximum vector value and the host having the sub-minimum/sub-maximum vector value) the information related to their new virtual neighbors; addressed CDR hosts update its lists with the information end considerate these new neighbors in the successive computation. • remove in L CDR the CDR host with the minimum vector value and in H CDR the CDR host with the maximum vector value; At the end of the procedure, the last CDR host contained in the list L CDR , represents the linked CDR host with the highest vector value among all linked CDR hosts with the vector value lower than v CDR ; while, the H CDR list contains the linked CDR host with the lowest vector value among all linked CDR hosts with the vector value higher than H CDR . Obviously, the host having the vector with the absolute minimum value and the absolute maximum value of all vectors in the network, will be linked to a unique CDR host. If not exist any linked CDR host with vector higher/lower than v CDR , i.e. H CDR /L CDR is empty, H CDR represents the CDR host with the highest/lowest vector value of the whole network.

IV. DIAGNOSIS PREDICTION CASE STUDY: PROBLEM FORMULATION AND METHOD
Electronic Health Records (EHR), consisting of longitudinal patient health data, including demographics, diagnoses, procedures, and medications, have been utilized successfully in several predictive modeling tasks in healthcare. One critical task is to predict the future diagnoses based on patient's historical EHR data.
In this paper, we address the problem of leveraging EHR patient data to infer the discharge diagnosis of patients. We introduce an automated method that exploits basic patientspecific information gathered at admissions, including medical history, symptoms, and preliminary diagnoses, to identify similar patients and predict patient discharge diagnosis.
The problem addressed in the paper is formulated as follows: Definition 1: Given a patient p, a set of symptoms s felt by the patient and a set of preliminary diagnosis pd, the objective is to predict the discharge diagnosis d for p by exploiting the similarity in terms of symptoms and preliminary diagnoses with other patients already treated and for which a discharge diagnosis has been already formulated.
In particular, we propose a novel patient-similarity-based framework for diagnostic prediction, where the similarity is defined as the similarity of diagnoses and symptoms among patients. The multilabel classification problem is converted to a single-value regression problem by integrating the pairwise patients' clinical features into a vector and taking the vector as the input and the patient similarity as the output.
It is worth pointing out that, differently to previous approaches, the diagnosis prediction problem is addressed differently since the aim of the work proposed in this paper is to predict the exact diagnosis in terms of semantic meaning. In fact, instead of referring to the diagnosis in terms of the International Classification of Diseases (ICD-9) codes we consider the diagnosis semantics by exploiting Natural Language Processing. This way we are able to implement semantic enhanced diagnosis prediction. In particular, we exploit word embedding models since neural networks shown to have the ability to learn complex semantic feature representations.

A. DATA MODEL
In the proposed model the digital patient is built by combining traditional medical sources like EHR and external knowledge bases like social media and sensors data. For each of such sources relevant data features are identified and extracted. The features from the disparate sources are then represented in ad hoc data structures introduced in the following of the Section. Specifically, in the proposed approach the symptoms, lab tests and preliminary diagnoses are integrated into the feature vectors of patients. The objective of the feature construction effort is to capture sufficient clinical nuances from the heterogeneity of data of the different patients. A major challenge is in data reduction and in summarizing the temporal event sequences in EHR data into features that can differentiate patients. The adopted feature-centered framework serves as the basis for implementing different similarity-based diagnoses prediction algorithms. Essentially, each patient is represented by a feature vector, which serves as the input to the similarity measure. Our objective is to design a similarity measure that operates on patient feature vectors and that it is consistent with physician feedback in terms of whether two patients are clinically similar or not. Accordingly, the semantic similarity metrics is a key aspect in the disease prediction approach.
Even if the architecture shown in Figure 1 shows that we include disparate and heterogeneous medical data sources, in this paper we focus only on classical Electronic Health Records (EHR).
To the aim of this work we refer to the well-known MIMIC-III database [3] and, thus, we rely on its data structures and model, as will be detailed in Section V. In particular, we refer to the database schema and to the data structures maintaining information about the target features that in the faced problem are the diagnosis and the symptoms. Specifically, since the symptoms are not explicitly specified in the database, an ad-hoc extraction module has been implemented and for each admission of a patient a list of symptoms is extracted. Differently, preliminary diagnoses are explicitly modeled in the database; they are established after hospital admissions and recorded in the admission notes. Preliminary diagnoses are tentative hypotheses and may be not satisfactory. Conversely, discharge diagnoses are achieved after treatment, confirmed by the doctors and extracted from the discharge summary notes. They are accurate and definite. Preliminary and discharge diagnoses are categorical and unintelligible for a computer. Several categorical clinical concepts have been classified into hierarchical taxonomies, such as the International Classification of Diseases (ICD-9). In the MIMIC-III database discharge diagnoses are expressed in terms of ICD-9 code, while preliminary diagnoses recorded in the admission notes are not encoded with ICD-9.
The aim of the proposed diagnosis prediction method is to infer the exact semantic meaning of the diagnosis since adopting diagnosis codes like ICD-9 code, an approximation can be introduced due to the categorization step. Accordingly, in our model a diagnosis is a list of words.
We define a patient entry as a data structure storing relevant features about patients, symptoms and diseases: Definition 2: A patient entry p is defined as a tuple p = (id, s, pd, ssv), where id is the patient identifier, s is a list of symptoms s = (s 1 , s 2 , ..., s s ) where each element s i corresponds to a symptom and consists of a list of words describing the symptom s i , pd is the list of preliminary diagnoses that similarly to the symptoms is represented as a list of lists pd = (pd 1 , pd 2 , ..., pd d ), and ssv is the semantic symptom vector corresponding to the list of symptoms s and preliminary diagnoses pd. Definition 3: A semantic symptom vector ssv = (ssv 1 , ssv 2 , ..., ssv s ) is a vector where each element in turn VOLUME 4, 2016 is an n-dimensional vector in a semantic space obtained through word embedding of each single symptom and preliminary diagnosis. Each feature item is represented as a realvalued vector, and each dimension contains a certain amount of semantic information.
We learn patient representation from EHR by developing a supervised machine learning model that allow an automatically generated context-based and rich representation of health related information. In particular, we apply a word embedding approach to categorize text fragments, at a sentence level, based on the emergent semantics extracted from a corpus of medical text. With the embedding approach, symptoms and diagnoses are represented as vectors in real space and trained using artificial neural networks. Usually, the vector representations of words are computed using their contexts so that words with similar meanings will have similar vector representations. Sentence embedding maps sentence to fixed length dense vectors and can be generated from the aggregation of embeddings of words in the sentence.
As sentence embedding model we used Sent2Vec [32], an unsupervised model allowing to compose sentence embeddings using word vectors along with n-gram embeddings, simultaneously training composition and the embedding vectors themselves. Conceptually, the model can be interpreted as a natural extension of the word-contexts from the C-BOW approach [25] to a larger sentence context, with the words being specifically optimized towards additive combination over the sentence, by means of the unsupervised objective function.

B. METHOD
The supervised prediction method proposed to compute discharge diagnosis similarity of patients is formulated as a sequence of steps. The input of the method are the feature vectors of each patient, which are built by integrating two parts: symptoms and preliminary diagnosis data. Accordingly, the first step is the construction of the patient feature vectors. The second step is the development of the semantic corpus by integrating the different medical knowledge. The third step is the building of the neural network and its training. The forth step is the construction of the similarity profiles, while the final step is the prediction task.

1) Complexity
Two main trends in NLP have emerged. Recurrent neural networks (RNNs), LSTMs, attention models are widely used for NLP applications. However, even if they exhibit extreme strong expressiveness, the increased model complexity makes such models much slower to train on larger datasets. On the other end, simpler approaches like matrix factorizations or bilinear models can be successfully trained on much larger datasets, which is an important advantage, in the unsupervised setting considered. In particular, is important to note that for constructing sentence embeddings naively using averaged word vectors was shown to outperform LSTMs (see Wieting et al. [33] for plain averaging, and Arora et al. [34] for weighted averaging). This example shows potential in exploiting the trade-off between model complexity and ability to process huge amounts of text using scalable algorithms. In view of this tradeoff, the adopted approach of Pagliardini et al [35]. further advances unsupervised learning of sentence embeddings. The Sent2Vec model proposed by Paglairdini et al. exploits, a simple unsupervised model allowing to compose sentence embeddings using word vectors along with ngram embeddings, simultaneously training composition and the embedding vectors themselves. Pagliardini et al demonstrated that the empirical performance of their proposed general-purpose sentence embeddings very significantly exceeds the state of the art, while keeping the model simplicity as well as training and inference complexity exactly as low as in averaging methods ( [33], [34] ). Accordingly, the computational complexity of our embeddings is only O(1) vector operations per word processed, both during training and inference of the sentence embeddings. This strongly contrasts all neural network based approaches, and allows our model to learn from extremely large datasets, in a streaming fashion, which is a crucial advantage in the unsupervised setting.
Fast inference is a key benefit in downstream tasks and industry applications. In contrast to more complex neural network based models, one of the core advantages of the proposed technique is the low computational cost for both inference and training. Given a sentence S and a trained model, computing the sentence representation vS only requires |S| · h floating point operations (or |R(S)| · h to be precise for the ngram case), where h is the embedding dimension.
We propose a distance-based similarity approach exploiting symptoms and preliminary diagnoses similarity. Given a set of historical admissions, the goal of the proposed algorithm is to predict the diagnosis of a new patient admission based on the similarity of symptoms and preliminary diagnoses with those of the historical admissions.
The semantic similarity metrics is formulated by exploiting the cosine similarity between the semantic symptoms vectors of the target patient p and that of another patient p k in the historic EHR. The similarity is computed by taking for each item in the semantic symptom vector of p the maximum similarity value with the semantic feature of patient p k . In other words, for each item in the ssv p the semantically most similar term in the ssv px i of patient p k is determined. Let n =| ssv p |, m =| ssv p k | be the size of the semantic symptom vectors of the target patient p and of the patient p k , respectively. Then, the semantic similarity between them is computed through the following function: arg max x y∈{1,...,m} cos(ssv p (x), ssv p k (y)) arg max(n, m) (2) where the arg max function gives the highest cosine similarity value computed among the n × m couples (x, y) of features of ssv p and ssv p k .  Table 1 shows the statistics of the dataset. The dataset version used in this research is MIMIC-III is v1.4, released on 2 September 2016 [3].

V. DATASET AND SEMANTIC KNOWLEDGE CORPUS
The tools used in this study were PostgreSQL and Python. PostgreSQL was used as the database management system which allowed SQL-based queries though PgAdmin4 to extract and select data from MIMIC-III database. Python was used to create more structured way of data processing.

B. SEMANTIC CORPUS MODEL: BIOSENTVEC
We trained the word embedding on the BioSentVec corpus, the first open set of sentence embeddings trained with over 30 million documents from both scholarly articles in PubMed and clinical notes in the MIMIC-III Clinical Database. BioSentVec is publicly available at https://github.com/ncbinlp/BioSentVec. BioSentVec implements the sentence to vector model (sent2vec) It is actually an unsupervised version of Fast-Text, and an extension of word2vec (CBOW) to sentences. Accordingly, BioSentVec is a pre-trained model for readily generating sentence embeddings given as input any arbitrary sentence. In benchmarking, BioSentVec shows superior performance on two public datasets for computing sentence similarity, compared to the current state of the art.
Specifically, BioSentVec is created by applying sent2vec to both biological and clinical texts at a large scale, to compute the 700-dimensional sentence embeddings, using the bigram model and set window size to be 20 and negative examples 10.

VI. EXPERIMENTAL EVALUATION
In this section we present the results of the experimental evaluation performed with a twofold goal: one was assessing the effectiveness and accuracy of the proposed diagnosis prediction approach; the other goal was to asses the scalability of the distributed discovery services. We implemented the Disease Diagnosis method exploiting Cython 1 , an optimizing static compiler for Python programming. The source code is available at http://staff.icar.cnr.it/diseaseDiagnosis.zip.

A. DIAGNOSIS PREDICTION METHOD PERFORMANCE
In this Section we show the performance of the proposed approach in predicting patient diagnoses based on symptoms similarities.
We consider the following performance metrics: • Precision: P = T P/(T P + F P ) It is the fraction of the number of successfully detected ground-truth diagnoses (TP : true positive), out of the total number of detected diagnoses (TP + FP). FP (false positive) is the number of mistakenly detected groundtruth diagnoses. A ground-truth diagnosis is considered successfully detected if exists a predicted diagnosis that matches it.
It is the percentage of ground truth diagnosed successfully detected, where TP +TN (TN :True Negative) is the total number of ground-truth diagnoses.
It is the harmonic mean of precision and recall metrics, that reaches its best value at 1 and worst at 0. In our experiments we are interested in calculating precision and recall at k. In particular, we take the top k predictions with higher similarity. If one of them matches with the ground-truth diagnosis, the method classifies the prediction as correct. Top-1 accuracy is a special case, in which only the highest similarity is taken into account for the prediction.
The data consists of 50.870 hospital admissions. Different types of admissions are considered: 'ELECTIVE', 'URGENT', or 'EMERGENCY'. Emergency/urgent indicate unplanned medical care, while elective indicates a previously planned hospital admission. Each set of experiments is performed considering 5-fold cross validation. For each fold, the dataset was randomly splitted into training (80%) and test (20%) sets, then a model is trained using a training data and the resulting model is validated on a test set. The performance measures reported by the 5-fold cross validation are then the average of the values computed at each fold. In a first set of experiments we aimed to identify the best parameter settings for the prediction algorithm. The parameters are α, the symptoms similarity threshold, and β, the diagnoses similarity threshold. Thus, we considered different values of α and β and computed the precision and recall metrics.  From the graph one can note that the recall got lower values for α = 0.8 and α = 0.9. This happens because by increasing the similarity threshold between the symptoms, the similarity constraint becomes more stringent, and consequently the number of successfully predicted diagnoses decreases. According to these results, the best configuration for symptoms similarity threshold is α = 0.7. Configuration that will be used throughout the paper. For this first experiment is considered the semantics similarity threshold β = 1, setting that implies a semantic correspondence at 100% between the predicted and ground truth diagnosis, that in a reallife scenario is unlikely. By slightly lowering the semantic similarity between the predicted and ground truth diagnosis (as will be shown in Figure 4) the performance of the approach greatly increases.
In Figure 3 is shown the number of correct predictions at k = {5, 10, 15, 20, 25, 30} with respect to the dataset size. The graph exhibits how the fraction of correctly detected ground truth diagnoses increases with the size of the dataset and with the number of top predicted diagnoses. As discussed above, another key aspect of the proposed prediction method is the diagnoses similarity threshold β, that exploits diagnosis semantics and not just textual similarities between predicted and ground truth diagnoses. In particular, we use, similarly to the symptoms, the sentence embedding techniques to manage the diagnosis semantic information as vectors in the semantic embedding space.
The MIMIC-III dataset maintains diagnosis related groups (DRG) codes for each admission. More precisely, a final discharge diagnosis can consist of several terms based on the number of DRG codes associated with admission. DRG codes represent the diagnoses billed for by the hospital. There are three types of DRG codes in the database which have overlapping ranges but distinct definitions for the codes. The three types of DRG codes are 'HCFA' (Health Care Financing Administration), 'MS' (Medicare), and 'APR' (All Payers Registry). HCFA-DRG and MS-DRG codes have multiple descriptions as they have changed over time. Sometimes these descriptions are similar, but sometimes they are completely different diagnoses. So we need to consider both the type and the description for a certain diagnosis. All admissions have an HCFA-DRG or MS-DRG code, but not all admissions have an APR-DRG code. Note that APR-DRG is believed to be an alternative, more specific, code which could be used in conjunction with the HCFA codes. Consequently, each term is a pair < DRG T Y P E , DRG DESCRIP T ION >, and each diagnosis is a set of terms. Given a ground truth diagnosis gtd and a predicted diagnosis pd, first, we extract their vectors through the sentence embedding method, then we calculate the similarity between them. . The graph displays that, precision and recall increase with k, while decrease with the increasing of β. This happens because β = 1 indicates that pd and gtd are identical. By decreasing beta, the similarity constraint is loosened and the prediction is considered correct even if the diagnoses are not the same but are similar at the β level. This approach allows us to make a greater number of predictions with a high level of similarity between diagnoses. We can conclude that the proposed diagnosis prediction method based on word embedding and exploiting semantic similarity of both symptoms and diagnoses, resulted to be effective and accurate reaching considerable precision and recall rates. The obtained outcomes are promising for future developments and extensions of the framework that could be a valuable means for automatic inferring disease diagnosis.

B. DISTRIBUTED DISCOVERY SERVICE PERFORMANCE
With the aim to test the effectiveness of the system we implemented a Java simulator in which the characteristics of real networks of CDR hosts were careful considered. Firstly, the mean number of messages exchanged by each server, i.e. the traffic generated by the algorithm to obtain a stable and ordered situation, was evaluated. Figure 5 shows, for different network size, the average number of messages handled by each server, that is for different number of clinical serves involved in the logical reorganization. Different mean number of connections/neighbors of each server were considered in the experiments. We can note that, the algorithm achieve a stable overlay using a limited number of messages and therefore generating a low value of network traffic. Figure 6 reports the total number of messages exchanged by all servers per each step to reaches a stable situation. The network size is set to 5000. Notice that the algorithm converges in a finite number of steps and the number of messages decreases exponentially.
In order to evaluate the worst case for the organization, i.e. the maximum number of steps required to each CDR host to create the final links to its neighbors in the overlay, a set of experiments was performed and the results are reported in Figure 7. Even in this case, the simulations were executed for different network sizes and for different mean values of neighbors. It can possible to highlight that the maximum number of steps required to a server to identify its neighbors in the overlay is, in any case, very low. Moreover, we can highlight that the maximum number of steps to reach the sorting would be necessary only for the first running because none previous order exists and the system is completely disordered. An intuitive informed search mechanism can be designed, simply by forwarding (at each step) the query towards the CDR host with the highest similarity value with the target. The procedure assures that, at each step, the current resource is that with the highest similarity with the target vector. The search procedure finishes when none of the linked servers improve the similarity value with respect to the current CDR host. Figure 8 reports the mean number of query steps needed to locate the best resource for different number of linked CDR hosts. Notice that the number of steps is always limited also for high numbers of CDR hosts involved.

VII. CONCLUSION
In this paper, an artificial intelligence driven Clinical Decision Support system is proposed. The system is able to inte-VOLUME 4, 2016  grate heterogeneous health data from different sources, and implements a set of intelligent services exploiting innovative machine learning and deep learning approaches to support physicians in disease diagnosing and treating. In particular, the paper presented a neural network model for predicting patients' future health information. The model is based on patients similarity in terms of symptoms and diseases. The approach employs word embedding to model the semantic relations of symptoms and diagnoses, and it introduces a mechanism to measure the semantic relationship of different diagnoses in terms of symptoms similarity for the prediction. Experimental results, performed on a real-world EHR dataset, shown that the proposed approach is effective and accurate and provides clinically meaningful interpretations. The proposed approach addresses the challenging issue of inferring patient diagnoses exploiting similarity among patients' medical data. The considered similarity degree takes into account patients' diagnoses and symptoms, therefore, patients' information gathered at hospital admissions, like blood tests or laboratory results, electronic health records, medical history, have to be available to the platform. But, often, it's not possible to capture all of the related data mechanically, they can be incomplete or partially missing, thus, the value presented by a DSS may not be 100% true. Indeed, the patients' semantic models learned from the collected data, aiming to generate a faithful context-based representation of patients' health information, may be inaccurate. Moreover, a fundamental role is played by the text corpus exploited to train the word embedding tool for generating the vectors from medical data. Generating vector representations of clinical text data through word embeddings methods, is widely used, but it presents still some limitations. Indeed, the word embeddings task involves a model trained on a pre-processed corpus, and many scientists exploit their own word embeddings on the clinical data available to them, even if, there are many large, freely available datasets to be used. This can lead to inaccurate models. It is important to consider several factors, including: the domain of the corpus, the size of the training corpus, the hyper-parameters and the characteristics of the model, and the quality of the trained embeddings.
DEBORAH FALCONE is a research fellow at the Institute of High Performance Computing and Networking of the Italian National Research Council (ICAR-CNR), Italy. She received her Master's degree in Computer Engineering and her Ph.D. in Systems and Computer Engineering from the University of Calabria, Italy. In 2013 she was a visiting researcher at the Computer Laboratory of the University of Cambridge, UK. She worked as research fellow and teaching assistant of concurrent programming and object oriented programming, with experience in designing, developing and testing distributed applications, mobile applications and a good data science background and experience in social media data analysis. Her research interests include artificial intelligence, big data analysis and mining, social network data analysis and mining, health informatics.