Multi-Source Medical Data Integration and Mining for Healthcare Services

With the advent of Internet of Health (IoH) age, traditional medical or healthy services are gradually migrating to the Web or Internet and have been producing a considerable amount of medical data associated with patients, doctors, medicine, medical infrastructure and so on. Effective fusion and analyses of these IoH data are of positive significances for the scientific disaster diagnosis and medical care services. However, IoH data are often distributed across different departments and contain partial user privacy. Therefore, it is often a challenging task to effectively integrate or mine the sensitive IoH data, during which user privacy is not disclosed. To overcome the above difficulty, we put forward a novel multi-source medical data integration and mining solution for better healthcare services, named PDFM (Privacy-free Data Fusion and Mining). Through PDFM, we can search for similar medical records in a time-efficient and privacy-preserving manner, so as to offer patients with better medical and health services. A group of experiments are enacted and implemented to demonstrate the feasibility of the proposal in this work.


I. INTRODUCTION
With the ever-increasing popularity of Information Technology and the gradual adoption of digital software in medical or healthy domains, various medical departments or agencies have accumulated a considerable amount of historical data (e.g., patients' medical records, healthy treatment solutions and so on), which form a main source of big Internet of Health (IoH) data [1]. The utilization degree of such IoH data is a key criterion to evaluate and quantify the information level of medical or healthy units or departments [2].
Generally, most of historical IoH data records contain valuable information especially for the medical or healthy agencies, such as the past disease of a patient at a time point. Mining and analyzing such historical IoH data records can contribute much to doctors' scientific and reasonable diagnosis and treatment decision-makings, as well as disaster trend prediction and precaution [3]. Therefore, it is of emergent necessity to collect, integrate, fuse and analyze these multi-The associate editor coordinating the review of this manuscript and approving it for publication was Yizhang Jiang .
source IoH data records for high-quality healthcare services suitable for patients.
However, historical IoH data records from patients often contain sensitive patient privacy (e.g., blood pressure, temperature) as a patient is often not willing to let others know his or her historical disasters [4]. Therefore, the patients or the stakeholders of historical IoH data records dare not disclose their IoH data records to the public. In addition, they lack sufficient incentive for IoH data records sharing with others. The above two concerns significantly block the utilization of historical IoH data records. As a consequence, although many hospitals or other medical & healthy agencies have accumulated a considerable amount of historical IoH data records, they seldom release the data to the outside due to privacy concerns. Furthermore, the historical IoH data records are often distributed across different platforms or agencies, the integration and fusion of which further increases the privacy disclosure concerns.
Considering the above challenge, we use hash techniques to realize private data protection when the multi-source IoH data are integrated together for subsequent IoH data mining and analyses. As hash techniques are single-directional data mapping ways, the goal of privacy protection can be achieved accordingly.
In summary, the major contributions of the work in this paper are three-fold.
(1) We introduce the LSH (Locality-Sensitive Hashing) into multi-source IoH data fusion and integration so as to secure the sensitive information of patients hidden in the past IoH data.
(2) For the IoH data without patient privacy after LSH process, we bring forth a similar IoH data record search method for subsequent IoH data mining and analyses, so as to balance the IoH data availability and privacy.
(3) Based on a dataset collected by real-world users, we validate the advantages of the proposed work in this paper, through a set of pre-designed experiments.
The reminder of this paper is structured as below. We review the current research status of the field in Section II to further show the novelty of our work. In Section III, we use an intuitive example to motivate our research. The proposed multi-source medical data integration and mining method is specified in Section IV. Experiment comparisons are presented in Section V. At last, in Section VI, conclusions are drawn and the future research work is described in detail.

II. RELATED WORK
Many researchers have devoted themselves to the research of multi-source big data integration as well as the resulted sensitive data protection problems. In this section, we summarize the current research status as below.

A. ENCRYPTION
Encryption is a classic and effective way to secure sensitive user data, which has been investigated for a long time. In [5], Peng T. et al brought forth a multi-keywords sorting-based secure search method, which adopt the symmetric public key search encryption way to permit a user to make secure information retrieval in an encrypted dataset based on multiple keywords. The advantage is that it realizes secure service protection for cloud computing with limited resources. The disadvantage is that its computational efficiency is not high enough. Besides, the key disclosure risks are also present. In [6], Dai H. et al introduced a kind of oval curve encryption method to realize secure data use and proved that the oval curve encryption-based method is superior to the traditional FP-based method. The advantage is that it has a relatively high data security performance. However, it only considered the simple Boolean value-based keyword search, which narrows the method's application scope to some extent. In [7], Phuong T. V. X., et al employed vector space model and homomorphic encryption technique to realize encrypted data ranking, as well as multi-keyword data retrieval and file retrieval. The advantage is that it can guarantee high-level quality data protection. The disadvantage is that it brings additional computational time and communication cost that are often very high. For sortable and multi-keyword data encryption problem, the authors in [8] put forward a homomorphic encryption-based data retrieval method to aid the data stakeholders as each data item ready to be searched is homomorphic encrypted during information retrieval process. The proposal can solve most of the secure data processing requirements, however, it cannot support the possible fuzzy retrieval.

B. DIFFERENTIALLY PRIVACY
A differentially privacy-based improved collaborative filtering method named IPriCF was put forward in [9], to secure user privacy involved in collaborative data integration process. Through dividing user data and item data, IPriCF can effectively eliminate the disruption brought by noises incurred by differentially privacy. This method can balance user data privacy and accuracy of the recommended list. A stakeholder-feature-item matrix was built in [10] to analyze the sparse data and provide optimal services. The authors can guarantee the privacy-preservation of involved data while maintaining an acceptable prediction accuracy loss. A differentially private matrix factorization method named DPMF was brought forth in [11]: matrix factorization technique was used to convert sensitive user data into potential lowdimensional vectors; while differentially privacy technique was used to confuse the targeted object functions. However, when the number of dimensions grows, the prediction accuracy is reduced accordingly. The authors in [12] improved the TrustSVD model by introducing differentially privacy and further propose a new model named DPTrustSVD. The new model can effectively reach a tradeoff among data privacy, data sparsity and data availability. Other similar work includes [13] in which the authors combined Differentially Privacy and Huffman Coding to put forward a privacy-aware location segments publishing method, and [14] in which the authors combined Differentially Privacy, Bayes network and entropy theory to bring forth a protection method for highdimensional data.

C. ANONYMIZATION
Anonymization is an effective way for securing the sensitive user data when making big data analyses and mining [15]. Through hiding certain sensitive information (e.g., name, identity card no.) contained in data, anonymization can publish the rest of data (i.e., data after anonymization) to the public so as to achieve the tradeoff between data privacy and availability [16]. K-anonymity solution is adopted in [17] to hide the key sensitive information involved in data-driven decision-making process. A K-anonymity-based user location protection method is suggested in [18] to hide the real user location or position. Although the above solutions can help to hide sensitive user data when performing data-driven business analyses and applications, they cannot balance the data privacy and data utilization well as anonymized data would lose certain key information more or less.
With the above summarization, it is noticed that although many big data fusion and mining solutions (such as [19]- [24]) VOLUME 8, 2020 have been proposed, they cannot make a balancing tradeoff among multiple conflicting criteria such as data security, data availability and so on. Considering this drawback of existing literatures, we look for better multi-source data fusion methods through another kind of privacy-preservation technique, i.e., hash. The details of our suggested solution will be described step by step in the following sections.  Figure 1 shows the motivation of our paper, in which the doctor-medicine-nurse medical records of patients are partially located in cloud platforms cp 1 and cp 2 , respectively. To comprehensively mine the valuable information from the IoH data distributed across platforms cp 1 and cp 2 , we need to integrate and fuse these multi-source data for uniform data analyses and make more scientific healthcare decisions.

III. MOTIVATION
However, in the above IoH data fusion and analyses process, some privacy concerns are often raised as the historical IoH data records often contain partial sensitive information of patients. To encourage platforms cp 1 and cp 2 to release their IoH data records and alleviate the patients' privacy disclosure concerns, it is necessary to develop a novel data fusion method without revealing privacy.
Therefore, we explore a multi-source IoH data fusion method without privacy concerns in the next section. Furthermore, for better describe the details of our suggested data fusion method without revealing privacy information, we summarize the used symbols and their respective meanings in the method with TABLE 1.

IV. APPROACH
This section presents our proposed multi-source IoH data fusion and mining method, whose major procedure can be generalized with the following steps: First, the sensitive IoH data are projected based on LSH functions. Second, according to each IoH data record and its corresponding hash values derived after hash projection, we create a set of hash tables without patient privacy. Third, according to the derived hash tables, we make similar IoH data search and mining. In summary, the detailed three steps are listed in FIGURE 2. (1) Step-1: LSH-based IoH data projection. We use R i .q j to denote the value of dimension q j (j = 1, 2, . . . , m) of historical IoH data record R i (i = 1, 2, . . . , n) from a patient. As R i .q j is often sensitive to the patient, we need to secure the private information of R i .q j when R i .q j is published to the public. In this step, we use LSH strategy to achieve this goal.
For R i (i = 1, 2, . . . , n), it has m criteria q 1 , . . . , q m . Therefore, the healthy information corresponding to R i can be depicted by symbol − → R i = (R i .q 1 , . . . , R i .q m ). When releasing − → R i to others, we need to make an LSH projection first. In concrete, we create a new vector V = (v 1 , . . . , v m ) where v j (j = 1, 2, . . . , m) is a randomly generated value from domain [−1, 1]. Thus, we create an LSH function f as in equation (1).
Afterwards, we can get a f( − → R i ) which may be either positive or negative. Next, we make the following mapping as in equation (2). Thus, through mapping, we convert f( − → R i ) into h(R i ) of a binary value: 0 or 1. Concrete procedure can be shown in Algorithm 1.
Step-1 can be regarded as a hash value of R i through a projection process. However, one projection process is not enough for convert R i into a privacy-free index. Considering this, we repeat Algorithm 1 for multiple times by projections of f 1 , . . . , f a , and after which we get an adimensional hash vector H(R i ) as in equation (3). Thus the mappings of ''R i → H(R i )'' (i = 1, 2, . . . , n) constitute a hash table, denoted by ''T''. In other words, through ''T'', we can query about the index value of R i ; on the contrary, given an index value R i , we cannot infer the real value of R i . This way, the privacy of patients contained in R i is secured.
However, one hash table ''T'' cannot always accurately reflect the real index of each IoH data record. Considering this limitation, we repeat the creation process of ''T'' several (

3) Step-3: Hash tables-based similar IoH data search and mining.
In Step-2, b tables: T 1 , . . . , T b are generated. And in each table, there would be a set of corresponding ''R i → H(R i )'' (i = 1, 2, . . . , n) pairs. Furthermore, H(R i ) is approximately regarded as the index of R i in the table. According to Locality-Sensitive Hashing theory, the IoH data records with the same index would be approximately similar. As a result, if two records R 1 and R 2 share the same index, then R 1 and R 2 are probably similar records. This way, we can mine the potential similar IoH data records through check their respective index values without much privacy. However, for two IoH data records R 1 and R 2 , H(R 1 ) = H(R 2 ) is a rather rigid constraint condition as each dimensional value of H(R 1 ) should be exactly equal to that of H(R 1 ). Such a rigid constraint condition is apt to produce an empty result of similar IoH data records search, which makes nonsense to privacy-free IoH data fusion and mining.
Considering this drawback, we relax the above rigid condition by generating more hash tables instead of only one. In concrete, considering the b tables created in Step 2, i.e., T 1 , . . . , T b , if H(R 1 ) = H(R 2 ) holds in any T y (y = 1, 2, . . . , b), then it is simply conclude that R 1 and R 2 are probably similar IoH data records. Thus, the similar IoH data records search condition is relaxed accordingly. Therefore, for a specific IoH data record R x , we can look for its similar record set Sim_Set (R x ) through the above idea. Details of this step are presented in Algorithm 3. And finally, we return Sim_Set (R x ) as the final output of the proposal in this work.

V. EXPERIMENTS
To validate the effectiveness of our solution in dealing with privacy-free data fusion and mining (abbreviated as PDFM), a group of experiments are designed and compared with existing approaches. VOLUME 8, 2020

A. CONFIGURATION
We use the public data released by http://inpluslab.com/ wsdream/ for simulation purpose. In the dataset, each useritem-qos pair is taken as a patient-criterion-value pair in multi-source IoH data fusion scenarios. 90% entries of the dataset are used to train the parameters of the data fusion and mining model, while the reminder of 10% entries are employed for test and validation.
To show the competitive advantages of PDFM, UCF (baseline) and ICF methods are used to compare with PDFM. The compared metrics include missing data prediction accuracy (Mean Absolute Error) and computational time (s). The software and hardware configurations include: 2.80 GHz processor, 8.0 GB memory, Windows 7 operation system and JAVA 8. We run each experiment 50 times and report their average.

1) MEAN ABSOLUTE ERROR COMPARISON
We measure and compare the Mean Absolute Error of three methods. The parameter settings are as follows: the user volume is 339, item volume is varied from 1000 to 5000, a = b = 10. Here, two sets of experiments are tested, respectively. First, we test the variation trend of Mean Absolute Error of three methods with the change of the number of items in the used dataset. Second, we test the variation trend of Mean Absolute Error of three methods with the change of the number of users in the dataset.
Comparison results are shown in Fig.3. In both Fig.3(a) and Fig.3(b), we can observe a clear advantage of PDFM and UCF compared to ICF, as UCF is a baseline method and PDFM is an approximate solution to UCF. Besides, PDFM achieves an approximate Mean Absolute Error of UCF as the LSH strategy adopted in PDFM can promise a good similaritymaintenance property. Moreover, PDFM has an advantage of privacy-preservation capability which is not owned by UCF.

2) COMPUTATIONAL TIME COMPARISON
We measure and compare the computational time of three methods. The parameter settings are as follows: the user volume is varied from 100 to 300, item volume is varied from 1000 to 5000, a = b = 10. Compared data are reported in Fig.4.
As can be seen from Fig.4, the consumed time of three methods approximately grows when the number of users or the number of items rises. Specifically, UCF and ICF consumes more time than PDFM as heavy-weight user similarity calculation or item similarity calculation is required in UCF and ICF, respectively. While in PDFM, the time cost can be divided into two parts: (1) hash table creation, which can be finished offline; as a consequence, the time complexity is of O(1); (2) similar IoH data record retrieval, which needs to be done online and its time complexity is O(1). As a result, PDFM can often return similar IoH data records within a small response time and hence, our method can be applied to the big IoH data environment.

3) MEAN ABSOLUTE ERROR OF PDFM
PDFM method is based on LSH strategy whose performances are often related to some key factors including parameters  As reported in Fig.5, the Mean Absolute Error of PDFM increases with the rise of parameter b and the decline of parameter a. This is due to the following reasons: (1) when there are more hash tables (i.e., b increases), the similar IoH data record retrieval condition becomes looser; as a result, more similar records are returned and correspondingly, the Mean Absolute Error is rising; (2) when there are more hash functions (i.e., a increases), the similar IoH data record retrieval condition becomes stricter; as a result, less similar records are returned and correspondingly, the Mean Absolute Error is decreased. Moreover, we can observe that more hash functions (i.e., a larger a) and less hash tables (i.e., a smaller b) will bring better prediction accuracy.

4) NUMBER OF RETURNED RESULTS OF PDFM
As analyzed in the above analysis, PDFM method is based on LSH strategy whose returned result volume is often related to some key factors such as parameters a and b. Considering this, we observe the returned result volume of PDFM associated with a and b. The parameter settings are as follows: the user volume is 339, item volume is 5825, a = {2, 4, 6, 8, 10}, b = {2, 4, 6, 8, 10}. Compared data are reported in Fig.6. As reported in Fig.6, the returned result volume of PDFM increases with the rising of parameter b and the dropping of parameter a. This is due to the following reasons: (1) when there are more hash tables (i.e., b increases), the similar IoH data record retrieval condition becomes looser; as a result, more similar records are returned; (2) when there are more hash functions (i.e., a increases), the similar IoH data record retrieval condition becomes stricter; as a result, less similar records are returned. Moreover, we can observe that more hash functions (i.e., a larger a) and less hash tables (i.e., a smaller b) will bring fewer returned results.

C. FURTHER DISCUSSIONS
The gradual popularity of big data technology has enabled a number of successful big IoH applications [25]- [27]. However, in our suggested IoH data fusion and mining method PDFM, there are several limitations.
(1) First of all, we only consider a simple IoH data of continuous type without considering the possible data type diversity (e.g., continuous data, discrete data, Boolean data) and data structure diversity.
(2) Second, how to measure the capability of securing sensitive patient information is still not introduced in PDFM. VOLUME 8, 2020 (3) Third, how to further fuse different privacy protection solutions [28]- [31] for better performances is still an open problem that calls for future study.
(4) There is often an intrinsic tradeoff between data privacy and data availability. So it is inevitable to reduce the data availability when protecting the data privacy. As a result, our proposed LSH-based privacy-aware data fusion method cannot always guarantee 100% data availability. However, due to the inherent property of LSH, our proposal can guarantee 99.99% prediction accuracy if appropriate parameters are selected.
(5) At last, the reason that we choose LSH technique for privacy protection goal is that: LSH has a good property of similarity keeping. Concretely, if two points are neighboring points, then they would be projected into the same bucket after hash projection by LSH.

VI. CONCLUSION
Effective fusion and analyses of IoH data are of positive significances for scientific disaster diagnosis and medical care services. However, the IoH data produced by patients are often distributed across different departments and contain partial patient privacy. Therefore, it is often a challenging task to effectively integrate or mine the sensitive IoH data without disclosing patient privacy. To tackle this challenge, we bring forth a novel multi-source medical data integration and mining solution for better healthcare services, named PDFM. Through PDFM, we can search for similar medical records in a time-efficient and privacy-preserving manner, so as to provision patients with better medical and health services. The experiments on a real dataset prove the feasibility of PDFM.
In upcoming research, we will update the suggested PDFM method by considering the possible diversity of data types [32]- [34] and data structure [35]- [38]. In addition, how to fuse multiple existing privacy solution for better performances is still an open problem that requires intensive and continuous study. QINGGUO