Secure User Privacy in Population Physique Clustering and Prediction Based on Sport Questionnaires

Population physique is one of the key aspects for measuring and evaluating the healthy degree or living level of the population of a nation. Many population physique evaluation models or tools have been developed in the past decades, among which questionnaire is an essential and promising way to help achieve the above population physique measurement goal. Typically, through designing, distributing and collecting a variety of sport questionnaires associated with people’s living conditions or sport preferences, we can quantify, observe and cluster the healthy degree of the whole nation’s population objectively and scientifically. However, the above sport questionnaire-based population physique analysis methods are often time-consuming as a considerable amount of sport questionnaires needs to be compared. Moreover, the sport questionnaires filled out by people often contains some sensitive information that is not supposed to be disclosed to other people, which also call for appropriate privacy protection measures. Inspired by the above two challenges, we introduce hash techniques into population physique evaluation process and afterwards, we propose a hash-based population physique clustering and prediction method that is efficient in time cost and effective in terms of privacy protection. At last, we designed experiments based on a dataset to prove the effectiveness of our proposal in this research work.


I. INTRODUCTION
With the economic growth and social progress in the last decades, people's living conditions have been improved considerably in terms of both physical and mental aspects. As a result, more and more people are gradually changing their attentions from living to life, through focusing on various health-guaranteed measures such as scientific and healthy diet, sufficient and high-quality sleeping and moderate sport activities, and so on [1], [2]. Therefore, in recent years, healthcare industry is gaining a rapid development and has constituted an important part of the whole nation's or society's economical system. Correspondingly, healthcare-related research (e.g., intelligent medical care, scientific exercise) is also becoming a hot and significant The associate editor coordinating the review of this manuscript and approving it for publication was Gautam Srivastava . research topic that draws wide attentions from both academy and industries [3], [4].
As one of the popular healthcare-related research topics, population physique monitoring has always been a hot research direction as it can effectively reflect the current health and development conditions of all the individuals of a nation or a country. To achieve this goal, a classic way is through statistics-related techniques. Typically, to realize population physique monitoring, we can design, distribute and collect a variety of sport questionnaires associated with people's living conditions or sport preferences from individuals. Thus, through the collected sport questionnaires, we can quantify, observe and evaluate the healthy degree of the whole nation's population objectively and scientifically. Furthermore, according to the sport questionnaires, a variety of related applications can also be developed, such as population physique prediction with time, population physique trend analyses, population clustering based on physique, and so on.
However, there are often several critical challenges existing in the above sport questionnaires-based population physique monitoring and utilization process. First of all, the traditional population physique analysis solutions based on sport questionnaires often fall short in quick response as there are often a huge amount of collected sport questionnaires that are necessary to be integrated and compared. In addition, sport questionnaires collected from individuals are usually sensitive enough as they often include the personalized information of individuals. In most cases, individuals are often reluctant to release their private information to the third party.
Therefore, it is becoming a necessity to develop novel techniques or solutions to cope with the abovementioned two challenges existing in sport questionnaires-based population physique monitoring and utilization process. Inspired by this observation, hash techniques with privacy-preservation effects are introduced into population physique analyses and prediction, to further improve the performances of present population physique monitoring and utilization.
In summary, the major contributions of this research work are three-fold.
(1) We recognize the significant importance of hash techniques in securing the sensitive information hidden in sport questionnaires from individuals.
(2) We utilize hash techniques to create the healthcare status indices for individuals and use the indices to achieve timeefficient and privacy-free population physique clustering and prediction.
(3) A variety of simulation experiments are enacted and developed based on a real-world dataset. Reported experimental results show the feasibility of our algorithm in coping with sport questionnaires-based population physique monitoring and utilization.
The rest of this research paper is organized as follows. We investigate the up-to-date research work in the same field in Section II to better outline research significance of our paper. Section III presents a real-world example to demonstrate the paper motivation more clearly. We clarify the concrete steps of our suggested sport questionnairesbased population physique clustering and prediction method in Section IV. Evaluations are made in Section V. Finally, we summarize the research significance as well as its advantages and disadvantages in Section VI.

II. RELATED WORK
In this section, we investigate the current research status of the field of privacy-free data analyses and prediction, and compare the related work in existing literatures with a discussion about their respective advantages and disadvantages, so as to further outline the research significance of this work.

A. DATA ENCRYPTION FOR PRIVACY PROTECTION
As a classical data protection manner, encryption has been studied for thousands of years. The authors have focused on multiple keywords-driven information search with privacypreservation [5]. This method uses symmetric public key mechanism to achieve data encryption when searching for the pre-designated multiple keywords. Although secure data transmission and search are achieved, this method often falls short in time efficiency. The authors use oval curve encryption mechanism to pursue privacy-aware information reuse and show the competitive advantages of the proposal compared to existing solutions [6]. Although this solution can achieve high privacy protection capability, the applicable domain is a bit narrow as it is especially designed for Boolean type data-driven information retrieval. The authors adopt homomorphic manner for secure data transmission [7]. Although high privacy-protection performances of this solution are available, additional data transmissions among different parties are inevitable. As a consequence, computational cost is increased accordingly. Similar work can be found in [8] where homomorphic manner is adopted to secure sensitive keywords that are ready to be retrieved. Although this solution can protect data privacy better, fuzzy keywords search from users' undetermined requirements are not considered, which decrease the proposal's comprehensiveness.

B. DIFFERENTIAL PRIVACY FOR DATA PROTECTION
Differential technology has recently been put forward for measurable and computational privacy protection in various applications associated with sensitive information. The authors combine Differential technology and collaborative techniques for trusted data protection when performing multiple-party information fusion and integration [9]. Although this solution can alleviate the noise issues raised by Differential technology, the computational cost is still high. Similar job is done in [10] where a tradeoff between privacy protection and data utilization is achieved. In [11], the authors combine Differential technology and Matrix Factorization techniques for missing value prediction that considers both prediction accuracy and privacy disclosure probability. This solution can secure user privacy to some extent; however, the data availability and prediction precision are reduced with the growth of privacy-protection performances. Differential technology and trust information are combined in [12] to balance the performances of privacy protection and data use. Other related work includes [13] where Differential technology and Huffman Coding are fused for location privacy protection, and [14] where the authors combine Differential technology and information entropy for sensitive information protection.
However, the abovementioned Differential technologybased data protection solutions often fall short in noise disturbance and high computational cost raised by Differential technology itself.

C. ANONYMIZED DATA FOR PRIVACY PROTECTION
Eliminating or generalizing the sensitive user identity information from the data to be released is called data anonymization or generalization, and has been widely used in various VOLUME 8, 2020 domains [15]. Typical user identity information includes user name, user identity ID, and so on. After eliminating this sensitive information, the rest data without much privacy can be opened to the third party for reuse [16]. Motivated by this advantage, the authors introduce K-anonymity strategy for expert system decisions that involve a considerable amount of private information [17]. Similar job is done in [18] to secure the sensitive locations of users. In summary, the abovementioned research work can all secure private information to some extent; however, it is inevitable to drop some valuable information after data are anonymized.
The above analyses indicate that although many research works have paid attentions to the value extraction from massive data [19]- [24], data privacy and data utilization cannot be balanced well due to the natural tradeoffs between them. Inspired by this observation, we introduce hash technology into data analyses and prediction when sport questionnairesbased population physique monitoring and utilization are performed.

III. MOTIVATION
We use the example shown in Fig.1 to motivate our research in this article. In the scenario, Jack and Carolina both have their respective answered sport questionnaires (e.g., problem ''sport frequency'' with four answer choices: {Very often, Often, Sometimes, Rarely, Never}), recorded in platforms p 1 and p 2 , respectively.
To perform population physique clustering and prediction, we need to compare the sport questionnaires from Jack and Carolina so as to analyze their physique similarity or physique differences. While as seen in the scenario, the sport questionnaires often contain certain private information of Jack and Carolina. As a result, analyzing these distributed sport questionnaires from different people often inevitably disclose the people's privacy. In addition, due to the existence of so many people as well as their respective sport questionnaires, the comparison and analyses of the existing sport questionnaires are often a time-consuming computational task that cannot be done in a limited time period.
In view of the above limitation analyses, we employ hash techniques that can make privacy-free similarity calculation to perform the above population physique clustering and prediction task. Concrete procedure will be described in detail in the following section. Besides, for formal description and solving of the problem focused in this article, here, we specify the symbols and their meanings with Table 1.

IV. PPCP METHOD
The proposed population physique clustering and prediction method is named PPCP. The basic rationale of PPCP is: we first convert each sensitive sport questionnaire into a privacy-free index value through hash. Then we perform privacy-free similar individuals finding based on hash theory. At last, we cluster or predict the population physiques with the obtained similar individuals. Concrete procedure of PPCP is briefly introduced in Fig. 2. (1) Step-1: From sport questionnaires to individual indices.
As presented in the example of Fig.1, an individual's sport questionnaire is often constituted by a set of pre-designed multiple-choice questions, represented by q 1 , . . . , q m as formalized in Table 1. Each question in the sport questionnaire is often accompanied by a several choices such as the five choices in Fig.1, i.e., {Very often, Often, Sometimes, Rarely, Never}. Such a kind of discrete textual description cannot be directly used for population physique clustering and prediction that are focused in this research work.
Therefore, we convert the textual descriptions of choices of the multiple-choice questions into numeric expressions, so as to ease the subsequent calculation. Typically, we can use scale 1∼Z to represent the multiple choices of the questions in a sport questionnaire. Taking the example in Fig.1 for illustration: the five choices in answer set {Very often, Often, Sometimes, Rarely, Never} can be converted into the five elements in set {5, 4, 3, 2, 1}, respectively (assume a larger number means more frequent sport exercises, vice versa). In other words, parameter Z is equal to 5 here.
However, for different multiple-choice questions in an identical sport questionnaire, their respective Z values are not always the same or equal. Let's consider the following multiple-choice question with three answer choices. In the example, Z is equal to 3 and the three choices should be converted into the three elements in set {3, 2, 1}, respectively.
Question: sport strength ratings.  (2). Here, Norm(q i ) means the normalized value of original value of dimension q i ; min(q i ) and max(q i ) represent the minimal and maximal values of dimension q i , respectively. For example, in the example in Fig.1, min(q) = 1 and max(q) = 5. Besides, value(q i ) denotes a certain concrete value of dimension q i ; for example, in the above example, value(q) = 1 or 2 or 3 or 4 or 5. This way, we can get a set of normalized values for the m dimensions {q 1 , . . . , q m }, i.e., {Norm(q 1 ), . . . , Norm(q m )} falling into range [0, 1].
Thus, through (1)-(2), we can convert each individual P x (1 ≤ x ≤ n)'s sport questionnaire into a normalized vector Vec(P x ) = (Norm(q 1 ), . . . , Norm(q m )) in which each element belongs to range [0, 1]. Vec(P x ) is typically sensitive as its elements Norm(q 1 ), . . . , Norm(q m ) often contain certain privacy of the individual. Next, we try to minimize the privacy amount contained in Vec(P x ) through the well-known hash technology.
Concretely, we generate an m-dimensional vector X = (v 1 , . . . , v m ) where each dimension's value falls into range [−1, 1]. Thus, we use the hash function f in equations (3)-(4) to achieve the projection from Vec(P x ) with privacy to Boolean value of 0 or 1 with little privacy. In concrete, in (3), '' '' represents the dot product; in other words, considering two vectors A = (a 1 , a 2 , . . . , a n ) and B = (b 1 . . + a n * b n . Thus, through '' '' operation, the value of f (Vec(P x )) is either positive or negative or 0. Then according to the projection function in (4), we can further convert f (Vec(P x )) with partial individual privacy into a Boolean value h(P x ) with no privacy. This way, through the two-step hash projection process in (3)-(4), the privacy protection goal is achieved partially. Concrete procedure of this step is available in Algorithm1.
According to similarity-unchanged nature of hash theories, two individuals with similar index values are also similar. Inspired by this nature, we can use index values H (P 1 ), . . . , H (P n ) containing less privacy to seek for the similar individuals who own similar physiques, instead of using sensitive answer x (q 1 ), . . . , answer x (q m ) (1 ≤ x ≤ n).
In concrete, we put the (P x , H(P x )) (1 ≤ x ≤ n) pairs in a hash table T. Due to the probabilistic nature of hash techniques, multiple hash tables are generated by repeating the above process N times, i.e., T 1 , . . . , T N . Detailed procedure is available in Algorithm 2. VOLUME 8, 2020

Output:
(1) h(P x ): Boolean value of individual P x . 1: for x = 1, . . . , n do 2: for i = 1, . . . , m do 3: convert answer x (q i ) into value(q i ) by the example in Step-1 4: if q i is positive 5: then calculate Norm(q i ) by (1) 6: else calculate Norm(q i ) by (2)  For two individuals P x1 and P x2 , if their corresponding index values are the same in any of the N tables, then it is probably concluded that P x1 and P x2 are similar individuals, vice versa. More formally, the similar individual evaluation process is based on the equation (6), where y = 1 or 2 or . . . or N. P x1 is similar with P x2 iff H y (P x1 ) = H y (P x2 ) (6)

(3) Step-3: Population physique clustering and prediction based on similar individuals.
Through Algorithm 2, similar individuals who share the same or similar sport physiques are discovered. Next, according to the individual similarity, we can cluster the populations into different groups. In each group, all the individuals hold approximately same sport preferences or habits. With the derived population groups, various group-based population analysis operations can be performed further.
Another alternative application based on population groups is individual physique prediction. Generally, the feedback rate of sport questionnaires is not high enough. put (P x , H(P x )) pair in T y 9: end for 10: return T y 11: end for As a consequence, the collected sport physique data of population are often sparse, which probably decrease the availability and dependability of the collected sport questionnaires.
In this situation, to improve the quality of collected sport questionnaires and enhance the availability of questionnaire data, a promising way is to preprocess the collected data and predict the missing values existing in the questionnaires. Specifically, for an individual P x whose answer choice for question q is absent from the collected questionnaires, his/her missing data, denoted by value x (q), can be predicted by equation (7). Here, Sim(P x ) records all the similar individuals of P x . Formal procedure is available in Algorithm 3. value x (q) = 1 |Sim(P x )| * P k ∈Sim(P x ) value k (q) (7)

V. EVALUATION
In this section, we simulate a set of experiments to test the feasibility of our PPCP method in dealing with population physique clustering and prediction.

A. EXPERIMENTAL SETTING
As the answer choices of the questions in the sport questionnaires are often discrete values, we employ the classic MovieLens dataset [25] for experimental evaluation purpose. Similar to sport questionnaires, the ratings in MovieLens dataset are also discrete values scaled from 1-star to 5-star.
In the simulation, the population volume (n) is up to 6040, the question volume in the sport questionnaires (m) is up to 3900, the hash function volume (M) is up to 15 and the hash table volume (N) is up to 15. For performance comparisons, we introduce two baseline methods: UPCC and IPCC. These two compared methods are both classic and effective in performing similar objects search. Moreover, we mainly measure the RMSE and time

1) RMSE OF THREE METHODS
RMSE is regarded as a key indicator that influences the prediction accuracy when missing values are predicted. In this test, we measure the RMSE of three methods with the variation of parameters m and n, respectively. Other parameters: M = 15, N = 15. Compared results are reported in Fig. 3.
From Fig.3, there is no obvious variation tendency of RMSE for each of the three methods. While an obvious observation is available in Fig.3, i.e., PPCP's RMSE value is often superior to the other two methods. We analyze the reasons as below: in PPCP, three tuning the parameter settings, only the most similar individuals are returned for missing data prediction. As a result, the prediction RMSE is decreased accordingly.

2) TIME COST OF THREE METHODS
In big data environment, response time is a key factor that influences the final satisfactions of people. Therefore, we measure the time cost of three methods to quantify their respective response speed. Other parameters: M = 15, N = 15, m = 3900, n = 6040. Compared results are reported in Fig. 4.
A clear variation tendency is found in Fig.4 for both UPCC and IPCC methods, as these two solutions often involve heavy-weight similarity computational costs. Especially when the population or question volume increases, the computational time grows approximately linearly. On the  contrary, the time efficiency of PPCP is relatively high compared to UPCC and IPCC. We analyze the reasons as follows: in PPCP, the first two steps are often done offline, whose time cost is O(1); in the third step, we only need to query the individual index tables produced offline to make population physique clustering and prediction, whose time is O(1). As a result, the response speed of PPCP is high enough.

3) RMSE OF PPCF
As observed from Section IV, the proposed three algorithms are related to several parameters. In this test, we observe the relationship between the RMSE of PPCP and the involved parameters such as M and N. Other performance parameters: m = 3900, n = 6040. Compared results are reported in Fig. 5.

4) SIZE OF SIM(P x ) IN PPCP
As we discussed in the last paragraph, parameters M and N determine the similar individual discovery condition to some extent. Therefore, M and N can also influence the number of similar individuals of a pre-designated individual. To evaluate such a relationship between the size of Sim(P x ) and the parameters, we conduct an experiment whose results are reported in Fig.6.
The figure data show that there is a relatively clear variation tendency of the size of Sim(P x ) when parameters M or N varies. In concrete, when M value increases, the size of Sim(P x ) decreases as more hash functions often indicate a more rigid similar individual discovery condition; as a result, size of Sim(P x ) drops accordingly. On the contrary, when N value increases, the size of Sim(P x ) rises as more hash tables often indicate a looser similar individual discovery condition; as a result, the size of Sim(P x ) grows accordingly.

C. FURTHER DISCUSSIONS
In the experiments, we did not measure the performance of privacy guaranteeing due to the characteristics of hash technology [26]- [28]. In addition, only the prediction accuracy and efficiency are compared in the evaluation section. In our upcoming research, more prediction metrics (e.g., diversity [29]- [30], robustness [31], [32], and so on) should be added for better evaluating the performances of the suggested PPCP method.

VI. CONCLUSION
The existence of various medical data or healthy data has enabled the intelligent Internet of Health (IoH) services. As an example IoH application, we use collected sport questionnaires to investigate and monitor the population physiques. Meanwhile, we use hash techniques to secure the private data included in the sport questionnaires. Finally, we simulate a range of experiments to show that the proposed PPCP method is superior to other methods, especially for the RMSE and time cost.
To make the PPCP method more general and comprehensive, we will further refine PPCP to adapt multiple data categories and structures [33]. Besides, we will study the possibility of combining different privacy protection ways for advanced data security.