Heap Bucketization Anonymity—An Efficient Privacy-Preserving Data Publishing Model for Multiple Sensitive Attributes

The publication of a patient’s dataset is essential for various medical investigations and decision-making. Currently, significant focus has been established to protect privacy during data publishing. The existing privacy models for multiple sensitive attributes do not concentrate on the correlation among the attributes, which in turn leads to much utility loss. An efficient model Heap Bucketization-anonymity (HBA) has been proposed to balance privacy and utility with multiple sensitive attributes. The Heap Bucketization-anonymity model used anatomization to vertically partition the dataset into 1. Quasi-identifier table and 2. Sensitive attribute table. The quasi-identifier is anonymized by implementing k-anonymity and slicing and the sensitive attributes are anonymized by applying slicing and Heap Bucketization. The metrics Normalized Certainty Penalty and KL-divergence have been used to compute the utility loss in the patient dataset. The experimental results show that the HB-anonymity can significantly achieve high privacy with less utility loss than other existing models. The HB-anonymity model not only balances the utility and privacy also eradicates the i) background knowledge attack, ii) quasi-identifier attack iii) membership attack, iv) non-membership attack and v) fingerprint correlation attack.


I. INTRODUCTION
Information is significant to the various innovations. To discover information, the data are retrieved and analyzed by the research community [1]. Public and private sectors examine the human behavior patterns to enhance their services. In the process of extracting knowledge, the individual's information is leaked and leads to privacy breaches. An adversary may use publically available data to gather individual information. Privacy is the foremost concern in all applications and sectors. Data is used for various purposes such as statistical analysis, knowledge discovery, policy-making, etc.
The associate editor coordinating the review of this manuscript and approving it for publication was Vivek Kumar Sehgal .
Various organization, pharmacies and health sectors share their employee details and patient details to third parties for various analysis purposes. As the data grow tremendously, the analysis of the data becomes tedious. Thus to deal with big data, various approaches have been proposed [2,3]. The lifecycle of data has different stages i) data creation ii) data storage, iii) pre-processing of data, iv) data archival and v) data purging. Existing techniques of privacy preservation are still in evolving stage and achieving the balance between privacy and utility is still an open issue in the research area.
Currently, the healthcare industry collects information about patients for a better, accurate diagnosis and treatment for the patients. Since the dataset consists of sensitive attributes, it needs to be anonymized. The healthcare VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ industry is the largest and currently developing area in research. It is shifting towards disease-oriented to patientoriented approaches. Information and Communication Technology (ICT) is incorporated in health care practices. The volume of data in the health care industry is growing rapidly and the data are used for various analysis purposes. To achieve the best results from the data, the utility of the data needs to be maintained. There are many researches done in preserving privacy viz. privacy of the big data in the health care industry, privacy-preserving in the Internet of things (IoT), maintaining privacy in the cloud, Artificial intelligence in healthcare for maintaining privacy. Various technologies have been used in the health care industry such as Machine learning [4]- [6], IoT systems [7]- [12], data analytics [13], [14], and cloud system [15]- [17]. For continuous monitoring of patients, wearable equipment has been introduced. The data recorded in the equipment is being continuously monitored, streamed, shared and analyzed to enable various health services to the patients [18]. Due to continuous monitoring, the patient can be diagnosed earlier and the proper action can be taken. Though all this technology improvises patient care, there arises a question ''What about patient's privacy?' ' Publishing the raw data will cause privacy breaches, which could lead to annoyance or deceit. With the released data, the intruder can cause heavy damage even to the status and life of individuals. Privacy-preserving has become an essential one when sharing data with researchers and third parties. Privacypreserving data publishing (PPDP) delivers various methods, models and tools for protecting the breaches while publishing the data to third parties or analysts. Security and privacy are the major concerns in today's digital world. Recently, the PPDP has gained a lot of attention from the research community [19].
Earlier, researchers removed the explicit identifiers (e.g., name) considering the dataset is well protected. However, the intruders can easily infer the individual's sensitive attributes and complete details [20]. Thus, such measures seem to be insufficient because the individual record can be identified by relating it with the other external sources. Later, various anonymization techniques were proposed to mask the individual data viz. Generalization, suppression, permutation, encryption, role-built access control, etc.
In the proposed work, the privacy of the individual is protected from five breaches: (1) background knowledge attack (bka); (2) quasi-identifier attack (qia); (3) membership disclosure attack (mda); (4) non-membership disclosure attack (n-mda); and (5) fingerprint correlation attack (fca). The background knowledge of an individual can lead to identifying the pattern of that particular individual. Identifying the pattern of a particular individual by possessing background knowledge can lead to a fingerprint correlation attack. The multiple sensitive values of an individual are grouped to form the individual fingerprint. The correlation of the fingerprint among the various groups of k-anonymized can help the adversary to gain the insights of other individuals also in the dataset. The background knowledge paves a path for all the attacks such as quasi-identifier attack (qia) membership disclosure attack (mda) non-membership disclosure attack (n-mda) and fingerprint correlation attack (fca). The linking of individual quasi-identifier values helps the adversary to gain insights into personal information. Through the background knowledge and linking of qid values, if an adversary can find out the existence and non-existence of an individual in the microdata. Then, the membership and non-membership disclosure attacks persist. The fingerprint correlation attack is a strong privacy breach as it could snoop all the individual information in the dataset.
In the paper, Heap Bucketization -Anonymity model has been proposed and compared with two existing approaches (p,k)-angelization and (c,k) anonymization. In (p,k)-angelization, the sensitivity levels were fixed and that is represented as p and k represent the k-anonymous groups. The (p,k)-angelization is a strong approach and eradicated non-membership attack and membership attack. However, it could not resist fca. In (c,k)-anonymization, the fca was eradicated. However, the execution time is relatively high compared to HBA model.
The paper is systematized as below. Section II discusses the various literature works in privacy-preserving data publishing with 1:1 single sensitive attribute, 1:1 multiple sensitive attributes and 1:M micro data. Section III summarizes the problem definition and related preliminaries and their definitions. Section IV presents the details of the work contributed to the paper. Section V explains the motivation of proposed HB-anonymity which is an extension of (p,k)-Angelization and (c,k)-Anonymization. The (p,k)-Angelization and (c,k)-Anonymization definitions and the complete working of the models are elaborately discussed. Section VI discusses 1:1 microdata with multiple sensitive attribute attacks and their scenarios. Section VII discusses the proposed model Heap Bucketization-anonymity. Under this section, various steps involved in the model are clearly explained. Section VIII explains the implementation of slicing on the sensitive attributes table and the merging part of the quasi-identifier and sensitive attributes table. In addition, the workflow and framework of the Heap Bucketizationanonymity model are depicted clearly. Section IX gives a detailed step-wise algorithm for the Heap Bucketizationanonymity model. Section X describes the experimental details and results of the examination. The complete setup of the experimental, the outcomes and various utility metrics used in the experiments are discussed clearly. Furthermore, the experimental results are shown as graphs for a better understanding of the work. Finally, section XI concludes the work along with the future direction.

II. RELATED WORKS
The rapid growth of electronic health care systems and sharing of the data increases the need for privacy [21]. Due to sharing of data to third parties the protection of individual identity becomes a major challenge. However, the privacy of the microdata is set based on the well-defined procedures and policies for sharing the individual's health data. The Health Insurance Portability and Accountability Act proposed two methodologies to achieve de-duplication. The recent available privacy-preserving technologies have been analyzed and discussed [22].
A. 1:1 MICRO DATA WITH SINGLE SENSITIVE ATTRIBUTE Samarati and Sweeney proposed k-anonymity. k-anonymity protects the dataset from record linkage and not every record of the table should be distinguishable from at least k-1 records. In k-anonymity, anonymization methods such as generalization and suppression are applied. It makes sure that the probability of re-identifying a person in the disclosed data must not be more than 1/k [23]. Another model called l-diversity was proposed which is an extended model of k-anonymity. It prevents attribute linkage l-diversity makes sure that there should be at least l different values for the sensitive attributes in every equivalence class [24].
In l-diversity, the skewness attack occurs due to the skewness of sensitive attributes in the overall distribution, so a model t-closeness was proposed. T-closeness ensures that sensitive attributes distribution in each class must be closer to the dissemination of sensitive attributes in the entire table [25]. Many researchers have proposed the extended version of k-anonymity, l-diversity and t-closeness. An extended version of the k-anonymity model was proposed with minimum data distortion for record suppression. The scalable l-diversity (ImSLD), an extended version of improved scalable k-anonymity (ImSKA) was proposed to handle a large amount of data. MapReduce has been used as a programming paradigm. The usage of MapReduce iteration reduced the running time consistently. To compute the utility loss in the anonymized dataset, a metric called Normalized Certainty Penalty (NCP) was used [26].
A new versatile publishing method was proposed with a set of privacy rules for the quasi-identifier and sensitive attributes. A Guardian Normal Form (GNF) was introduced for publishing each sub-table along with the existing publishing approach. When the sub tables are merged while publishing the entire table. Then, the privacy rules should be able to guarantee the utmost privacy. Two different algorithms: (1) Guardian Decomposition; and (2) Utility-aware decomposition was proposed to anonymize the microdata. The main focus of the work is to concentrate on the versatility problem of the privacy-preserving data publishing with various privacy rules incorporated for the anonymization of data [27].
Two privacy models: (1) enhanced identity-reserved l-diversity; and (2) enhanced identity (α, δ)-anonymity has been proposed. To implement the above two privacy models, the DAnonyIR generalization algorithm has been designed with a clustering technique to reduce information loss. The EIR l-diversity and EIR (α, β)-anonymity works well for the multiple sensitive attributes in static relational data [28]. The Sensitive Label Privacy Preservation with Anatomization (SLPPA) scheme has been proposed to protect the microdata. The scheme adopts two techniques i) table division and ii) group division. In the table division procedure, the mean-square contingency coefficient and entropy have been adopted for anonymization. In group division, non-overlapping groups have been framed to satisfy the (α, β, γ , δ) model [29]. The (α,l)-model was implemented to achieve proper diversity requirements for the dataset with multiple sensitive attributes. The two variables α and l confine the values of a sensitive attribute in the equivalence classes. The (α,l)-model is designed with k-anonymity as a foundation. The (α,l)-model has less running time and utility loss [30]. An addictive noise approach was proposed by satisfying the conditions of l-diversity [31]. A privacy-preserving data publishing method known as MNSACM was proposed to handle numerical attributes. The MNSACM method comprises of two approaches which are clustering and Multi-sensitive bucketization. A Two-dimensional bucket has been formed to anonymize the sensitive attributes. The MNSACM aims to publish a one-time static relational table [32].
A distribution model was proposed to fix the values of sensitive attributes. For multiple sensitive attribute values, a threshold p is set for minimizing the sensitive attributes disclosure probability [33]. A new framework (k, p) -anonymity was proposed to resolve the sensitive attributes disclosure problems in k-anonymity and l-diversity models [34]. A proficient approach (p, k)-Angelization has been proposed for anonymizing the dataset with MSA. The (p, k)-Angelization not only preserves privacy but also enhances the utility of the disclosed dataset [35].
Quasi-identifier-Multiple heterogeneous sensitive attribute (QI-MHSA) generalization algorithm was proposed to protect the privacy of the dataset with multiple sensitive attributes. k-anonymity has been applied for the quasiidentifier bucket and l-diversity on the sensitive attributes bucket. In addition, a flag has been set to generalize the sensitive attributes according to their sensitivity requirements [36].
Most of the researchers have separated the quasi-identifier and sensitive attribute from the microdata. Yuichi Sei [37] has adopted new privacy models (l1, . . . , lq)-diversity and (t1, . . . , tq)-closeness and stated that each attribute has a sensitive value in it. Thus, he categorized the quasi-identifier as sensitive QID. Two algorithms such as (1) anonymization; and (2) reconstruction algorithms were proposed to anonymize the sensitive QID's to achieve great privacy.
A PPDP for dynamic data with MSA was proposed and named KC slice [38]. An improvised version of the KC slice named KCi-Slice was proposed to balance the privacy and utility while publishing the dataset with MSA [39]. A model used an anonymization technique called slicing. It uses the fuzzy method for numerical sensitive attributes and the generalization method for categorical sensitive attributes [40]. Various models have been proposed for the 1: M dataset with MSA. A novel method called ''MSAs generalization correlation attacks'' was proposed for 1:M microdata for multiple sensitive attributes. An approach called (p,l)-angelization was proposed to anonymize the 1:M MSA dataset [41]. To preserve privacy in data publishing for 1:M microdata, a model known as G-model was proposed. G-model provides a proper balance between utility and privacy. It protects the 1:M microdata from gender precise sensitive attribute attacks [42]. An f-slip model has been proposed for 1:M microdata. It eradicates various attacks such as bk attacks, MSAcorr attacks, QIcorr, NMcorr and Mcorr attacks. A unique approach frequency-slip was adopted to preserve privacy [43]. Various methods have been adopted in relational data [44], [45].

III. PROBLEM DESCRIPTION AND PRELIMINARIES A. PROBLEM DEFINITION
Let the dataset T P , consist of multiple sensitive attributes. Can the dataset be anonymized in such a way that the intruder should not get any clue about the individual? The anonymization of data should ensure the optimal balance between utility and privacy. For the eases of consequent discussion, the basic notions and descriptions of the paper are presented briefly.

B. BASIC NOTIONS AND DESCRIPTIONS
The patient microdata is presented in a relational table. The table T r * c consists of r which represents the rows and c that represents the columns. The microdata can be categorized as below: Direct identifier, is a unique identifier such as social security number, driving license and name. The direct identifiers will be encrypted or removed before disclosing them to the third party.
Quasi-identifier (T QI P ), is a group of attributes used to detect the individual by relating it with external sources. The QI are age, sex, height and weight in Table 1.
Sensitive attribute (T SA P ), possesses secretive information of the individuals, which needs to be secured during the disclosure of microdata such as Disease, Pulse rate, etc in Table 1. The focus of the paper is to protect the sensitive attribute from being revealed with less utility loss and high privacy.
Definition 1 (Equivalence Class [46]): In the 1:1 microdata T P , the records of the same quasi-identifier values constitute the equivalence class (i.e) the n subset of T P comprises records that correspond to each other. If the tuple tp T P, then the generalized form of table T P that comprises of tuple tp is represented in the form: where Gf i (1 ≤ i ≤ n) is the unique quasi-identifier subset including tp. Gf i [j] (1 ≤ j ≤ r) is the generalized value of the record on T P for all the records in Gf i and T SA P the sensitive attributes in the table T P.

Definition 2 (k-Anonymity):
The dataset T P satisfy k-anonymity if the record of every individual should not be eminent from at least k-1 individual records whose record also exists in the dataset T P .
Definition 3 (Slicing [47]): The dataset T P is partitioned both vertically and horizontally. Vertical partition groups the attributes based on the high correlations among the attributes. Each column comprises a subset of highly correlated attributes.
There may be columns a 1 , a 2 , . . . ., a n (i.e.) The horizontal partition groups the tuples into different buckets. Each tuple can belong to the only bucket. Consider the bucket B id and the number of buckets b id, Definition 4 (Bucketization [46]): The dataset T P has been partitioned into n quasi-identifier groups and m groups of sensitive attributes. The subset of tuples in the partitioned table is called a bucket and represented in the form: T QI P (QI, B id ) and T SA P (SA, B id ) The QI and SA are the quasi-identifier of the table T P and the sensitive attributes of the table T P . The B id represents the bucket id.
Definition 5 (Heap Bucketization): Heap Bucketization is an advancement of bucketization. The dataset T P has been partitioned into n quasi-identifier groups and m groups of sensitive attributes. The sensitive attributes of each record in the same bucket are cumulated and represented as the records of the same bucket.

C. WHY THE ADVANCEMENT OF BUCKETIZATION IS NEEDED?
Angel [48] and Anatomy [49] have implemented bucketization to preserve privacy in data publishing. However, Angel and Anatomy have been implemented on the single sensitive attribute. In (p,k)-angelization and (c,k)-anonymization, bucketization has been adopted for MSA. The (p,k)angelization lead to high utility loss and (c,k)-anonymization lead to high execution time. So, an advancement of bucketization named ''Heap Bucketization'' is proposed to prevent higher utility loss and privacy loss.

IV. CONTRIBUTION
An efficient Heap Bucketization-Anonymity (HB) model was proposed to protect privacy in data publishing with MSA. The table is anonymized using the HB-anonymity approach to achieve an optimal balance between privacy and utility.
1. A unique privacy-preserving data-publishing model ''Heap Bucketization''-anonymity has been proposed which can have an optimal balance between privacy and utility. The HB-Anonymity is framed for multiple sensitive attributes to achieve high privacy with less loss of information. An algorithm has also been framed for Heap Bucketzation-Anonymity.
2. The method has been evaluated both theoretically and experimentally to validate the proposed model. The proposed HB-Anonymity model prevents privacy under i) background knowledge attack, ii) quasi-identifier attack iii) membership attack, iv) non-membership attack and v) fingerprint correlation attack.

V. MOTIVATION OF PROPOSING HB-ANONYMITY-AN EXTENSION OF (P,K)-ANGELIZATION AND (C.K)-ANONYMIZATION
Earlier many models dealt with a single sensitive attribute. However, in real case scenarios, the health records might have multiple sensitive attributes. The health record comprises various attributes that are sensitive such as disease, temperature, etc. as shown in Table 1. Handling those sensitive attributes and maintaining the privacy of the individuals is not an easy task. In PPDP, the privacy and utility need to be balanced so that the researchers can make analysis and decision-making. If the utility is not preserved along with privacy, then the researchers would not be able to analyze and extract valuable information.
Definition 6 ((p,k)-Angelization [35]): The relation data Tp is said to be (p,k)-angelization, if the table is partitioned into  category table, quasi-identifier table and sensitive attribute  table. The anonymized table is published in two different  batches i.e quasi-identifier table and sensitive attribute table. The p represents the category of sensitivity levels and k represents the group of k-anonymous data. The maximum weighted attribute is calculated using a weighted function as the (p,k)-angelization considers the maximum weighted attribute as the most sensitive attribute.
In (p,k)-angelization, the privacy breach is initiated with a highly weighted attribute and the values of the quasi- identifier table and the sensitive attribute table are correlated with the batch id. The (p,k)-angelization is an iterative process that failed in preventing the record re-identification of the individual with his/her complete details. Through an iterative process, the adversary can be able to obtain the details of the other individual as well in the dataset. In the (p,k)-angelization, the intersection of attribute values in two different buckets results in single sensitive attribute values against each sensitive attribute. The intersection value of an individual for all the attributes has been carried out to find the complete details of an individual.
Due to the iterative process, though the identification of the individual is complex, it leads to a privacy breach with the intersection of attribute values between two buckets. As the (p,k)-angelization did not completely utilize the angelization mechanism, the splitting of the patient table into two subtables: (1) Quasi-identifier table; and (2) Sensitive attribute table is useless. When the intruder finds the batch id of an individual in the generalized table he can easily infer the sensitive details of the individuals in SBT by correlating with the batch id, thus the splitting of the table into two is useless. In (p,k)-angelization, the weight of attributes is calculated to identify the highly sensitive attribute which is more likely to cause a privacy breach. To compare the (p,k)-angelization with HBA model, the attributes age, sex, height and weight from the generalized table, the attributes temperature, pulse rate, respiratory rate, blood pressure and disease forms the sensitive batch table. In our experiment work, the sensitivity level p = 4 (i.e) (Very High, High, Medium and Less). The value of k = 3 to anonymize the generalized table (i.e) quasiidentifier table.
Definition 7 ((c.k)-Anonymization [50]): A table Tp is said to be (c,k)-anonymization if the table comprises a generalized table and fingerprint bucket of sensitive attributes. The generalized table comprises quasi-identifier and k-anonymized with bucket id to prevent the linking attack. The bucket id in the quasi-identifier table is linked with the bucket id in VOLUME 10, 2022 the sensitive table. However, the sensitive table comprises of c varied records to avoid fingerprint correlation attacks. The c represents the category of the table. In the proposed work, c = 4(Very High, High, Medium and Less) and k = 3(i.e) anonymized group of data. In (p,k)-angelization and (c,k)-anonymization, the weights of the sensitive attributes are calculated to find out the highly sensitive attribute.
In (c,k)-anonymization, the finger bucket is created that satisfies the c-diversity to prevent an attack such as fingerprint correlation. The c-diversity is in the form of l-diversity in (c,k)-anonymization. The disadvantage in (p,k)-angelization has been overcome in (c,k)-anonymization by considering the two factors: 1. Minimizing the linking of records between two fingerprint buckets, 2. Un correlating the records between the fingerprint buckets. A linkability control factor (c f ) has been introduced to minimize the repetition of the same value of the attribute in the fingerprint bucket.
The goal of HB-anonymity is to provide sustainable privacy and less utility loss. In HB-anonymity, the correlation between the QI and the SA is computed using the Pearson correlation coefficient. The association of the QI and the SA of the table is calculated for slicing the highly correlated attributes. The purpose of slicing highly correlated attributes is to minimize information loss. If the correlated attributes are not connected, then the information distribution will be scattered and the researchers from the data that is anonymized cannot gain valuable information.
In the HB-anonymity model, the two tables are produced: (1) Quasi-identifier table (TP QI ); and (2) Sensitive attribute table (TP SA ). The attribute generalization is carried only in the quasi-identifier table and not in the sensitive attribute table to minimize the utility loss. The QI table is split into two tables TP QI1 and TP QI2 based on the correlation. As the k-anonymity cannot prevent attribute disclosure, it cannot protect the sensitive attribute effectively so, only the quasiidentifier is k-anonymized. The slicing is performed on the dataset to preserve the data utility and to protect the dataset against the membership disclosure and attribute disclosure attack as the generalization and bucketization may lead to membership disclosure attacks. The weights of the sensitive attributes and the iterative processes are not performed in HB-anonymity that increases the time complexity. In HB-anonymity, the buckets are formed with the increasing order of disease. (i.e.) alphabetically sorted.
Due to the slicing of highly correlated attributes, the data utility is highly preserved in the sensitive attribute table and the quasi-identifier table. Finally, Heap Bucketization is performed on the bucketized table to preserve privacy. As the HB-anonymity releases the single anonymized table, the linking of batch id is avoided. In the process of Heap Bucketization, all the records of a single bucket are combined to form the heap bucket. The heap bucket consists of sensitive details of the individuals.
In the proposed work, the privacy loss was checked by varying the bucket size ( ). If the bucket size is large, then the records of the heap bucket will be high and due to that, the loss of utility is also high. To minimize the utility loss, the should be less. As the heap bucket consists of all the individual records of each bucket(i.e,)all the records of bucket 1 are comprised to form heap bucket1, the possibility of identifying an individual is almost zero. In Heap Bucketization, the probability of the distribution of the records will be high. Even the intruder knows the background details of the individual; the probability of identifying his record is close to zero.
The proposed HB-Anonymity model prevents privacy under i) background knowledge attack, ii) quasi-identifier attack iii) membership disclosure attack, iv) non-membership disclosure attack and v) fingerprint correlation attack. In the heap bucketized anonymized data, the background knowledge attack (bka) cannot be accomplished since the intruder cannot gain any individual details even if the intruder has strong background knowledge of the individual. The linking of quasi-identifier cannot provide any information to the intruder as the QID are k-anoymized. The membership disclosure attack (mda) and non-membership disclosure attack (n-mda) cannot be accomplished as the existence and nonexistence of an individual cannot be recognized in the proposed model due to the heap records of the buckets. The individual records from the buckets of the sensitive attributes cannot be identified at any cost, so the fingerprint correlation attack (fca) is also eradicated.

VI. 1:1 MICRODATA WITH MULTIPLE SENSITIVE ATTRIBUTE ATTACKS AND THEIR SCENARIOS
The Heap-Bucketization-anonymity anonymizes the dataset to protect it from five attacks to achieve high privacy and less information loss. The five attacks: (1) background knowledge attack (bka); (2) quasi-identifier attack (qia); (3) membership disclosure attack(mda); (4) nonmembership disclosure attack (n-mda); and (5) fingerprint correlation attack (fca). In the paper, the HB-Anonymity has been compared with two models (p,k)-Angelization and (c,k)-anonymization. For explaining the scenarios of all five attacks, the original patient table has been anonymized using (p,k)-angelization in Tables 2 and 3. The case scenario discusses that (p,k)-angelization could not resist the five attacks. A. SCENARIO 1 Scenario 1 discusses the background knowledge attack (bka). If an intruder can infer the sensitive information of individuals by possessing strong background knowledge. Then, bka can be accomplished. If the intruder knows that individual pid2, is a male, age < 50, with medium height and weight is suffering from some eye problem. Then, he can easily infer that the individual pid2 falls in bucket 3 in Table 2 and 3. If the intruder has strong background knowledge about pid2, then he can also conclude that individual pid2 does not suffer from allergy and anemic so the intruder confirms that the individual suffers from an eye disorder.

B. SCENARIO 2
Scenario 2 discusses the quasi-identifier attack (qia). If the intruder has strong background knowledge about the quasiidentifier values of the individual, then the intruder can correlate the quasi-identifier values to identify the sensitive attribute values. If the intruder knows that individual pid1, is a Female age > 50, with height > 170 and overweight, then the intruder can find the record in bucket 3 in Table 2 and3. If the intruder has strong background knowledge that the individual falls sick often, the intruder can conclude that the individual falls in bucket 3 and the disease is Anaemia.

C. SCENARIO 3
If the intruder possesses the individual background knowledge and quasi-identifier values. Then, the intruder can easily infer whether the individual is present in the dataset. If the intruder knows that pid7 is a female, age around 60, with height and weight around 155 and 50 respectively, and has enough knowledge that the individual does not suffer from severe disease. However, often sneezes and nose block, the intruder can conclude that pid7 falls in bucket 4 in Table 2 and 3.

D. SCENARIO 4
If the intruder possesses the background knowledge and quasi-identifier values of an individual. Then, the intruder can infer whether the individual exists in the dataset or not. The main aim of this non-membership disclosure attack is to find the non-existence of the individual. If the individual pid12, is male age > 75, suffering from severe pandemic disease. Then, the intruder can easily infer the non-existence of the individual as it does not lie in any of the buckets in Table 2 and 3.

E. SCENARIO 5
Scenario 5 discusses the fingerprint correlation attack. When two buckets are intersected, the unique sensitive values are derived from them and that helps in identifying the individuals. For example, if buckets 3 and 4 are intersected, temperature = 98, pulse rate = 22, Disease = Allergic Rhinitis are the sensitive values that can be uniquely identified. Because of this privacy breach, not only pid0 and pid7 are identified, the sensitive values of individual's pid1.pid2, pid3 and pid8 can also be identified. The definitions of the five attacks have been discussed in Table 4.

VII. HEAP BUCKETIZATION-ANONYMITY
Heap Bucketization-anonymity model has been proposed by designing architecture and algorithm. Various privacypreserving models have been designed and proposed to carry out anonymization for multiple sensitive attributes in a 1:1 dataset. However, achieving the optimal balance between the privacy and utility challenge remains open. The proposed HB-anonymity model resists various attacks such as i) background knowledge attack, ii) quasi-identifier attack iii) membership attack, iv) non-membership attack and v) fingerprint correlation attack.
The goal of the proposed model is to achieve intensified privacy with less information loss. The HB-anonymity model performs the below steps i) pre-processing of the data, ii) anatomization of the table into T QI P and T SA P iii) calculating the correlation separately for both T QI P and T SA P iv) employing k-anonymity on T QI P and slicing v) implementation of slicing on T SA P vi) merging of T QI P and T SA P and vii) Heap Bucketization.

A. PRE-PROCESSING AND ANATOMIZATION OF THE TABLE
The real-time and unique dataset is used in the experimental work. As the dataset is received from the Interdisciplinary Institute of Indian System of Medicine, Ayurveda, the data is already pre-processed and in relational format. Few missing values of the attributes are filled by taking the average of the column values. In the HB-anonymity, the patient table is anatomized into two different tables i) quasi-identifier table T QI P and ii) sensitive attribute sub-table T SA P . The anatomization is performed to disconnect the relationship between sensitive attributes and quasi-identifier. The goal of anatomization is VOLUME 10, 2022 to apply different methods to the partitioned sub-table. Both T QI P and T SA P are allocated with a pid just for future reference. The pid will be eradicated during the publication of the table.

B. CORRELATION AMONG THE ATTRIBUTES
In the HB-anonymity model, the correlation of the attributes in both T QI P and T SA P is calculated. The purpose of finding correlation among the attributes in the HB-anonymity is to perform slicing. If slicing of the attributes is done randomly, then the linking relationship between the attributes will be broken and thus lead to utility loss. In Table 1, age, sex, height and weight are the quasi-identifiers. Temperature, pulse rate, respiratory rate, BP (further broken into systolic and diastolic) and disease are the sensitive attributes. Pearson correlation coefficient metric is used for computing the correlation among the sensitive attributes and quasi-identifier. A correlation matrix was generated to find the highest correlated attributes. where

EXP [A] = Expected values of A and EXP [B]
= Expected values of B.
M n = mean value.
As per the correlation metrics, the (age, sex) and (height, weight) are highly correlated in quasi-identifier. The (sys, dys), (Pulse rate, disease) and (respiratory rate, temperature) are highly correlated in the sensitive attribute table.

C. K-ANONYMITY AND SLICING ON THE QUASI-IDENTIFIER TABLE
The raw patient microdata is anatomized into two sub-tables i) quasi-identifier  The mean (EC1 Height ) represents the mean value of the attribute height in the first equivalence class.
The mean (EC1 Weight ) represents the mean value of the attribute weight in the first equivalence class. Equations 8, 9 and 10 show the sample calculation of the mean value for the attributes age, height and weight (i.e.) data Perturbation [52]. The attribute sex is not generalized. In the HB-anonymity, generalization hierarchy trees are not adopted in the process of anonymization. Hence, the amount of utility loss is very less.
After implementing k-anonymity on the T QI1 P and T QI2 P , both the tables are merged to implement slicing as shown in Tables 7 and 8. The anatomization of tables, based on correlation reduces the loss of utility and the slicing helps in preserving the utility and correlation among the attributes. The slicing with the principle of k-anonymity prevents various attacks such as non-membership disclosure, membership disclosure, quasi-identifier attack and background knowledge attack.

VIII. IMPLEMENTATION OF SLICING ON TP SA AND MERGING OF TP QI AND TP SA
Considering the utility loss caused by the anonymization process, the HB-anonymity does not adopt any type of hierarchy generalization or suppression. The patient dataset has six sensitive attributes temperature, pulse rate, respiratory rate,   blood pressure and disease. The blood pressure comprises systolic and diastolic so the attribute BP is broken into two parts as shown in Table 9.
To anonymize the sensitive attributes of the dataset, HB-anonymity forms the buckets by sorting the values of diseases (i.e.) alphabetically sorted. Four buckets are formed in the sample dataset comprising of three records. After the formation of buckets, slicing has been performed on T SA P as shown in Table 10. The vertical slicing is performed based on the highly correlated attributes. The correlation matrix has been computed using the Pearson correlation coefficient to perform slicing. As per the correlation matrix, Pulse Rate (PR) and disease belongs to sl 1    Bucketization, the merging of T QI P and T SA P is done as shown in Table.11. Heap Bucketization is formed by combining the records of each bucket as shown in Table 12. All three tuples comprise three individual records from bucket 1 itself. When an intruder tries to infer an individual record, he would not be able to identify even the buckets where the record is located as the generalized quasi-identifier is also distributed and the bucket is formed based on the disease. Finally, the anonymized data is released by sorting it according to the id of the patient.
The main goal of the proposed HB-Anonymity model is to perform Heap Bucketization. In bucketization, the quasi-identifier and the sensitive attributes are separated and the sensitive attribute values are randomly anonymized. In bucketization, the quasi-identifier values are published in the original form and thus it fails to protect the membership disclosure. Bucketization needs a clear parting of QI and SA values that might lead to the breaking of correlation among the quasi-identifier and sensitive attributes. Due to this breaking of the linking relationship, the utility loss will be high [53]. To overcome the disadvantages of bucketization such as membership disclosure and improper anatomization, HB-anonymity model is proposed. In HB-anonymity, the quasi-identifier and sensitive attributes are identified and further, the QI is broken into two sub-tables based on the correlation among the attributes. As the QI attributes are separated based on the correlation coefficient, the breaking of the linking relationship is prevented. k-anonymity is applied on the quasi-identifier and the QI is generalized by replacing it with mean values of the equivalence class. The sensitive attributes are anonymized with heap bucketization approach and slicing which in turn prevent the non-membership attack and fingerprint correlation attack.
The proposed HB-Anonymity model prevents privacy under: (1) background knowledge attack; (2) quasiidentifier attack; (3) membership disclosure attack; (4) nonmembership disclosure attack; and (5) fingerprint correlation attack. If the intruder knows that pid0 is male, age > 50, the intruder can infer that the record falls in bucket 1. However, each record in bucket 1 comprises of all the three record values. Thus, the exact values of pid0 cannot be inferred. Even if the intruder knows the quasi-identifier values of an individual pid3, the intruder can correlate the values of qid and conclude the record falls in bucket 2. However, exact values for any attribute cannot be retrieved. Likewise, the existence (mda) and non-existence (n-mda) cannot be inferred precisely in Table 12. If buckets 2 and 3 are intersected, only the sensitive attribute disease = Diabetes is a common value that can be retrieved. In Bucket 2 and 3, there are total of 6 records and thus the probability of finding the individual is 0.1, which is very negligible. Thus the heap bucketization anonymity model protects the dataset from the fca also. An exhaustive evaluation of anonymization approaches on privacy-preserving data publishing has been studied and summarized in Table 13. The complete workflow of HB-anonymity and the framework of the HB-anonymity are depicted in Figure 1 and Figure 2.

IX. HEAP BUCKETIZATION-ANONYMITY ALGORITHM
The primary aim of the HB-anonymity algorithm is to achieve a balance between privacy and utility. The Heap Bucketization is designed to overcome the limitations of bucketization. The generalization, slicing and bucketization are together implemented in HB-anonymity model. The complete process of the HB-anonymity model is explained in the HB-anonymity algorithm for better understanding purposes. In the HB-anonymity algorithm, the patient table, k variable is sent as an input argument in line 1. The output of the table is heap bucketized. The patient table is anatomized into two tables 1. Quasi-identifier and 2. Sensitive attribute table in lines 3 and 4. The correlation among the quasi-identifier attributes is calculated using the Pearson correlation coefficient in lines 5 and 6. In line 7, the k variable is passed to anonymize quasi-identifier and the correlation table of quasiidentifier D 1 . The quasi-identifier table is further anatomized in line 8. The two quasi-identifier tables T QI1 P and T QI2 P are anonymized by implementing k-anonymity in lines 9 and 10. After anonymization of tables by k-anonymity, the tables are merged and slicing is applied on the highly correlated attributes in lines 11 and 12. The correlation among the sensitive attributes is calculated by the Pearson correlation coefficient in lines 13 and 14. In line 15, the sensitive attribute table is anonymized based on the correlation table D2. In the sensitive attribute table, the blood pressure is divided into two fields' sys and dys in line 16. The sensitive attribute table comprises six attributes Temp, Pulse Rate (PR), Respiratory Rate, Sys, Dys, and Disease in line 17. To form the buckets, the attributes are sorted with respect to disease in line 18. From lines 19 to 23, buckets have been formed with three records in each bucket. In line 24, the slicing has been performed on the bucketized table. In lines 25 and 26, the quasiidentifier and sensitive attribute tables are merged and the records are sorted based on the bucket id to implement Heap Bucketization. From lines 27 to 29, the values of the highly correlated attributes are grouped in each bucket to perform Heap Bucketization. Finally, the records are sorted based on the id of the records for data publishing.

X. EXPERIMENTAL DETAILS AND RESULT A. EXPERIMENTAL SETUP
The experimental setup used for the proposed model is a windows 10 operating system with 8 GB memory, 1TB hard disk. We experimented with the work in Python 3. A novel dataset has been used in our work. The dataset is received from the Interdisciplinary Institute of Indian System of Medicine, Ayurveda. The total number of instances is 22,527. The dataset consists of information of the patients such as age, sex, height, weight, temperature, pulse rate, respiratory rate, VOLUME 10, 2022 blood pressure and disease. Age, sex, height and weight are categorized as quasi-identifier and temperature, pulse rate, respiratory rate, blood pressure and disease are categorized as the sensitive attributes. As age, sex, height, weight are the general information, they are categorized as quasi-identifier attributes.

B. RESULTS AND DISCUSSION
The proposed model objective is to improve the privacy of the data and to maintain the utility of the data. During preprocessing of the data, the missing field values are filled with the mean value of the column and the duplicate records are removed. After the removal of the duplications, the total number of instances is 22,043. In the proposed model, the generalization is carried out only in quasi-identifier. The sensitive attributes are not generalized or suppressed. Only the slicing of the highly correlated attributes is implemented as an anonymization process. The utility loss is measured for the quasi-identifiers using the metric Normalized Certainty Penalty (NCP) [44].

C. NCP
The utility loss for the anonymized attribute is measured using NCP as per equ.11. In the proposed model, the metric NCP is used to measure the anonymized quasi-identifier attribute.
Let 'a' be the attribute value of X. The NCP is defined as follows: Let |a| be the number of nodes enclosed by 'a' corresponding to generalized node and |X| be the total number of nodes in attribute X. The original value of the height, weight and age are taken as the old value and the generalized values are taken as the new value of the attributes. Infoloss age = abs(abs infoloss age new − abs infoloss age old ) (14) The infoloss age new represents the information loss of generalized value of attribute age and infoloss age old represents the information loss of the original value of the attribute age in the patient table as per equ.14.
The total unique records of the attributes such as height, weight and age are measured to find the mean deviation of the attributes across the unique values. The total unique records of the attribute height are 136, the unique records of the weight are 311, and the unique records of the attribute age are 89. The mean deviation of the attribute height across the 136 unique values is calculated as per equ. 15   * 100 (17) The average information loss using the NCP metric for the patient dataset is 0.083% which is very less as shown in equation18. As per our proposed model, the quasi-identifier is alone generalized and information loss due to the generalization is 0.083%. The sensitive attributes are not generalized so there is no information loss in the sensitive attributes.
Average info loss = (Md ht + Md wt + Md age ) 3 (18) Figure 3 shows the NCP percentage value by changing the values of k with a fixed number of sensitive attributes (e.g. MSA = 6) for examining HB-anonymity, (c,k)-anonymization, (p,k)-angelization. The NCP% value of (p,k)-angelization increases unceasingly when the value of k increases. Due to this continuous increase in k-value, the utility of the dataset is getting degraded. The bucket formed in the sensitive table may affect the utility in the quasi-identifier table. The HB-anonymity has a utility loss of about 0.083% is almost equal to zero and the loss is consistent though the value of k is increased. (c,k)-anonymization has 0.9 utility loss as per our execution and in the case of NCP % value, HB-anonymity is having negligible utility loss.

D. KL-DIVERGENCE
Kullback-Leibler divergence is a metric to measure the difference in one probability distribution to another probability distribution. KL divergence is implemented considering the relation table as probability distribution d 1 . d 1 (a) represents the element of records that belongs to A (a∈ A). The anonymized table is denoted as the probability distribution d 2 after applying the HB-anonymity. The Kullback-Leibler divergence for the patient table for the actual probability distribution (d 1 ) and the estimated distribution (d 2 ) after applying HB-anonymity is defined as below:   In HB-Anonymity, the d 1 is the actual distribution of the sensitive attributes in the patient table T P and the estimated distribution of the sensitive attributes in the patient table after Heap Bucketization is d 2 . The KL divergence is performed by changing the group size from 3-15. The (p,k)-angelization has a different score for the probability of estimated distribution for sensitive attribute buckets. In the proposed model, the KL-divergence is calculated for the sensitive attributes part. As the NCP metric has been used to calculate the utility loss in quasi-identifier, to measure the utility loss in sensitive VOLUME 10, 2022 attributes the metric KL-divergence has been used. There are no generalization or suppression methods in sensitive attributes, so the utility loss is measured through the probability of the distribution of the actual and anonymized records. In figure 4, the KL-divergence is plotted for the different bucket sizes. By varying the bucket size, the distribution of the data also varies. As the bucket size increases, high privacy is achieved but the utility loss is high. In the (p,k)-angelization, the probability of data distribution increases rapidly with bucket size. Whereas in (c,k)-anonymization there is zero utility loss and there is a slight increase in the utility loss if the bucket size increases. HB-anonymity also results in negligible utility loss for the reasonable bucket size and there is a slight increase in utility loss if bucket size is very high. From figure 5, the conclusion is that the (p,k)-angelization has reasonable utility loss whereas (c,k)-anonymization and HB-anonymity have negligible utility loss and it is very consistent.

E. EXECUTION TIME
When it comes to execution time, the HB-anonymity has a very negligible execution time in connection with the number of sensitive attributes. The execution time of (c,k)-anonymization is greater compared with (p,k)-angelization and HB-anonymity. The HB-anonymity protects the privacy of the data and maintains the consistent execution time. The execution time of HB-anonymity is very small and satisfactory. The HB-anonymity has very little execution time, for the number of records. The execution time of (p,k) angelization is also less and there is only a slight difference in the execution time between the (p,k) angelization and the HB-anonymity. The execution time of (c,k)-anonymization is greater when compared to (p,k) angelization and HB-anonymity.
The main advantage of HB-anonymity is though high privacy is achieved, the execution is also reduced. The proposed model has not incorporated many customized rules to achieve privacy. As the (c,k)-anonymization has imposed many rules to achieve high privacy, the execution time increases as the number of records and number of sensitive attributes increases. Figure 5 depicts the Execution time for the number of sensitive attributes and figure 6 depicts Execution time for the number of records.

F. PRIVACY LOSS
The vulnerable records that can be identified by the intruders can measure the privacy loss in a dataset. Identifying an individual in the released anonymized table is directly proportional to the privacy loss. The higher the records exposed to the intruders, is higher the privacy loss. In (p.k)-angelization and (c,k)-anonymization the privacy loss is measured by varying the values of k and multiple sensitive attributes. Likewise, in HB-anonymity the privacy loss is measured by changing the values of k and multiple sensitive attributes. Figure 7a represents the privacy loss in (p.k)-angelization, (c,k)-anonymization and HB-anonymity by varying k value. The number of vulnerable records in (p,k)-angelization increases gradually as the k value increases because the number of records with a single sensitive value is high during the intersection of fingerprint buckets. To have a clear insight of the privacy loss for HBA and (c,k)-anonymization, the logarithmic function has been used for the k value as shown in figure 7b. When the sensitive attributes are increased, the vulnerable records also increase in (p,k)-angelization due to the increase in single sensitive value as shown in Figure 7c. The (c,k)-anonymization does not have any privacy loss as there exist no vulnerable records. Furthermore, the HB-anonymity achieves high privacy due to Heap Bucketization. To have a clear insight of the privacy loss for HBA and (c,k)-anonymization, the logarithmic function has been used for the MSA value as shown in figure 7d. The combinations of all the records belonging to one bucket are put together such that the intruder would not be able to identify any individual record. The intruder cannot be able to predict the sensitive attribute values from the intersection of any buckets. Though the (c,k)-anonymization and HB-anonymity have no privacy loss, the (c,k)-anonymization possess a much complex anonymization process, thus the execution time of the (c,k)-anonymization is high.

XI. CONCLUSION AND FUTURE DIRECTION
The paper has presented various related works on privacy-preserving data publishing with MSA. In the paper, an efficient model Heap Bucketization-Anonymity has been proposed to address the challenge of balancing the utility loss and privacy. An HB-anonymity algorithm has been developed based on the anonymization methods adopted for quasiidentifier and sensitive attributes. The HB-anonymity model concentrates on the prevention of breaking of the relationship between the attributes, thus the correlation among the quasi-identifier and sensitive attributes are calculated using Pearson correlation co-efficient to achieve less utility loss. The quasi-identifier has been anonymized by implementing k-anonymity and slicing.
A new approach Heap Bucketization has been implemented to anonymize the sensitive attributes. The proposed model Heap Bucketization makes the re-identification of the individual a challenging task for the intruder in the disclosed dataset. Experimental evaluation has been performed on the unique Ayurveda patient dataset and resulted that the proposed model achieves the balance between utility and privacy with less execution time. Moreover, HB-anonymity eradicates the various attacks such as i) background knowledge attack, ii) quasi-identifier attack iii) membership attack, iv) non-membership attack and v) fingerprint correlation attack. The future direction of the work is to develop models for dynamic data and unstructured data. In addition, we believe that quasi-identifier could be a semi-sensitive attribute and the work can be carried in such a direction. The work could be extended to 1:M microdata which is a challenging research topic.