Towards Privacy Preservation using Clustering based Anonymization: Recent Advances and Future Research Outlook

With the continuous increase in avenues of personal data generation, privacy protection has become a hot research topic resulting in various proposed mechanisms to address this social issue. The main technical solutions for guaranteeing a user’s privacy are encryption, pseudonymization, anonymization, differential privacy (DP), and obfuscation. Despite the success of other solutions, anonymization has been widely used in commercial settings for privacy preservation because of its algorithmic simplicity and low computing overhead. It facilitates unconstrained analysis of published data that DP and the other latest techniques cannot offer, and it is a mainstream solution for responsible data science. In this paper, we present a comprehensive analysis of clustering-based anonymization mechanisms (CAMs) that have been recently proposed to preserve both privacy and utility in data publishing. We systematically categorize the existing CAMs based on heterogeneous types of data (tables, graphs, matrixes, etc.), and we present an up-to-date, extensive review of existing CAMs and the metrics used for their evaluation. We discuss the superiority and effectiveness of CAMs over traditional anonymization mechanisms. We highlight the significance of CAMs in different computing paradigms, such as social networks, the internet of things, cloud computing, AI, and location-based systems with regard to privacy preservation. Furthermore, we present various proposed representative CAMs that compromise individual privacy, rather than safeguarding it. Finally, we discuss the technical challenges of applying CAMs, and we suggest promising opportunities for future research. To the best of our knowledge, this is the first work to systematically cover current CAMs involving different data types and computing paradigms.


I. INTRODUCTION
With the rapid advances in information and communications technologies, personal data have become an economic resource that can assist data owners (hospitals, banks, insurance companies, social networking service providers, etc.) in fulfilling the needs/expectations of their affiliates in a seamless manner. With the huge proliferation of pervasive computing and digital tools, data owners are obtaining huge and varied amounts of personal data for financial gain. In recent years, collections of personal big data (i.e., private data produced in the daily lives/work of individuals) have become valuable assets in the data market, and have replaced oil as the most economical resource [1]. The huge amount of collected personal data often encompasses information about an individual's demographics, spatial-temporal activities, photographs, finances, political/religious views, interests, hobbies, social circle information, and medical status, to name just a few types. Outsourcing the collected data to analytics firms/companies in order to extract relevant information regarding consumers can help companies sustain a competitive advantage, but privacy problems are the main hurdle in doing so [2]. Due to privacy issues, companies often prefer not to outsource their consumer/customer data to legitimate information consumers for knowledge discovery. The three common privacy issues that can occur as a result of data outsourcing based on users' attributes are disclosures of identity, sensitive information, and memberships [3].
According to a survey in the United States [4], unique identification of individuals is possible at very different percentages based on the following combinations of three attributes: • zip code (five-digits), date of birth, gender → 87% • place of residence, date of birth, gender → 50% • country of origin, date of birth, gender → 18% User attributes such as date of birth, gender, zip code, race, and country of origin are called quasi-identifiers (QIDs). These QIDs in personal data can increase the chances of disclosing identities and corresponding sensitive attributes (SAs) [5]. To address these privacy issues, personal data are usually anonymized before publication. The technical solutions for protecting an individual's privacy in personal data handling are obfuscation, encryption, anonymization, and pseudonymization. However, due to low computing overhead and algorithmic simplicity, anonymization has been used extensively in commercial settings for privacy-preserving data publishing (PPDP) and was recently legislated in some advanced countries [6]. It employs many anonymization operations, such as generalization, suppression, randomization, slicing, and derived records, in order to strike a balance between privacy and utility in PPDP.
Primarily, most anonymization approaches are applied to tabular/relational data. The well-known anonymization approaches applied to tabular data are k-anonymity [7], ℓdiversity [8], and t-closeness [9]. These models showed remarkable results in terms of privacy preservation in the early days. However, they proved unsuccessful against certain contemporary privacy threats, and many refinements have been proposed to upgrade them [10], [11]. Some other developments (a.k.a. utility enhancements) have emerged in parallel to meet the needs of data analysts by keeping most data characteristics as close as possible to the original. For example, in 2006, differential privacy (DP) [12] was proposed for dynamic scenarios (e.g., query-answer). Afterwards, researchers extended the anonymization concepts from tabular data to social networking (SN) data in order to protect user privacy in graph publishing [13], [14]. For example, the k-anonymity concept for tabular data was modified to k-degree anonymity in order to preserve privacy in social graph G(U, V ), where U denotes SN users, and V is the set of edges modeling the relationship between users [15]. In recent years, anonymization approaches have been rigorously applied to diverse data formats (matrixes, tables, graphs, text, traces, multimedia, documents, etc.) for privacy preservation under multiple computing paradigms, such as the internet of things (IoT), artificial intelligence (AI) environments, and cloud computing. In this paper, we focus on clustering-based anonymization mechanisms (CAMs) that have shown remarkable improvements over traditional approaches in preserving both privacy and utility in recent years.
Previous reviews related to PPDP have covered important aspects, such as relational/graph anonymization techniques, privacy models and their extensions, anonymization operations, data anonymity frameworks, privacy-protection tools, and evaluation metrics used by the PPDP mechanisms. Rajendran et al. [16] discussed the strengths and weaknesses of three famous anonymity models: k-anonymity, ℓ-diversity, and t-closeness. Tran and Hu [17] provided a systematic review of big data analytics that preserves privacy. Other authors have discussed many generic privacy-preserving approaches for data querying, data publishing, and data mining. A few surveys have been published on outsourcing SN users' data while preserving privacy [18], [19]. Some authors have discussed various ways for preserving node/edge privacy when sharing G with third parties. Sharma et al. [20] discussed privacy concerns and corresponding privacypreserving techniques for big data. Majeed and Lee [21] presented a detailed review of anonymization approaches that were applied to tabular and graph data. Cunha et al. [22] discussed various anonymization approaches for different data types, and provided a detailed taxonomy of privacy protection mechanisms and tools. Recently, a survey on privacy preservation in social media networks was published [23]. Although we fully affirm the key findings of previous reviews, the concepts/approaches covered in those reviews were limited, and CAMs were not covered thoroughly. To the best of our knowledge, none of the existing reviews covered CAMs that have been proposed for different computing paradigms and heterogeneous data formats. To close the gap, this paper presents a comprehensive review of anonymization techniques that employ clustering concepts while converting raw data into anonymized data.
The major contributions of this review to the PPDP field are summarized as follows. (i) We summarize the key findings of state-of-the-art (SOTA) clustering-based anonymization mechanisms that have been proposed for effective resolution of privacy and utility in PPDP. (ii) We systematically categorize existing CAMs into heterogeneous data formats, including SN (i.e., social graphs), relational (i.e., tabular), transactional (i.e., set), and trace, and present an up-to-date, thorough review of the latest anonymization techniques and metrics employed for evaluations. (iii) We describe the role of CAMs regarding privacy preservation in different computing paradigms, such as cloud computing, the IoT, location-based services, and application-specific SN scenarios (community clustering, collaborative filtering, privacy-aware recommendations, graph mining, etc.) that remained unexplored in the recent literature. (iv) This paper highlights various representative CAMs that are exploited by malevolent adversaries in order to compromise user privacy from published data (i.e., unique re-identification across SN sites, SA disclosures, and inferring private data). (v) We present the technical challenges in protecting user privacy by leveraging CAMs, and list potential avenues for future research to address contemporary privacy threats. (vi) This is the first work centering on CAMs from a broader perspective that can provide a solid foundation for future developments in the PPDP area.
The remainder of this article is organized as follows. Section II presents the background of the privacy concept, on personal information enclosed in multiple data types, on well-known privacy threats, on privacy protection techniques and their operations, and on the role of machine learning (ML) techniques in the information privacy domain. Section III presents an overview of the PPDP process, conceptual overview of CAMs, and superiority of CAMs over traditional anonymization methods. Section IV provides an overview of the 10 most widely used data formats and the corresponding SOTA CAMs for each data format. Section V discusses the significance of CAMs in the emerging computing paradigms. Section VI highlights the dark side of CAMs in terms of privacy breaches. Then, we discuss the challenges of CAMs and suggest promising avenues for future research in Section VII. Finally, we conclude the paper in Section VIII. Figure 1 demonstrate the high level structure of this survey paper.
As shown in Figure 1, we categorize the structure of this survey paper based on the complexity of information in a sequential manner (i.e., the information complexity increase down the order). For example, in Section II, we present the basic knowledge about the subject matter including the scope of privacy as a whole, ten different data types in which personal data is usually represented, and privacy threats based on the data types (i.e., edge disclosure can occur only in a graph data), the taxonomy of major privacy-enhancing technologies, and role of AI in privacy area from three perspectives. In Section III, we demonstrate an overview of the system where CAMs are used followed by their working principle. In Section IV, we present an overview of different data styles and CAMs applications on them. Later, we analyze SOTA CAMs used for each data type with a detailed analysis of each technique. In Section V, we show the CAMs application in multiple computing paradigms along with a detailed analysis. Basically, Sections IV and V show the bright sides of CAMs. In Section VI, we show the dark sides of the CAMs along with the critical analysis of each study. In Section VII, we highlight open challenges and future research opportunities in detail. Finally, we summarize the key points of this article and the conclusion of CAMs in Section VIII. VOLUME 4, 2016

II. BACKGROUND
Privacy has countless shades/definitions and is very subjective (i.e., the perception of it varies from individual to individual) [24]. In simple words, privacy is about safeguarding private information against prying eyes (a.k.a. public access) [25]. Privacy is regarded as one of the fundamental human rights and is vital for autonomy, individualism, and selfrespect. The scope of privacy can be classified into four distinct categories, as shown in Figure 2. This review focuses on the first category (information privacy), which includes systems/infrastructures that gather, store, analyze, utilize, and disseminate personal data.

FIGURE 2.
Description of the scope of privacy (adapted from [26]).
Personal data can be represented in different formats, including tables, graphs, text, sets, and matrixes. For example, SN data are frequently modeled/represented with graphs. Moreover, hospitals/clinics mostly store and process personal data in a tabular form. Superstores usually manage consumer/customer data in set-valued form. In contrast, some sectors handle personal data in continuous fashion called streams. Figure 3 presents a generic overview of the four different types/styles in which personal data are encompassed.
In some cases, the same personal data can be consistently modeled in multiple formats. For instance, SN users' data can be presented in both graphs and tables. Personal information that needs privacy preservation can be of different types (diseases, photos, income, etc.) and can be encompassed in any one of the above data formats (tables, traces, set-valued, etc.). We present a detailed overview of personal information enclosed in different data types/styles in Figure 4. These can be classified as unstructured, semi-structured, and structured [22].
Privacy threats can also vary depending upon the style of the data and the corresponding personal information enclosed in each style (a.k.a. type). We provide a brief overview of privacy threats that can be executed on different data styles as follows.
• Table: identity disclosure, SA disclosure, membership disclosure, privacy-intrusive pattern revelation, group privacy theft, association rules extraction, etc. • Graph: node re-identification, connection/ relationship disclosure, edge/vertex label disclosure, affiliation disclosure, multiple SN account disclosure, community label disclosure, etc. • Matrix/Set: sensitive itemset disclosures, purchase history theft, financial status disclosure, spatial-temporal activities disclosure, transport usage data disclosure, etc. • Traces/Logs: location disclosure, trajectory disclosures, mobility pattern disclosures, spatial-temporal stay points disclosure, web-search disclosure, sensitive place visit disclosure, interaction disclosures, etc. • Documents: intimate details of someone's life, medical/prescription history disclosure, income tax expo- sure, personal data disclosure, genomics data disclosure, etc. • Text: intent disclosures, opinion disclosures, political party affiliation disclosure, personal preferences disclosure, social circle information disclosure, content disclosure, etc. • Stream: diagnosis history, illegitimate data aggregation, stalking of individuals, targeted profiling, patterns in web searches, interest disclosures, mobility disclosure, location disclosure, etc. • Multimedia: facial privacy disclosures (a.k.a. identity disclosure), SA disclosure, appearance disclosure, political affiliation disclosure, sensitive/controversial place visit disclosures, sensitive information predictions, itemset disclosures, surveillance data disclosure, hidden profiling, etc. • Hybrid: multiple and intrusive high-privacy disclosures mentioned in the above data styles.
To safeguard user privacy against prying eyes, multiple privacy protection approaches have been proposed for secure collection, processing, analysis, utilization, and publication of personal data. We present a taxonomy of famous approaches in Figure 5, along with their concise descriptions and main operations. The main operations performed in each approach have benefits/liabilities in terms of computing complexity, conceptual simplicity, robustness, effectiveness in the privacy/usefulness trade-off, number of iterations, and resource utilization. For example, suppression and generalization operations have a distinct impact on privacy and utility, respectively. The former provides a higher level of privacy, but no utility for information consumers. In contrast, the latter sustains better utility and privacy in anonymized data. In addition, cryptography-based operations are mostly slow, but enable trans-border data flow, and provide rigorous privacy guarantees. These operations have been widely used in interactive scenarios (e.g., the IoT, SN, edge/cloud computing). Obfuscation-based approaches are highly useful in preserving the privacy of geo-spatial data (i.e., mobility and trajectories) by incorporating a fair amount of noise. The operations performed by pseudonymization-based approaches assist in hiding sensitive data by replacing them with pseudonyms. These approaches are mainly preferred in vehicular networks and smart-home environments. Finally, the hybrid approaches perform multiple operations, jointly considering the type of data, the characteristics of the attributes, and the objectives of privacy/utility in order to meet privacy/utility expectations [27]. All these approaches have been widely used in preserving both privacy and utility in different computing paradigms. . Significance of AI in information privacy domain (adapted from [28] In recent years, AI approaches have opened up new chal- VOLUME 4, 2016 lenges and opportunities in the privacy protection domain. On one hand, they have enhanced the capabilities of existing privacy-preserving approaches in effectively preserving a user's privacy. On the other hand, they have become a target of malevolent adversaries, and can still allow disclosure of sensitive information. Majeed et al. [29] applied AI concepts to improve performance from the traditional anonymity approach to privacy and utility preservation. In contrast, Park and Lim [30] proposed the idea of securing federated learning (FL) using homomorphic encryption. In the coming years, privacy-preserving approaches will benefit from AIbased approaches, and vice versa. In line with this trend, synergy between AI and privacy-preserving approaches can be categorized from three aspects (as shown in Figure 6). Although many SA-specific, data-specific, application/threatspecific, domain-specific, attack-specific, sector-specific, and AI-based privacy-preserving approaches have been devised, clustering-based privacy-preserving approaches have improved traditional anonymization in different contexts. Therefore, the remainder of this review solely explores clustering-based anonymization approaches/developments in the context of PPDP.

III. OVERVIEW OF PRIVACY PRESERVING DATA PUBLISHING AND CAMS
In this section, we discuss the overview of PPDP and CAMs. Specifically, we discuss the life cycle of PPDP, the basic concepts of CAMs, and the superiority of CAMs over traditional anonymization algorithms.

A. DESCRIPTION OF THE LIFE CYCLE OF PPDP
The typical PPDP process encompasses six steps, all of which, along with their execution order, are shown in Figure  7. In Step A, appropriate data are collected from relevant individuals. Examples of data collection are account-opening procedures in a bank, or a check-up from a diagnostic center. In both of these scenarios, some basic information (i.e., QIs) as well as sensitive information (i.e., SAs) is obtained. Subsequently, the collected data are stored in safe repositories/databases for further analysis (Step B). Storage can be in graph form (e.g., SN data) or tabular form (e.g., hospital/bank data) depending upon the nature of the collected data. Due to the recent advancements in technology, storage capacity has become sufficiently large, and all types of data can be stored for utilization in multiple contexts. In Step C, preprocessing is applied to the collected data. During this step, the data are cleaned (outliers and missing values are removed, formatting and type checking is performed, and redundant records are removed). In Step D, the cleaned data from Step C are anonymized. During data anonymization, the original data are modified to preserve privacy, leaving the anonymized dataset useful for analysis. In Step E, anonymized data are published for analysis and data mining. In the final step, analytics is applied to the published data to extract useful information for hypothesis generation/verification. A conceptual overview of the anonymization process applied on raw data for PPDP is demonstated in Figue 8.

B. DESCRIPTION OF THE CLUSTERING BASED ANONYMIZATION APPROACHES USED FOR PPDP
Thus far, many anonymization approaches have been proposed to address privacy and utility issues in PPDPleveraging clustering concepts. We illustrate a generic overview of the clustering concept in Figure 9. The anonymization of clusters is mainly the same as anonymizing QI groups (a.k.a. equivalence classes) in the traditional anonymization approaches (k-anonymity, ℓ-diversity, t-closeness, and their extensions). The CAMs have been extensively studied in the recent literature for privacy preservation due to improved privacy and utility results. Furthermore, the anonymized data produced by the CAMs are helpful for secondary purposes (e.g., demography-based disease analysis, policy-making, future event predictions).

C. THE SUPERIORITY OF CLUSTERING-BASED APPROACHES OVER TRADITIONAL ANONYMIZATION APPROACHES
The clustering-based approaches have revolutionized the information privacy domain in many aspects. For instance, the k-anonymity model enforces a constraint on the number of people in a QI-group/class, and usually retains k people in the QI group. In contrast, CAMs can remove such hard constraints and can keep the same people in the cluster, regardless of their strengths in using the similarity/distance concept. The mathematical expression used to compute similarity between two users (or between a user and the cluster center) is given in Eq. 1: This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
where i denotes the QIs, and p represents the total number of QIs. From Eq. 1, S values between users and cluster centers can be computed, and clusters can be formed. In CAMs, multiple checks are performed for each record in order to find the best-matching cluster that ensures homogeneity in clusters. In contrast, the traditional anonymization approaches usually assign records to a QI group by performing a single check (leading to imprecise utility results in most cases). Since traditional anonymization approaches often ignore similarity/distance concepts while making the groups/classes, the generalization intervals are very wide, which can lead to false hypothesis generation in the end. In contrast, CAMs employ distance/similarity concepts, and therefore, the possibility of false hypothesis generation is relatively low. Furthermore, CAMs can control the issue of over-generalization, and an anonymized dataset produced by them has better utility and privacy. Analytical and data mining tasks can be performed with sufficient accuracy. In contrast, traditional anonymization often leads to imprecise analysis results by introducing heavier changes in the anonymized data. Furthermore, CAMs have the ability to control the heavier changes during data anonymization, which can pave the way for better resolution of privacy versus utility.
The anonymized data produced by CAMs enable better understanding of differences among commonalities, and of commonalities among the differences. Furthermore, CAMs are vital for enhancing the performance of knowledge-based systems/applications. CAMs have the ability to control inaccurate decision-making, and they enable a better understanding of patterns/trends from the anonymized data. In addition, CAMs are flexible, meaning they can be applied to different data styles with minor modifications. CAMs can yield consistent performance with different data styles and domains. CAMs have the ability to produce promising results in big data platforms such as MapReduce, Spark, and Hadoop. Recently, CAMs have also been extensively applied to unsuitable/imbalanced data in order to meet analytics demands [31]. In the coming years, application areas for CAMs are likely to expand to many domains.

IV. CLUSTERING-BASED ANONYMITY MECHANISMS FOR HETEROGENEOUS DATA TYPES/STYLES
In this section, we describe the effectiveness of CAMs on heterogeneous data, and we present SOTA approaches for each data type. We chose 10 representative data styles for the analysis: tables, graphs, matrixes, traces, documents, text, streams, logs, multimedia, and hybrids. We discuss basic concepts with an example of each style before discussing SOTA CAMs in Table 1.

A. CAMS FOR TABULAR DATA
Most data owners, such as banks, hospitals, and insurance companies, maintain their patient/customer/subscriber data in tabular form. Data storage, analysis, utilization, and distribution are relatively easier in tabular form, compared to other styles. A table, T , is a combination of rows and columns. Each row of T provides complete information about an individual, whereas a column is for one item (e.g., age) concerning the individuals. A generic overview of a common structure of T for a sample of 9000 individuals is shown in Eq. 2: where each row represents complete information about a user, including basic attributes (i.e., QIDs) as well as SAs. Moreover, each column represents one item (e.g., age or salary) related to all users. Many approaches have been proposed to anonymize T . Well-known anonymization models (e.g., k-anonymity, ℓdiversity, t-closeness, and their extensions) were primarily applied to tabular data only. Later, they were extended to other styles of data. CAMs have improved various drawbacks in these models with regard to computing efficiency, privacy, VOLUME 4, 2016 and utility preservation. Figure 10 presents an overview of tabular data anonymization using the clustering concept. The original T to be anonymized with the clustering technique is shown in Figure 10 (a). In Figure 10 (b), the clustering technique has been applied to T , and corresponding clustered results are shown. As seen in Figure 10 (b), record placement has been changed, and users have been grouped into different clusters. In the last step, anonymized data T ′ was generated. T ′ can be outsourced to information consumers for analytical/data-mining purposes.

B. CAMS FOR GRAPH DATA
Social network data are usually modeled/represented with the help of a social graph. Social graph G can contain n users, and each user can have m edges/connections with other users. Multiple ways exist to represent SN user data. In Figure 11, we illustrate the four most widely used representations of SN data via G. Anonymization approaches generally modify the structure of G to preserve both privacy and utility. The anonymization approaches devised for one type of representation cannot be directly applied to another type of G. The five main approaches used for privacy-preserving SN-data publishing are G modification, G generalization/clustering, DP-based approaches to G anonymization, privacy-aware G computation, and hybrid G anonymity methods [33]. In this work, our focus is on clustering-based anonymization, and therefore, we discuss concepts and examples related to clustering-based approaches. CAMs usually partition G into various non-overlapping clusters, and then generalize the clusters to either super nodes or edges. An overview of clustering-based anonymization of G is shown in Figure 12. In Figure 12, G encompasses seven vertices and two QIs used as input for CAMs. The CAM partitions this G into three non-overlapping clusters by exploiting similarities between QIs. Finally, a generalized/anonymized G is obtained with three super nodes. For the sake of simplicity, we denote only three clusters with distinct shapes. The ordered pair of numbers in each super node represents the numbers of users and intra-cluster edges. Zero in a super node indicates no edge/connection between the users. Recently, there has been an increasing focus on developing CAMs for data encompassed in G.

C. CAMS FOR SET-VALUED/MATRIX DATA
Superstores usually store and process user data in setvalued/matrix form. The complete set of data in this regard is known as a transactional database. These transactional datasets contain multiple records, called transactions, which encompass a set of items (e.g., products purchased or diagnosis codes). Such datasets have higher applicability in biomedical studies, e-commerce, and recommender systems. Many approaches have been developed for anonymizing setvalued data [34]. We demonstrate an overview of set-valued data anonymization with the clustering concept in Figure 13.
In Figure 13 (a), original data to be anonymized are shown, whereas Figure 13 (b) shows the anonymized data. Due to the significant advancements in recommender systems, transactional database anonymization has become a hot research topic.

D. CAMS FOR TRACE DATA
Web searches, network usage, and mobility data are mostly collected, stored, and processed in trace form. Trace data hold spatial-temporal and detailed information about individuals. The anonymization of trace data has become a hot research topic, especially in the era of COVID-19. The contact tracing apps used in this pandemic mainly store individuals' movements for contact tracing purposes. Furthermore, trace data have been widely used in analytics and recommender systems. However, anonymization of trace data is very challenging due to the existence of multiple fields and high dimensionality. In the trace data, there exist multiple fields such as time of day, protocol, IP address, and many other fields. We demonstrate an overview of clustering-based anonymization of trace data with a single field (e.g., the IP address) in Figure 14. C 1 and C 2 refer to cluster 1 and cluster 2, respectively.

FIGURE 14.
Overview of trace data anonymization using a CAM (adapted from [36]).

E. CAMS FOR DOCUMENT DATA
Sensitive data, such as medical histories, newspapers, conversations, reports, agreements, etc., are mostly enclosed in document form. In recent years, anonymization of document data has become a very hot research topic [37], [38]. Various techniques, from natural language processing (named entity recognition) combined with clustering concepts (e.g., k-means) are employed to anonymize textual data of documents. We present in Figure 15 anonymization of text data in a document format. Overview of documents data anonymization using a CAM (adapted from [39]).

F. CAMS FOR TEXT DATA
With the rapid adoption of SN across the globe, text data including posts/comments encompass a variety of personal data that need privacy preservation from malevolent adversaries. Due to the inclusion of personal data in texts and blogs, privacy preservation has become challenging for SN service providers. Similarly, privacy preservation in clinical text by detecting and anonymizing sensitive data items has also become a vibrant area of research in recent years [40]. We present an overview of text data anonymization in Figure 16. CAMs usually anonymize cluster names and other sensitive data items from multiple texts.

G. CAMS FOR LOGS DATA
Web searches, website usage, communication frequency, and SN usage data are mostly collected, stored, and processed in log form. Web search logs are useful in many respects, but present the possibility of misuse. Since logs are distinct, compared to other data styles, many aspects (such VOLUME 4, 2016

FIGURE 16.
Overview of text data anonymization using a CAM (adapted from [41]).
as diversity) can easily lead to privacy breaches and hidden data collection. This necessitates the need to develop anonymization methods and solutions specific to this data style/environment. CAMs are highly applicable to log data for privacy preservation [42]. We demonstrate an overview of log data anonymization in Figure 17. In clustering-based anonymization, similar attributes are combined to effectively address the privacy-utility trade-off, and storage capacity [43].

H. CAMS FOR STREAM DATA
With the emergence of the cloud and the edge computing paradigm, many real-time services have been developed for healthcare and intelligent prediction. In these environments, data are usually collected in real time. The main term used to denote this kind of data is stream. Stream data have many potential benefits in time-sensitive and IoT-based applications. Privacy preservation in stream data has been extensively studied in recent years [44]- [46]. Stream data are usually collected in tuples; hence, privacy preservation is more challenging, compared to other types of data [47]. We present an overview of stream data anonymization using a CAM in Figure 18. With stream data, anonymization approaches generally employ the widowing concept during conversion of raw data into anonymized data.

I. CAMS FOR MULTIMEDIA DATA
With the rapid developments of SN services, multimedia data (i.e., images and video) have become a rich source of interaction among people. People increasingly use multimedia for a variety of purposes, such as sharing events, historical places visited, workplace activities, photographs, etc. Due to a significant rise in multimedia data generation and consumption, privacy protection has become necessary on multiple platforms. We present an overview of image data anonymization using CAM in Figure 19. Apart from image data, video also needs privacy protection from prying eyes [49]. Hence, privacy protection of multimedia data has been extensively studied in recent times.

FIGURE 19.
Overview of image data anonymization using a CAM (adapted from [50]).

J. CAMS FOR HYBRID DATA
In hybrid data, more than one style is used to represent personal data. For example, SN data can be modeled with the help of tables and graphs. The anonymization of hybrid data can be performed just like it is with an individual data style. However, the number of operations can be higher with the hybrid data style, compared to the single data style. Mohapatra and Patra [116] discussed clustering-based anonymization of hybrid data (e.g., tables and graphs). The proposed approach has the ability to represent data in both table and graph.
Recently, some CAMs have focused on securing personal data with higher dimensional hybrid data [117]- [119]. With the rapid developments of many digital infrastructures, hybrid data have been increasingly used in knowledge-based systems, and therefore, privacy preservation is more urgent.
In recent years, many CAMs have been developed for each data style explained above, achieving multiple objectives (privacy-preserving data analytics, data mining, analytical tasks, securing IoT-based infrastructure, and securing personal data from AI-based systems/manipulations). We provide a systematic analysis of SOTA CAMs used for different data styles in Table 1.

Ref.
Data type Key Assertion (e.g., problem solved) in the context of privacy preserving data publishing Clustering concept used Zheng et al. [51] Table Improved k-anonymity using clustering by choosing cluster position reasonably to improve utility. Community clustering Kabiru et al. [52] Table Improved privacy utility trade-off using dimensionality reduction techniques in clustering Self organising maps Khan et al. [53] Table better privacy preservation over sanitization-based methods using clustering concepts Hierarchical clustering Pritam et al. [54] Table Ensured better privacy in healthcare sector using clustering concepts and by improving t-closeness single identity clustering Zouinina et al. [55] Table Bette privacy preservation in PPDP via constrained clustering and topological collaborative clustering Self organising maps Zheng et al. [56] Table Significant reduction in information loss and privacy issues by using clustering -based anonymity K means algorithm Ashkouti et al. [57] Table ℓ-diversity based model for better privacy protection in big data platforms (e.g., Apache Spark) K means algorithm Abbasi et al. [58] Table Addressed scalability and utility issues in PPDP involving high dimensional data K-means++ method Onesimu et al. [59] Table Ensured privacy protection against five active privacy attacks by using k-anonymity and clustering bottom-up clustering Yan et al. [60] Table Lowered the influence of outliers on the clustering process and improved data availability in PPDP Weighted K-member algorithm Rupali et al. [61] Graph Ensured high-level of privacy compared to SOTA methods using clustering concept on SN data K means algorithm Zhang et al. [62] Graph Minimize the information loss in graph data publishing using k-degree and clustering concepts Partitional clustering Shakeel et al. [63] Graph Provided strong privacy guarantees against active attacks while sustaining the usefulness of data Node clustering Chen et al. [64] Graph Maintained the balance between utility of mobile SN data and performance of anonymization Density-based clustering Debasis et al. [65] Graph Provided better protection against identity disclosure and maintained higher utility Hierarchical clustering Skarkala et al. [66] Graph Ensured strong privacy preservation in weighted graphs using k-anonymity and clustering node and edge clustering Langari et al. [67] Graph Provided better privacy in four SN data such as Google+, Facebook, YouTube, and Twitter K-member fuzzy clustering Reza et al. [68] Graph Provided superior results in information loss, privacy, and time complexity in G anonymization Edge clustering Kumar et al. [69] Graph Resolved the privacy and utility trade-off in anonymizing SN data enclosed in a G form Fuzzy clustering Heidari et al. [70] Graph Analyzed the distribution of nodes in clustering process to improve running time and utility of G Sequential clustering Wang et al. [71] Set-valued Provided effective resolution of privacy and information loss using community clustering concepts Community clustering Shyue et al. [72] Set-valued Provided experimental analysis of different clustering techniques on set-valued data Gray Sort Clustering Divanis et al. [73] Set-valued Devised a method that can fulfil diverse privacy and utility requirements in PPDP agglomerative clustering Awad et al. [74] Set-valued Provided higher usefulness of anonymized transactional databases in term of utility rules ant-based clustering Awad et al. [75] Set-valued Devised a method for higher knowledge extraction from anonymized data for future analysis ant-based clustering Barakat et al. [76] Set-valued Provided a mechanism for detecting quantitative privacy breach on real-world disassociated datasets Partitional clustering WANG et al. [77] Set-valued Provided a first solution towards personlized privacy preservation in transactional data anonymization k-member clustering Can et al. [78] Set-valued Proposed a mechanism for lowering information loss while anonymizing data agglomerative and hierarchical Meisam et al. [79] Trace Proposed a viable method for preventing sensitive information leakage from trace Partitional clustering Fan et al. [80] Trace Ensured strong privacy in network traffic synthesis using DP concepts and clustering similarity-aware clustering MEISAM et al. [81] Trace Retained higher utility in trace anonymization without compromising privacy via grouping approach Multi-view grouping Aleroud et al. [82] Trace Provided a robust solution for preserving privacy and utility in publicly sharing trace data k-means clustering Ahmed et al. [83] Trace Addressed privacy-utility trade off, and it excel prior techniques in attack prediction accuracy k-means clustering Velarde et al. [84] Trace Provided a new anonymization method for real life traffic traces while preserving utility and privacy k-means clustering Shaham et al. [85] Trace Provided an ML based approach for higher utility provision in spatial-temporal databases k ′ means algorithm Mahajan et al. [86] Documents Developed a framework for carrying out search over published information in the cloud environments EM and k means Li et al. [87] Documents Implemented a prototype system for privacy preservation in sharing medical documents with 3rd party recursive partitional clustering Li et al. [88] Documents Developed a system to ensure privacy in documents such as pathology reports and clinical narratives DP-based grouping Kong et al. [89] Documents Devised a system for preserving privacy of contents and their extracted features in documents Supervised clustering Garat et al. [90] Documents Developed a prototype for privacy preservation of sensitive data in legal documents agglomerative clustering Li et al. [91] Text Developed a prototype system for privacy preservation of medical data enclosed in text form k-means clustering Lima et al. [92] Text Developed a web-based framework named, HITZALMED for privacy preservation in clinical text unsupervised clustering Liu et al. [93] Text Developed a practical approach for frequent pattern mining with higher utility in the PPDP similarity-based clustering Liu et al. [94] Text Preserved privacy without compromising data utility in anonymizing incomplete medical data k-member algorithm Ghemri et al. [95] Text Ensured better utility and privacy while converting raw text data into anonymized data for data sharing k means clustering Chen et al. [96] Logs Developed a system for privacy preservation in logs data using clustering-based generalization concept Hierarchical clustering Garcia et al. [97] Logs Developed privacy-preserving method for logs data that retain univariate statistics for data mining Semantic similarity Yuvaraj et al. [98] Logs Developed a model for accuracy enhancement of anonymized data without compromising data quality Deep adaptive clustering Meng et al. [99] Logs Developed a system for improving personalized search by preserving utility and privacy in logs data Semantic clustering Pamies et al. [100] Logs Developed a method for query logs protection from adversaries while improving the service searches Semantic clustering Ullah et al. [101] Logs Developed a framework for privacy preservation of web searches via clustering based anonymization k means clustering Nasab et al. [102] Stream Developed a framework for privacy preservation that can handle both numerical and categorical data Adaptive clustering Kumar et al. [103] Stream Developed a method for privacy preservation in IoT scenario using stream clustering concepts Stream clustering Veron et al. [104] Stream Ensured privacy of location data in cloud/edge computing settings using stream clustering concept Stream clustering Yanget al. [105] Stream Developed anonymization technique for distributed data in sensor network for privacy preservation Similarity-aware clustering Patil et al. [106] Streams Developed a privacy preserved system for crime related news identification by mining live data Supervised clustering Tekli et al. [107] Stream Addressed the corelation problem using clustering concept in transactional streams with better privacy (k, l)-clustering Honda et al. [108] Images Developed a practical anonymity approach for crowd movement analysis with privacy guarantees Fuzzy clustering Yang et al. [109] Images Developed a practical privacy preservation framework for facial recognition in online services Eigenface algorithm Zhang et al. [110] Video Developed a framework for anonymizing any class of objects of interest using clustering approach Semantic segmentation Le et al. [111] Images Designed a methodology and full system to improve and adjust the privacy-utility trade-off in images StyleGAN and clustering Grossel et al. [112] Video Developed an anonymization pipeline for effectively preserving privacy in video data Semantic segmentation Ren et al. [113] Images Developed a complete anonymizer for privacy preservation of people's action in human image data Semantic segmentation Deivanai et al. [114] Hybrid Developed a practical method for privacy preservation in a multiparty environment with hybrid data Records clustering Bazai et al. [115] Hybrid Developed a multi-dimensional anonymization scheme for effective resolution of privacy and utility Spark clustering The SOTA CAMs listed in Table 1 have resolved many privacy-related issues in different contexts while guaranteeing utility from the anonymized data. Furthermore, these approaches were extensively used for privacy preservation in different computing paradigms, such as edge computing, the IoT, cloud computing, and SN. In Table 2, we present an in-depth analysis in terms of strength and weaknesses of each SOTA approach listed in Table 1. This comprehensive analysis of the approaches stated in Table 2 can pave the way for understanding existing developments as well as improving them from multiple perspectives. In Table 2, we categorized the nature of each existing study into one of the three types, theoretical, practical, and/or conceptual. Theoretical studies have not been deployed to some real-world scenarios, and limited experiments were conducted to prove their persuasiveness against major privacy threats. In con-trast, practical approaches have been deployed to some realworld case, and their evaluation was performed rigorously using real-world benchmark datasets. Furthermore, in most practical approaches, attention was paid to both metrics (i.e., privacy and utility). Lastly, conceptual studies have presented only proof-of-concepts or ablation analysis, and their efficacy through detailed experiments is yet to be investigated.
The evaluation metrics employed by CAMs can vary based on data style, attack scenario, and target application. We present in Table 3 a generic overview of the metrics employed by SOTA CAMs. Apart from the famous privacy and utility metrics listed in Table 3, some SOTA CAMs have improved the performance-time, scalability, resource consumption, and other data-related issues in the PPDP process (noise, imbalance, dimensionality, outliers, etc.) [130]- [134]. Furthermore, some studies also used application-specific metrics to VOLUME 4, 2016 TABLE 2. Detailed analysis (i.e., strengths and weaknesses) of the SOTA studies presented in Table 1.

Ref. Study nature Strengths Weaknesses
Zheng et al. [51] Practical Exploits the distribution of QIs to enhance the data quality using an improved clustering concept SA disclosure is possible because the proposed method does not consider the diversity of SA's values Kabiru et al. [52] Practical Ensures higher data utility for data mining by increasing problem size (i.e., more attributes) The privacy breaches can be higher due to the linkage attacks with auxiliary data (i.e., voter list) Khan et al. [53] Practical Ensures minimal changes in data anonymization and better data re-construction Prone to background knowledge and other practical attacks (i.e., skewness, table linkage, etc.) Pritam et al. [54] Conceptual Better privacy preservation using enhanced t-closeness concept Poor utility due to suppression operation and more diversity in SA's values Zouinina et al. [55] Theoretical Constrained cluster-based k-anonymization with significantly reduced hand engineering Prone to the disclosure of personal information and data re-construction attack Zheng et al. [56] Practical Effective re-solution of privacy-utility trade-off in data publishing scenarios Fails to provide privacy and utility when data is highly imbalanced (distribution is uneven) Ashkouti et al. [57] Practical Higher utility of data by using improved ℓ-diversity in big data environments Prone to skewness attack and less applicability to highly imbalanced datasets Abbasi et al. [58] Practical A low cost anonymization method with 1.5 × reduction in IL and 3.5 × reduction in time Deletes less frequent data items that may hinder knowledge discovery process Onesimu et al. [59] Practical Guarantees user's privacy at the data collection time in healthcare sectors using clustering Prone to data reconstruction as well as table/record linkage with the data available at auxiliary sources Yan et al. [60] Practical Efficiently anonymize data in the presence of outliers and lowers the IL as well as clustering effect Prone to attribute disclosure by not considering the diversity of SA's values Rupali et al. [61] Practical Privacy protection of three different elements (i.e., node, edge, and attributes) of social networks Degradation of information availability by enforcing strict privacy parameters (i.e., ℓ, t) Zhang et al. [62] Practical Strong privacy protection in neighborhood attacks using k-anonymity approach on graphs Fails to provide robust privacy in subgraph attacks as well as graph matching Shakeel et al. [63] Practical Strong privacy protection in mutual friend attack scenarios Prone to sensitive information disclosure in attributed social networks Chen et al. [64] Practical Strong privacy protection against contemporary privacy threats in graph data Feasibility test were conducted on relatively small-sized graphs (privacy analysis is not stated) Debasis et al. [65] Practical Strong protection against identity disclosure using k-degree anonymity concept Poor utility due to the addition of edges from outside in order to fulfil k-degree requirements Skarkala et al. [66] Conceptual Strong privacy guarantees against identity, attributes, and edge weight The feasibility evaluation was carried out on relatively small graphs and preliminary analysis is given Langari et al. [67] Practical Robust anonymization of graph data using hybrid clustering and satisfying all syntactic methods Prone to nodes' attributes and links privacy breaches by not considering background knowledge Reza et al. [68] Practical Efficient anonymization of graph data by fulfilling (k, ℓ)-anonymity properties Limited applicability to other types (i.e., directed, weighted, attributed, etc.) of the graphs Kumar et al. [69] Practical Anonymizes graphs with better utility for social network analysis and graph mining tasks The proposed method has a very high computational complexity, and privacy issues via graph linkage Heidari et al. [70] Practical Strong privacy guarantees in graph data anonymity using k-edge-connected subgraph clustering Prone to identity and attribute disclosure in attributed social networks Wang et al. [71] Conceptual Better privacy guarantees in publishing personal data using ρ uncertainty model Excessive disclosure of the sensitive transaction by not ensuring sufficient diversity in the SA's values Shyue et al. [72] Practical Strong privacy protection in transactional data using sensitive k-anonymity with tuple delete/add Subject to important data items deletion that can hinder data analytics and mining Divanis et al. [73] Practical Unified framework that satisfies multiple privacy requirements and incurs less IL Less applicability to heterogeneous data types, and sensitive itemset disclosure Awad et al. [74] Theoretical Provides higher utility for certain itemset in transactional data using ant-based clustering The vulnerability analysis of selected itemset is not provided that may expose one/group privacy Awad et al. [75] Practical Creates a neighbor dataset for knowledge discovery/extraction purposes using utility rules The vulnerability analysis of selected itemset is ignored that may impact one/group privacy Barakat et al. [76] Practical Executed a privacy attack on k m -anonymity model that can expose the privacy of some user explicitly Utility analysis and formal proof of the privacy breach on large scale datasets are not provided WANG et al. [77] Practical Sufficient protection for a group of people who have distinct privacy-related preferences in data Prone to lower utility on special-purpose metrics (e.g., accuracy, precision, recall, F1 scores, etc.) Can et al. [78] Practical Ensures protection based on distinct privacy-related preferences provided by users to control anonymity Can lead to higher information loss if data is imbalanced, and values of most QIs are close Meisam et al. [79] Practical Preserving both privacy and utility by creating k views of the trace data Can lead to higher computing cost when the dataset is large, utility can be poor when data is skewed Fan et al. [80] Practical Effectively preserve the privacy of network flows data by creating synthetic data using GANs Can lead to higher utility loss when offset between original and synthetic data is high Meisam et al. [81] Practical Preserves privacy of important fields in trace data using pseudonyms and Multiview approach Yields higher computing complexity by creating multiple views of data, and prone to linking attack Aleroud et al. [82] Practical A DP-based prototype to address the privacy-utility trade-off in network trace data Subject to personal information disclosure in the presence of auxiliary information Ahmed et al. [83] Practical Strong privacy protection of critical fields in network logs data using condensation-based approach Prone to low utility results on special purpose metrics (i.e., accuracy, F1, etc.) of data mining Velarde et al. [84] Practical Practical solution for anonymization of traffic trace data with better privacy using entropy approaches Yields poor utility when most of the data belongs to distinct regions and # of fields are large Shaham et al. [85] Practical Strong privacy protection in location data sharing, and applicable to medical records and web analysis Less resilience against knowledge graph-powered attacks as well linking attack using auxiliary data Mahajan et al. [86] Conceptual Enables keyword searches on encrypted data with better privacy using k-means clustering approach Does not provide provable utility in terms of information loss, accuracy, and F1 score on diverse data Li et al. [87] Practical A low-cost solution for extracting and anonymizing sensitive data items from documents Lack of validation and testing on real-world (i.e., PHI data) and large scale medical documents Li et al. [88] Practical Developed a practical solution for identifying, summarizing, and report generation from health data Prone to repeated query attacks using same noise for some queries, and can reveal true values Kong et al. [89] Practical Privacy preservation of documents and multiple data items including features, metadata, and text Evaluation was conducted on static data, there exist a possibility of true value disclosure Garat et al. [90] Practical A corpus-based method for privacy preservation of court documents and sensitive data items in them Requires a very large # of documents (e.g., up to 80K) for good performance, and complexity is high Li et al. [91] Practical Robust privacy protection of medical data by concealing potentially identifying health data items Poor utility when original data to be anonymized is in scattered form, and values are highly dissimilar Lima et al. [92] Practical A robust three-step privacy-preserving solution for documents data which can be used in medical data Does not provide additional support for languages other than clinical text written in Spanish language Liu et al. [93] Practical A practical approach for bag-valued data with better data utility using semantic similarity concept Prone to identity, SA, and membership disclosure by not identifying the vulnerable data items Liu et al. [94] Practical Ability to anonymize personal data in which some fields values are missing using clustering concepts Can lead to the disclosure of identities as well as membership when the adversary has known data Ghemri et al. [95] Conceptual Ensures the analytics results/statistics remain same when computed from original/anonymized data Formal analysis and validation was performed in limited aspects, fewer records were used in tests Chen et al. [96] Practical Robust and efficient solution for anonymizing query logs data with better utility and privacy Prone to identity and itemset disclosure when a large # of queries mapped to same user Garcia et al. [97] Practical Strong privacy preservation of personal data using dependant and independent attributes information Limited applicability to categorical data, and disclosure of multivariate statistics during data analytics Yuvaraj et al. [98] Practical Strong privacy preservation of individual using both anonymization and cryptography approaches Higher computing cost, and limited evaluation against major privacy risks (i.e., data reconstruction) Meng et al. [99] Practical Restricts personal information disclosure without sacrificing data utility in web search data Yields limited analysis of data, and prone to higher privacy leakage without doing sensitivity analysis Pamies et al. [100] Practical Privacy preservation of query logs by anonymizing sensitive data items using dynamic analysis Efficacy in real-time environments has not been tested, and computing complexity is much higher Ullah et al. [101] Practical A practical solution towards search queries anonymization to protect privacy of users in search engines Limited scalability tests were performed as the efficacy of method was tested with 1000 users only Nasab et al. [102] Practical Ensures better privacy in anonymizing real-time IoT data of two types (i.e., categorical & numerical) Disclosure of SA is possible with higher probability by not considering the diversity of SA values Kumar et al. [103] Conceptual Strong preservation of user's privacy in transactional data mining via sliding window addition concept Less applicability to diverse datasets and prone to linkage attacks in the presence of auxiliary data Veron et al. [104] Practical Better privacy protection of location data using three technologies in cloud computing environments May lead to poor data utility when obtained data is noisy and contain skewed values for some QIs Yanget al. [105] Conceptual Better resolution of the privacy-utility trade-off in data stemming from IoT environments Anonymized data may hinder the knowledge discovery as well as new hypothesis generation Patil et al. [106] Practical Privacy preservation of user in mining news data streams related to crimes using k-anonymity concept Yields constrained analysis of data and prone to identity disclosure via background knowledge Tekli et al. [107] Practical Strong privacy protection when original data contains more than one records about the same individual Poor data utility by enforcing strict privacy parameters (e.g., k, ℓ) during anonymization Honda et al. [108] Conceptual A strong privacy preservation method for facial image data using k-anonymity concept Validation was limited to only fewer tests that may hinder solution's progress in complex cases Yang et al. [109] Conceptual Strong privacy protection of facial images in real-time scenarios over encrypted outsourced dataset Prone to individual and group privacy disclosures by linking extracted features with open data Zhang et al. [110] Practical Ensure privacy guarantees in videos data using blurring algorithm that hides salient parts May lead to privacy disclosure if background contains sensitive information about individual Le et al. [111] Practical Robust and practical solution for adjusting the degree of privacy-utility while anonymizing image data May yield inconsistent results on data that is partially anonymized due to policies or regulations Grossel et al. [112] Practical An intelligent mechanism to selectively anonymize selective parts in image data for privacy protection Less applicable to medical environments due to very high diversity in data styles and templates Ren et al. [113] Practical Strong privacy preservation of facial attributes without sacrificing action detection accuracy Prone to higher utility loss, and identity and habits disclosure via linkage attack with auxiliary data Deivanai et al. [114] Conceptual Better privacy protection by classifying data and selective anonymization Extensive comparisons with existing methods are missing, and formal aspects are weak Bazai et al. [115] Practical Efficient implementation of Mondrian algorithm on Spark framework Computing complexity is significantly high, and is prone to diversity-based attacks

Category Famous Evaluation Metrics Rep. Studies
Privacy Disclosure risk, probabilistic disclosure for SA/identity, record linkage, table linkage, privacy-sensitive (PS) rules protection, inference preservation, thwarting prediction against SA/attributes, community privacy preservation, statistical disclosure control, distribution disclosure, trajectory info hiding, vertex and edge disclosure protection, entropy leakage, association rule hiding, content hiding, and privacy preservation from active attacks, etc.
[21], [125], [126], [127], [128], [129] quantify the level of privacy and utility [135]- [137]. In many real-world cases, the privacy measured by one evaluation metric may not be monotonic. Hence, some approaches have suggested employing a metrics suit (i.e., multiple metrics), rather than relying on one or two metrics, while evaluating performance of anonymization methods [138], [139]. With the rapid increase in the diversity of privacy threats, and the ever-changing landscape of attacker capabilities, the development of accurate privacy and utility evaluation metrics has become more urgent than ever. Lastly, we present a quantitative analysis (e.g., average results) of SOTA studies included in each category (i.e., tables/graphs) in terms of privacy preservation and utility enhancement in Figure  20. From the analysis, it can be observed that CAMs can improve the privacy and utility results significantly. The higher improvements in the utility results are due to the distance/similarity concepts adoption in the clustering process. Through quantitative analysis of each study, we found that lowest and highest values of the utility improvements were 5% and 90%, respectively. In contrast, the lowest and highest improvements in privacy results were 3.1% and 35 %, respectively. Table,

V. SIGNIFICANCE OF CAMS IN DIFFERENT COMPUTING PARADIGMS
In this section, we emphasize the significance of CAMs in multiple computing paradigms with regard to privacy preservation in different contexts. For example, in SN, CAMs not only help with privacy preservation in data publishing, but they are also used to support multiple applications involving personal data (e.g., community detection/clustering, information diffusion, privacy-aware graph computation, and sensitive topic diffusion, to name a few). Hence, it is vital to provide thorough perspectives on CAMs in different emerging computing paradigms along with recent SOTA approaches. We demonstrate an overview of the five emerging computing paradigms along with concise details of data sources in Figure 21. In Table 4, we describe the significance of CAMs for practical applications/services in each computing paradigm shown in Figure 21.
As shown in Table 4, CAMs have played a vital role in multiple emerging computing paradigms in different contexts. Many sectors benefit from CAMs, including healthcare, SN service providers, recommender systems, thirdparty apps, data mining infrastructures, intelligent services, multi-party computations, policymakers, researchers, and cloud-based services. In the coming years, CAMs can play a vital role in preserving privacy of AI-based systems, such as federated learning, swarm learning, and federated analytics. The synergy of CAMs with these emerging technologies can protect AI-based systems and the associated data (e.g., models, parameters, and underlying data). Furthermore, CAMs have been increasingly applied to heterogeneous data formats for privacy preservation and utility enhancements. Apart from the strengths and weaknesses, we present a quantitative analysis (e.g., average results) of SOTA studies included in each computing paradigm in terms of privacy preservation and utility enhancement in Figure 22.  From the analysis, it can be observed that CAMs can improve both the privacy and utility results significantly.   [198] Privacy preservation in scattered data Spatial clustering Multi-cloud platform Liu et al. [199] Through quantitative analysis of each study, we found that the lowest and highest values of the utility improvements were 5.3% and 96%, respectively. In contrast, the lowest and highest improvements in privacy results were 2.1% and 75 %, respectively.

VI. DARK SIDE OF CLUSTERING-BASED ANONYMIZATION MECHANISMS
Although CAMs have demonstrated more effectiveness in preserving privacy and utility than traditional anonymization approaches, they can also be used to jeopardize individual privacy. For example, plenty of methods have been proposed based on clustering concepts that can either reidentify people or assist in inferring private information from anonymized graphs/tables. Analysis of the dark side of CAMs can pave the way to securing personal data against prying eyes in a more practical way. Figure 23 presents a generic overview of de-anonymization of published data and inferring SAs from that data. Zhang et al. [200] described an identity-revelation method based on attributes from the anonymized G. The authors achieved accuracy of up to 80% in de-anonymization of identities. Similarly, some de-anonymizing approaches have used the community-clustering concept to group users, and users' de-anonymization was subsequently successful [201]. Some approaches have jointly used clustering and attribute  Table 4.

Computing Paradigm Rep. Studies Strengths Weaknesses Study nature
Logeswari et al. [140] Efficient clustering and enable collaborate learning of medical data with privacy preservation Evaluation was carried out using only 2K records and formal aspects are not discussed Conceptual Usha et al. [141] Effective solution towards privacy-utility trade-off in big data environment using parallelism Prone to data leakage by combining multiple SA from heterogenous data and high complexity Practical Zhang et al. [142] Highly efficient and scale solution towards big data anonymization using t-ancestor clustering Weaker assumptions towards data availability at external sources which can lead to disclosures Practical Nayahi et al. [143] Strong resistance to major privacy threats, and flexible tailoring to any domain and datasets Utility of anonymized data can be poor due to the higher noise addition in some cases Practical Singh et al. [144] Rigorous solution towards maintaining the confidentiality of the data in cloud environments Missing discussion about classifying about what constitutes as sensitive data in the cloud Theoretical Cloud Computing Lekshmy et al. [145] Cloud-simulator based implementation to preserve users privacy in data mining tasks Pre-mature convergence when data is skewed, and high difficulty in selecting parameters' values Practical Jayaraman et al. [146] Robust solution for maintaining the integrity of confidential data in cloud environments The computing complexity of the method used to identify likely security incidents is very high Theoretical Madan et al. [147] Optimization of the utility-privacy trade-off using fitness function of dragonfly algorithm Prone to identity and SA disclosure when data is imbalanced, and values range is small Practical Madan et al. [148] Better utility of anonymized data under many constant anonymization parameters in the cloud Prone to skewedness, homogeneity, and similarity attacks by not using the SA's values diversity Practical Shanmuga et al. [149] Effective protection of medical big data by jointly using clustering and access control Less reliable query results for data mining tasks, and low accuracy on special purpose metrics Practical Abul et al. [150] Strong privacy preservation in moving databases using k-anonymity and co-localization Insufficient to resist location disclosure of a user group, and prone to hidden profiling of users Practical Fei et al. [151] Low-cost schema to be used in client-server settings for privacy protection of trajectory data Poor utility and misleading analytics results in most cases by adding dummy location data Practical Lee et al. [152] An efficient solution for lowering identity and SA disclosures in location data using clustering Lack of assumption regarding the auxiliary data availability, and poor data utility Practical Niu et al. [153] Robust answers to dynamic queries without compromising user's privacy using tags/sequences In some cases, the data can be rendered useless due to higher noise addition (i.e., ϵ is low) Practical Yao et al. [154] Personalized privacy preservation using k-anonymity based clustering (CK) in location services Computing complexity is high, data utility can be low when # of preferences are large Practical Location-based Services Lin et al. [155] Strong privacy preservation by decoupling the requested contents and location position Inability to address group privacy issues as well as data reconstruction attack at server-side Practical Zhang et al. [156] Strong privacy preservation of queries data using semantic information and cloaking concept Prone to intent and SA disclosure by using the combine information of data and location Conceptual Altuwaiyan et al. [157] Strong privacy preservation of location data, request contents, and user's positions Limited experimental evaluation was performed, and comments sensitivity analysis is not given Conceptual Mahdavifar et al. [158] Personalized privacy protection of location data considering moving objects requirements Prone to identity and SA disclosure when certain users specify lower privacy preferences Practical Dritsas et al. [159] Strong privacy protection of mobile users data by defining a new metric (i.e., vulnerability) Prone to identity and SA disclosures when cluster cannot meet the diversity requirements Practical Chen et al. [160] Sufficient protection of basic and sensitive data items against background knowledge attacks Prone to record, SA, and table linkage due to the existence of auxiliary data Practical Ros et al. [161] Generic solution towards SN data mining with privacy protection for recommendation purposes Prone to linkage attack, and limited applicability to other graph's types (directed, weighted, etc.) Practical GU et al. [162] Strong resistance against background knowledge attacks and limits SA/identity disclosure Prone to higher utility loss by introducing heavier changes in the structure of graph Practical Truta et al. [163] Fostering social mining and graph analytics by making less changes in the graph structure Prone to community privacy disclosure, and SA in the case of attributed SN Practical Campan et al. [164] Ensure strong protection against community privacy disclosure, and control heavier changes Prone to individual's privacy disclosure when # of users in each community are large Practical Chen et al. [165] Personalized privacy preservation of users in the community by link perturbation techniques Decrease the reliability of extracted information from graph due to noises in the form of links Practical Ghosh et al. [166] Strong defense against identity, SA, and membership attacks by encrypting graph structure Prone to higher complexity when graph size is large, and induce heavier changes in graph Practical Social Networks Yu et al. [167] Provides strong protection against linkage attack and preserves the privacy of sensitive edges Can lead to node privacy breaches as well as the SA in attributed social network graph Practical Gazalian et al. [168] Protection against multiple threats such as Identity, SA, membership, and graph linkage Prone to poor utility in SN mining and analysis, structural modifications are very high Practical Zhang et al. [169] Strong protection in 1-neighbourhood attacks in privacy-preserving graph data publishing Heavier changes in the anonymized graph in order to meet the constraints values Practical Liu et al. [170] Strong defense against identity disclosure problem using k-possible anonymity concept Prone to SA and membership attacks by not ignoring the diversity of sensitive information Practical Siddula et al. [171] Strong privacy protection of the whole network against linkage attack from auxiliary data Can lead to the disclosure of identity, SA, and membership by not considering user-level privacy Practical Sai et al. [172] Provides customized setting for better control of privacy preservation in location data of SNs Prone to a # of attacks for the users who set loose privacy settings or unaware about privacy Practical Gao et al. [173] Ensures privacy protection of nodes and edges by adding exponential noises in the data Yields lower utility when graph size is relatively small and prone to data reconstruction Practical Yuan et al. [174] Ensures strong protection against attributes inference attacks and is applicable to diverse graphs Largely destroys the structure of the graph by enforcing the strict parameters (i.e., k and ℓ) Practical Zhao et al. [175] Strong privacy protection of users data, parameter, and communication in dynamic scenarios Prone to data derivation, prediction, and re-construction attacks when the adversary is in system Practical Liu et al. [176] Strong privacy protection in collaborative machine learning without data dissemination Prone to parameters/data reconstruction attacks and wrong utility analysis during analytics Practical Badra et al. [177] Strong protection of user's billing information in energy sector using encryption Can lead to higher computing complexity and do not assist in searching from encrypted data Conceptual Wang et al. [178] Effective solutions towards privacy preservation in heterogeneous data using federated k-means Prone to privacy breaches when data is relatively small and belong to an identical sector/domain Practical Ghahramani et al. [179] Strong solution towards hiding privacy-sensitive patterns in insurance company data User-and group level privacy breaches cannot be effectively prevented from the segmented area Practical Mohammed et al. [180] Strong privacy preservation of users from high dimensional social networks data Prone to the disclosure of group privacy as well meta data of graph that can be used to infer SA practical AI-based services Kumar et al. [181] Privacy protection of the individual as well group metadata and original data across platforms Less applicability to diverse data types and prone to identity, SA, and membership disclosures Practical Stallmann et al. [182] Strong privacy protection of local data via clustering in federated learning scenarios Can lead to the disclosure of local data or gradients by not adding noise during transfer Practical Rajesh et al. [183] Ensures privacy protection in association rule mining cases via perturbation-based approach Less resilience against SA reconstruction, derivation, and prediction in medical environments Practical Virupaksha et al. [184] Overcome the issues of invalid and ineffective data mining results by adding less noise to data Prone to privacy breaches when the adversary has auxiliary data or background knowledge Practical Bollaa et al. [185] Privacy preservation of SA disclosure by identifying and generalizing sensitive data in clusters Can lead to infeasible query results as well as inaccurate data-mining/analytics results Practical Khan et al. [186] Reduction in client-side privacy breaches in federated learning by not centralizing local data Fails to provide a strong defense against the data poisoning as well reconstruction attacks Practical Virupaksha et al. [187] Enhances the quality of anonymized data by adding noise along each dimension of the data Prone to explicit disclosures of user's identity and SA by not changing position of data items Practical Guo et al. [188] Strong privacy protection at user-level as well as cluster centers-level by using encryption Failed to provide resilience against probabilistic, skewness, similarity, and homogeneity attacks Practical Zhu et al. [189] Higher accuracy in reducing multiple attacks stemming from network traffic using k means Failed to address group-privacy issues, and complexity is high when data is high dimensional Practical Almusallam et al. [190] Detailed discussion of privacy attacks originating in smart healthcare (IoT & edge healthcare) Limited discussion about the real-time data processing and corresponding privacy attacks Theoretical Huang et al. [191] Robust privacy guarantees in small and large-scale interval data stemming from IoT devices Higher computing complexity, limited tests, and less defence against hidden profiling attacks Practical Elhoseny et al. [192] Effective privacy preservation of real-time data transmitting between IoT devices to gateway Prone to identity, or SA disclosure by not classifying sensitive/non-sensitive data before sending Practical Internet of Things Kumar et al. [193] Strong privacy preservation of IoT sensor nodes data without disclosing semantic information Communication cost is very high and privacy issues can occur when values range is small Practical Shuja et al. [194] Strong users data privacy protection by not computing similarity/distance among all data points Fails to handle complex and high dimensional data such as temporal sequences or locality traces Practical Otgonbayar et al. [195] Enable anonymization of dynamic, incomplete, and high dimensional data without privacy loss Some statistics such as vulnerability/utility-levels of data items cannot be computed in real-time Practical Patil et al. [196] Minimizes privacy violations that can stem from the personal data originating from smart homes Damage the utility of less sensitive data by not using AI methods to classify the data nature practical Ullah et al. [197] Fosters data re-usability by sharing it at a large scale without compromising user's privacy Prone to identity/SA disclosure when a higher amount of data is available at auxiliary sources Practical Li et al. [198] Better privacy preservation by solving data island problem by sharing location data of vehicles Less adoption in real-world cases by not considering the data diversity and heterogeneous styles Practical Liu et al. [199] Strong protection against SA inference attacks by incentivizing users to form cohesive clusters Prone to explicit disclosures of identity/SA when # of users in each cluster are significantly small Practical information to correctly identify individuals from privacypreserved published graphs [202]. Shao et al. [203] proposed a robust de-anonymization method based on structural information from a published graph. Figure 24 illustrates an overview of SN data de-anonymization. In figures 24 (b) and (c), users' location information can be inferred by linking anonymized and crawled networks, respectively. Similarly, clustering concepts are employed to group similar/dissimilar people in order to infer their private information by employing background knowledge or auxiliary graphs.
Due to the rapid developments in SN services, user de-anonymization within an SN site and across SN sites has become a very hot research topic. In line with the trends, we summarize the contributions of clustering-based de-anonymization methods in different computing environments, along with their data types, in Table 6. The analysis presented in Table 6 provides another perspective on CAMs (i.e., de-anonymization of users and their corresponding personal information) that has remained unexplored in the recent literature. By understanding the dynamics of such research from the attackers' perspectives, more secure and resilient anonymity methods can be developed to preserve users' privacy. Furthermore, these kinds of analyses provide a better overview of the research gaps to aid researchers who are working on the defense side.
In Table 6, we compared various methods based on four parameters (i.e., data/items exploited in de-anonymization, objectives achieved in compromising user privacy, clustering concepts employed, and target applications/services). The last column of the table included pertinent studies from which VOLUME 4, 2016  [244] detailed contents can be gathered for an in-depth investigation of each method. Key limitations (i.e., weaknesses) of each study listed in Table 6 are discussed in Table 7. This extended knowledge demonstrates that CAMs can be used to infer the identity/SA of the user with significantly higher %ages using various kinds of data available at external sources (i.e., online repositories, social networks, web searches, internet traffic logs, etc.). However, the utilization of a strong privacy mechanism and lower availability of auxiliary data can restrict the reidentification rate. Avg. re-identication rate (in %) Apart from the strengths and weaknesses, we present a quantitative analysis (e.g., average re-identification rate re-TABLE 7. Limitations of de-anonymization approaches listed in Table 6.

SOTA study Key limitation (s)
Gambs et al. [204] Poor performance on highly skewed (or non i.i.d) training data Chiasserini et al. [205] De-anonymization rate drops when # of users in each cluster increase Chiasserini et al. [206] Less applicability to other types (i.e., weighted, attributed, etc.) of graphs Chiasserini et al. [207] Yields poor performance when total variations in graph structures are high Fu et al. [208] Poor convergence when nodes degree or variations in attributes' values is high Fu et al. [209] Optimal mapping conditions cannot be met in non-overlapping community cases Francia et al. [210] De-anonymization rate drops when anonymization is made through DP model Orekondy et al. [211] The performance can be severely impacted if no external data is available Chen et al. [212] Yields poor performance in the presence of outliers/misaligned feature space Murakami et al. [213] Use different variants (≃ 20) of user's locations to perform de-anonymization Li et al. [214] De-anonymization rate drops significantly when no identical graphs are available Zhen et al. [215] Extensive comparisons are required to infer SA when most users are similar Wang et al. [216] Heavily relies on exogenous records to perform de-anonymization of data Zhang et al. [217] Waste of computing time when same users are not available in both graphs Chen et al. [218] The algorithmic complexity is high, and dramatically increases with graph size Ma et al. [219] Many operations are performed to infer SA, and the solution is not generic Nilizadeh et al. [220] Poor performance in the case of overlapping communities, or sparse graph cases Takbiri et al. [221] Lack of numerical tests and uses multiple assumptions to perform SA's inference Shirani et al. [222] The # of queries can significantly increase when the number of users is very large Aliakbari et al. [223] Poor results in terms of matching and computing time when seeds are erroneous Xueshuo et al. [224] Significant efforts are needed to convert complex data into structured data Li et al. [225] Poor performance when most profile attributes are not visible externally Wang et al. [226] Matching rate drops significantly in the presence of noises and mismatches Tu et al. [227] Prone to less matching when co-relation among aggregated data is low Yang et al. [228] Heavily depends on the auxiliary data to perform user's de-anonymization Miculan et al. [229] Prone to poor performance when users perform searches in a dissimilar way Nardin et al. [230] # of successful matches decrease when most data is in the encrypted form Naini et al. [231] Poor performance when data is overly anonymized with strong anonymity model Tian et al. [232] Prone to fewer matches when graph structure is not aligned with external graph Iwata et al. [233] Yields poor performance in node-level (one user) de-anonymization from graph Cecaj et al. [234] Prone to poor performance when dummy records are present across datasets Cecaj et al. [235] Yields infeasible results when most data cannot be collected due to regulation Hirschprung et al. [236] Requires extensive matching and analysis to infer the user's identity/SA Linoy et al. [237] Yield infeasible results when diversity among users in terms of attributes is high Sharad et al. [238] The applicability of approach on other similar datasets is not possible Acquisti et al. [239] Cannot guarantee consistent results when images have poor visibility or tilted Huang et al. [240] Requires a significantly large amount of data in order to compromise privacy Chen et al. [241] Less applicability to other types of social graphs and higher time complexity Ong et al. [242] Requires a substantial # of records to be present in Google repository Lin et al. [243] Poor performance when most data items are hidden based on privacy policies Castro et al. [244] Yields poor results when diversity among features is high (i.e., low similarity) sults) of SOTA studies included in Table 6 and 7 in Figure 25. From the analysis, it can be observed that de-anonymization approaches can significantly impact the privacy of users. From the results, the lowest and highest re-identification rate were 10% and 99.6 %, respectively.

VII. CHALLENGES OF CAMS AND FUTURE RESEARCH DIRECTIONS
In this section, we highlight the technical challenges of CAMs regarding user's privacy preservation in recent times, and we provide promising avenues for future research taking into account emerging computing systems.

A. OPEN CHALLENGES IN PERSEVERING USER PRIVACY LEVERAGING CAMS
Due to the rapid increase in digital-solution use and adoption, privacy protection has become more challenging. Owing to pervasive technology developments, many users are deeply concerned about privacy and the responsible use of their personal information. Since sensitive data of all kinds about an individual's daily activities and schedules can easily be collected now, there is a risk of intimate detail disclosures. The rate of personal data collection is increasing at a significantly rapid pace, and the scale and number of privacy breaches are likely to increase in the coming years. Hence, there is an emerging need to upgrade the existing defense mechanisms and to propose new, sophisticated, privacyenhancing technologies. In Figure   We summarize below the details of fourteen unique technical challenges of CAMs in protecting user privacy at present.
• Quantifying the impacts of user's attributes on privacy and utility: Most CAMs give equal weight to all attributes in data from a privacy and utility point of view. However, recent research has shown that each item within an attribute has a distinct impact on privacy and utility [245]. For example, a zip code allows locating someone more accurately than race and/or gender. Similarly, gender is more appropriate for making creditrelated decisions, rather than age. Hence, quantifying the impacts of a user's attributes, and ensuring protection based on such statistics in the CAMs, is challenging. • Hidden disclosure of group privacy: With the advent of big data, a new threat to information privacy has emerged, named group privacy [246]. Most existing CAMs provide strong resilience against privacy threats concerning individual privacy. However, they are prone to hidden disclosure of group privacy. For example, clustering based on k-anonymity concepts can preserve the privacy of one person in a group of k users, but it can inevitably hurt group privacy. Hence, controlling group privacy issues while preserving individual privacy when leveraging CAMs is very challenging. • Anonymization of imbalanced data: Generally, most anonymization methods, including CAMs, work well on balanced data (the distribution of most attribute values is uniform). However, due to the rapid developments in AI (e.g., federated learning) and legal measures enforcement, diverse values regarding individuals cannot be collected, leading to imbalanced datasets. In these datasets, the distribution of most attribute values is not uniform, and anonymization can be highly complex [247], [248]. In such circumstances, preserving privacy while sustaining high utility from data anonymized using CAMs is very challenging. • Applicability to heterogeneous types of data: Most CAMs were designed for specific scenarios/applications, and extension to diverse types of data is not straightforward. For example, CAMs proposed for a single SA cannot be directly applied to multiple-SA scenarios. Similarly, CAMs proposed for tables cannot be straightforwardly applied to directed graphs. Hence, making each CAM efficient and applicable to diverse data formats is very challenging. • Effective resolution of the privacy-equity trade-off: In the recent past, utility and privacy were regarded as two conflicting goals. Optimizing for utility can degrade privacy, and vice versa. A lot of research has been conducted to resolve this universal trade-off [249]- [251]. Recently, due to significant advancements in AI techniques, a new trade-off, named privacy-equity, has emerged that can lead to biased and inaccurate decision making about some minor groups [252]. However, solving the privacy-equity trade-off with CAMs is very challenging. • Tailoring the objective function of clustering to privacy and utility expectations/goals: In most cases, the objective function of CAMs usually focuses on grouping similar data items in order to lessen the heavier changes in anonymized data. By doing so, only one metric (e.g., utility) can be improved, and privacy issues such as identity and SA disclosures inevitably occur [56]. How to make the objective function aware of both utility and privacy goals/expectations is very challenging. • Reducing the computational complexity of CAMs: VOLUME 4, 2016 Generally, the clustering process encompasses multiple iterations and many hyperparameters, leading to higher computing complexity while processing highdimensional datasets. In anonymization, the clustering process usually adopts some anonymity requirements as well (i.e., k users in a cluster/class); hence, computation complexity increases drastically [253]. Although some efforts have been devoted to lowering the computing complexity in CAMs [254], [255], reducing the computing burdens of CAMs on high-dimensional and large datasets is still very challenging. • Ensuring sufficient resilience against AI-powered attacks: In recent years, due to the proliferation of AIbased systems, privacy breaches have increased significantly because traditional anonymization mechanisms cannot ensure sufficient resilience against AIpowered attacks [256]- [259]. AI-powered attacks can be launched to disclose identities, SAs, and memberships from large and complex datasets with the help of hyperparameter tuning [260]. Hence, there in an emerging need to integrate AI concepts in the anonymization approaches for effective resolution of privacy and utility. However, integrating AI concepts in CAMs to safeguard the privacy of individuals from multiple perspectives is very challenging. • Adaptation of CAMs to more anonymization principles: In the published literature, most CAMs have created synergy with the k-anonymity concept in order to preserve user privacy in different settings [261]. Moreover, the k-anonymity concept is relatively weak at resisting many contemporary privacy threats. Therefore, establishing synergy in CAMs with more anonymization principles (e.g., ϵ-DP) has become more urgent than ever. However, establishing synergy between CAMs and other sophisticated anonymization principles is challenging due to the many differences in algorithm designs. Furthermore, guaranteeing the construct validity of these synergies is challenging due to higher variations in personal data formats across domains/applications. • Consistent performance in the presence of outliers: The presence of outliers (out-of-range values) in the data can significantly increase the complexity of the anonymization process, and the resulting anonymized dataset can yield poor utility. Most traditional algorithms, such as k-anonymity, ℓ-diversity, and t-closeness, cannot guarantee consistent performance when the original data encompass outliers [262]. Furthermore, CAM performance on data that contain outliers can be degraded, and convergence cannot be achieved in a reasonable time.
Recently, some CAMs have been proposed to efficiently detect outliers and minimize their impact on the clustering process [263], [264]. However, devising low-cost CAMs that can perform well on data with outliers is still very challenging and requires further development from the research community. • Heterogeneous source data anonymization using CAMs: In some real-world computing environments (e.g., the IoT, IoMT, and IIoT), a huge amount of data is collected from heterogeneous sources for analytical purposes. These data play a vital role in pattern extraction leading to effective and accurate decision making. However, anonymization mechanisms based on clustering concepts are paramount in such environments in order to alleviate privacy concerns [141]. Recently, some parallel clustering algorithms have been devised to address data diversity and heterogeneity issues during anonymization [265]- [267]. Moreover, the application of CAMs on data originating from heterogeneous sources is challenging due to the huge diversity in data formats and correlations between tuples. In recent years, personal data anonymization originating from different devices in the form of distributed streams has become a popular research topic [268], [269]. However, the application of CAMs to such data is challenging due to temporal differences in the stream order.
• Privacy preservation of AI-based systems/ infrastructures through CAMs: In recent years, there has been an increasing focus on privacy preservation of AI-based systems such as federated learning, deep learning, and centralized machine learning [270]- [274]. These systems have become the target of malevolent adversaries and require privacy preservation of the model's parameters, workflow, and underlying data. The DP approach has been extensively investigated in preserving privacy of AI-based systems/ infrastructures [275]- [279]. However, the application of CAMs in order to preserve AIbased system/infrastructure privacy is challenging due to the fundamental differences in workflows and data types. • Adaptive configuration of clustering and privacy parameters in CAMs: Most CAMs developed for privacy preservation require configuration of clustering (e.g., the number of clusters, the number of iterations, and the optimizing strategy) as well as anonymization parameters (the number of users in a cluster, the similarity/dissimilarity threshold, the value ranges, etc.). These have a significant impact on privacy preservation and utility enhancement, and careful selection of parameters is vital to lowering the complications from the anonymization process. However, devising CAMs with as few parameters as possible without compromising privacy and utility is challenging. In addition, applying optimization strategies to select these parameter values in order to optimize the clustering process is challenging due to the differences in data styles or application features. • Verification/validity of internal, external, statistical, and construct validity in CAMs: Most CAMs that have been developed so far are threat-, domain-, and attackspecific. Hence, their internal, external, statistical, and construct validity cannot be guaranteed in most generic scenarios. Moreover, due to various parameters and op- timization goals, validation of external, statistical, and construct validity in CAMs is challenging. In addition, quantifying the defence level accurately at the time of anonymization is also challenging owing to inadequate knowledge of an attacker's expertise. Apart from the technical challenges cited above, accurate quantification of privacy and utility levels offered by CAMs, development of low-cost evaluation metrics for CAMs, improving the interpretability of CAMs, resisting multiple AIpowered attacks, and addressing the privacy versus utility trade-off are all challenging tasks.

B. POTENTIAL OPPORTUNITIES FOR FUTURE RESEARCH IN PRIVACY DOMAIN
Owing to rapid digitization in recent years, especially during the COVID-19 pandemic, privacy protection has become one of the most trendy topics. Recently, many privacy protection techniques have been developed to secure personal data against manipulations in different digital infrastructures. Considering the latest research dynamics and emerging technologies, privacy protection will remain a concern [280]- [284]. Based on the thorough analysis of the published literature, the threats/challenges to information privacy in recent times, and considering the existing countermeasures, we highlight in Figure 27 various potential avenues for future research.
With the advent of COVID-19, location data have been used as one of the potential tools for accomplishing multiple goals (e.g., contact tracing, surveillance, and quarantine monitoring) [285], [286]. Since many apps constantly track trajectories and location data, privacy issues of various kinds can arise over matters such as targeted profiling, spatialtemporal activities, web searches, interests, preferences, and web-search patterns. Furthermore, location data published by many location-based services can lead to privacy leaks due to the availability of huge amounts of auxiliary information about users. Recently, there has been an increasing focus on devising practical anonymization methods to restrict corporate surveillance and ensure responsible use of personal data. In this line of work, devising practical, verifiable, and efficient anonymization mechanisms is a vibrant avenue for future research. VOLUME 4, 2016 Primarily, most research in the information privacy area has mainly focused on tabular and graph data. Moreover, due to the increase in sources of data generation, privacy preservation mechanisms for images [287], videos [288], stream data [289], and temporal data [290] have become hot research topics. Despite many developments, this is still an emerging avenue of research. The DP model is regarded as one of the most promising solutions for privacy protection in static and dynamic scenarios. However, due to excessive noise added by the DP model during anonymization, the utility of the anonymized data can be significantly low [291]. Hence, devising new methods that can boost the utility of the DP model in most settings, especially in the healthcare sector, is a vital research direction.
Generally, most anonymization methods have certain parameters to consider (k, ϵ, t, ℓ, etc.), and each parameter has a distinct impact on privacy and utility [292]. Furthermore, these parameters do not yield consistent performance in diverse applications. Similarly, the synergy of anonymization approaches with clustering approaches brings another set of parameters. Hence, optimization of anonymization and clustering parameters by introducing adaptive learning strategies (or exploiting the inherent statistics of the data) is an important research direction. Recently, machine learning techniques have shown potential in securing personal data from adversaries [293]. Hence, employing ML to preserve privacy of data encompassed in diverse formats is a vibrant area of research. In the published literature, most privacy/utility evaluation metrics do not yield consistent performance, and fail to provide sufficient resilience against emerging privacy threats. Their performance differs from application to application, and they mainly capture only minor privacy attacks, or measure utility from fewer aspects. Recently, there has been an increasing focus on developing fine-grained evaluation metrics for PPDP [294]. Considering their necessity and significance, devising accurate evaluation metrics that can accurately measure the privacy and utility levels is an active area of research.
Since the emergence of COVID-19, privacy has become a main concern for most people around the globe due to the rapid proliferation of digital surveillance technologies. In these technologies, intimate details of people's lives are collected in order to control the effects from the pandemic. However, due to data transfer in cyberspace and the invasive use of personal data, privacy issues were reported from different regions. In the early days of the pandemic, due to privacy issues and interference in personal lives, some people even committed suicide in South Korea [295]. Furthermore, a lot of personal data (travel logs, mobility data, facility visits, generic personal information, etc.) have been transferred to cyberspace amid this pandemic. Hence, privacy issues will spark renewed interest in the near future. Considering the circumstances, finding practical privacy-preserving methods from the different perspectives shown in Figure 27 is a hot research area. Furthermore, devising solutions for synthetic data generation that can fulfil the data demands of researchers is also an emerging avenue of research [296].
Retaining sufficient utility in anonymized data without compromising privacy is a very hot research area because, in most cases, high-quality data are usually preferred for data mining tasks [297], [298]. Restricting extensive changes during data anonymization is imperative to yielding highquality anonymized data, but this can only be possible by exploiting hidden characteristics of the underlying data to be anonymized. Considering the significance of high-utility datasets, anonymity methods that can restrict heavier changes in data conversion are required to improve the performance of knowledge-based systems/applications. Recently, it has been suggested that there exist various groups (major, minor, super minor) in data, leading to a new trade-off: privacy versus equity [252]. We demonstrate this trade-off in Figure 28, and an effective resolution of this trade-off is imperative in decision-making. Hence, there is an emerging need to develop privacy-preserving methods for this important research direction. Recently, due to pervasive technologies such as the IoT, SN, and fog/edge computing, a huge amount of distributed data (a.k.a. aggregated data) is available about individuals [299]. The data anonymized in one domain can be de-anonymized by linking them with another domain. To avoid these issues, finding practical anonymization methods that can provide resilience in aggregated data is a vibrant area of research. Most anonymity methods published so far have mainly focused on individual privacy preservation, which can still lead to group-privacy disclosures. With the advent of big data technologies, more practical methods that can simultaneously guarantee individual privacy, as well as group privacy, are needed in the near future. In recent years, privacy preservation in AI-based systems has become one of the famous research areas that require robust mitigating strategies in order to lower potential privacy risks [300]. There is a pressing need to devise privacypreserving solutions for all critical components of AI-based systems, such as data input (client-devices/sensors), data preprocessing, ML models, and output [301]. With the advent of federated learning [302], privacy in AI has become a trendy topic, because FL requires privacy preservation from different perspectives. The conceptual overview of FL is shown in Figure 29. The privacy landscape of FL is relatively extensive, compared to centralized learning, due to its distributed nature [303]. Recently, Ferrag et al. [304] comprehensively discussed various methods for mitigating cybersecurity issues in IoT environments using federated deep learning approaches. Through experimental analysis, the authors proved that federated deep learning approaches are superior compared to non-FL approaches in many ways (i.e., privacy preservation of IoT devices' data, and attacks detection accuracy). Treleaven et al. [305] discussed the data ecosystem, and highlighted the relationship between FL and other data science technologies. The authors highlighted various engineering issues in the FL ecosystem. Bouacida et al. [306] discussed many vulnerabilities of the FL paradigm from user/participants, server, and aggregation protocol perspectives. The authors suggested a technology stack and valuable directions to mitigate those vulnerabilities. Benmalek et al. [307] provided a holistic view of the security concerns in the FL paradigm. The authors have discussed various attacks and vulnerabilities in FL and recently developed promising defense mechanisms against them. Li et al. [308] discussed the challenges and characteristics of the FL ecosystem from technical perspectives. The authors provided a broad survey about the technical problems of the FL ecosystem, especially regarding privacy preservation. The authors pointed out that significant interdisciplinary efforts are needed in order to solve most technical problems of the FL paradigm. Shyu et al. [309] discussed data-related challenges concerning the FL paradigm in the healthcare industry, and suggested valuable directions to solve those challenges. In recent years, FL has been thoroughly investigated from applications as well as threats point of view. We refer interested readers to learn more about the FL ecosystem from recently published previous surveys [310]- [312].
In FL, multiple adversarial attacks can be launched, such as model inversion, data poisoning, model poisoning, and data re-construction. Furthermore, the privacy of participating clients and their associated personal data needs to be preserved in an effective way. A large number of studies have been published on defending against adversarial attacks on FL, however, there is still a lot of room to improve the privacy in such systems. Considering the need for privacy-preserving mechanisms in the FL context, provable privacy-preserving methods to safeguard FL systems from adversarial attacks, as shown in Figure 27, is an emerging avenue of research.
The last six directions listed in Figure 27 are related to development. To this end, devising privacy policies, visualizing and monitoring data flows in digital systems, quantifying privacy and privacy loss in web data are needed, as well as integrating ML-based methods for analyzing the sensitivity of data in multi-party computing environments, developing anonymization methods that can work in client devices, constant updating of security patches, and integration of legal measures with other robust methods, such as privacy by design (PbD) to secure personal data in third-party applications. In this line of work, answering data analysts' queries by ensuring sufficient privacy is an emerging avenue of research. Furthermore, developing low-cost solutions to generate synthetic data from real data in order to fulfill the demands of researchers is a main focus of research these days. In addition, there is a pressing need to develop privacypreserving methods to ensure privacy for data originating from different computing environments, such as SN, sensors, actuators, and wearables.
With the evolution of FL and FL-based systems, federated analytics (FA) has emerged as a new collaborative analytics paradigm that solves the data-mining-related tasks without centralizing data from edge devices [313]. In line with the trends, it is imperative to develop prototypes and full-scale systems to realize FA on large and high-dimensional datasets. Furthermore, the integration of FA with systems that are used to fight the COVID-19 pandemic is a vibrant area of research. In recent years, there has been an increasing focus on the responsible use of personal data. In this line of work, some anonymization mechanisms have been recently developed for data dissemination [314]- [316]. However, this area still requires practical anonymity solutions to ensure confidentiality and transparency amid continuous data generation from different sectors. Lastly, improving the efficiency and efficacy of anonymization methods leveraging soft computing techniques is also an emerging avenue for research [317]. Considering the ever-changing landscape of privacy threats, developing computationally efficient and robust anonymization techniques that encompass fewer parameters and steps to increase defenses against adversarial attacks without degrading data utility is a hot research area for the near future. VOLUME 4, 2016 In recent years, many real-time applications have emerged to facilitate decision-making by utilizing data produced by IoT/wearable devices. Although these applications assist in robust decision-making, privacy issues can also occur due to personal data involved in such applications. Therefore, data protection regulations, as well as fair information principles (FIPs), are being developed/adopted across the globe for the privacy-preservation of personal data. We demonstrate emerging real-time computing applications in Figure 30. All these applications listed in Figure 30 mostly work with realtime data. Recently, Shen et al. [318] discussed a real-time pricing method for big data environments based on DP. The proposed method produces aggregated query answers with minimal noise to facilitate data owners and data buyers in a privacy-preserved way. Sanchez et al. [319] presented a cyber-security platform that restricts privacy issues in the healthcare ecosystem in an automated way. The developed platform helps in developing many real-time innovative applications in the healthcare sector. In this line of this work, Awotunde et al. [320] developed a real-time framework based on an IoT-based cloud system to monitor the patients' condition. The proposed framework works with real-time data obtained from IoT sensors and alerts medical staff to advise patients when their health conditions change in hospitals. Recently, searching for the desired data item (or querying ) from encrypted data has been extensively investigated to preserve the privacy of underlying original data in dynamic setting [321]. This technology has been extensively used in real-time applications for data mining-related tasks without compromising users' privacy. Due to the proliferation of IoT and cloud-based smart devices, medical jobs have been taken up by AI systems. In this regard, a real-time system that utilizes IoT devices to identify/detect patients suffering from respiratory disease was recently proposed by Akram et al. [322]. Similarly, a real-time platform, named OnTimeEvidence was proposed by Alarcon et al. [323] to find multiple data sources related to healthcare in order to facilitate healthcare data consumers. In the ongoing COVID-19 pandemic, many real-time contact tracing applications have been developed that utilize IoT devices data in order to curb the spread of COVID-19 by identifying potentially suspected COVID-19 cases [324], [325]. In these applications, proximity and nature of contacts were analyzed in real-time to identify the contacts of confirmed cases as quickly as possible to lower the spread. In many smart city applications, real-time data was extensively used to make intelligent decisions, route suggestions, and product recommendations, to name a few. Zhang et al. [326] recently proposed a privacy-preserved real-time system for streaming traffic using the DP model. Tang et al. [327] proposed a realtime resource management scheme for cyber-physical-social systems (CPSSs). The proposed scheme maximizes the profit of the CPSS operator, and incentives users. Recently, due to an increase in avenues of data generation, many real-time applications that can gather and process stream data have emerged [328]- [331]. These stream-based applications can assist in performing basic data mining tasks (i.e., frequency analysis, alerting, monitoring, association rules, etc.) as well as advanced analytics such as video-based analytics, rulesbased co-relations, event detection, pattern recognition, etc.
Knowledge-based systems can assist in solving complex problems using a knowledge base (i.e., a large and complex database). Chen et al. [332] proposed a real-time diagnosis method for surveillance videos based on deep learning combined with multiscale feature fusion in order to detect multiple types of anomalies. The proposed method can distinguish between normal and anomalous images with 98.52% accuracy. Recently, training/learning high-quality AI models from imbalanced data in real-time applications has become a popular research topic [333]. Vu et al. [334] developed a novel collaborative data model for semi-fully distributed settings for real-time medical applications. The proposed model employs the Naive Bayes classification to provide both privacy and accuracy in many real-life applications. In this line of work, researchers have explored the multiclass imbalanced problems in order to improve classifiers' performance from multiple perspectives [335]. Blockchain technology has revolutionized the privacy domain by removing the heavy reliance on a central server. Li et al. [336] discussed a provably secure method for privacy preservation in real-time IoT applications. Wen et al. [337] discussed a mobile medical, a system that ensures privacy and security of sensitive data while users can enjoy multiple medical services. Mobile medical makes use of identity authentication and blockchain technology in order to restrict information sharing with the server, and only minimal information is shared with the server.
Recently, many fog/edge-based applications have been developed to avoid any delay in real-time monitoring applications that collect and process real-time data. Sarrab et al. [337] discussed a fog computing method for preserving data privacy in IoT-based healthcare. The proposed healthcare internet of things (H-IoT)-based framework classifies data based on criticality, and only some data is moved to the cloud environments. The h-IoT framework can restrict privacy breaches and can avoid delays in time-critical realtime applications. Alzoubi et al. [338] suggested blockchain as a promising privacy-preserving mechanism for fog computing. Recently, fog and edge computing applications have been recognized as a promising tool for the prognosis and diagnosis of many critical diseases in the healthcare industry [339]. Due to the resource limitations and lack of technical knowledge, many companies outsource computations to external 3rd party servers. Computational offloading has become a very popular trend in recent times due to the huge proliferation of IoT-based applications across the globe. Xu et al. [340] discussed a promising solution for computational offloading in cloud-enabled IoT via federated deep reinforcement learning. The proposed method separates the high context-aware data from low context-aware data, and some parts of low context-aware data are sent to edge devices for processing. Wang Jin [341] discussed a computational offloading method for computation-intensive services without sacrificing guarantees on users' privacy. The proposed method's effectiveness was tested through various workflow parameters.
In 2017, Google coined the term federated analytics (FA), performing analytics on local devices with data in way analogous to FL. See Figure 31. FA is another real-time technology resulting from FL and is based on the distributed computing paradigm [342]. It has been used in diverse fields such as medical, supply chain, finance, and energy sector for online analytics [343]. In the near future, FA will be one of the mainstream realtime technologies in the collaborative learning domain. In the context of the ongoing COVID-19 pandemic, FA has been widely used from multiple perspectives such as vaccine efficacy analysis for different subgroups, contact tracing [344], for analytics of the impact of COVID-19 and other diseases, in the categorization of effects from COVID-19 based on demographics, for identification of COVID-19 risk factors, prediction of vulnerability indexes [345], and mortality/case predictions, to name a few. Although all real-time technologies cited above have helped societies/communities in multiple ways, their investigation regarding privacy protection and real-world deployment is yet to be made. From the extensive analysis of published literature, we suggest devising practical privacy-preserving solutions for real-time technologies that can ensure privacy preservation of personal data enclosed in diverse formats (i.e., logs, tables, graphs, streams, images, etc.) along with service requirements (i.e., scalability, low latency, transparency, trustworthiness, easy to use, availability, etc.).
According to the recent report of stonebranch 1 on IT automation state across the globe, 88 % of companies have a plan to invest in IT orchestration and automation in the year 2022. However, stonebranch identifies many challenges that hinder the adoption of IT solutions (e.g., cloud computing) through an in-depth survey. The respondents of this survey were IT professionals working in different IT-related enterprises. Among many other challenges, security and privacy concerns were regarded as one of the main barriers to IT adoption. In Table 8, we present a list of top reasons that are currently hindering the job's placement in public clouds. As shown in Table 8, 58% respondents think privacy and security as the main barrier when placing (or deciding to place) computing jobs in public clouds. Most of the existing privacy solutions are scenario-specific, and cannot ensure strong privacy and security of data in cloud environments. Considering the expected boom in IT orchestration and automation, robust solutions that can ensure strong privacy and security are required in the coming years. Apart from privacy protection at the data distribution stage, approaches are needed that can secure the complete lifecycle of data handling (i.e., collection, storage, pre-processing, analytics, distribution, use, and archival), especially in cloud computing environments. From the extensive analysis of published literature and existing developments, we found that there are still a lot of opportunities to develop practical privacy-preserving mechanisms or tools that can ensure privacy preservation in static and dynamic scenarios [346]. Furthermore, we suggest devising technical privacy-preserving solutions for real-time technologies, heterogeneous types of personal data (i.e., logs, streams, graphs, time series data, videos, images, etc.), multiple computing paradigms (SN, IoT, CC, AI-based systems, IoT, etc.), and integrated technologies (i.e., FL and blockchain, IoT and cloud computing, anonymization and encryption, etc.). Most importantly, the need for privacy-preserved solutions has been greatly felt during the ongoing COVID-19 pandemic across the globe. Hence, proposing socio-technical and practical solutions to fight infectious diseases without sacrificing the guarantees of individual privacy is also a vibrant area of research in the coming years.

VIII. CONCLUSION
In this article, we described the findings of the latest SOTA research that proposed ways to overcome privacy issues in data sharing by leveraging clustering concepts. Recently, there has been an increasing focus on developing clusteringbased anonymization mechanisms (CAMs) for responsible data science 2 , and this research area is gaining researchers' interests dramatically. CAMs have demonstrated their effectiveness in improving various technical aspects of traditional anonymization methods (e.g., k-anonymity, ℓ-diversity, and t-closeness) regarding better privacy-preservation and utility enhancements in privacy-preserving data publishing (PPDP). Hence, it is of paramount importance to deliver good perspectives on information privacy involving heterogeneous data styles along with recent CAMs. In this work, we presented detailed and systematic coverage of CAMs used for securely publishing personal data enclosed in heterogeneous formats. Specifically, we mapped the existing CAMs to ten different data styles (tables, graphs, matrixes, logs, streams, traces, multimedia, text, documents, and hybrids), and we summarized and analyzed key features, including clustering algorithms used in each study. Furthermore, we discussed the significance of CAMs in the emerging computing paradigms (e.g., social networks, cloud computing, location-based services, IoT-based applications/services, and AI-based services) that will assist in understanding research dynamics in these paradigms as well as in developing more practical anonymization solutions for them. In addition, we discussed the dark side of CAMs, exploited by malevolent adversaries to breach individual privacy by leveraging clustering algorithms and their respective data items. We discussed the substantial number of open challenges faced by the anonymization approaches that employ clustering concepts. Finally, we discussed various promising opportunities for future research considering the ever-changing landscape of privacy threats in recent times amid continuous technological developments. Based on the analysis of recent developments in CAMs, we examined that no single CAM could allay all types of privacy threat emanating from personal data handling in digital environments. However, CAMs that have used low-cost clustering methods and that have shown better performance against major privacy threats on benchmark datasets are believed to be most effective for preserving privacy and utility in data analysis. Moreover, considering the recent research trends, the efficacy of these CAMs against AI-powered attacks in dynamic scenarios needs rigorous verification from both theoretical and experimental perspectives. The contents of this article can pave the way to improving existing CAMs as