Skip to Main Content
Due to recent technological development, the amount of data generated by social networking sites, sensor networks, Internet, healthcare applications, and many other companies, is drastically increasing day by day. All the huge amount of data generated from different sources in multiple formats with very high speed is referred as big data. Big data has become a very active research area for last couple of years. The data generation rate is growing so rapidly that it is becoming extremely difficult to handle it using traditional methods or systems . Meanwhile, big data could be structured, semi-structured, or unstructured, which adds more challenges when performing data storage and processing tasks. Therefore, to this end, we need new ways to store and analyse data in real time. Big data, if captured and analyzed in a timely manner, can be converted into actionable insights which can be of significant value. It can help businesses and organizations to improve the internal decision making power and can create new opportunities through data analysis. It can also help to promote the scientific research and economy by transforming traditional business models and scientific values .
Big data can be defined in various ways. For the scope of this paper we use the definition given by International Data Corporation (IDC) in . In , the term big data is defined as “a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling high-velocity capture, discovery, and/or analysis”. Based on this definition, the properties of big data are reflected by 3 V’s, which are, volume, velocity and variety, as shown in Fig. 1. Volume refers to the amount of data generated. With the emergence of social networking sites, we have seen a dramatic increase in the size of the data. The rate at which new data are generated is often characterized as velocity. A common theme of big data is that the data are diverse, i.e., they may contain text, audio, image, or video etc. This diversity of data is denoted by variety.
Despite big data could be effectively utilized for us to better understand the world and innovate in various aspects of human endeavors, the exploding amount of data has increased potential privacy breach. For example, Amazon and Google can learn our shopping preferences and browsing habits. Social networking sites such as Facebook store all the information about our personal life and social relationships. Popular video sharing websites such as YouTube recommends us videos based on our search history. With all the power driven by big data, gathering, storing and reusing our personal information for the purpose of gaining commercial profits, have put a threat to our privacy and security. In 2006, AOL released 20 million search queries for 650 users by removing the AOL id and IP address for research purposes. However, it took researchers only couple of days to re-identify the users. Users’ privacy may be breached under the following circumstances :
In order to ensure big data privacy, several mechanisms have been developed in recent years. These mechanisms can be grouped based on the stages of big data life cycle, i.e., data generation, storage, and processing. In data generation phase, for the protection of privacy, access restriction and falsifying data techniques are used. While access restriction techniques try to limit the access to individuals’ private data, falsifying data techniques alter the original data before they are released to a non-trusted party. The approaches to privacy protection in data storage phase are mainly based on encryption techniques. Encryption based techniques can be further divided into attribute based encryption (ABE), Identity based encryption (IBE), and storage path encryption. In addition, to protect the sensitive information, hybrid clouds are used where sensitive data are stored in private cloud. The data processing phase includes privacy preserving data publishing (PPDP) and knowledge extraction from the data. In PPDP, anonymization techniques such as generalization and suppression are used to protect the privacy of data. Ensuring the utility of the data while preserving the privacy is a great challenge in PPDP. In the knowledge extracting process, there exist several mechanisms to extract useful information from large-scale and complex data. These mechanisms can be further divided into clustering, classification and association rule mining based techniques. While clustering and classification split the input data into different groups, association rule mining based techniques find the useful relationships and trends in the input data.
Protecting privacy in big data is a fast growing research area. Although some related papers have been published but only few of them are survey/review type of papers , . Moreover, while these papers introduced the basic concept of privacy protection in big data, they failed to cover several important aspects of this area. For example, neither  nor  provides detailed discussions regarding big data privacy with respect to cloud computing. Besides, none of the papers discussed future challenges in detail.
In this paper, we will give a comprehensive overview of the state-of-the-art technologies to preserve privacy of big data at each stage of big data life cycle. Moreover, we will discuss privacy issues related to big data when they are stored and processed on cloud, as cloud computing plays very important role in the application of big data. Furthermore, we will discuss about potential research directions. The remainder of this paper is organized as follows. The infrastructure of big data and issues related to privacy of big data because of the underlying structure of cloud computing will be discussed in section II. Privacy issues related to data generation phase will be discussed in section III. Issues related to privacy during data storage and data processing phase will be discussed in sections IV and V, respectively. Finally future research directions are identified and discussed in section VI.
To handle different dimensions of big data in terms of volume, velocity, and variety, we need to design efficient and effective systems to process large amount of data arriving at very high speed from different sources. Big data has to go through multiple phases during its life cycle, as shown in Fig. 2. Data are distributed nowadays and new technologies are being developed to store and process large repositories of data. For example, cloud computing technologies, such as Hadoop MapReduce, are explored for big data storage and processing.
In this section we will explain the life cycle of big data. In addition, we will also discuss how big data are leveraging from cloud computing technologies and drawbacks associated with cloud computing when used for storage and processing of big data.
The excessive data and domain specific analytical methods are used by many application to derive meaningful information. Although different fields in data analytics require different data characteristics, few of these fields may leverage similar underlying technology to inspect, transform and model data to extract value from it. Emerging data analytics research can be classified into the following six technical areas: structured data analytics, text analytics, multimedia analytics, web analytics, network analytics, and mobile analytics .
Big data need massive computation and storage, which brings in the need for cloud computing. Cloud computing is driving enterprises and businesses to adopt cloud, because of many advantages it is offering, such as cost saving and scalability. It also offers huge processing power and storage capability. Technologies used in cloud computing like virtualization, distributed storage and processing have made it possible to perform tasks that had been considered difficult in conventional system. However, on the other hand, could computing also results in serious cloud specific privacy issues. People hesitate to transfer their private or sensitive data to the cloud unless they are sure that their data will be secure on the cloud. There are some challenges for building a trustworthy and secure big data storage and processing system on cloud which are as follows .
Data generation can be classified into active data generation and passive data generation. Active data generation means that the data owner is willing to provide the data to a third party, while passive data generation refers to the situations that the data are generated by data owner’s online activity (e.g., browsing) and the data owner may not even be aware of that the data are being collected by a third party. The major challenge for data owner is that how can he protect his data from any third party who may be willing to collect them. The data owner wants to hide his personal and sensitive information as much as possible and is concerned about how much control he could have over the information. We can minimize the risk of privacy violation during data generation by either restricting the access or by falsifying data .
If the data owner thinks that the data may reveal sensitive information which is not supposed to be shared, he can simply refuse to provide such data. For that, the data owner has to adopt effective access control methods so that the data can be prevented from being stolen by some third party. If the data owner is providing the data passively, some measures could be taken to ensure privacy, such as anti-tracking extensions, advertisement/script blockers and encryption tools . By using these tools, one can effectively limit the access to sensitive data. For the ease of use, most of these tools are designed as browser extensions.
In addition to these tools, there are some alternative means, such as to use anti-malware and anti-virus software to protect the data stored digitally on their computer or laptop. These tools can help to protect user’s personal data by limiting the access. Though there is no guarantee that one’s sensitive data are completely protected from untrustworthy sources, making it a habit of clearing online traces of one’s activity by using security tools can significantly reduce the risk.
In some circumstances, it is not possible to prevent access of sensitive data. In that case, data can be distorted using certain tools before the data are fetched by some third party. If the data are distorted, the true information cannot be easily revealed. The following techniques are used by the data owner to falsify the data .
Storing high volume data is not a big challenge due to the advancement in data storage technologies such as the boom in cloud computing. However, securing the data is very challenging. If the big data storage system is compromised, it can be very harmful as individuals’ personal information can be disclosed. Therefore, we need to ensure that the stored data are protected against such threats. In modern information systems, data centres play an important role of performing complex commutations and retrieving large amount of data. In distributed environment, an application may need several datasets from different data centres and therefore face the challenge of privacy protection.
The conventional security mechanisms to protect data can be divided into four categories. They are file level data security schemes, database level data security schemes, media level security schemes and application level encryption schemes . The conventional mechanism to protect data security  and privacy ,  for existing storage storage architectures (i.e., direct attached storage, network attached storage and storage area network)  have been a very hot research area but may not be directly applicable to big data analytics platform. In response to the 3V’s nature of the big data analytics, the storage infrastructure should be scalable. It should have the ability to be configured dynamically to accommodate diverse applications. One promising technology to address these requirements is storage virtualization, enabled by the emerging cloud computing paradigm . Storage virtualization is process in which multiple network storage devices are combined into what appears to be a single storage device. However, using a cloud service offered by cloud provider means that the organization’s data will be outsourced to a third party such as cloud provider. This could affect the privacy of the data. Therefore, in this paper we will limit our discussions to privacy of data when stored on cloud.
When data are stored on cloud, data security mainly has three dimensions, confidentiality, integrity and availability . The first two are directly related to privacy of the data i.e., if data confidentiality or integrity is breached it will have a direct effect on users privacy. Therefore we will also discuss privacy issues related to confidentiality and integrity of data in this section.
A basic requirement for big data storage system is to protect the privacy of an individual. There are some existing mechanisms to fulfil that requirement. For example, a sender can encrypt his data using pubic key encryption (PKE) in such a way that only the valid recipient can decrypt the data. The approaches to preserve the privacy of the user when data are stored on the cloud are as follows.
ABE ,  is an encryption technique which ensures end to end big data privacy in cloud storage system. In ABE access polices are defined by data owner and data are encrypted under those policies. The data can only be decrypted by the users whose attributes satisfy the access policies defined by the data owner. When dealing with big data one may often need to change data access policies as the data owner may have to share it with different organizations. The current attribute based access control schemes ,  do not consider policy updating. The policy updating is a very challenging task in attribute based access control systems. The reason for that is once the data are outsourced to the cloud, the data owner would not keep the local copy in the system. If the data owner wants to update the policy, he has to transfer the data back to the local system, re-encrypt the data under new policy and store it back on the cloud server. This process has got very high communication overhead and high computational cost. To solve the problem of policy updating, recently Yang et al.  proposed a secure and verifiable policy updating outsourcing method. In , data owner does not need to retrieve all the data and re-encrypt it. Instead the data owner can send the queries to cloud to update the policy, and the cloud server can update the policy directly without decrypting the data.
IBE is an alternative to PKE which is proposed to simplify key management in a certificate-based public key infrastructure (PKI) by using human identities like email address or IP address as public keys. To preserve the anonymity of sender and receiver, the IBE  scheme was proposed.
By employing these primitives, the source and the destination of data can be protected privately. Encryption scheme like IBE and ABE does not support the update of ciphertext receiver. There are some approaches to updating the ciphertext recipient. For instance, data owner can employ the decrypt then re-encrypt mode. However, if data are large as it is mostly the case when dealing with big data, the decryption and re-encryption can be very time consuming and costly because of computation overhead. Moreover, in this mode, data owner has to be online all the time. Another approach to updating ciphertext receiver is to delegate this task to a trusted third party with the knowledge of decryption key of the data owner. This approach has few drawbacks like the scheme relies on the fully trust of the third party and also the anonymity of the ciphertext receiver cannot be achieved as the third party needs to know the information about the receipt to proceed the re-encryption. Mambo and Okamoto  introduced proxy-re encryption (PRE) which was further defined in . PRE is proposed to handle the problem of data sharing between different receipts. In , a semi trusted third party transforms a ciphertext intended for one user into a ciphertext of the same message intended for another user without leaking any knowledge about the message or the decryption keys. The workload of data owner is now transferred to the proxy and the proxy does not have to be online all the time.
In , proxy re-encryption is employed in the IBE setting. In , anonymous identity based proxy re-encryption (IBPRE) was introduced but the work only supports one time ciphertext receiver update, while in practice multiple receivers update is desirable. On the other hand, the work provides an all or nothing share mode that limits the flexibility. Liang et al.  proposed an anonymous identity based proxy re-encryption scheme with the following properties: the identity information of sender and receiver is anonymous and the ciphertext receiver can be updated multiple times, with the possibility of conditional fine grained sharing of ciphertext.
Public cloud is more vulnerable to privacy breaches because of multi-tenancy and virtualization. The cloud users may share the same physical space and in such a scenario the chances of data leakage are very high. One way to protect the data on cloud is to encrypt the data and store them on cloud and allow the cloud to perform computations over encrypted data. Fully homomorphic encryption is the type of encryption which allows functions to be computed on encrypted data . Given only the encryption of a message, one can obtain an encryption of a function of that message by computing directly on the encryption. Homomorphic encryption provides full privacy but it comes at the cost of computational complexity and sometimes very hard to implement with existing technologies. A comparison of different encryption schemes is shown in Table 1.
Recently Cheng et al.  proposed a scheme for secure storage of big data on clouds. In the proposed scheme, the big data are first separated into many sequenced parts and then each part is stored on a different storage media owned by different cloud storage providers. To access the data, different parts are first collected together from different data centres and then restored into original form before it is presented to the data owner. In this scheme the big data stored on the cloud is classified into public data and confidential data. There are no extra security requirements for public data and each tenant can access the data freely. In contrast, confidential data are always kept secure and inaccessible to irrelevant individual and organizations. A trapdoor function has been incorporated in this scheme. It is a function which is easy to compute in one way and difficult to compute in the opposite direction without some additional information. The trapdoor functions are used widely in cryptographic applications. In the proposed scheme instead of encrypting the whole big data, only the storage path is encrypted which is called the cryptographic virtual mapping of big data. For some special applications, the proposed scheme also encrypts some part of data which are considered confidential. In order to improve the availability and robustness of the big data, the scheme will store the copies for each piece of data on cloud storage, so that when the information or data part is lost we can try to find another copy. The owner of the big data will keep the storage index information .
According to the national institute of standards and technology (NIST), the cloud can be deployed by the following three models : private clouds (owned and accessed only by the providing enterprise), public cloud (available and accessible by all service subscribers), and hybrid clouds (a combination of public and private cloud).
Private clouds are inherently trustworthy and secure but there are some limitations which hamper the private clouds for the processing and storage of big data . The first limitation is scalability. Building a highly scalable private cloud requires a large capital investment. It becomes very difficult to accurately plan private cloud capacity when the volume, velocity, and variety of the data are constantly changing. The second limitation is unavailability of analytical models and software frameworks required to manage heterogeneous data. The third limitation is on data sharing. Sometimes, data sharing should be available among authorized collaborators who do not have access or reside outside of private cloud. However, due to security concerns, this is not always possible. On the other hand, public cloud support scalability and easy sharing of data. However public clouds are more prone to security and privacy attacks because of the multi-tenancy of virtual machines and data.
Hybrid cloud is the combination of public cloud and private cloud. It brings together the inherent features of public clouds i.e., scalability, processing power etc. and private clouds i.e., security and provides potential research opportunities for processing and storage of big data. In , hybrid clouds have been deployed for the privacy preserving processing and storage of big data. We can take advantage of hybrid cloud by separating sensitive data from non-sensitive data and storing them in trusted private cloud and un-trusted public cloud respectively . However, this approach has a drawback because if we adopt this approach directly, all the sensitive data have to be stored in private cloud, which would require a lot of storage in private cloud. Most users want to minimize the storage and computation in private cloud, and let public cloud do most of the storage and computation. The authors in  have presented a scheme to reduce the communication overhead between private and public cloud besides achieving privacy protection by using hybrid cloud. Specifically, data privacy is achieved by dividing the image into pieces and then shuffling these pieces directly. Each pixel of every block is mapped into another value via random one to one mapping function. The transformed image is stored on public cloud whereas the mapping function to recover the image is stored on private cloud.
When cloud computing is used for big data storage, data owner loses control over data. The outsourced data are at risk as cloud server may not be fully trusted. The data owner needs to be strongly convinced that the cloud is storing data properly according to the service level contract. One way to ensure privacy to the cloud user is to provide the system with the mechanism to let data owner verify that his data stored on the cloud is intact. Therefore data integrity verification is of critical importance. Table 2 compares different integrity verification schemes discussed in this paper. Numerous research problems have been studied over the past decade          . The integrity of data storage in traditional systems can be verified through number of ways i.e., Reed-Solomon code, checksums, trapdoor hash functions, message authentication code (MAC), and digital signatures etc. To verify the integrity of the data stored on cloud, one straight forward approach is to retrieve all the data from the cloud. However, the great volume of big data makes it very inefficient to consider time consumption and communication overhead. To address this problem, researchers have developed schemes to verify the integrity of data without having to retrieve the data from cloud , . In integrity verification scheme, the cloud server can only provide the valid proof of integrity of data when all the data are intact. It is highly recommended that the integrity verification should be conducted regularly to provide highest level of data protection . In the following, we will discuss the framework of integrity verification, followed by popular integrity verification schemes for dynamic data. Note that the data in most big data applications are dynamic in nature. Fig. 3 explains the basic framework of integrity verification schemes.
Data owners could perform integrity verification by themselves or delegate the task to trusted third parties. The basic framework of any integrity verification scheme consist of three participating parties: client, cloud storage server (CSS) and third party auditor (TPA). The client stores the data on cloud and the objective of TPA is to verify the integrity of data. The main life cycle of a remote integrity verification scheme consists of the following steps .
Proposed by Ateniese et al. in 2007 ,  and Juels and Kaliski , PDP scheme was built to offer block-less verification i.e., the verifier can verify the integrity of a proportion of the outsourced file through verifying a combination of pre-computed HVT or HVL. The HVT tags are used as the building block of PDP schemes and the tag construction is based on RSA signatures. The tag is stored on server together with the file and act as a verification metadata for the file block. The HVTs are unforgeable and have the following properties
PORs are cryptographic proofs that enable a cloud provider to prove that a user can retrieve a targeted file in its entirety. POR consist basically of a challenge-response protocol in which the service provider proves to the user that the file is still intact and retrievable. The concept of POR and its first model was proposed by Jules and Kaliski . Unfortunately, this scheme can only be applied to static data storage such as an archive or library. Later in 2008, Shacham and Waters  proposed an improved version of POR. They proposed a construction for private verification so that the data can only be verified with the secret key. As a result, no other party can verify it except for the client. The scheme was efficient because it admits short response and fast computation. Armknecht et al.  proposed outsourced proofs of retrievability (OPOR), in which users can task an external auditor to perform and verify POR with the cloud provider.
Data integrity verification performed by third parties is termed as public auditing , . Wang et al.  proposed a scheme based on BLS signature that can support public auditing and full data dynamics, which is one of the latest works on public data auditing. However, this scheme lacks the support for fine-grained update (an operation which is applied to a smaller set from a large dataset such as a single row) and authorized auditing. Liu et al.  proposed a public auditing scheme with support of fine-grained updates over variable-sized file blocks. In addition, an authentication process between the client and TPA is also proposed to prevent TPA from endless challenges, thereby cutting the possibility of attacks over multiple challenges.
The problem with public auditing scheme is that the linear combination of blocks aggregated for assured auditing may reveal user information, especially if enough number of linear combination of the same blocks are collected. Wang et al. proposed a privacy preserving public auditing scheme . When computing integrity proof, a random masking technique is incorporated to prevent the part of original file being extracted from several integrity proofs over this specific part of data.
Privacy protection in data processing part can be divided into two phases. In the first phase, the goal is to safeguard information from unsolicited disclosure because the collected data may contain sensitive information about the data owner. In the second phase, the goal is to extract meaningful information from the data without violating the privacy. We will discuss the two phases in this section.
During PPDP, the collected data may contain sensitive information about the data owner. Directly releasing the information for further processing may violate the privacy of the data owner, hence data modification is needed in such a way that it does not disclose any personal information about the owner. On the other hand, the modified data should still be useful, not to violate the original purpose of data publishing. The privacy and utility of data are inversely related to each other and will be discussed in detail later in this section. Many studies have been conducted to modify the data before publishing or storing them ,  for further processing. To preserve the privacy of a user, PPDP mainly uses anonymization techniques. The original data are assumed to be sensitive and private and consist of multiple records. Each record may consist of the following four attributes .
The data are anonymized by removing the identifiers and modifying the quasi-identifiers before publishing or storing for further processing. As a result of anonymization, identity of the data owner and sensitive values are hidden from the adversaries. How much data should be anonymised mainly depends on how much privacy we want to preserve in that data. The privacy models are basically classified into two categories based on the ability of an attacker to identify an individual . The first category is based on the assumption that the attacker is able to identify the records of a specific user by linking the records with external data sources. The second category is based on the assumption that the attacker has enough background knowledge to conduct probabilistic attacks i.e., the attacker is able to make a confident guess about whether the specific user’s record exists in the database or not.
There are several models proposed to deal with the above problems. Some of them include k-anonymity to prevent the record linkage, l-diversity to prevent attribute linkage and record linkage, t-closeness to prevent probabilistic attacks and attribute linkage .
Before publishing, the original table is modified according to the specified privacy requirements. To preserve the privacy, one of the following anonymization operations are applied to the data .
A high level of data anonymization indicates that the privacy is well protected. However, on the other hand, it may also affect the utility of the data, which means that less values can be extracted from the data. Therefore, balancing the trade-off between privacy and utility is very important in big data applications. The reduction in data utility is represented by information loss. Various methods have been proposed in the literature for measuring the information loss, some of the examples include minimal distortion , discernibility metric , the normalized average equivalence class size metric , weighted certainty penalty , and information theoretic metrics , . To solve the problems of trade-off between privacy and utility, PPDP algorithms usually take greedy approach to achieve proper trade-off. These algorithms work by generating multiple tables using the given metrics of privacy preservation and information loss, all of which satisfy the requirement of specific privacy model during the anonymization process. Output of the greedy algorithm is the table with minimum information loss.
Quantifying privacy is a very hard task. For example, consider a scenario where a piece of data is collected from a data owner. The data owner is free to decide how much and what kind of information he or she wants to share with a third party. Once the data are handed over to the third party, some privacy loss may occur. Different data owners may provide the same data to the third party. However, when privacy disclosure happens, some individuals who treat privacy seriously may perceive more loss than those who have little concern about the privacy.
To extract useful information from big data without breaching the privacy, privacy preserving data mining techniques have been developed to identify patterns and trends from data. Those techniques cannot be applied straightaway to big data as big data may contain large, complex and dynamically varying data. To handle big data in an efficient manner, those techniques should be modified, or some special set of techniques should be used. In addition to this, those modified techniques should address the privacy concern. There are several techniques proposed to analyze large-scale and complex data. These techniques can be broadly grouped into clustering, classification and association rule based techniques.
Clustering is one of the popular data processing techniques for its capability of analyzing un-familiar data. The fundamental idea behind clustering is to separate unlabelled input data into several different groups . Conventional clustering algorithms require data to be in the same format and be loaded into a single processing unit, which is not suitable for big data processing. Many solutions ,  have been presented in the recent decade. However, due to the nature of the big data, they have several disadvantages, among which computational complexity and privacy concern are the major problems. To handle the issue of computational complexity, in , Shirkhorshidi et al. introduced sampling and dimension reduction solutions for single-machine clustering and parallel and map-reduce solutions for multiple-machine clustering. To improve the efficiency, in , cloud computing based parallel processing was proposed. To make clustering feasible for very large data sets, in , Feldman et al. presented a parallel processing approach in which core sets are created using a tree construction. Compared to traditional clustering algorithms, in , the processing time and the required amount of energy are significantly reduced. Nevertheless, in all of these methods       , privacy is a major concern. Privacy preservation in clustering is a challenging problem when large volume complex data are involved. In the early days, hybrid geometric data transformation based methods  were proposed to protect the privacy in clustering. However, these methods alter numerical attributes by translations, scaling and rotations. Although certain level of privacy could be achieved, data utility is usually reduced. Thus these methods are not practically feasible. In , Oliveira and Zaiane proposed a method for centralized data by using dimensionality reduction and object similarity based representation. Since this method is specifically designed for centralized data, it cannot be used with more commonly existing de-centralized big data. To improve the efficiency of clustering in new data (non-familiar), in , privacy-preserving clustering based on the probability distributed model was proposed. In order to handle complex and distributed data, in , a novel algorithm called distributed local clustering is presented. In , secure multi-party computation based techniques such as homomorphic encryption, are used to achieve privacy protection. In the above mentioned methods, clustering is done using low order statistics. When the input data are complex, these lower order statistics are inadequate and could yield poor clustering results. To overcome this, in , Shen and Li developed a clustering method using information theoretic measures as a cost function to develop a linear and a kernel distributed clustering algorithm. In , the nodes only exchange a few parameters instead of original data with their neighbors.
Classification is a technique of identifying, to which predefined group a new input data belongs. Similar to clustering algorithm, classification algorithms are traditionally designed to work in centralized environments. To cope up with the demands of big data, traditional classification algorithms were modified to suit parallel computing environment. For example, in , a classification algorithm is designed to process data in two ways. This algorithm, known as “classify or send for classification”, either classifies the data by themselves or forward the input data to another classifier. It is computationally efficient particularly when handling large and complex data. In another novel classification algorithm, Rebentrost et al.  proposed a quantum based support vector machine for big data classification. This method reduces the computational complexity and the required training data. The main limitation of this method is the immature hardware technologies in quantum computing. Even though the classification algorithms developed for big data can reach a reasonable level of performance, these algorithms do not pay much attention to the data privacy either. In , Agrawal et al. proposed a privacy preserving classification algorithm for the discovery for knowledge from the data. The original data are altered by adding random offsets. Then Bayesian formula is used to derive the density function of the original data in order to reconstruct the decision tree. The major problem with this method is it is only suitable for the centralized data. In , another privacy preserving data mining algorithm is proposed using random reconstruction techniques. The random operation in the algorithm protects the privacy of the original data via data scrambling. However, this method is also not suitable for diverse data. Unlike the methods in  and , a privacy preserving method is proposed in  for distributed databases. In this work, a random perturbation matrix is used to protect the privacy of the data. Due to the nature of the algorithm, it requires the reconstruction of the original data set from the altered data set. This significantly reduces the accuracy of the algorithm. To improve the accuracy, the authors in  developed an algorithm using single-attribute data random matrix. This matrix is used to slightly modify the data, and the reconstruction of original data set is improved by the use of multi-attribute joint distribution matrix. This method improves the accuracy at the expense of privacy. By using the advantage of multi-attribute joint distribution matrix, in , Zhang and Bi proposed a privacy preserving method for classification with slightly improved the accuracy and privacy, but this method is incapable of handling large and complex data.
While clustering and classification try to group the input data, association rules are designed to find the important relationships or patterns between the input data. Finding the relationships on larger data set has been studied for many years. In the early days, tree structures such as FP-tree  were used in finding the pattern. Early algorithms were not suitable for large and diverse data, because parallel computing and cloud computing technologies are used in these scenarios. To handle large and complex data in an efficient way, several methods    have been developed using map-reduce. The map-reduction concept is ideally suitable for could based association rule finding algorithms. However, the association rule mining methods proposed in    do not consider the privacy of the input data. Protecting privacy in association rule mining is an operation to protect the sensitive information from being mined. For example in , privacy is preserved by distorting the original data. In , data are distorted in such a way that the distorted data can be used to generate an approximation of the original data distribution, without exposing the values in the original data. In this approach, the level of privacy is relatively low. Thus, to enhance the privacy, in , tougher conditions are imposed to reduce the privacy leakage. Recently in  and , privacy protection techniques were applied to Boolean association rules. Similar to other methods, the original data are also distorted in these works. In some methods, cryptographic techniques are used to construct the decision trees . In , privacy-preserving data mining is considered as a part of secure multi-party computation. Although these methods achieve some level of privacy and accuracy, they are not fully capable of handling large and complex data.
The amount of data are growing everyday and it is impossible to imagine the next generation applications without producing and executing data driven algorithms. In this paper, we have conducted a comprehensive survey on the privacy issues when dealing with big data. We have investigated privacy challenges in each phase of big data life cycle and discussed some advantages and disadvantages of existing privacy preserving technologies in the context of big data applications. A lot of works have been done to preserve the privacy of users from data generation to data processing, but there still exist several open issues and challenges. In this section, we discuss a few future research directions for big data privacy.
To ensure that the data are only accessible by authorized users and for end to end secure transfer of data, access control methods and different encryption techniques like IBE, ABE, and PRE, are used. The main problem of encrypting large datasets using existing techniques is that we have to retrieve or decrypt the whole dataset before further operations could be performed. These techniques does not allow data owners to easily perform fine grained actions such as sharing records for data analytics. Techniques such as PRE have solved this problem up to some extend. However, to obtain the values from the data, sometimes the data need to be shared multiple times with different companies. As different companies have different cryptographic keys, the data need to be decrypted and then re-encrypted again which not only has a computational overhead but also has a possibility of data leakage. To solve these kind of problems, we need encryption techniques which allows data sharing between different parties without decrypted and re-encrypting process.
Data is anonymized by removing the personal details to preserve the privacy of users. It indicates that it would not be possible to identify an individual only from the anonymized data. However, due to the availability of huge volumes of data and powerful data analytic tools, the existing anonymization techniques are becoming increasingly ineffective. In big data scenarios, anonymization needs to be more than just masking or generalizing certain fields. One needs to carefully analyse if the anonymized data are vulnerable to any attacks. For that, we need to study different attack models and information loss metric for big data anonymization. Moreover, most of the existing anonymization techniques are for static data, while much practical data is dynamic. Thus, we need to propose new privacy and utility metrics. Furthermore, data anonymization is a cumbersome process and it needs to be automated to cope with the growing 3 V’s.
As our personal data are gradually collected and stored on centralized cloud server over the time, we need to understand the associated risk regarding privacy. The concept of centralized collection and storage of personal data should be challenged. In centralized storage, a single point of failure would indicate the lost of the whole data. One flaw or one breach in privacy can lead to a devastating consequences, which is happening more frequently with sophisticated methods of attacks. Instead of centralizing all the computation, we can bring the computation to intelligent agents running on our own personal devices. Using such schemes, business models can still be profitable and we can regain our privacy by hosting our data in personal encrypted clouds. There are researchers who are strongly suggesting to adopt decentralized storage . Some works have been done with projects like OwnCloud and the IndieWeb . To adopt the view of data distribution, we need algorithms that are capable to work over extreme data distribution and build models that learn in a big data context.
Machine learning and data mining should be adapted to unleash the full potential of collected data. Nowadays, machine learning techniques, together with the improvement of computational power (e.g., cloud computing), have come to play a vital role in big data analytics. They are employed widely to leverage the predictive power of big data. For example, the predictive power of big data is extensively used in medical science and astronomy. Most of these computations are done by third party resources on private data, which can pose a threat to the privacy of users. To protect privacy, machine learning algorithms such as classification, clustering and association rule mining need to be deployed in a privacy preserving way.
Sometimes the data owned by an organization (e.g., hospitals) does not have sufficient information to discover useful knowledge in that domain, and acquiring that data may be costly or difficult due to legal constraints and fear of privacy violation. To solve such problems, we need to design privacy preserving distributed analytic systems which are able to process different datasets from different organizations while preserving the privacy of each dataset.
Secure multiparty computation techniques such as homomorphic encryption can be deployed to solve such issues. The main challenge in deploying homomorphic encryption in the context of big data analytics is to keep the computational complexity as low as possible.
Corresponding author: Y. Xiang
Back to Top