Secure Data Storage and Sharing Techniques for Data Protection in Cloud Environments: A Systematic Review, Analysis, and Future Directions

A large number of researchers, academia, government sectors, and business enterprises are adopting the cloud environment due to the least upfront capital investment, maximum scalability, and several other features of it. Despite the multiple features supported by the cloud environment, it also suffers several challenges. Data protection is the primary concern in the area of information security and cloud computing. Numerous solutions have been developed to address this challenge. However, there is a lack of comprehensive analysis among the existing solutions and a necessity emerges to explore, classify, and analyze the signiﬁcant existing work for investigating the applicability of these solutions to meet the requirements. This article presents a comparative and systematic study, and in-depth analysis of leading techniques for secure sharing and protecting the data in the cloud environment. The discussion about each dedicated technique includes: functioning for protecting the data, potential and revolutionary solutions in the domain, the core and adequate information including workﬂow, achievements, scope, gaps, future directions, etc. about each solution. Furthermore, a comprehensive and comparative analysis of the discussed techniques is presented. Afterward, the applicability of the techniques is discussed as per the requirements and the research gaps along with future directions are reported in the ﬁeld. The authors believe that this article’s contribution will operate as a catalyst for the potential researchers to carry out the research work in the area

due to its several advantages such as on-demand service, scalability, reliability, elasticity, measured services, disaster recovery, accessibility, and many others [5]. Cloud computing is a paradigm that enables huge memory space and massive computation capacity at a low cost. It allows users to obtain the intended services across multiple platforms irrespective of location and time and consequently conveys an extensive convenience to the cloud users [6]. By migrating the local data management system into cloud storage and using cloud-based services, users can accomplish cost savings and productivity enhancements to manage projects and establish collaborations [7]. Therefore, individuals and organizations are shifting increasingly to the cloud for their multiple services [8]. With the growing expansion of cloud computing technologies, it is not difficult to imagine that almost all the businesses will be switched to the cloud in the foreseeable future [9].
Despite the multiple features offered by cloud computing, it encounters several impediments that may obstruct its fast growth, if not tackled appropriately [10]. Consider a real implementation, where an enterprise permits its staff or departments to store and share the data through the cloud. By exploiting the cloud, the enterprise can be completely released from the burden of maintaining and storing the data locally [11], [12]. Nevertheless, it also endures various security threats, which are the leading concerns of cloud users [13]. Firstly, outsourcing the data to the cloud servers signifies that the data is out of the users' control resulting in discomfort to the users because the outsourced data may comprehend sensitive and valuable information. Secondly, data sharing is frequently put into operation in a hostile and open environment, and the cloud server turned out to be a target of attacks. In the worst condition, users' data may be revealed by the cloud server itself for illegal profit [14], [15]. Furthermore, the data need to be shared among distinct relevant stakeholders, for instance, business partners, employees, customers, etc., interior or exterior of the organization's premises for upgrading the performance of the business. However, the recipient party can maltreat this data and disclose it purposefully or inattentively to some unauthorized third party [16], [17]. Fig. 1 represents a sharing environment where the data owners need to share the organization's valuable data to the cloud platform due to the limited storage and computational capacity of the enterprises and the multiple benefits of clouds. Furthermore, the cloud data is shared with multiple users as per different requirements for its utility purpose. However, the recipient party may leak the data after obtaining it. The data can be leaked by the involved parties or may steal by the unauthorized party through illegal access. Data leakage or loss may induce a severe threat to the organization's confidentiality. It can diminish the value of shareholders, decline the firm's rank and status, and destruct the enterprise's goodwill and reputation [18]. As the data is an important asset of an organization, thus it is essential to keep this asset secure. There arises a necessity for  solutions that can protect the data efficiently in the sharing environment.
A number of models for data protection in the cloud environment have been explored and developed for many applications. Typically, data protection is achieved through leakage prevention and leaker detection and this article concentrates on achieving efficient protection by preventing leakage and detecting the malicious entity responsible for leakage as depicted in Fig. 2. The major approaches for preventing data leakage are tailored by utilizing cryptography, access control mechanisms, and differential privacy with machine learning techniques while leaker detection is mainly achieved through watermarking and probabilistic techniques.

A. MOTIVATION
It was reported that 83% of the organizational workloads has shifted to the cloud platform by 2020 which raised to 90% within a year by 2021 [9], [19]. The cloud computing industry is forecast to rise with a 14.6% compound annual rate of growth to become a $300 billion industry by 2022 as of $188 billion in 2018 [20], [21]. Additionally, the connected IoT devices will reach 75 billion by 2025 which is 3 times the increment from 2019. IoT is the future and everything will continue to become more connected through technology that uses cloud services [22]. The data sharing and on-demand cloud access features of cloud computing have significantly reduced the data management cost while increasing the storage flexibility as well as capacity [23], [24]. Despite that, it also sustains a crucial security threat to data confidentiality [25]. Precisely, the cloud users can not fully trust the Cloud Service Providers (CSPs) since the stored data files in the cloud may be confidential and sensitive [26]. Moreover, the data owners have serious concerns after sharing the data with the cloud due to the unavoidable loss of control over the data which clears the way for unauthorized data access [27]. Therefore, the security and privacy of sensitive data have become a major preoccupation for cloud users while using cloud computing services.
Also, the number of data leakage events together with the cost endured as a consequence of these leakages continuing to escalate is a serious matter of concern [28], [29]. According to Risk Based Security's (RBS) report, almost 22 billion data records have been disclosed within a single year 2021 surprisingly that is further expected to increase by 5% in 2022 [28]. The global average total cost of a data breach has reached $4.24 million in the year 2021 as per the IBM annual security report conducted by the Poneman institute which is the highest in the past 17 years [29]. Because of COVID-19, the average cost of a data breach is increased by $1.07 million due to remote work [22]. As a consequence, the data leakage problem is increasing day by day and it needs to be addressed. Thus, data protection has become a challenging task in the area of information security and cloud computing. There is a need for robust mechanisms that can address the existing problem effectively. The emerged challenge can be significantly overcome by preventing data leakage and recognizing the malicious entity that provokes data leakage. Several approaches have been discovered to protect the data in a cloud environment. Although a number of substantial solutions have been presented to mitigate the existing challenges in the domain, there arises a need to perform a systematic study of the existing solutions in order to find the applicability of these solutions as per the applications. Motivated by the significance and requirement for a better understanding of the current trends for sharing the data securely in the area of cloud computing, we present this analysis. For this purpose, a global level study is conducted and exhibited in this manuscript with the descriptions of the foremost techniques of data protection for the wide spectrum, easiness in obtaining related and eminent state-of-art existing solutions, their research gaps, future directions along with subsequent feasible solutions. The authors have first defined the general mechanism followed by provisioning in-depth detail and analysis of a particular technique with the aim of better understanding the concept and furnishing all the essential information conjointly for acquiring knowledge in the area. The relevant solutions of every individual technique, their merits, and scope are reported and further, analysis is performed to explore the relevancy of each technique as per the scenario. It is reckoned this article will contribute as a foundation for the emerging applications demanding data protection.

B. OUR CONTRIBUTION
The main contributions of the article are summarized as follows: 1) This work reviews the major and significant existing techniques for data protection through secure sharing in the cloud environment. 2) We provide the following-mentioned details about each of the technique (a) how it works for data protection and (b) the qualitative, outstanding, and primary solutions in the area. Furthermore, we present the potential and valuable information like the working, implementation environment, achievement, scope of the given model, etc., about each discussed solution in the tabular format to easily grab the core of the method along with its applications. 3) A comparative and comprehensive analysis of the discussed techniques are performed and exposed in a concise form. Furthermore, it is investigated which technique is best suited as per the requirements.

C. ORGANIZATION OF THE PAPER
Sections II to VI analyze the cryptography, access control, differential privacy with machine learning, watermarking, and probability techniques individually. Each section elaborates the following descriptions of the designated technique (A) the functioning of the technique with the help of a block diagram for protecting the data in the cloud environment (B) the remarkable contribution which is relevant and justifiable to identify the work done and the research gaps in the domain (C) the core information about every described solution is summarized and presented in a tabulated form for the ease of grabbing the necessary and sufficient details to carry out the further work. Section VII accomplishes a comparative analysis among the discussed technique and exposes the optimality of techniques as per the circumstances. Finally, the conclusion of the analysis performed and future remarks are reported in section VIII.

II. CRYPTOGRAPHY BASED MODELS
Let E τ is the set of entities to be encrypted, S K , PB K , and PV K are the sets of secret, public, and private keys for encryption and decryption then the symmetric cryptography technique maps e : E τ ×S K → E * τ and d : E * τ ×S K → E τ such that d e (E τ , S k ) = E τ and the asymmetric cryptography technique maps e : E τ × PB K → E * τ and d : The symmetric cryptography technique (E τ , E * τ , S K , e , d ) consists of three functions is defined as- The key generator function K gen (CG) as shown in Eq. (1) generates a key S k for the given security factor S F .
The encryption function e : E τ × S k → E * τ takes the original entity E τ and the key S k as an input and generates an encrypted entity E * τ as given in Eq. (2).
The decryption function d : E * τ × S k → E τ generates the original entity E τ as output by considering the conceivably encrypted entity E * τ and the key S k as an input as depicted in Eq. (3).
The asymmetric cryptography technique (E τ , E * τ , PB K , PV K , e , d ) consists of three functions is defined as- The key generator function K gen (CG) given in Eq. (4) generates the keys PB k and PV k for the given security factor S F .
The encryption function e : E τ × PB k → E * τ takes the original entity E τ and the key PB k as an input and generates an encrypted entity E * τ as depicted in Eq. (5).
The decryption function d : E * τ × PV k → E τ generates the original entity E τ as output by considering the conceivably encrypted entity E * τ and the key PV k as an input as shown in Eq. (6).
The building block of the cryptography technique is demonstrated in Fig Kao et al. [30] presented a user-centric key management scheme named uCloud to protect the cloud. In uCloud, the data of users is indirectly encrypted through RSA by utilizing users' public keys. The users' private keys are stored on the users' mobile devices instead of users' PCs or servers. Furthermore, the two-dimensional (2D) barcode images are exploited to express the users' private keys which are further employed for the decryption of users' sensitive data. Al-Haj et al. [31] provided the two crypto-based algorithms to provide confidentiality, integrity, and authenticity to the data. They introduced a cryptographic function by using the hash code and symmetric keys to protect the data. The integrity and authenticity are provisioned by applying the elliptic curve digital signature algorithm. Additionally, the advanced encryption standard-Galois counter mode is used with the whirlpool hash function to support authenticity and confidentiality.
Liang et al. suggested a Ciphertext-Policy Attribute-Based Proxy Re-Encryption Scheme for the secure sharing of cloud data [32]. An enhancement of re-encryption and re-encryption key generation phases is introduced which minimized the communication and computational cost. A data owner is authorized in the scheme to assign the access rights of the encrypted data stored on a cloud system to others. A file hierarchy attribute-based encryption scheme is proposed by Wang et al. in [15] for securing the data in the cloud environment. This scheme used an access structure layered model to unravel the issue of sharing various hierarchical files and also demonstrated the protection of the file hierarchyciphertext policy-attribute based encryption (FH-CP-ABE) scheme which can effectively hinder the chosen plaintext attacks (CPA) under the assumption of Decisional Bilinear Diffie-Hellman (DBDH). The results showed that the cost of storage and complexity of computation is less in terms of encryption and decryption as compared to CP-ABE. The disadvantage of this scheme is that the computation cost is increased dynamically when the common attributes and an integrated ciphertext are desirable to be computed only once by the data owner.
Liu et al. [33] proposed a fair data access control scheme for cloud storage. In the scheme, a fair key reconstruction is performed to resist the access of shared data and none of the users exchanged their shares. A large number of fake keys are generated in the proposed scheme for obfuscating the decryption key of the shared data. Theoretical analysis of this scheme showed that all the shares are always contributed by their corresponding users which enables them to reconstruct the fair decryption key each time. Moreover, the performance evaluation demonstrated that the computation delay and communication costs are reduced, but the authentication scheme was not efficient in the scheme. A CP-ABE scheme is proposed by Liu et al. in [34] to reduce the computation cost of heavy decryption at the user end which increases with respect to the complexity of access policy. This system facilitated decryption outsourcing, revocation attributes, and policy updating while attributes of the user are changed. The rigorous tests are implemented to analyze the performance of the proposed scheme which is measured in terms of storage overhead and processing power, however, it lacks in terms of privacy protection.
For mobile cloud computing, a lightweight data sharing scheme (LDSS) is proposed by Li et al. [6]. LDSS enhanced the structure of the access control tree by adopting the CP-ABE scheme to stimulate the mechanism applicable for mobile cloud environments. A large portion of the computation is displaced to external proxy servers from mobile devices in this scheme. The overhead on the side of the mobile device is reduced in LDSS when the data is shared by the users in the mobile cloud environments. Zaghloul et al. proposed a Privilege-based Multilevel Organizational Data-sharing (P-MOD) scheme in [2]. In P-MOD, the attribute-based encryption mechanism is strengthened by incorporating a privilege-based access structure into it to operate the sharing and management of big data sets effectively. It is demonstrated by the experimental analysis that the P-MOD is more efficient in comparison to both CP-ABE [35] and FH-CP-ABE [15] schemes for a hierarchical organization with many levels to perform the encryption and decryption and generate the keys. Also, the cumulative total of operations is minimized in the P-MOD scheme compared to the hierarchical schemes HABE [36], [37] and FH-CP-ABE [15].
Li et al. [8] presented a Linear Secret Sharing Scheme (LSSS) matrix access structure based an effectual CP-ABE scheme to update the file dynamically and improve the efficiency of the policy in the cloud environment. The objective of the scheme is to resist the selected plaintext attacks (CPA), and reduce the storage consumption of the proxy cloud service provider (PCSP), the communication expense, and the computing cost of the data owner. The theoretical analysis and experimental simulation of the proposed scheme showed that it has outperformed Policy Update CP-ABE [38] in terms of effective handling of the policy changes and file updates. To ensure the data confidentiality and protect the personal privacy of the user, a privacy-preserving scheme of the hidden access policy CP-ABE (HP-CP-ABE) schemes with an efficient authority verification is proposed by Zhang et al. [13]. In this approach, an authority detection mechanism to verify the authorized user and complete the decryption process is designed. This scheme obtained a private key of the constant size which is independent of the number of user's attributes. Though transmission and storage costs are decreased by this approach, it is realized as a weak security model because it supports the AND policy only. A thumbnail of relevant models based on the cryptography technique comprising potential details is portrayed in Table 1.

III. ACCESS CONTROL BASED MODELS
The Access Control Mechanism ACM allows controlled exposure of the confidential data to the authorized entity based on data type, user type, user's privileges, and permissions. An Access Control Policy (ACP) is defined for data distribution among users. ACP consists of a tuple (D, U, G) where D refers to a set of data objects D 1 , D 2 , . . . , D n to be distributed, U denotes a set of users U 1 , U 2 , . . . , U m , and G is an expression or a set of expression that decide which D i can be accessed by which U j or which D i can be allocated to which U j or U j is allowed to access which D i . ACP can vary depending upon the situations and applications.
ACM provides the information flow control and is suitable for any organization if access rights and data classification are properly established. Without a proper definition of access rights, it cannot be decided whether or not the data D is being accessed by a legitimate U j . It is important to be able to distinguish between U 1 , U 2 , . . . , U m based on their type, privileges, and permissions for an effective ACM . There must be predefined user privileges and data secrecy levels to work properly. Access is normally granted to U j with credentials that meet the organization's policy. Fig. 4 represents a conventional model for access control mechanism. Three users U 1 , U 2 , U 3 send the request through the internet for the six documents D 1 , D 2 , . . . , D 6 . An access control policy is applied based on the users attributes, data attributes, and other essential factors; and a subset of data for which the users U j qualify is transferred among Nabeel and Bertino proposed a privacy-preserving policybased content sharing scheme in public clouds [54]. The approach utilized a privacy-preserving attribute-based key management scheme that protects the privacy of users while enforcing attribute-based ACPs. The data owner performs coarse-grained encryption, whereas the cloud performs finegrained encryption on top of the owner encrypted data to minimize the overhead at the data owners while assuring data confidentiality from the cloud. For the dynamic members in the cloud, a secure data sharing scheme is presented in [27]. The users can securely obtain their private keys due to the verification of their public keys. Revoked users cannot get the original data even if they conspire with the untrusted cloud to secure the scheme against collusion attacks. Previous users have no need to update their private keys when a new user joins or a user is revoked from the group to support dynamic groups.
A threshold multi-authority CP-ABE access control scheme TMACS is provided in [23] for public cloud storage in which multiple authorities jointly manage a uniform attribute set. A combination of the traditional multi-authority scheme and TMACS scheme is employed to handle the attributes set as well as achieve security and system-level robustness in which attributes coming from different authority sets and multiple authorities in   an authority-set jointly maintain a subset of the whole attribute set.
A hierarchical access control system is designed in [17] that provides inheritance of authorization to reduce the burden and risk in the case of a single authority. The scheme adopts CP-ABE with the constant-size ciphertext to solve the linear dependency of ciphertext size on the number of attributes and maintains the size of ciphertext and the computation of encryption and decryption at a constant value which reduces the extra overhead of space storage, data transmission, and computation. Ali et al. [55] proposed a security scheme for outsourced data to the cloud (DaSCE) that provides (a) key management (b) access control, and (c) file assured deletion. The scheme utilizes Shamir's threshold scheme to manage the keys. Access control is enforced to both data and key through the validity of policies and mutual authentication between the client and key managers, and client and cloud. Assured deletion is based on policies associated with the data file uploaded to the cloud.
Almutairi et al. [56] presented virtual resource management methodologies for a cloud environment by designing Role-Based Access Control (RBAC) policy that minimizes the threat of data exposure. The concept of sensitivity is utilized in multi-tenant data centers in terms of the degree of data sharing among tenants. Limited sharing implies a high sensitivity data center and high sharing of data means a low sensitivity data center. Xu et al. [26] proposed a fine-grained access control and data sharing scheme for dynamic user groups and on-demand services by 1) defining and enforcing access policies based on the data attributes; 2) permitting the key generation center to update user credentials, and 3) allowing computation tasks to be performed by untrusted CSPs without requiring any delegation key.
A time and attribute factors combined access control on time-sensitive data for public cloud storage (TAFC) method is proposed in [57] by embedding Timed-Release Encryption (TRE) into Ciphertext-Policy Attribute-based Encryption (CP-ABE). This scheme provides data owners with the capability to flexibly release the access privilege to different users at different times according to a well-defined access policy over attributes and release time. Table 2 outlines the considerable models relying on access control involving the vital descriptions.

IV. DIFFERENTIAL PRIVACY WITH MACHINE LEARNING BASED MODELS
A mechanism M N : D → Range (M N ) satisfies -differential privacy if for any possible output O P ∈ Range (M N ) and every pair D i , D i ∈ D distinct in only one record as depicted in Eq. (7) where P b denotes the probability andê signifies the exponent.
Differential privacy with machine learning aims to protect sensitive information by making the outputs of different queries differing in at most one record indistinguishable. -differential privacy is a popular approach to privacy protection for machine learning algorithms on data sets where > 0 is a real number and predefined privacy parameter. It controls how much information is disclosed about an individual's data through statistical analysis and computation. The lesser the value of , the more powerful is privacy protection. The main idea of -Differential privacy in machine learning is to learn a simple rule automatically from the distributional information of the data set at hand without revealing too much about any single individual in the data set. Fig. 5 depicts a conventional example of privacy-preserving machine learning. The documents D 1 , D 2 , . . . , D n of various types are protected through -differential privacy and made private followed by the machine learning to classify D 1 , D 2 , . . . , D n . The -differential privacy is applied over D 1 , D 2 , . . . , D n where the statistical noises are embedded with the documents for preserving their privacy. Afterward, computation is performed over D 1 , D 2 , . . . , D n through machine learning that classify these documents in their appropriate categories

{A, B, C, D}.
Let E τ is the set of entities to be applied differential privacy, N G ∈ R is the set of generated noise that has to be embedded within the documents then the differential privacy technique maps * e : The differential privacy technique can be represented as a tuple (E τ , E * τ , N G ) comprises of three functions that are delineated as- The noise generator function N gen (DP) generates noise N G for the given security factor S F as shown in Eq. (8).
The noise embedding function * e : E τ × N G → E * τ takes the original entity E τ and the generated noise N G as an input and generates a noised entity E * τ as depicted in Eq. (9).
The noise extraction function * d : E * τ → N G extracts the embedded noise N G as an outcome by exploring the conceivably noised entity E * τ as an input given in Eq. (10).
Yonetani et al. developed a Doubly Permuted Homomorphic Encryption (DPHE) based privacy-preserving mechanism [65] that enabled the multi-party protected scalar product and reduced the high computational cost. The experimental evaluation proved that the envisioned method is capable of achieving better performance in comparison with the state-of-the-art visual recognition approaches. The major disadvantage of DPHE is that at an instant, it supported one operation only i.e. either multiplication or addition. Hesamifard et al. [66] proposed a framework named Cryp-toDL in which remedies are provided for employing deep neural network algorithms over encrypted data. They developed a theoretical basis for the implementation of deep neural network algorithms in the encrypted domain. Additionally, a neural network technique is established within the practical limitations of current homomorphic encryption schemes. Although the scheme operates adequately for securing private data, the attention is not drawn to the requirement of protecting private data through multiple keys from individual data owners.
Li et al. [67] introduced a privacy-conserving outsourced classification in cloud computing (POCC) framework under various public keys. To assure the confidentiality of sensitive data without leakage, they applied a fully homomorphic encryption proxy technique. But the data owner and VOLUME 10, 2022  the storage servers are considered to lie in the equivalent trustworthy area despite the fact that the storage servers are completely trusted. However, in cloud computing, this assumption is no longer applicable because both the data owner and database servers are very likely to be within different domains. Li et al. [68] proposed a scheme for a classifier owner to delegate a remote server to provide the privacypreserving classification service for users. They designed efficient classification protocols for two concrete classifiers i.e. Naive Bayes and hyperplane decision-based. The experiments were conducted on the LAN server over testing datasets from the UCI Machine Learning Repository. A drawback of this scheme is that it involves frequent interactions of the users while launching a classification query. Li et al. [69] proposed a Privacy-Preserving Machine Learning with Multiple Data Providers (PMLM) scheme to defend the privacy of the data sets. They used public-key encryption with a double decryption algorithm (DD-PKE) and -differential privacy to encrypt the data sets of different data providers and the cloud respectively. The experiments are conducted under diverse classical machine learning algorithms to show the performance of the protocol. However, the computational cost is high in the proposed solution as a consequence of its dependency upon integer factorization.
A scheme is developed by Gao et al. in [70] to prevent information disclosure against the substitution-thencomparison (STC) attack. They adopted a double-blinding strategy and designed a functional privacy-preserving classification mechanism for the Naive Bayes classifier to protect data privacy. Most of the computations were performed offline phase in the server to reduce the overhead of online computation and communication. However, their approach has failed to achieve the discovery of truth that protects privacy. A data protection scheme for privacy-preserving Naive Bayes learning over data, contributed by multiple providers is proposed by Li et al. [71] which enabled the training of Naive Bayes classifier over the dataset, which is provided jointly by different data owners. The result of the training was achieved -differential privacy while the training will not break each owner's privacy. In this approach, collusions are allowed and adversaries had the ability to forge and manipulate the data.
Ma et al. [24] provided a Privacy-Preserving Deep Learning (PDLM) method for addressing the issue of training the model over the encrypted data under multiple keys. The proposed mechanism trains the model based on stochastic gradient descent (SGD) and performs the feedforward and back-propagation procedure based on an efficient privacy-preserving calculation toolkit. This scheme reduced the overhead of the storage and computational complexity. The experimental evaluation showed that the classification model offered very little accuracy and high computation cost. A discussion incorporating the significant information about the expressive models referring to differential privacy with machine learning is presented in Table 3.

V. WATERMARKING BASED MODELS
Let E τ is the set of entities that have to be watermarked, W K is the set of keys used for watermarking, and W M is the set of all feasible watermarks comprises the information that the owner wants to embed such that W M ⊆ {0, 1} + then the watermarking technique mapsφ e : E τ × W M → E τ and δ d : The symmetric watermarking technique (E τ , E * τ , W K , W M , W * M ,φ e ,δ d ) composed of three functions that are outlined as- The key generator function K gen (WM ) generates a key W k for the given security factor S F as shown in Eq. (11).
The watermark embedding functionφ e : E τ × W k × W m → E * τ takes the original entity E τ , the key W k , and the watermark W m as an input and generates a watermarked entity E * τ as depicted in Eq. (12).
The watermark detection functionδ d : E * τ × W k × E τ → W * m extracts the watermark W * m as output by taking into account the conceivably watermarked entity E * τ , the key W k , and the original entity E τ as an input given in Eq. (13).
Then we require a similarity function ξ depicted in Eq. (14) that takes the two objects W m and W * m to be compared and returns if the two objects are identified as similar and otherwise.
A robust watermarking should satisfy the following properties- which means the original entity E τ and the watermarked entity E * τ are similar.  Robustness: For a watermark entity E * τ =φ e (E τ , W k , W m ) where E τ ∈ E τ , W k ∈ W K and W m ∈ W M , there does not exist any polynomial time antagonist that can compute an E * * τ ∈ E τ given E * τ and W m such that ξ (E * τ , E * * τ ) = and W * * m = δ d (E * * τ , W k , E τ ) but ξ (W m , W * * m ) = which means it should not be possible to change or remove the watermark W m by any antagonist effectively without cracking the similarity i.e without interpreting the entity inoperable. A technique is developed in [73] for fingerprinting relational data by extending the watermarking scheme given in [74]. A multi-bit watermark is combined with a collusion-resistant code. The arbitrary bit-string marks can be embedded in the relations as well as detected by the scheme. For the robustness properties of the scheme, the quantitative models are presented which demonstrated that the scheme is capable of detecting the embedded fingerprints against extensive kinds of attacks including collusion attacks. A scheme is provided for embedding the intangible water-mark securely in the relational data via framing the watermarking in the form of a confined optimization case [75]. For this purpose, pattern search (PS) techniques in conjunction with genetic algorithms (GAs) are employed as well as data partitioning and threshold-based techniques are presented. The watermarks are embedded repeatedly and to enhance the watermark resilience, multiple attributes along with majority voting techniques were utilized for the watermark decoding phase. The performance evaluation showed that the technique is resilient to tuple insertion, alteration as well as deletion attacks, and watermark synchronization errors due to the employment of a partitioning approach where marker tuples are not demanded.
A mobile agent-based approach is developed in [76] for the identification of potential information leakage by automating the process of coloring and detecting the file systems of receptive hosts as well as monitoring the colored file systems. The detection capabilities are modularised and conditionally employed at the authority of a central control mechanism. The distributed reporting potential of mobile agent networks can perform future analyses of information leakage. Kumar et al. [77] introduced an approach based on watermarking that utilized the Bell-La Padula model for ensuring security via providing access control in the cloud environment. The approach embedded the client ID in the document whenever the cloud data is shared among the users. The guilty party is detected by extracting the embedded client ID from the discovered document. The model provides   security against the data leakage problem and is cost-effective in the context of space and time. However, the scheme is unproductive in an environment where the data objects are frequently accessed by multiple users. A technique that uses curvelet transforms is presented in [78] to hide patient information into their ECG signal. Curvelet transform decomposes the ECG signal into frequency sub-bands. A quantization approach is used to embed patient data into the coefficients whose values are around zero, in the high-frequency subbands. The experimental analysis proved that compared with the method which chooses random locations for the watermark, the proposed method performs better.
A generic framework called Lineage In Malicious Environment (LIME) based on data lineage is proposed by Backes et al. in [79] to protect the data in the vicious environment through the identification of the culprit entity. In this scheme, data is shared among multiple entities that can be either owners or consumers. To preserve the data in the malicious environment, a liable data transfer protocol is developed between the involved party via utilizing a robust combination of watermarking, signature primitives, and oblivious transfer techniques. This method considers the probable data leakage and the associated impediments at the design stage. The execution times are measured for distinct phases named watermarking, detection, oblivious transfer, encryption, and signature creation of the protocol. The framework is applied to the data leakage scenarios of social networks and data outsourcing. However, the proposed model cannot prevent the data from unauthorized access. The model can be extended to design the data leakage detection mechanisms for divergent scenarios and types of documents. The work provides future guidance in designing a verifiable lineage protocol for derived data.
A solution that uses role-and attribute-based access control for data exchange among services, including services hosted by untrusted environments is presented in [80] for privacypreserving data exchange, data leakage detection, and prevention. The methodology employs Active Bundles (AB) that contain key-value pairs with values in encrypted form; metadata; access control policies and a policy enforcement engine. The active bundle mechanism provides data integrity and confidentiality and protects the data from malicious/curious cloud administrators. Implementation demonstrated that the data leakage detection mechanism imposes a 60.8% performance overhead. Amini et al. [81] proposed a statistical watermark detector for color images based on the Hidden Markov Model (HMM) to legitimate and secure online image transactions with a high detection rate. The HMM is used to trace the inter-channel dependencies among the contourlet coefficients of the color images. The superiority of the method against state-of-the-art methods [82]- [86] including Power-exponential [83], Cauchy [84], and Generalized Gaussian distributions [82] is confirmed by the experimental results.
An identity-based remote data integrity auditing approach is given by Shen et al. in [4] to conserve the integrity and protect the storage of sensitive information in the cloud. For this purpose, the integrity of files is verified through the use of signatures, and the cloud data is shared among multiple parties while hiding sensitive information. This method realized both the remote data integrity auditing and the files stored in the cloud are able to be shared and used by others on the condition that sensitive information is hidden in cloud storage.
A Genetic Algorithm and Histogram Shifting Watermarking (GAHSW) based reversible database watermarking technique is proposed in [87] to maximize the robustness and minimize distortion of the numerical relational database. In this approach watermarking is embedded by applying GA to select the best secret key for grouping the database. GAHSW causes less distortion and improves the robustness of watermarking as compared to state-of-the-art approaches in terms of robustness against malicious attacks and preservation of data quality. However, this approach is only applicable to a numerical database. A secure and robust digital text watermarking technique is proposed in [88] to provide copyright protection for text documents on local and cloud computing paradigms with the help of data mining techniques. This technique is applied to find suitable properties from the document for embedding the watermark. The proposed technique has attained a high level of imperceptibility where Peak Signal-to-Noise Ratio (PSNR) values are between 64.67% and 71.03%, and similarity (SIM) percentage is between 99.92% and 99.99%.
For the protection of medical images, Haddad et al. [89] presented a joint watermarking-encryption-compression (JWEC) scheme which has the ability to give access to watermarking-based security services from both encrypted and compressed image bit-streams. This scheme combines the bit-substitution watermarking with JPEG-LS and the AES block cipher algorithm in its cipher block chaining (CBC) mode, in a single operation performed on the entire image. The result demonstrates that watermark capacities are capable enough to support watermarking-based security services at the same time. A separable robust reversible watermarking in encrypted 2D vector graphics is proposed by Peng et al. [90] proposed to accomplish robust watermark extraction in plaintext as well as encrypted domain. In this scheme, a watermark mapping based on the polar coordinate system, hash-based message authentication code (HMAC), and erasure coding is built, which, achieves better invisibility and robustness against normal operations and malicious attacks compared with the existing methods. It is the first work reported on reversible watermarking in encrypted 2D vector graphics that can extract watermark in both domains but the attack to the reference vertex may fail the data extraction. Table 4 analyzes the remarkable models adapted from the watermarking technique coupled with imperative details.

VI. PROBABILITY BASED MODELS
The probability technique assesses the likelihood that an agent U j ∈ U is accountable for exposing the given leaked data set L based on the overlap of his data with the leaked data and the data of other agents and based on the probability that objects can be guessed by other means. VOLUME 10, 2022 Since the agents U 1 , U 2 , . . . , U m have some of the data, it is reasonable to suspect them leaking the data. However, the agents can argue that they are innocent, and leaked data are obtained by the target through other means. For example, say that one of the objects in L represents a customer Z . Perhaps Z is also a customer of some other company, and that company provided the data to the target. Or perhaps Z can be reconstructed from various publicly available sources on the web. The more data in L, the harder it is for the agents to argue they did not leak anything. Similarly, the rarer the objects, the harder it is to argue that the target obtained them through other means. For instance, if one of the L objects was only given to agent U 1 , while the other objects were given to all agents, we may suspect U 1 more. To compute the probability, an estimate for the probability that values in L can be guessed by the target is required. For instance, say that some of the objects in L are e-mails of individuals. We can conduct an experiment and ask a person with approximately the expertise and resources of the target to find the e-mail of, say, 100 individuals. If this person can find, say, 90 emails, then we can reasonably guess that the probability of finding one e-mail is 0.9. On the other hand, if the objects in question are bank account numbers, the person may only discover, say, 20, leading to an estimate of 0.2. Data distribution strategies help in improving the probability of identifying a guilty user M U . To identify a M U with high confidence, it is needed to minimize Therefore, the distribution strategies should distribute the data set W * 1 , W * 2 , . . . , W * m with the objective given in Eqs. (15) and (16)  Figs. 7a to 7c depict the three distribution strategies where four documents D 1 , D 2 , D 3 , D 4 are distributed among four users U 1 , U 2 , U 3 , U 4 , each with a request R 1 , R 2 , R 3 , R 4 of two documents. The third one avoids the full overlapping and is optimal among all three.
Papadimitriou and Garcia-Molina [103] proposed an agent guilt model based on a probabilistic approach to evaluate the likelihood of whether the data is revealed by one or more agents or it has been individually assembled by an unauthorized party through alternative means. This model assessed the maliciousness of various agents when the leaked data is found by the allocator at an illegal place. It is demonstrated that the judicious distribution of objects can assist in distinguishing the malicious entities with a remarkable distinction, especially, when the overlapping among the data acquired by the users is large.
Harel et al. [104] introduced a misuseability weight concept by delegating a score to the data sets as per their sen-sitivity. The scheme evaluated the sensitivity level of data which is revealed among the insiders, for diminishing the data misuse as well as data leakage incidents in the database system. Furthermore, the scheme estimated the insider's ability to exploit the sensitive data maliciously and also predicted the possibility of damage that could be resulting from the data leakage. A method is presented by Kumar et al. [105] to secure the data from unauthorized use. The allocation strategies are introduced that operate on account of no wait prototype and increase the chances of identifying the guilty party. The likelihood is assessed whether an individual agent was culpable for leaking a dataset or not.
For preventing the data leakage, a file distribution model is proposed by Fan et al. [106] which plans file allocation so as to minimize the overlapping between the received file sets of agents. Consequently, the model is proficient in discovering the origin of leakage with a large probability. The performance analysis revealed that the model is capable of detecting the sources of leakage as well as distinguishing the vicious agents efficiently. However, the achieved parameters are not compared with its baseline state-of-the-art distribution model [103]. For the textual data, a misuseability evaluator named TM-Score is defined in [107] which is an extension of the misuseability weight concept [104]. By utilizing the presented evaluator, the enterprises become capable of estimating the quantum of the detriment that is resulted from gradual and continuous exposure of textual content such as emails and documents caused by an insider. The degree of destruction is assessed by employing the quality, type, and amount of the revealed information. Sodagudi and Kurra provided a method in [108] to identify the malicious attackers in the mobile ad-hoc network (MANET) by considering the integration of routing protocol and cryptography technique. Allocation strategies are followed by the data distributor which results in less scope for data leakage to happen. Guevara et al. presented an algorithm in [109] for data leakage detection by exploiting the property of anomalous user behavior. To accomplish the same, the user's operations are codified in a computer system by employing a dynamic structure, which permitted the extraction of a user's profile by following the sequences of actions from the historical database. The efficiency of the work is proved with reference to the low false-positive rate and high detection accuracy. However, its dependency on the historical data of the users for generating their behavioral patterns makes it a timeconsuming process. Ezhilchelvan and Mitrani [110] have examined a system where several virtual machines shared a common physical machine and evaluated the probability of malicious co-residency in public clouds. The allocation of VMs to the physical machine is carried out by applying random and priority block policies while considering multiple security breaches simultaneously. The simulation results indicated the acceptable accuracy however, real-life experiments are required for confirming the accuracy of the reported method. An interpretation composing the essential description regarding the influential models founded on the probability technique is demonstrated in Table 5.     Although CG is a powerful technique for preserving the data privacy, security, confidentiality, and leakage prevention, this technique has a crucial drawback that once the key applied is revealed, then the data can be compromised. Also, the technique is unable to identify the culprit entity and consists of a considerable amount of overheads. The benefit of the AC technique is that it enables monitored disclosure of the sensitive data and minimizes the risk of data leakage without the involvement of transformation cost but the technique is incapable of identifying the vicious party in case of the data leakage occurrence and does not ensure data privacy. DP technique has the advantage that it controls the information exposure, preserves privacy, security as well as the utility, and prevents the data from leakage, but this technique is ineffective for malicious entity identification. It involves a substantial amount of overhead. Also, this technique is not fit for applications where absolute data is required without any modifications. WM is a sturdy leaker detection technique that preserves high utility along with security to the data. The strength of the technique is its effectiveness in detecting the absolute entity that revealed the confidential information. However, the technique has a crucial limitation of becoming incapable of identifying the malignant entity when the embedded information is completely removed or altered by the vicious agent. This technique cannot prevent data leakage, comprises of overhead, and does not assure privacy. Also, the technique may not be applicable to every data type. The effectiveness of the PB technique is that the identification of the malignant entity is not impacted by the amendments in the shared data, unlike the watermarking technique. Also, data security is not key-dependent in the PB technique, unlike the cryptography technique where the key can be compromised. It is a powerful leaker detection technique that is applicable to every data type and ensures high utility in addition to the security VOLUME 10, 2022 without involving the transformation cost. But this technique constitutes an estimation only of the malicious entity and does not assure privacy. Also, the technique does not support the leakage preventive mechanism and cannot cease illegal access to attain and mistreat the data.

VII. COMPARATIVE AND COMPREHENSIVE ANALYSIS
Comprehensively, it is inferred that CG is leading among the five techniques for preserving privacy, security, confidentiality as well as leakage prevention. AC is the foremost technique to ensure privacy without involving the cost of transformation. DP is the prime technique to preserve privacy coupled with utility. Watermarking and probability are the best techniques for assuring data utility in association with leaker detection. Furthermore, WM is the optimum leaker detection technique for absolute culprit identification while PB is the superior leaker estimation technique without having the impact of data modification. However, no technique alone is sufficient to provide completely secure methodologies and there arises a necessity to utilize an integration of the techniques for an effective data protection mechanism.

VIII. CONCLUSION AND FUTURE WORK
Data protection is a challenging task in the field of cloud computing and information security. A plethora of work is interpreted to mitigate this challenge. However, there is an inadequacy for the comprehensive study of the ongoing solutions. From this perspective, this paper presented a comprehensive analysis and explored the foremost techniques concerning the functionality and the relevant solutions to share the data securely for data protection in the cloud environment. The essential and adequate information which is desired to fetch the core of the method along with the research gaps and future directions about each discussed solution is highlighted. Furthermore, exhaustive analysis and a comparison among the refereed techniques are performed. The relevancy of every technique is analyzed in compliance with the context.
It is investigated that no technique alone is efficient in ensuring the absolute security of the data from every directly or indirectly engaged party in the system. The robust solution can be developed by integrating the techniques for providing complete security to the system in the sharing environment. Moreover, with the set of highlights of addressed remarkable solutions, it is deemed that the exposed analysis will act as a milestone for the potential researchers working in the area as well as other emerging applications demanding secure data storage and sharing for its protection.